From 2794519506bd389b33615c8e5dc6f914044832b6 Mon Sep 17 00:00:00 2001 From: Matej Focko Date: Thu, 16 Nov 2023 10:16:13 +0100 Subject: [PATCH 1/3] feat(algorithms): add Breaking of the Hash Table Signed-off-by: Matej Focko --- .../2023-11-28-breaking/01-python.md | 230 ++++++++++++++++++ .../2023-11-28-breaking/02-mitigations.md | 150 ++++++++++++ .../2023-11-28-breaking/index.md | 207 ++++++++++++++++ .../hash-tables/breaking/benchmark.cpp | 133 ++++++++++ .../hash-tables/breaking/benchmark.py | 118 +++++++++ 5 files changed, 838 insertions(+) create mode 100644 algorithms/12-hash-tables/2023-11-28-breaking/01-python.md create mode 100644 algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md create mode 100644 algorithms/12-hash-tables/2023-11-28-breaking/index.md create mode 100644 static/files/algorithms/hash-tables/breaking/benchmark.cpp create mode 100644 static/files/algorithms/hash-tables/breaking/benchmark.py diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md b/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md new file mode 100644 index 0000000..c35c626 --- /dev/null +++ b/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md @@ -0,0 +1,230 @@ +--- +id: python +title: Breaking Python +description: | + Actually getting the worst-case time complexity in Python. +tags: + - cpp + - python + - hash-tables +last_update: + date: 2023-11-28 +--- + +## Breaking the Hash Table in Python + +Our language of choice for bringing the worst out of the hash table is _Python_. + +Let's start by talking about the hash function and why we've chosen Python for +this. Hash function for integers in Python is simply _identity_, as you might've +guessed, there's no avalanche effect. Another thing that helps us is the fact +that integers in Python are technically `BigInt`s[^1]. This allows us to put bit +more pressure on the hashing function. + +From the perspective of the implementation, it is a hash table that uses probing +to resolve conflicts. This also means that it's a contiguous space in memory. +Indexing works like in the provided example above. When the hash table reaches +a _breaking point_ (defined somewhere in the C code), it reallocates the table +and rehashes everything. + +:::tip + +Resizing and rehashing can reduce the conflicts. That is coming from the fact +that the position in the table is determined by the hash and the size of the +table itself. + +::: + +## Preparing the attack + +Knowing the things above, it is not that hard to construct a method how to cause +as many conflicts as possible. Let's go over it: + +1. We know that integers are hashed to themselves. +2. We also know that from that hash we use only lower bits that are used as + indices. +3. We also know that there's a rehashing on resize that could possibly fix the + conflicts. + +We will test with different sequences: + +1. ordered one, numbers through 1 to N +2. ordered one in a reversed order, numbers through N back to 1 +3. numbers that are shifted to the left, so they create conflicts until resize +4. numbers that are shifted to the left, but resizing helps only in the end +5. numbers that are shifted to the left, but they won't be taken in account even + after final resize + +For each of these sequences, we will insert 10⁷ elements and look each of them +up for 10 times in a row. + +As a base of our benchmark, we will use a `Strategy` class and then for each +strategy we will just implement the sequence of numbers that it uses: + +```py +class Strategy: + def __init__(self, data_structure=set): + self._table = data_structure() + + @cached_property + def elements(self): + raise NotImplementedError("Implement for each strategy") + + @property + def name(self): + raise NotImplementedError("Implement for each strategy") + + def run(self): + print(f"\nBenchmarking:\t\t{self.name}") + + # Extract the elements here, so that the evaluation of them does not + # slow down the relevant part of benchmark + elements = self.elements + + # Insertion phase + start = monotonic_ns() + for x in elements: + self._table.add(x) + after_insertion = monotonic_ns() + + print(f"Insertion phase:\t{(after_insertion - start) / 1000000:.2f}ms") + + # Lookup phase + start = monotonic_ns() + for _ in range(LOOPS): + for x in elements: + assert x in self._table + after_lookups = monotonic_ns() + + print(f"Lookup phase:\t\t{(after_lookups - start) / 1000000:.2f}ms") +``` + +### Sequences + +Let's have a look at how we generate the numbers to be inserted: + +- ordered sequence (ascending) + ```py + x for x in range(N_ELEMENTS) + ``` +- ordered sequence (descending) + ```py + x for x in reversed(range(N_ELEMENTS)) + ``` +- progressive sequence that “heals” on resize + ```py + (x << max(5, x.bit_length())) for x in range(N_ELEMENTS) + ``` +- progressive sequence that “heals” in the end + ```py + (x << max(5, x.bit_length())) for x in reversed(range(N_ELEMENTS)) + ``` +- conflicts everywhere + ```py + x << 32 for x in range(N_ELEMENTS) + ``` + +## Results + +Let's have a look at the obtained results after running the code: + +| Technique | Insertion phase | Lookup phase | +| :------------------------------------------: | --------------: | -----------: | +| ordered sequence (ascending) | `558.60ms` | `3304.26ms` | +| ordered sequence (descending) | `554.08ms` | `3365.84ms` | +| progressive sequence that “heals” on resize | `3781.30ms` | `28565.71ms` | +| progressive sequence that “heals” in the end | `3280.38ms` | `26494.61ms` | +| conflicts everywhere | `4027.54ms` | `29132.92ms` | + +You can see a noticable “jump” in the time after switching to the “progressive” +sequence. The last sequence that has conflicts all the time has the worst time, +even though it's rather comparable with the first progressive sequence with +regards to the insertion phase. + +If we were to compare the _always conflicting_ one with the first one, we can +see that insertion took over 7× longer and lookups almost 9× longer. + +You can have a look at the code [here](path:///files/algorithms/hash-tables/breaking/benchmark.py). + +## Comparing with the tree + +:::danger + +Source code can be found [here](path:///files/algorithms/hash-tables/breaking/benchmark.cpp). + +_Viewer discretion advised._ + +::: + +Python doesn't have a tree structure for sets/maps implemented, therefore for +a comparison we will run a similar benchmark in C++. By running the same +sequences on both hash table and tree (RB-tree) we will obtain the following +results: + +| Technique | Insertion (hash) | Lookup (hash) | Insertion (tree) | Lookup (tree) | +| :------------------: | ---------------: | ------------: | ---------------: | ------------: | +| ordered (ascending) | `316ms` | `298ms` | `2098ms` | `5914ms` | +| ordered (descending) | `259ms` | `315ms` | `1958ms` | `14747ms` | +| progressive a) | `1152ms` | `6021ms` | `2581ms` | `16074ms` | +| progressive b) | `1041ms` | `6096ms` | `2770ms` | `15986ms` | +| conflicts | `964ms` | `1633ms` | `2559ms` | `13285ms` | + +:::note + +We can't forget that implementation details be involved. Hash function is still +the identity, to my knowledge. + +::: + +One interesting thing to notice is the fact that the progressive sequences took +the most time in lookups (which is not same as in the Python). + +Now, if we have a look at the tree implementation, we can notice two very +distinctive things: + +1. Tree implementations are not affected by the input, therefore (except for the + first sequence) we can see **very consistent** times. +2. Compared to the hash table the times are much higher and not very ideal. + +The reason for the 2nd point may not be very obvious. From the technical +perspective it makes some sense. Let's dive into it! + +If we take a hash table, it is an array in a memory, therefore it is contiguous +piece of memory. (For more information I'd suggest looking into the 1st blog +post below in references section by _Bjarne Stroustrup_) + +On the other hand, if we take a look at the tree, each node holds some +attributes and pointers to the left and right descendants of itself. Even if we +maintain a reasonable height of the tree (keep the tree balanced), we still need +to follow the pointers which point to the nodes _somewhere_ on the heap. When +traversing the tree, we get a consistent time complexity, but at the expense of +jumping between the nodes on the heap which takes some time. + +:::danger + +This is not supposed to leverage the hash table and try to persuade people not +to use the tree representations. There are benefits coming from the respective +data structures, even if the time is not the best. + +Overall if we compare the worst-case time complexities of the tree and hash +table, tree representation comes off better. + +::: + +:::tip Challenge + +Try to benchmark with the similar approach in the Rust. Since Rust uses +different hash function, it would be the best to just override the hash, this +way you can also avoid the hard part of this attack (making up the numbers that +will collide). + +::: + +--- + +## References + +1. Bjarne Stroustrup. + [Are lists evil?](https://www.stroustrup.com/bs_faq.html#list) + +[^1]: Arbitrary-sized integers, they can get as big as your memory allows. diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md new file mode 100644 index 0000000..8d58239 --- /dev/null +++ b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md @@ -0,0 +1,150 @@ +--- +id: mitigations +title: Possible Mitigations +description: | + Talking about the ways how to prevent the attacks on the hash table. +tags: + - cpp + - python + - hash-tables +last_update: + date: 2023-11-28 +--- + +There are multiple ways the issues created above can be mitigated. Still we can +only make it better, we cannot guarantee the ideal time complexity… + +For the sake of simplicity (and referencing an article by _Neal Wu_ on the same +topic; in references below) I will use the C++ to describe the mitigations. + +## Random seed + +One of the options how to avoid this kind of an attack is to introduce a random +seed to the hash. That way it is not that easy to choose the _nasty_ numbers. + +```cpp +struct custom_hash { + size_t operator()(uint64_t x) const { + return x + 7529; + } +}; +``` + +As you may have noticed, this is not very helpful, since it just shifts the +issue by some number. Better option is to use a shift from random number +generator: + +```cpp +struct custom_hash { + size_t operator()(uint64_t x) const { + static const uint64_t FIXED_RANDOM = + chrono::steady_clock::now().time_since_epoch().count(); + return x + FIXED_RANDOM; + } +}; +``` + +In this case the hash is using a high-precision clock to shift the number, which +is much harder to break. + +## Better random seed + +Building on the previous solution, we can do some _bit magic_ instead of the +shifting: + +```cpp +struct custom_hash { + size_t operator()(uint64_t x) const { + static const uint64_t FIXED_RANDOM = + chrono::steady_clock::now().time_since_epoch().count(); + x ^= FIXED_RANDOM; + return x ^ (x >> 16); + } +}; +``` + +This not only shifts the number, it also manipulates the underlying bits of the +hash. In this case we're also applying the `XOR` operation. + +## Adjusting the hash function + +Another option is to switch up the hash function. + +For example Rust uses [_SipHash_](https://en.wikipedia.org/wiki/SipHash) by +default. + +On the other hand, you can usually specify your own hash function, here we will +follow the article by _Neal_ that uses so-called _`splitmix64`_. + +```cpp +static uint64_t splitmix64(uint64_t x) { + // http://xorshift.di.unimi.it/splitmix64.c + x += 0x9e3779b97f4a7c15; + x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9; + x = (x ^ (x >> 27)) * 0x94d049bb133111eb; + return x ^ (x >> 31); +} +``` + +As you can see, this definitely doesn't do identity on the integers :smile: + +## Combining both + +Can we make it better? Of course! Use multiple mitigations at the same time. In +our case, we will both inject the random value **and** use the _`splitmix64`_: + +```cpp +struct custom_hash { + static uint64_t splitmix64(uint64_t x) { + // http://xorshift.di.unimi.it/splitmix64.c + x += 0x9e3779b97f4a7c15; + x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9; + x = (x ^ (x >> 27)) * 0x94d049bb133111eb; + return x ^ (x >> 31); + } + + size_t operator()(uint64_t x) const { + static const uint64_t FIXED_RANDOM = + chrono::steady_clock::now().time_since_epoch().count(); + return splitmix64(x + FIXED_RANDOM); + } +}; +``` + +## Fallback for extreme cases + +As we have mentioned above, Python resolves the conflicts by probing (it looks +for empty space somewhere else in the table, but it's deterministic about it, so +it's not “_oops, this is full, let's go one-by-one and find some spot_”). In the +case of C++ and Java, they resolve the conflicts by linked lists, as is the +usual text-book depiction of the hash table. + +However Java does something more intelligent. Once you go over the threshold of +conflicts in one spot, it converts the linked list to an RB-tree that is sorted +by the hash and key respectively. + +:::tip + +You may wonder what sense does it make to define an ordering on the tree by the +hash, if we're dealing with conflicts. Well, there are less buckets than the +range of the hash, so if we take lower bits, we can have a conflict even though +the hashes are not the same. + +::: + +You might have noticed that if we get a **really bad** hashing function, this is +not very helpful. It is not, **but** it can help in other cases. + +:::danger + +As the ordering on the keys of the hash table is not required and may not be +implemented, the tree may be ordered by just the hash. + +::: + +--- + +## References + +1. Neal Wu. + [Blowing up `unordered_map`, and how to stop getting hacked on it](https://codeforces.com/blog/entry/62393). diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/index.md b/algorithms/12-hash-tables/2023-11-28-breaking/index.md new file mode 100644 index 0000000..60a4f58 --- /dev/null +++ b/algorithms/12-hash-tables/2023-11-28-breaking/index.md @@ -0,0 +1,207 @@ +--- +id: breaking +title: Breaking Hash Table +description: | + How to get the linear time complexity in a hash table. +tags: + - cpp + - python + - hash-tables +last_update: + date: 2023-11-28 +--- + +We will try to break a hash table and discuss possible ways how to prevent such +issues to occur. + +## Introduction + +Hash tables are very commonly used to represent sets or dictionaries. Even when +you look up solution to some problem that requires set or dictionary, it is more +than likely that you'll find something that references usage of hash table. You +might think it's the only possible option[^1] or it's the best one[^2]. + +One of the reasons to prefer hash tables over any other representation is the +fact that they are **supposed** to be faster than the alternatives, but the +truth lies somewhere in between. + +One of the other possible implementations of the set is a balanced tree. One of +the most common implementations rely on the _red-black tree_, but you may see +also others like the _AVL tree_[^3] or _B-tree_[^4]. + +## Hash Table v. Trees + +The interesting part are the differences between those implementations. Why +should you choose hash table, or why should you choose the tree implementation? +Let's compare the differences one by one. + +### Requirements + +We will start with the fundamentals on which the underlying data structures +rely. We can also consider them as _requirements_ that must be met to be able to +use the underlying data structure. + +Hash table relies on the _hash function_ that is supposed to distribute the keys +in such way that they're evenly spread across the slots in the array where the +keys (or pairs, for dictionary) are stored, but at the same time they're +somewhat unique, so no clustering occurs. + +Trees depend on the _ordering_ of the elements. Trees maintain the elements in +a sorted fashion, so for any pair of the elements that are used as keys, you +need to be able to decide which one of them is _smaller or equal to_ the other. + +Hash function can be easily created by using the bits that _uniquely_ identify +a unique element. On the other hand, ordering may not be as easy to define. + +:::tip Example + +If you are familiar with complex numbers, they are a great example of a key that +does not have ordering (unless you go element-wise for the sake of storing them +in a tree; though the ordering **is not** defined on them). + +Hashing them is much easier though, you can just “combine” the hashes of real +and imaginary parts of the complex number to get a hash of the complex number +itself. + +::: + +### Underlying data structure + +The most obvious difference is the _core_ of the idea behind these data +structures. Hash tables rely on data being stored in one continuous piece of +memory (the array) where you can “guess” (by using the hash function) the +location of what you're looking for in constant time and also access that +location in the, said, constant time[^5]. In case the hash function is +_not good enough_[^6], you need to go in blind, and if it comes to the worst, +check everything. + +:::tip tl;dr + +- I know where should I look +- I can look there instantenously +- If my guesses are very wrong, I might need to check everything + +::: + +On the other hand, tree implementations rely on the self-balancing trees in +which you don't get as _amazing_ results as with the hash table, but they're +consistent. Given that we have self-balancing tree, the height is same for +**every** input. + +:::tip tl;dr + +- I don't know where to look +- I know how to get there +- Wherever I look, it takes me about the same time + +::: + +Let's compare side by side: + +| time complexity | hash table | tree | +| --------------: | :--------------------: | :-------------------: | +| expected | constant | depends on the height | +| worst-case | gotta check everything | depends on the height | + +## Major Factors of Hash Tables + +Let's have a look at the major factors that affect the efficiency and +functioning of a hash table. We have already mentioned the hash function that +plays a crucial role, but there are also different ways how you can implement +a hash table, so we will have a look at those too. + +### Hash functions + +:::info + +We will start with a definition of hash function in a mathematical definition +and type signature in some known language: + +$$ + h : T \rightarrow \mathbb{N} +$$ + +For a language we will just take the definition from C++[^7]: + +```cpp +std::size_t operator()(const T& key) const; +``` + +If you compare with the mathematical definition, it is very similar, except for +the fact that the memory is not unlimited, so _natural number_ turned into an +_unsigned integer type_ (on majority of platforms it will be a 64-bit unsigned +integer). + +::: + +As we have already touched above, hash function gives “a guess” where to look +for the key (either when doing a look up, or for insertion to guess a suitable +spot for the insertion). + +Hash functions are expected to have a so-called _avalanche effect_ which means +that the smallest change to the key should result in a massive change of hash. + +Avalanche effect technically guarantees that even when your data are clustered +together, it should lower the amount of conflicts that can occur. + +:::tip Exercise for the reader + +Try to give an example of a hash function that is not good at all. + +::: + +### Implementation details + +There are different variations of the hash tables. You've most than likely seen +an implementation that keeps linked lists for buckets. However there are also +other variations that use probing instead and so on. + +With regards to the implementation details, we need to mention the fact that +even with the bounded hash (as we could've seen above), you're not likely to +have all the buckets for different hashes available. Most common approach to +this is having a smaller set of buckets and modifying the hash to fit within. + +One of the most common approaches is to keep lengths of the hash tables in the +powers of 2 which allows bit-masking to take place. + +:::tip Example + +Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our +hash table. What can we do here? + +Well, we definitely have a bigger hash than spots available, so we need to +“shrink” it somehow. Most common practice is to take the lower bits of the hash +to represent an index in the table: + +``` +h & (l - 1) +``` + +_Why does this work?_ Firstly we subtract 1 from the length (indices run from +`0..=(l - 1)`, since table is zero-indexed). Therefore if we do _binary and_ on +any number, we always get a valid index within the table. Let's find the index +for our hash: + +``` +0xDEADBEEF & 0xFFFF = 0xBEEF +``` + +::: + +[^1]: not true +[^2]: also not true +[^3]: actually first of its kind (the self-balanced trees) +[^4]: + Rust chose to implement this instead of the common choice of the red-black + or AVL tree; main difference lies in the fact that B-trees are not binary + trees + +[^5]: + This, of course, does not hold true for the educational implementations of + the hash tables where conflicts are handled by storing the items in the + linked lists. In practice linked lists are not that commonly used for + addressing this issue as it has even worse impact on the efficiency of the + data structure. + +[^6]: My guess is not very good, or it's really bad… +[^7]: https://en.cppreference.com/w/cpp/utility/hash diff --git a/static/files/algorithms/hash-tables/breaking/benchmark.cpp b/static/files/algorithms/hash-tables/breaking/benchmark.cpp new file mode 100644 index 0000000..1c666c0 --- /dev/null +++ b/static/files/algorithms/hash-tables/breaking/benchmark.cpp @@ -0,0 +1,133 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +using elem_t = std::uint64_t; + +const elem_t N_ELEMENTS = 10000000; +#define LOOPS 10 + +template struct strategy { + virtual std::string name() const = 0; + virtual T elements() = 0; + + template void run(C &&s) { + using namespace std; + + cout << "\nBenchmarking:\t\t" << name() << '\n'; + + auto start = chrono::steady_clock::now(); + for (auto x : elements()) { + s.insert(x); + } + auto after_insertion = chrono::steady_clock::now(); + + auto insertion_time = + chrono::duration_cast(after_insertion - start); + cout << "Insertion phase:\t" << insertion_time << "\n"; + + start = chrono::steady_clock::now(); + for (int i = 0; i < LOOPS; ++i) { + for (auto x : elements()) { + assert(s.contains(x)); + } + } + auto after_lookups = chrono::steady_clock::now(); + + auto lookup_time = + chrono::duration_cast(after_lookups - start); + cout << "Lookup phase:\t\t" << lookup_time << "\n"; + } + + virtual ~strategy() = default; +}; + +using iota_t = + decltype(std::views::iota(static_cast(0), static_cast(0))); + +struct ascending_ordered_sequence : public strategy { + std::string name() const override { return "ordered sequence (ascending)"; } + iota_t elements() override { + return std::views::iota(static_cast(0), N_ELEMENTS); + } +}; + +static elem_t reverse(elem_t x) { return static_cast(N_ELEMENTS) - x; } +using reversed_iota_t = + decltype(std::views::iota(static_cast(0), static_cast(0)) | + std::views::transform(reverse)); + +struct descending_ordered_sequence : public strategy { + std::string name() const override { return "ordered sequence (descending)"; } + reversed_iota_t elements() override { + return std::views::iota(static_cast(1), N_ELEMENTS + 1) | + std::views::transform(reverse); + } +}; + +static elem_t attack(elem_t x) { return x << (5 + std::bit_width(x)); } +using attacked_iota_t = + decltype(std::views::iota(static_cast(0), static_cast(0)) | + std::views::transform(attack)); + +struct progressive_ascending_attack : public strategy { + std::string name() const override { + return "progressive sequence that self-heals on resize"; + } + attacked_iota_t elements() override { + return std::views::iota(static_cast(0), N_ELEMENTS) | + std::views::transform(attack); + } +}; + +using reversed_attacked_iota_t = + decltype(std::views::iota(static_cast(0), static_cast(0)) | + std::views::transform(reverse) | std::views::transform(attack)); + +struct progressive_descending_attack + : public strategy { + std::string name() const override { + return "progressive sequence that self-heals in the end"; + } + reversed_attacked_iota_t elements() override { + return std::views::iota(static_cast(1), N_ELEMENTS + 1) | + std::views::transform(reverse) | std::views::transform(attack); + } +}; + +static elem_t shift(elem_t x) { return x << 32; } +using shifted_iota_t = + decltype(std::views::iota(static_cast(0), static_cast(0)) | + std::views::transform(shift)); + +struct hard_attack : public strategy { + std::string name() const override { return "carefully chosen numbers"; } + shifted_iota_t elements() override { + return std::views::iota(static_cast(0), N_ELEMENTS) | + std::views::transform(shift); + } +}; + +template void run_all(const std::string ¬e) { + std::cout << "\n«" << note << "»\n"; + + ascending_ordered_sequence{}.run(C{}); + descending_ordered_sequence{}.run(C{}); + progressive_ascending_attack{}.run(C{}); + progressive_descending_attack{}.run(C{}); + hard_attack{}.run(C{}); +} + +int main() { + run_all>("hash table"); + run_all>("red-black tree"); + + return 0; +} diff --git a/static/files/algorithms/hash-tables/breaking/benchmark.py b/static/files/algorithms/hash-tables/breaking/benchmark.py new file mode 100644 index 0000000..710ea24 --- /dev/null +++ b/static/files/algorithms/hash-tables/breaking/benchmark.py @@ -0,0 +1,118 @@ +#!/usr/bin/env python3 + +from functools import cached_property +from time import monotonic_ns + +N_ELEMENTS = 10_000_000 +LOOPS = 10 + + +class Strategy: + def __init__(self, data_structure=set): + self._table = data_structure() + + @cached_property + def elements(self): + raise NotImplementedError("Implement for each strategy") + + @property + def name(self): + raise NotImplementedError("Implement for each strategy") + + def run(self): + print(f"\nBenchmarking:\t\t{self.name}") + + # Extract the elements here, so that the evaluation of them does not + # slow down the relevant part of benchmark + elements = self.elements + + # Insertion phase + start = monotonic_ns() + for x in elements: + self._table.add(x) + after_insertion = monotonic_ns() + + print(f"Insertion phase:\t{(after_insertion - start) / 1000000:.2f}ms") + + # Lookup phase + start = monotonic_ns() + for _ in range(LOOPS): + for x in elements: + assert x in self._table + after_lookups = monotonic_ns() + + print(f"Lookup phase:\t\t{(after_lookups - start) / 1000000:.2f}ms") + + +class AscendingOrderedSequence(Strategy): + @property + def name(self): + return "ordered sequence (ascending)" + + @cached_property + def elements(self): + return [x for x in range(N_ELEMENTS)] + + +class DescendingOrderedSequence(Strategy): + @property + def name(self): + return "ordered sequence (descending)" + + @cached_property + def elements(self): + return [x for x in reversed(range(N_ELEMENTS))] + + +class ProgressiveAttack(Strategy): + @staticmethod + def _break(n): + return n << max(5, n.bit_length()) + + +class ProgressiveAscendingAttack(ProgressiveAttack): + @property + def name(self): + return "progressive sequence that self-heals on resize" + + @cached_property + def elements(self): + return [self._break(x) for x in range(N_ELEMENTS)] + + +class ProgressiveDescendingAttack(ProgressiveAttack): + @property + def name(self): + return "progressive sequence that self-heals in the end" + + @cached_property + def elements(self): + return [self._break(x) for x in reversed(range(N_ELEMENTS))] + + +class HardAttack(Strategy): + @property + def name(self): + return "carefully chosen numbers" + + @cached_property + def elements(self): + return [x << 32 for x in range(N_ELEMENTS)] + + +STRATEGIES = [ + AscendingOrderedSequence, + DescendingOrderedSequence, + ProgressiveAscendingAttack, + ProgressiveDescendingAttack, + HardAttack, +] + + +def main(): + for strategy in STRATEGIES: + strategy().run() + + +if __name__ == "__main__": + main() From f2810936fade8b59828355cfcb80ecd23aa44ab6 Mon Sep 17 00:00:00 2001 From: Matej Focko Date: Tue, 28 Nov 2023 19:18:20 +0100 Subject: [PATCH 2/3] fix(algorithms): don't include date in the slug Signed-off-by: Matej Focko --- algorithms/12-hash-tables/2023-11-28-breaking/01-python.md | 1 + algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md | 1 + algorithms/12-hash-tables/2023-11-28-breaking/index.md | 1 + 3 files changed, 3 insertions(+) diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md b/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md index c35c626..63c8e6d 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/01-python.md @@ -1,5 +1,6 @@ --- id: python +slug: /hash-tables/breaking/python title: Breaking Python description: | Actually getting the worst-case time complexity in Python. diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md index 8d58239..e724a71 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md @@ -1,5 +1,6 @@ --- id: mitigations +slug: /hash-tables/breaking/mitigations title: Possible Mitigations description: | Talking about the ways how to prevent the attacks on the hash table. diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/index.md b/algorithms/12-hash-tables/2023-11-28-breaking/index.md index 60a4f58..d6b2811 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/index.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/index.md @@ -1,5 +1,6 @@ --- id: breaking +slug: /hash-tables/breaking title: Breaking Hash Table description: | How to get the linear time complexity in a hash table. From 6117d79454d29f595b83ef9b96d9165dd8841959 Mon Sep 17 00:00:00 2001 From: Matej Focko Date: Tue, 28 Nov 2023 19:32:38 +0100 Subject: [PATCH 3/3] fix(algorithms): reword some parts of breaking the hash table Signed-off-by: Matej Focko --- .../2023-11-28-breaking/02-mitigations.md | 30 ++++++++++ .../2023-11-28-breaking/index.md | 56 +++++++++---------- 2 files changed, 58 insertions(+), 28 deletions(-) diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md index e724a71..c515320 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md @@ -89,6 +89,36 @@ static uint64_t splitmix64(uint64_t x) { As you can see, this definitely doesn't do identity on the integers :smile: +Another example would be +[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339) +function in Java: + +```java +/** + * Computes key.hashCode() and spreads (XORs) higher bits of hash + * to lower. Because the table uses power-of-two masking, sets of + * hashes that vary only in bits above the current mask will + * always collide. (Among known examples are sets of Float keys + * holding consecutive whole numbers in small tables.) So we + * apply a transform that spreads the impact of higher bits + * downward. There is a tradeoff between speed, utility, and + * quality of bit-spreading. Because many common sets of hashes + * are already reasonably distributed (so don't benefit from + * spreading), and because we use trees to handle large sets of + * collisions in bins, we just XOR some shifted bits in the + * cheapest possible way to reduce systematic lossage, as well as + * to incorporate impact of the highest bits that would otherwise + * never be used in index calculations because of table bounds. + */ +static final int hash(Object key) { + int h; + return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); +} +``` + +You can notice that they try to include the upper bits of the hash by using +`XOR`, this would render our attack in the previous part helpless. + ## Combining both Can we make it better? Of course! Use multiple mitigations at the same time. In diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/index.md b/algorithms/12-hash-tables/2023-11-28-breaking/index.md index d6b2811..2bf11d8 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/index.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/index.md @@ -19,20 +19,20 @@ issues to occur. Hash tables are very commonly used to represent sets or dictionaries. Even when you look up solution to some problem that requires set or dictionary, it is more -than likely that you'll find something that references usage of hash table. You -might think it's the only possible option[^1] or it's the best one[^2]. +than likely that you'll find something that references usage of the hash table. +You might think it's the only possible option[^1], or it's the best one[^2]. One of the reasons to prefer hash tables over any other representation is the fact that they are **supposed** to be faster than the alternatives, but the truth lies somewhere in between. -One of the other possible implementations of the set is a balanced tree. One of -the most common implementations rely on the _red-black tree_, but you may see -also others like the _AVL tree_[^3] or _B-tree_[^4]. +One of the other possible implementations of the set is a balanced tree. Majorly +occurring implementations rely on the _red-black tree_, but you may see also +others like an _AVL tree_[^3] or _B-tree_[^4]. ## Hash Table v. Trees -The interesting part are the differences between those implementations. Why +The most interesting part are the differences between their implementations. Why should you choose hash table, or why should you choose the tree implementation? Let's compare the differences one by one. @@ -43,11 +43,11 @@ rely. We can also consider them as _requirements_ that must be met to be able to use the underlying data structure. Hash table relies on the _hash function_ that is supposed to distribute the keys -in such way that they're evenly spread across the slots in the array where the -keys (or pairs, for dictionary) are stored, but at the same time they're -somewhat unique, so no clustering occurs. +in such way that they're evenly spread across the slots where the keys (or +pairs, for dictionary) are stored, but at the same time they're somewhat unique, +so no clustering occurs. -Trees depend on the _ordering_ of the elements. Trees maintain the elements in +Trees depend on the _ordering_ of the elements. They maintain the elements in a sorted fashion, so for any pair of the elements that are used as keys, you need to be able to decide which one of them is _smaller or equal to_ the other. @@ -60,9 +60,9 @@ If you are familiar with complex numbers, they are a great example of a key that does not have ordering (unless you go element-wise for the sake of storing them in a tree; though the ordering **is not** defined on them). -Hashing them is much easier though, you can just “combine” the hashes of real -and imaginary parts of the complex number to get a hash of the complex number -itself. +Hashing them is much easier though, you can just “combine” the hashes of the +real and imaginary parts of the complex number to get a hash of the complex +number itself. ::: @@ -71,9 +71,9 @@ itself. The most obvious difference is the _core_ of the idea behind these data structures. Hash tables rely on data being stored in one continuous piece of memory (the array) where you can “guess” (by using the hash function) the -location of what you're looking for in constant time and also access that +location of what you're looking for in a constant time and also access that location in the, said, constant time[^5]. In case the hash function is -_not good enough_[^6], you need to go in blind, and if it comes to the worst, +_not good enough_[^6], you need to go in _blind_, and if it comes to the worst, check everything. :::tip tl;dr @@ -86,8 +86,9 @@ check everything. On the other hand, tree implementations rely on the self-balancing trees in which you don't get as _amazing_ results as with the hash table, but they're -consistent. Given that we have self-balancing tree, the height is same for -**every** input. +**consistent**. Given that we have a self-balancing tree, the height of the tree +is same for **every** input and therefore checking for any element can take the +same time even in the worst case. :::tip tl;dr @@ -122,16 +123,16 @@ $$ h : T \rightarrow \mathbb{N} $$ -For a language we will just take the definition from C++[^7]: +For a type signature we will just take the declaration from C++[^7]: ```cpp std::size_t operator()(const T& key) const; ``` If you compare with the mathematical definition, it is very similar, except for -the fact that the memory is not unlimited, so _natural number_ turned into an -_unsigned integer type_ (on majority of platforms it will be a 64-bit unsigned -integer). +the fact that the memory is not unlimited, so the _natural number_ turned into +an _unsigned integer type_ (on majority of platforms it will be a 64-bit +unsigned integer). ::: @@ -141,7 +142,6 @@ spot for the insertion). Hash functions are expected to have a so-called _avalanche effect_ which means that the smallest change to the key should result in a massive change of hash. - Avalanche effect technically guarantees that even when your data are clustered together, it should lower the amount of conflicts that can occur. @@ -153,9 +153,9 @@ Try to give an example of a hash function that is not good at all. ### Implementation details -There are different variations of the hash tables. You've most than likely seen +There are different variations of the hash tables. You've more than likely seen an implementation that keeps linked lists for buckets. However there are also -other variations that use probing instead and so on. +other variations that use probing instead. With regards to the implementation details, we need to mention the fact that even with the bounded hash (as we could've seen above), you're not likely to @@ -171,15 +171,15 @@ Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our hash table. What can we do here? Well, we definitely have a bigger hash than spots available, so we need to -“shrink” it somehow. Most common practice is to take the lower bits of the hash -to represent an index in the table: +“shrink” it somehow. The most common practice is to take the lower bits of the +hash to represent an index in the table: ``` h & (l - 1) ``` _Why does this work?_ Firstly we subtract 1 from the length (indices run from -`0..=(l - 1)`, since table is zero-indexed). Therefore if we do _binary and_ on +`⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on any number, we always get a valid index within the table. Let's find the index for our hash: @@ -191,7 +191,7 @@ for our hash: [^1]: not true [^2]: also not true -[^3]: actually first of its kind (the self-balanced trees) +[^3]: actually the first of its kind (the self-balanced trees) [^4]: Rust chose to implement this instead of the common choice of the red-black or AVL tree; main difference lies in the fact that B-trees are not binary