diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md index e724a71..c515320 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md @@ -89,6 +89,36 @@ static uint64_t splitmix64(uint64_t x) { As you can see, this definitely doesn't do identity on the integers :smile: +Another example would be +[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339) +function in Java: + +```java +/** + * Computes key.hashCode() and spreads (XORs) higher bits of hash + * to lower. Because the table uses power-of-two masking, sets of + * hashes that vary only in bits above the current mask will + * always collide. (Among known examples are sets of Float keys + * holding consecutive whole numbers in small tables.) So we + * apply a transform that spreads the impact of higher bits + * downward. There is a tradeoff between speed, utility, and + * quality of bit-spreading. Because many common sets of hashes + * are already reasonably distributed (so don't benefit from + * spreading), and because we use trees to handle large sets of + * collisions in bins, we just XOR some shifted bits in the + * cheapest possible way to reduce systematic lossage, as well as + * to incorporate impact of the highest bits that would otherwise + * never be used in index calculations because of table bounds. + */ +static final int hash(Object key) { + int h; + return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); +} +``` + +You can notice that they try to include the upper bits of the hash by using +`XOR`, this would render our attack in the previous part helpless. + ## Combining both Can we make it better? Of course! Use multiple mitigations at the same time. In diff --git a/algorithms/12-hash-tables/2023-11-28-breaking/index.md b/algorithms/12-hash-tables/2023-11-28-breaking/index.md index d6b2811..2bf11d8 100644 --- a/algorithms/12-hash-tables/2023-11-28-breaking/index.md +++ b/algorithms/12-hash-tables/2023-11-28-breaking/index.md @@ -19,20 +19,20 @@ issues to occur. Hash tables are very commonly used to represent sets or dictionaries. Even when you look up solution to some problem that requires set or dictionary, it is more -than likely that you'll find something that references usage of hash table. You -might think it's the only possible option[^1] or it's the best one[^2]. +than likely that you'll find something that references usage of the hash table. +You might think it's the only possible option[^1], or it's the best one[^2]. One of the reasons to prefer hash tables over any other representation is the fact that they are **supposed** to be faster than the alternatives, but the truth lies somewhere in between. -One of the other possible implementations of the set is a balanced tree. One of -the most common implementations rely on the _red-black tree_, but you may see -also others like the _AVL tree_[^3] or _B-tree_[^4]. +One of the other possible implementations of the set is a balanced tree. Majorly +occurring implementations rely on the _red-black tree_, but you may see also +others like an _AVL tree_[^3] or _B-tree_[^4]. ## Hash Table v. Trees -The interesting part are the differences between those implementations. Why +The most interesting part are the differences between their implementations. Why should you choose hash table, or why should you choose the tree implementation? Let's compare the differences one by one. @@ -43,11 +43,11 @@ rely. We can also consider them as _requirements_ that must be met to be able to use the underlying data structure. Hash table relies on the _hash function_ that is supposed to distribute the keys -in such way that they're evenly spread across the slots in the array where the -keys (or pairs, for dictionary) are stored, but at the same time they're -somewhat unique, so no clustering occurs. +in such way that they're evenly spread across the slots where the keys (or +pairs, for dictionary) are stored, but at the same time they're somewhat unique, +so no clustering occurs. -Trees depend on the _ordering_ of the elements. Trees maintain the elements in +Trees depend on the _ordering_ of the elements. They maintain the elements in a sorted fashion, so for any pair of the elements that are used as keys, you need to be able to decide which one of them is _smaller or equal to_ the other. @@ -60,9 +60,9 @@ If you are familiar with complex numbers, they are a great example of a key that does not have ordering (unless you go element-wise for the sake of storing them in a tree; though the ordering **is not** defined on them). -Hashing them is much easier though, you can just “combine” the hashes of real -and imaginary parts of the complex number to get a hash of the complex number -itself. +Hashing them is much easier though, you can just “combine” the hashes of the +real and imaginary parts of the complex number to get a hash of the complex +number itself. ::: @@ -71,9 +71,9 @@ itself. The most obvious difference is the _core_ of the idea behind these data structures. Hash tables rely on data being stored in one continuous piece of memory (the array) where you can “guess” (by using the hash function) the -location of what you're looking for in constant time and also access that +location of what you're looking for in a constant time and also access that location in the, said, constant time[^5]. In case the hash function is -_not good enough_[^6], you need to go in blind, and if it comes to the worst, +_not good enough_[^6], you need to go in _blind_, and if it comes to the worst, check everything. :::tip tl;dr @@ -86,8 +86,9 @@ check everything. On the other hand, tree implementations rely on the self-balancing trees in which you don't get as _amazing_ results as with the hash table, but they're -consistent. Given that we have self-balancing tree, the height is same for -**every** input. +**consistent**. Given that we have a self-balancing tree, the height of the tree +is same for **every** input and therefore checking for any element can take the +same time even in the worst case. :::tip tl;dr @@ -122,16 +123,16 @@ $$ h : T \rightarrow \mathbb{N} $$ -For a language we will just take the definition from C++[^7]: +For a type signature we will just take the declaration from C++[^7]: ```cpp std::size_t operator()(const T& key) const; ``` If you compare with the mathematical definition, it is very similar, except for -the fact that the memory is not unlimited, so _natural number_ turned into an -_unsigned integer type_ (on majority of platforms it will be a 64-bit unsigned -integer). +the fact that the memory is not unlimited, so the _natural number_ turned into +an _unsigned integer type_ (on majority of platforms it will be a 64-bit +unsigned integer). ::: @@ -141,7 +142,6 @@ spot for the insertion). Hash functions are expected to have a so-called _avalanche effect_ which means that the smallest change to the key should result in a massive change of hash. - Avalanche effect technically guarantees that even when your data are clustered together, it should lower the amount of conflicts that can occur. @@ -153,9 +153,9 @@ Try to give an example of a hash function that is not good at all. ### Implementation details -There are different variations of the hash tables. You've most than likely seen +There are different variations of the hash tables. You've more than likely seen an implementation that keeps linked lists for buckets. However there are also -other variations that use probing instead and so on. +other variations that use probing instead. With regards to the implementation details, we need to mention the fact that even with the bounded hash (as we could've seen above), you're not likely to @@ -171,15 +171,15 @@ Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our hash table. What can we do here? Well, we definitely have a bigger hash than spots available, so we need to -“shrink” it somehow. Most common practice is to take the lower bits of the hash -to represent an index in the table: +“shrink” it somehow. The most common practice is to take the lower bits of the +hash to represent an index in the table: ``` h & (l - 1) ``` _Why does this work?_ Firstly we subtract 1 from the length (indices run from -`0..=(l - 1)`, since table is zero-indexed). Therefore if we do _binary and_ on +`⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on any number, we always get a valid index within the table. Let's find the index for our hash: @@ -191,7 +191,7 @@ for our hash: [^1]: not true [^2]: also not true -[^3]: actually first of its kind (the self-balanced trees) +[^3]: actually the first of its kind (the self-balanced trees) [^4]: Rust chose to implement this instead of the common choice of the red-black or AVL tree; main difference lies in the fact that B-trees are not binary