fix(algorithms): reword some parts of breaking the hash table

Signed-off-by: Matej Focko <mfocko@redhat.com>
This commit is contained in:
Matej Focko 2023-11-28 19:32:38 +01:00
parent f2810936fa
commit 6117d79454
Signed by: mfocko
GPG key ID: 7C47D46246790496
2 changed files with 58 additions and 28 deletions

View file

@ -89,6 +89,36 @@ static uint64_t splitmix64(uint64_t x) {
As you can see, this definitely doesn't do identity on the integers :smile: As you can see, this definitely doesn't do identity on the integers :smile:
Another example would be
[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339)
function in Java:
```java
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
```
You can notice that they try to include the upper bits of the hash by using
`XOR`, this would render our attack in the previous part helpless.
## Combining both ## Combining both
Can we make it better? Of course! Use multiple mitigations at the same time. In Can we make it better? Of course! Use multiple mitigations at the same time. In

View file

@ -19,20 +19,20 @@ issues to occur.
Hash tables are very commonly used to represent sets or dictionaries. Even when Hash tables are very commonly used to represent sets or dictionaries. Even when
you look up solution to some problem that requires set or dictionary, it is more you look up solution to some problem that requires set or dictionary, it is more
than likely that you'll find something that references usage of hash table. You than likely that you'll find something that references usage of the hash table.
might think it's the only possible option[^1] or it's the best one[^2]. You might think it's the only possible option[^1], or it's the best one[^2].
One of the reasons to prefer hash tables over any other representation is the One of the reasons to prefer hash tables over any other representation is the
fact that they are **supposed** to be faster than the alternatives, but the fact that they are **supposed** to be faster than the alternatives, but the
truth lies somewhere in between. truth lies somewhere in between.
One of the other possible implementations of the set is a balanced tree. One of One of the other possible implementations of the set is a balanced tree. Majorly
the most common implementations rely on the _red-black tree_, but you may see occurring implementations rely on the _red-black tree_, but you may see also
also others like the _AVL tree_[^3] or _B-tree_[^4]. others like an _AVL tree_[^3] or _B-tree_[^4].
## Hash Table v. Trees ## Hash Table v. Trees
The interesting part are the differences between those implementations. Why The most interesting part are the differences between their implementations. Why
should you choose hash table, or why should you choose the tree implementation? should you choose hash table, or why should you choose the tree implementation?
Let's compare the differences one by one. Let's compare the differences one by one.
@ -43,11 +43,11 @@ rely. We can also consider them as _requirements_ that must be met to be able to
use the underlying data structure. use the underlying data structure.
Hash table relies on the _hash function_ that is supposed to distribute the keys Hash table relies on the _hash function_ that is supposed to distribute the keys
in such way that they're evenly spread across the slots in the array where the in such way that they're evenly spread across the slots where the keys (or
keys (or pairs, for dictionary) are stored, but at the same time they're pairs, for dictionary) are stored, but at the same time they're somewhat unique,
somewhat unique, so no clustering occurs. so no clustering occurs.
Trees depend on the _ordering_ of the elements. Trees maintain the elements in Trees depend on the _ordering_ of the elements. They maintain the elements in
a sorted fashion, so for any pair of the elements that are used as keys, you a sorted fashion, so for any pair of the elements that are used as keys, you
need to be able to decide which one of them is _smaller or equal to_ the other. need to be able to decide which one of them is _smaller or equal to_ the other.
@ -60,9 +60,9 @@ If you are familiar with complex numbers, they are a great example of a key that
does not have ordering (unless you go element-wise for the sake of storing them does not have ordering (unless you go element-wise for the sake of storing them
in a tree; though the ordering **is not** defined on them). in a tree; though the ordering **is not** defined on them).
Hashing them is much easier though, you can just “combine” the hashes of real Hashing them is much easier though, you can just “combine” the hashes of the
and imaginary parts of the complex number to get a hash of the complex number real and imaginary parts of the complex number to get a hash of the complex
itself. number itself.
::: :::
@ -71,9 +71,9 @@ itself.
The most obvious difference is the _core_ of the idea behind these data The most obvious difference is the _core_ of the idea behind these data
structures. Hash tables rely on data being stored in one continuous piece of structures. Hash tables rely on data being stored in one continuous piece of
memory (the array) where you can “guess” (by using the hash function) the memory (the array) where you can “guess” (by using the hash function) the
location of what you're looking for in constant time and also access that location of what you're looking for in a constant time and also access that
location in the, said, constant time[^5]. In case the hash function is location in the, said, constant time[^5]. In case the hash function is
_not good enough_[^6], you need to go in blind, and if it comes to the worst, _not good enough_[^6], you need to go in _blind_, and if it comes to the worst,
check everything. check everything.
:::tip tl;dr :::tip tl;dr
@ -86,8 +86,9 @@ check everything.
On the other hand, tree implementations rely on the self-balancing trees in On the other hand, tree implementations rely on the self-balancing trees in
which you don't get as _amazing_ results as with the hash table, but they're which you don't get as _amazing_ results as with the hash table, but they're
consistent. Given that we have self-balancing tree, the height is same for **consistent**. Given that we have a self-balancing tree, the height of the tree
**every** input. is same for **every** input and therefore checking for any element can take the
same time even in the worst case.
:::tip tl;dr :::tip tl;dr
@ -122,16 +123,16 @@ $$
h : T \rightarrow \mathbb{N} h : T \rightarrow \mathbb{N}
$$ $$
For a language we will just take the definition from C++[^7]: For a type signature we will just take the declaration from C++[^7]:
```cpp ```cpp
std::size_t operator()(const T& key) const; std::size_t operator()(const T& key) const;
``` ```
If you compare with the mathematical definition, it is very similar, except for If you compare with the mathematical definition, it is very similar, except for
the fact that the memory is not unlimited, so _natural number_ turned into an the fact that the memory is not unlimited, so the _natural number_ turned into
_unsigned integer type_ (on majority of platforms it will be a 64-bit unsigned an _unsigned integer type_ (on majority of platforms it will be a 64-bit
integer). unsigned integer).
::: :::
@ -141,7 +142,6 @@ spot for the insertion).
Hash functions are expected to have a so-called _avalanche effect_ which means Hash functions are expected to have a so-called _avalanche effect_ which means
that the smallest change to the key should result in a massive change of hash. that the smallest change to the key should result in a massive change of hash.
Avalanche effect technically guarantees that even when your data are clustered Avalanche effect technically guarantees that even when your data are clustered
together, it should lower the amount of conflicts that can occur. together, it should lower the amount of conflicts that can occur.
@ -153,9 +153,9 @@ Try to give an example of a hash function that is not good at all.
### Implementation details ### Implementation details
There are different variations of the hash tables. You've most than likely seen There are different variations of the hash tables. You've more than likely seen
an implementation that keeps linked lists for buckets. However there are also an implementation that keeps linked lists for buckets. However there are also
other variations that use probing instead and so on. other variations that use probing instead.
With regards to the implementation details, we need to mention the fact that With regards to the implementation details, we need to mention the fact that
even with the bounded hash (as we could've seen above), you're not likely to even with the bounded hash (as we could've seen above), you're not likely to
@ -171,15 +171,15 @@ Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our
hash table. What can we do here? hash table. What can we do here?
Well, we definitely have a bigger hash than spots available, so we need to Well, we definitely have a bigger hash than spots available, so we need to
“shrink” it somehow. Most common practice is to take the lower bits of the hash “shrink” it somehow. The most common practice is to take the lower bits of the
to represent an index in the table: hash to represent an index in the table:
``` ```
h & (l - 1) h & (l - 1)
``` ```
_Why does this work?_ Firstly we subtract 1 from the length (indices run from _Why does this work?_ Firstly we subtract 1 from the length (indices run from
`0..=(l - 1)`, since table is zero-indexed). Therefore if we do _binary and_ on `⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on
any number, we always get a valid index within the table. Let's find the index any number, we always get a valid index within the table. Let's find the index
for our hash: for our hash:
@ -191,7 +191,7 @@ for our hash:
[^1]: not true [^1]: not true
[^2]: also not true [^2]: also not true
[^3]: actually first of its kind (the self-balanced trees) [^3]: actually the first of its kind (the self-balanced trees)
[^4]: [^4]:
Rust chose to implement this instead of the common choice of the red-black Rust chose to implement this instead of the common choice of the red-black
or AVL tree; main difference lies in the fact that B-trees are not binary or AVL tree; main difference lies in the fact that B-trees are not binary