fix(algorithms): reword some parts of breaking the hash table

Signed-off-by: Matej Focko <mfocko@redhat.com>
This commit is contained in:
Matej Focko 2023-11-28 19:32:38 +01:00
parent f2810936fa
commit 6117d79454
Signed by: mfocko
GPG key ID: 7C47D46246790496
2 changed files with 58 additions and 28 deletions

View file

@ -89,6 +89,36 @@ static uint64_t splitmix64(uint64_t x) {
As you can see, this definitely doesn't do identity on the integers :smile:
Another example would be
[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339)
function in Java:
```java
/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower. Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.) So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
```
You can notice that they try to include the upper bits of the hash by using
`XOR`, this would render our attack in the previous part helpless.
## Combining both
Can we make it better? Of course! Use multiple mitigations at the same time. In

View file

@ -19,20 +19,20 @@ issues to occur.
Hash tables are very commonly used to represent sets or dictionaries. Even when
you look up solution to some problem that requires set or dictionary, it is more
than likely that you'll find something that references usage of hash table. You
might think it's the only possible option[^1] or it's the best one[^2].
than likely that you'll find something that references usage of the hash table.
You might think it's the only possible option[^1], or it's the best one[^2].
One of the reasons to prefer hash tables over any other representation is the
fact that they are **supposed** to be faster than the alternatives, but the
truth lies somewhere in between.
One of the other possible implementations of the set is a balanced tree. One of
the most common implementations rely on the _red-black tree_, but you may see
also others like the _AVL tree_[^3] or _B-tree_[^4].
One of the other possible implementations of the set is a balanced tree. Majorly
occurring implementations rely on the _red-black tree_, but you may see also
others like an _AVL tree_[^3] or _B-tree_[^4].
## Hash Table v. Trees
The interesting part are the differences between those implementations. Why
The most interesting part are the differences between their implementations. Why
should you choose hash table, or why should you choose the tree implementation?
Let's compare the differences one by one.
@ -43,11 +43,11 @@ rely. We can also consider them as _requirements_ that must be met to be able to
use the underlying data structure.
Hash table relies on the _hash function_ that is supposed to distribute the keys
in such way that they're evenly spread across the slots in the array where the
keys (or pairs, for dictionary) are stored, but at the same time they're
somewhat unique, so no clustering occurs.
in such way that they're evenly spread across the slots where the keys (or
pairs, for dictionary) are stored, but at the same time they're somewhat unique,
so no clustering occurs.
Trees depend on the _ordering_ of the elements. Trees maintain the elements in
Trees depend on the _ordering_ of the elements. They maintain the elements in
a sorted fashion, so for any pair of the elements that are used as keys, you
need to be able to decide which one of them is _smaller or equal to_ the other.
@ -60,9 +60,9 @@ If you are familiar with complex numbers, they are a great example of a key that
does not have ordering (unless you go element-wise for the sake of storing them
in a tree; though the ordering **is not** defined on them).
Hashing them is much easier though, you can just “combine” the hashes of real
and imaginary parts of the complex number to get a hash of the complex number
itself.
Hashing them is much easier though, you can just “combine” the hashes of the
real and imaginary parts of the complex number to get a hash of the complex
number itself.
:::
@ -71,9 +71,9 @@ itself.
The most obvious difference is the _core_ of the idea behind these data
structures. Hash tables rely on data being stored in one continuous piece of
memory (the array) where you can “guess” (by using the hash function) the
location of what you're looking for in constant time and also access that
location of what you're looking for in a constant time and also access that
location in the, said, constant time[^5]. In case the hash function is
_not good enough_[^6], you need to go in blind, and if it comes to the worst,
_not good enough_[^6], you need to go in _blind_, and if it comes to the worst,
check everything.
:::tip tl;dr
@ -86,8 +86,9 @@ check everything.
On the other hand, tree implementations rely on the self-balancing trees in
which you don't get as _amazing_ results as with the hash table, but they're
consistent. Given that we have self-balancing tree, the height is same for
**every** input.
**consistent**. Given that we have a self-balancing tree, the height of the tree
is same for **every** input and therefore checking for any element can take the
same time even in the worst case.
:::tip tl;dr
@ -122,16 +123,16 @@ $$
h : T \rightarrow \mathbb{N}
$$
For a language we will just take the definition from C++[^7]:
For a type signature we will just take the declaration from C++[^7]:
```cpp
std::size_t operator()(const T& key) const;
```
If you compare with the mathematical definition, it is very similar, except for
the fact that the memory is not unlimited, so _natural number_ turned into an
_unsigned integer type_ (on majority of platforms it will be a 64-bit unsigned
integer).
the fact that the memory is not unlimited, so the _natural number_ turned into
an _unsigned integer type_ (on majority of platforms it will be a 64-bit
unsigned integer).
:::
@ -141,7 +142,6 @@ spot for the insertion).
Hash functions are expected to have a so-called _avalanche effect_ which means
that the smallest change to the key should result in a massive change of hash.
Avalanche effect technically guarantees that even when your data are clustered
together, it should lower the amount of conflicts that can occur.
@ -153,9 +153,9 @@ Try to give an example of a hash function that is not good at all.
### Implementation details
There are different variations of the hash tables. You've most than likely seen
There are different variations of the hash tables. You've more than likely seen
an implementation that keeps linked lists for buckets. However there are also
other variations that use probing instead and so on.
other variations that use probing instead.
With regards to the implementation details, we need to mention the fact that
even with the bounded hash (as we could've seen above), you're not likely to
@ -171,15 +171,15 @@ Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our
hash table. What can we do here?
Well, we definitely have a bigger hash than spots available, so we need to
“shrink” it somehow. Most common practice is to take the lower bits of the hash
to represent an index in the table:
“shrink” it somehow. The most common practice is to take the lower bits of the
hash to represent an index in the table:
```
h & (l - 1)
```
_Why does this work?_ Firstly we subtract 1 from the length (indices run from
`0..=(l - 1)`, since table is zero-indexed). Therefore if we do _binary and_ on
`⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on
any number, we always get a valid index within the table. Let's find the index
for our hash:
@ -191,7 +191,7 @@ for our hash:
[^1]: not true
[^2]: also not true
[^3]: actually first of its kind (the self-balanced trees)
[^3]: actually the first of its kind (the self-balanced trees)
[^4]:
Rust chose to implement this instead of the common choice of the red-black
or AVL tree; main difference lies in the fact that B-trees are not binary