mirror of
https://github.com/mfocko/blog.git
synced 2024-11-24 22:11:54 +01:00
fix(algorithms): reword some parts of breaking the hash table
Signed-off-by: Matej Focko <mfocko@redhat.com>
This commit is contained in:
parent
f2810936fa
commit
6117d79454
2 changed files with 58 additions and 28 deletions
|
@ -89,6 +89,36 @@ static uint64_t splitmix64(uint64_t x) {
|
||||||
|
|
||||||
As you can see, this definitely doesn't do identity on the integers :smile:
|
As you can see, this definitely doesn't do identity on the integers :smile:
|
||||||
|
|
||||||
|
Another example would be
|
||||||
|
[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339)
|
||||||
|
function in Java:
|
||||||
|
|
||||||
|
```java
|
||||||
|
/**
|
||||||
|
* Computes key.hashCode() and spreads (XORs) higher bits of hash
|
||||||
|
* to lower. Because the table uses power-of-two masking, sets of
|
||||||
|
* hashes that vary only in bits above the current mask will
|
||||||
|
* always collide. (Among known examples are sets of Float keys
|
||||||
|
* holding consecutive whole numbers in small tables.) So we
|
||||||
|
* apply a transform that spreads the impact of higher bits
|
||||||
|
* downward. There is a tradeoff between speed, utility, and
|
||||||
|
* quality of bit-spreading. Because many common sets of hashes
|
||||||
|
* are already reasonably distributed (so don't benefit from
|
||||||
|
* spreading), and because we use trees to handle large sets of
|
||||||
|
* collisions in bins, we just XOR some shifted bits in the
|
||||||
|
* cheapest possible way to reduce systematic lossage, as well as
|
||||||
|
* to incorporate impact of the highest bits that would otherwise
|
||||||
|
* never be used in index calculations because of table bounds.
|
||||||
|
*/
|
||||||
|
static final int hash(Object key) {
|
||||||
|
int h;
|
||||||
|
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can notice that they try to include the upper bits of the hash by using
|
||||||
|
`XOR`, this would render our attack in the previous part helpless.
|
||||||
|
|
||||||
## Combining both
|
## Combining both
|
||||||
|
|
||||||
Can we make it better? Of course! Use multiple mitigations at the same time. In
|
Can we make it better? Of course! Use multiple mitigations at the same time. In
|
||||||
|
|
|
@ -19,20 +19,20 @@ issues to occur.
|
||||||
|
|
||||||
Hash tables are very commonly used to represent sets or dictionaries. Even when
|
Hash tables are very commonly used to represent sets or dictionaries. Even when
|
||||||
you look up solution to some problem that requires set or dictionary, it is more
|
you look up solution to some problem that requires set or dictionary, it is more
|
||||||
than likely that you'll find something that references usage of hash table. You
|
than likely that you'll find something that references usage of the hash table.
|
||||||
might think it's the only possible option[^1] or it's the best one[^2].
|
You might think it's the only possible option[^1], or it's the best one[^2].
|
||||||
|
|
||||||
One of the reasons to prefer hash tables over any other representation is the
|
One of the reasons to prefer hash tables over any other representation is the
|
||||||
fact that they are **supposed** to be faster than the alternatives, but the
|
fact that they are **supposed** to be faster than the alternatives, but the
|
||||||
truth lies somewhere in between.
|
truth lies somewhere in between.
|
||||||
|
|
||||||
One of the other possible implementations of the set is a balanced tree. One of
|
One of the other possible implementations of the set is a balanced tree. Majorly
|
||||||
the most common implementations rely on the _red-black tree_, but you may see
|
occurring implementations rely on the _red-black tree_, but you may see also
|
||||||
also others like the _AVL tree_[^3] or _B-tree_[^4].
|
others like an _AVL tree_[^3] or _B-tree_[^4].
|
||||||
|
|
||||||
## Hash Table v. Trees
|
## Hash Table v. Trees
|
||||||
|
|
||||||
The interesting part are the differences between those implementations. Why
|
The most interesting part are the differences between their implementations. Why
|
||||||
should you choose hash table, or why should you choose the tree implementation?
|
should you choose hash table, or why should you choose the tree implementation?
|
||||||
Let's compare the differences one by one.
|
Let's compare the differences one by one.
|
||||||
|
|
||||||
|
@ -43,11 +43,11 @@ rely. We can also consider them as _requirements_ that must be met to be able to
|
||||||
use the underlying data structure.
|
use the underlying data structure.
|
||||||
|
|
||||||
Hash table relies on the _hash function_ that is supposed to distribute the keys
|
Hash table relies on the _hash function_ that is supposed to distribute the keys
|
||||||
in such way that they're evenly spread across the slots in the array where the
|
in such way that they're evenly spread across the slots where the keys (or
|
||||||
keys (or pairs, for dictionary) are stored, but at the same time they're
|
pairs, for dictionary) are stored, but at the same time they're somewhat unique,
|
||||||
somewhat unique, so no clustering occurs.
|
so no clustering occurs.
|
||||||
|
|
||||||
Trees depend on the _ordering_ of the elements. Trees maintain the elements in
|
Trees depend on the _ordering_ of the elements. They maintain the elements in
|
||||||
a sorted fashion, so for any pair of the elements that are used as keys, you
|
a sorted fashion, so for any pair of the elements that are used as keys, you
|
||||||
need to be able to decide which one of them is _smaller or equal to_ the other.
|
need to be able to decide which one of them is _smaller or equal to_ the other.
|
||||||
|
|
||||||
|
@ -60,9 +60,9 @@ If you are familiar with complex numbers, they are a great example of a key that
|
||||||
does not have ordering (unless you go element-wise for the sake of storing them
|
does not have ordering (unless you go element-wise for the sake of storing them
|
||||||
in a tree; though the ordering **is not** defined on them).
|
in a tree; though the ordering **is not** defined on them).
|
||||||
|
|
||||||
Hashing them is much easier though, you can just “combine” the hashes of real
|
Hashing them is much easier though, you can just “combine” the hashes of the
|
||||||
and imaginary parts of the complex number to get a hash of the complex number
|
real and imaginary parts of the complex number to get a hash of the complex
|
||||||
itself.
|
number itself.
|
||||||
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
|
@ -71,9 +71,9 @@ itself.
|
||||||
The most obvious difference is the _core_ of the idea behind these data
|
The most obvious difference is the _core_ of the idea behind these data
|
||||||
structures. Hash tables rely on data being stored in one continuous piece of
|
structures. Hash tables rely on data being stored in one continuous piece of
|
||||||
memory (the array) where you can “guess” (by using the hash function) the
|
memory (the array) where you can “guess” (by using the hash function) the
|
||||||
location of what you're looking for in constant time and also access that
|
location of what you're looking for in a constant time and also access that
|
||||||
location in the, said, constant time[^5]. In case the hash function is
|
location in the, said, constant time[^5]. In case the hash function is
|
||||||
_not good enough_[^6], you need to go in blind, and if it comes to the worst,
|
_not good enough_[^6], you need to go in _blind_, and if it comes to the worst,
|
||||||
check everything.
|
check everything.
|
||||||
|
|
||||||
:::tip tl;dr
|
:::tip tl;dr
|
||||||
|
@ -86,8 +86,9 @@ check everything.
|
||||||
|
|
||||||
On the other hand, tree implementations rely on the self-balancing trees in
|
On the other hand, tree implementations rely on the self-balancing trees in
|
||||||
which you don't get as _amazing_ results as with the hash table, but they're
|
which you don't get as _amazing_ results as with the hash table, but they're
|
||||||
consistent. Given that we have self-balancing tree, the height is same for
|
**consistent**. Given that we have a self-balancing tree, the height of the tree
|
||||||
**every** input.
|
is same for **every** input and therefore checking for any element can take the
|
||||||
|
same time even in the worst case.
|
||||||
|
|
||||||
:::tip tl;dr
|
:::tip tl;dr
|
||||||
|
|
||||||
|
@ -122,16 +123,16 @@ $$
|
||||||
h : T \rightarrow \mathbb{N}
|
h : T \rightarrow \mathbb{N}
|
||||||
$$
|
$$
|
||||||
|
|
||||||
For a language we will just take the definition from C++[^7]:
|
For a type signature we will just take the declaration from C++[^7]:
|
||||||
|
|
||||||
```cpp
|
```cpp
|
||||||
std::size_t operator()(const T& key) const;
|
std::size_t operator()(const T& key) const;
|
||||||
```
|
```
|
||||||
|
|
||||||
If you compare with the mathematical definition, it is very similar, except for
|
If you compare with the mathematical definition, it is very similar, except for
|
||||||
the fact that the memory is not unlimited, so _natural number_ turned into an
|
the fact that the memory is not unlimited, so the _natural number_ turned into
|
||||||
_unsigned integer type_ (on majority of platforms it will be a 64-bit unsigned
|
an _unsigned integer type_ (on majority of platforms it will be a 64-bit
|
||||||
integer).
|
unsigned integer).
|
||||||
|
|
||||||
:::
|
:::
|
||||||
|
|
||||||
|
@ -141,7 +142,6 @@ spot for the insertion).
|
||||||
|
|
||||||
Hash functions are expected to have a so-called _avalanche effect_ which means
|
Hash functions are expected to have a so-called _avalanche effect_ which means
|
||||||
that the smallest change to the key should result in a massive change of hash.
|
that the smallest change to the key should result in a massive change of hash.
|
||||||
|
|
||||||
Avalanche effect technically guarantees that even when your data are clustered
|
Avalanche effect technically guarantees that even when your data are clustered
|
||||||
together, it should lower the amount of conflicts that can occur.
|
together, it should lower the amount of conflicts that can occur.
|
||||||
|
|
||||||
|
@ -153,9 +153,9 @@ Try to give an example of a hash function that is not good at all.
|
||||||
|
|
||||||
### Implementation details
|
### Implementation details
|
||||||
|
|
||||||
There are different variations of the hash tables. You've most than likely seen
|
There are different variations of the hash tables. You've more than likely seen
|
||||||
an implementation that keeps linked lists for buckets. However there are also
|
an implementation that keeps linked lists for buckets. However there are also
|
||||||
other variations that use probing instead and so on.
|
other variations that use probing instead.
|
||||||
|
|
||||||
With regards to the implementation details, we need to mention the fact that
|
With regards to the implementation details, we need to mention the fact that
|
||||||
even with the bounded hash (as we could've seen above), you're not likely to
|
even with the bounded hash (as we could've seen above), you're not likely to
|
||||||
|
@ -171,15 +171,15 @@ Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our
|
||||||
hash table. What can we do here?
|
hash table. What can we do here?
|
||||||
|
|
||||||
Well, we definitely have a bigger hash than spots available, so we need to
|
Well, we definitely have a bigger hash than spots available, so we need to
|
||||||
“shrink” it somehow. Most common practice is to take the lower bits of the hash
|
“shrink” it somehow. The most common practice is to take the lower bits of the
|
||||||
to represent an index in the table:
|
hash to represent an index in the table:
|
||||||
|
|
||||||
```
|
```
|
||||||
h & (l - 1)
|
h & (l - 1)
|
||||||
```
|
```
|
||||||
|
|
||||||
_Why does this work?_ Firstly we subtract 1 from the length (indices run from
|
_Why does this work?_ Firstly we subtract 1 from the length (indices run from
|
||||||
`0..=(l - 1)`, since table is zero-indexed). Therefore if we do _binary and_ on
|
`⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on
|
||||||
any number, we always get a valid index within the table. Let's find the index
|
any number, we always get a valid index within the table. Let's find the index
|
||||||
for our hash:
|
for our hash:
|
||||||
|
|
||||||
|
@ -191,7 +191,7 @@ for our hash:
|
||||||
|
|
||||||
[^1]: not true
|
[^1]: not true
|
||||||
[^2]: also not true
|
[^2]: also not true
|
||||||
[^3]: actually first of its kind (the self-balanced trees)
|
[^3]: actually the first of its kind (the self-balanced trees)
|
||||||
[^4]:
|
[^4]:
|
||||||
Rust chose to implement this instead of the common choice of the red-black
|
Rust chose to implement this instead of the common choice of the red-black
|
||||||
or AVL tree; main difference lies in the fact that B-trees are not binary
|
or AVL tree; main difference lies in the fact that B-trees are not binary
|
||||||
|
|
Loading…
Reference in a new issue