mirror of
https://github.com/mfocko/blog.git
synced 2024-11-10 00:09:07 +01:00
Publish “Breaking of the Hash Table” (#6)
This commit is contained in:
commit
dadb0d51f7
5 changed files with 871 additions and 0 deletions
231
algorithms/12-hash-tables/2023-11-28-breaking/01-python.md
Normal file
231
algorithms/12-hash-tables/2023-11-28-breaking/01-python.md
Normal file
|
@ -0,0 +1,231 @@
|
||||||
|
---
|
||||||
|
id: python
|
||||||
|
slug: /hash-tables/breaking/python
|
||||||
|
title: Breaking Python
|
||||||
|
description: |
|
||||||
|
Actually getting the worst-case time complexity in Python.
|
||||||
|
tags:
|
||||||
|
- cpp
|
||||||
|
- python
|
||||||
|
- hash-tables
|
||||||
|
last_update:
|
||||||
|
date: 2023-11-28
|
||||||
|
---
|
||||||
|
|
||||||
|
## Breaking the Hash Table in Python
|
||||||
|
|
||||||
|
Our language of choice for bringing the worst out of the hash table is _Python_.
|
||||||
|
|
||||||
|
Let's start by talking about the hash function and why we've chosen Python for
|
||||||
|
this. Hash function for integers in Python is simply _identity_, as you might've
|
||||||
|
guessed, there's no avalanche effect. Another thing that helps us is the fact
|
||||||
|
that integers in Python are technically `BigInt`s[^1]. This allows us to put bit
|
||||||
|
more pressure on the hashing function.
|
||||||
|
|
||||||
|
From the perspective of the implementation, it is a hash table that uses probing
|
||||||
|
to resolve conflicts. This also means that it's a contiguous space in memory.
|
||||||
|
Indexing works like in the provided example above. When the hash table reaches
|
||||||
|
a _breaking point_ (defined somewhere in the C code), it reallocates the table
|
||||||
|
and rehashes everything.
|
||||||
|
|
||||||
|
:::tip
|
||||||
|
|
||||||
|
Resizing and rehashing can reduce the conflicts. That is coming from the fact
|
||||||
|
that the position in the table is determined by the hash and the size of the
|
||||||
|
table itself.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
## Preparing the attack
|
||||||
|
|
||||||
|
Knowing the things above, it is not that hard to construct a method how to cause
|
||||||
|
as many conflicts as possible. Let's go over it:
|
||||||
|
|
||||||
|
1. We know that integers are hashed to themselves.
|
||||||
|
2. We also know that from that hash we use only lower bits that are used as
|
||||||
|
indices.
|
||||||
|
3. We also know that there's a rehashing on resize that could possibly fix the
|
||||||
|
conflicts.
|
||||||
|
|
||||||
|
We will test with different sequences:
|
||||||
|
|
||||||
|
1. ordered one, numbers through 1 to N
|
||||||
|
2. ordered one in a reversed order, numbers through N back to 1
|
||||||
|
3. numbers that are shifted to the left, so they create conflicts until resize
|
||||||
|
4. numbers that are shifted to the left, but resizing helps only in the end
|
||||||
|
5. numbers that are shifted to the left, but they won't be taken in account even
|
||||||
|
after final resize
|
||||||
|
|
||||||
|
For each of these sequences, we will insert 10⁷ elements and look each of them
|
||||||
|
up for 10 times in a row.
|
||||||
|
|
||||||
|
As a base of our benchmark, we will use a `Strategy` class and then for each
|
||||||
|
strategy we will just implement the sequence of numbers that it uses:
|
||||||
|
|
||||||
|
```py
|
||||||
|
class Strategy:
|
||||||
|
def __init__(self, data_structure=set):
|
||||||
|
self._table = data_structure()
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
raise NotImplementedError("Implement for each strategy")
|
||||||
|
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
raise NotImplementedError("Implement for each strategy")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
print(f"\nBenchmarking:\t\t{self.name}")
|
||||||
|
|
||||||
|
# Extract the elements here, so that the evaluation of them does not
|
||||||
|
# slow down the relevant part of benchmark
|
||||||
|
elements = self.elements
|
||||||
|
|
||||||
|
# Insertion phase
|
||||||
|
start = monotonic_ns()
|
||||||
|
for x in elements:
|
||||||
|
self._table.add(x)
|
||||||
|
after_insertion = monotonic_ns()
|
||||||
|
|
||||||
|
print(f"Insertion phase:\t{(after_insertion - start) / 1000000:.2f}ms")
|
||||||
|
|
||||||
|
# Lookup phase
|
||||||
|
start = monotonic_ns()
|
||||||
|
for _ in range(LOOPS):
|
||||||
|
for x in elements:
|
||||||
|
assert x in self._table
|
||||||
|
after_lookups = monotonic_ns()
|
||||||
|
|
||||||
|
print(f"Lookup phase:\t\t{(after_lookups - start) / 1000000:.2f}ms")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sequences
|
||||||
|
|
||||||
|
Let's have a look at how we generate the numbers to be inserted:
|
||||||
|
|
||||||
|
- ordered sequence (ascending)
|
||||||
|
```py
|
||||||
|
x for x in range(N_ELEMENTS)
|
||||||
|
```
|
||||||
|
- ordered sequence (descending)
|
||||||
|
```py
|
||||||
|
x for x in reversed(range(N_ELEMENTS))
|
||||||
|
```
|
||||||
|
- progressive sequence that “heals” on resize
|
||||||
|
```py
|
||||||
|
(x << max(5, x.bit_length())) for x in range(N_ELEMENTS)
|
||||||
|
```
|
||||||
|
- progressive sequence that “heals” in the end
|
||||||
|
```py
|
||||||
|
(x << max(5, x.bit_length())) for x in reversed(range(N_ELEMENTS))
|
||||||
|
```
|
||||||
|
- conflicts everywhere
|
||||||
|
```py
|
||||||
|
x << 32 for x in range(N_ELEMENTS)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
Let's have a look at the obtained results after running the code:
|
||||||
|
|
||||||
|
| Technique | Insertion phase | Lookup phase |
|
||||||
|
| :------------------------------------------: | --------------: | -----------: |
|
||||||
|
| ordered sequence (ascending) | `558.60ms` | `3304.26ms` |
|
||||||
|
| ordered sequence (descending) | `554.08ms` | `3365.84ms` |
|
||||||
|
| progressive sequence that “heals” on resize | `3781.30ms` | `28565.71ms` |
|
||||||
|
| progressive sequence that “heals” in the end | `3280.38ms` | `26494.61ms` |
|
||||||
|
| conflicts everywhere | `4027.54ms` | `29132.92ms` |
|
||||||
|
|
||||||
|
You can see a noticable “jump” in the time after switching to the “progressive”
|
||||||
|
sequence. The last sequence that has conflicts all the time has the worst time,
|
||||||
|
even though it's rather comparable with the first progressive sequence with
|
||||||
|
regards to the insertion phase.
|
||||||
|
|
||||||
|
If we were to compare the _always conflicting_ one with the first one, we can
|
||||||
|
see that insertion took over 7× longer and lookups almost 9× longer.
|
||||||
|
|
||||||
|
You can have a look at the code [here](path:///files/algorithms/hash-tables/breaking/benchmark.py).
|
||||||
|
|
||||||
|
## Comparing with the tree
|
||||||
|
|
||||||
|
:::danger
|
||||||
|
|
||||||
|
Source code can be found [here](path:///files/algorithms/hash-tables/breaking/benchmark.cpp).
|
||||||
|
|
||||||
|
_Viewer discretion advised._
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
Python doesn't have a tree structure for sets/maps implemented, therefore for
|
||||||
|
a comparison we will run a similar benchmark in C++. By running the same
|
||||||
|
sequences on both hash table and tree (RB-tree) we will obtain the following
|
||||||
|
results:
|
||||||
|
|
||||||
|
| Technique | Insertion (hash) | Lookup (hash) | Insertion (tree) | Lookup (tree) |
|
||||||
|
| :------------------: | ---------------: | ------------: | ---------------: | ------------: |
|
||||||
|
| ordered (ascending) | `316ms` | `298ms` | `2098ms` | `5914ms` |
|
||||||
|
| ordered (descending) | `259ms` | `315ms` | `1958ms` | `14747ms` |
|
||||||
|
| progressive a) | `1152ms` | `6021ms` | `2581ms` | `16074ms` |
|
||||||
|
| progressive b) | `1041ms` | `6096ms` | `2770ms` | `15986ms` |
|
||||||
|
| conflicts | `964ms` | `1633ms` | `2559ms` | `13285ms` |
|
||||||
|
|
||||||
|
:::note
|
||||||
|
|
||||||
|
We can't forget that implementation details be involved. Hash function is still
|
||||||
|
the identity, to my knowledge.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
One interesting thing to notice is the fact that the progressive sequences took
|
||||||
|
the most time in lookups (which is not same as in the Python).
|
||||||
|
|
||||||
|
Now, if we have a look at the tree implementation, we can notice two very
|
||||||
|
distinctive things:
|
||||||
|
|
||||||
|
1. Tree implementations are not affected by the input, therefore (except for the
|
||||||
|
first sequence) we can see **very consistent** times.
|
||||||
|
2. Compared to the hash table the times are much higher and not very ideal.
|
||||||
|
|
||||||
|
The reason for the 2nd point may not be very obvious. From the technical
|
||||||
|
perspective it makes some sense. Let's dive into it!
|
||||||
|
|
||||||
|
If we take a hash table, it is an array in a memory, therefore it is contiguous
|
||||||
|
piece of memory. (For more information I'd suggest looking into the 1st blog
|
||||||
|
post below in references section by _Bjarne Stroustrup_)
|
||||||
|
|
||||||
|
On the other hand, if we take a look at the tree, each node holds some
|
||||||
|
attributes and pointers to the left and right descendants of itself. Even if we
|
||||||
|
maintain a reasonable height of the tree (keep the tree balanced), we still need
|
||||||
|
to follow the pointers which point to the nodes _somewhere_ on the heap. When
|
||||||
|
traversing the tree, we get a consistent time complexity, but at the expense of
|
||||||
|
jumping between the nodes on the heap which takes some time.
|
||||||
|
|
||||||
|
:::danger
|
||||||
|
|
||||||
|
This is not supposed to leverage the hash table and try to persuade people not
|
||||||
|
to use the tree representations. There are benefits coming from the respective
|
||||||
|
data structures, even if the time is not the best.
|
||||||
|
|
||||||
|
Overall if we compare the worst-case time complexities of the tree and hash
|
||||||
|
table, tree representation comes off better.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
:::tip Challenge
|
||||||
|
|
||||||
|
Try to benchmark with the similar approach in the Rust. Since Rust uses
|
||||||
|
different hash function, it would be the best to just override the hash, this
|
||||||
|
way you can also avoid the hard part of this attack (making up the numbers that
|
||||||
|
will collide).
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
1. Bjarne Stroustrup.
|
||||||
|
[Are lists evil?](https://www.stroustrup.com/bs_faq.html#list)
|
||||||
|
|
||||||
|
[^1]: Arbitrary-sized integers, they can get as big as your memory allows.
|
181
algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md
Normal file
181
algorithms/12-hash-tables/2023-11-28-breaking/02-mitigations.md
Normal file
|
@ -0,0 +1,181 @@
|
||||||
|
---
|
||||||
|
id: mitigations
|
||||||
|
slug: /hash-tables/breaking/mitigations
|
||||||
|
title: Possible Mitigations
|
||||||
|
description: |
|
||||||
|
Talking about the ways how to prevent the attacks on the hash table.
|
||||||
|
tags:
|
||||||
|
- cpp
|
||||||
|
- python
|
||||||
|
- hash-tables
|
||||||
|
last_update:
|
||||||
|
date: 2023-11-28
|
||||||
|
---
|
||||||
|
|
||||||
|
There are multiple ways the issues created above can be mitigated. Still we can
|
||||||
|
only make it better, we cannot guarantee the ideal time complexity…
|
||||||
|
|
||||||
|
For the sake of simplicity (and referencing an article by _Neal Wu_ on the same
|
||||||
|
topic; in references below) I will use the C++ to describe the mitigations.
|
||||||
|
|
||||||
|
## Random seed
|
||||||
|
|
||||||
|
One of the options how to avoid this kind of an attack is to introduce a random
|
||||||
|
seed to the hash. That way it is not that easy to choose the _nasty_ numbers.
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
struct custom_hash {
|
||||||
|
size_t operator()(uint64_t x) const {
|
||||||
|
return x + 7529;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
As you may have noticed, this is not very helpful, since it just shifts the
|
||||||
|
issue by some number. Better option is to use a shift from random number
|
||||||
|
generator:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
struct custom_hash {
|
||||||
|
size_t operator()(uint64_t x) const {
|
||||||
|
static const uint64_t FIXED_RANDOM =
|
||||||
|
chrono::steady_clock::now().time_since_epoch().count();
|
||||||
|
return x + FIXED_RANDOM;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
In this case the hash is using a high-precision clock to shift the number, which
|
||||||
|
is much harder to break.
|
||||||
|
|
||||||
|
## Better random seed
|
||||||
|
|
||||||
|
Building on the previous solution, we can do some _bit magic_ instead of the
|
||||||
|
shifting:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
struct custom_hash {
|
||||||
|
size_t operator()(uint64_t x) const {
|
||||||
|
static const uint64_t FIXED_RANDOM =
|
||||||
|
chrono::steady_clock::now().time_since_epoch().count();
|
||||||
|
x ^= FIXED_RANDOM;
|
||||||
|
return x ^ (x >> 16);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
This not only shifts the number, it also manipulates the underlying bits of the
|
||||||
|
hash. In this case we're also applying the `XOR` operation.
|
||||||
|
|
||||||
|
## Adjusting the hash function
|
||||||
|
|
||||||
|
Another option is to switch up the hash function.
|
||||||
|
|
||||||
|
For example Rust uses [_SipHash_](https://en.wikipedia.org/wiki/SipHash) by
|
||||||
|
default.
|
||||||
|
|
||||||
|
On the other hand, you can usually specify your own hash function, here we will
|
||||||
|
follow the article by _Neal_ that uses so-called _`splitmix64`_.
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
static uint64_t splitmix64(uint64_t x) {
|
||||||
|
// http://xorshift.di.unimi.it/splitmix64.c
|
||||||
|
x += 0x9e3779b97f4a7c15;
|
||||||
|
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
|
||||||
|
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
|
||||||
|
return x ^ (x >> 31);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
As you can see, this definitely doesn't do identity on the integers :smile:
|
||||||
|
|
||||||
|
Another example would be
|
||||||
|
[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339)
|
||||||
|
function in Java:
|
||||||
|
|
||||||
|
```java
|
||||||
|
/**
|
||||||
|
* Computes key.hashCode() and spreads (XORs) higher bits of hash
|
||||||
|
* to lower. Because the table uses power-of-two masking, sets of
|
||||||
|
* hashes that vary only in bits above the current mask will
|
||||||
|
* always collide. (Among known examples are sets of Float keys
|
||||||
|
* holding consecutive whole numbers in small tables.) So we
|
||||||
|
* apply a transform that spreads the impact of higher bits
|
||||||
|
* downward. There is a tradeoff between speed, utility, and
|
||||||
|
* quality of bit-spreading. Because many common sets of hashes
|
||||||
|
* are already reasonably distributed (so don't benefit from
|
||||||
|
* spreading), and because we use trees to handle large sets of
|
||||||
|
* collisions in bins, we just XOR some shifted bits in the
|
||||||
|
* cheapest possible way to reduce systematic lossage, as well as
|
||||||
|
* to incorporate impact of the highest bits that would otherwise
|
||||||
|
* never be used in index calculations because of table bounds.
|
||||||
|
*/
|
||||||
|
static final int hash(Object key) {
|
||||||
|
int h;
|
||||||
|
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
You can notice that they try to include the upper bits of the hash by using
|
||||||
|
`XOR`, this would render our attack in the previous part helpless.
|
||||||
|
|
||||||
|
## Combining both
|
||||||
|
|
||||||
|
Can we make it better? Of course! Use multiple mitigations at the same time. In
|
||||||
|
our case, we will both inject the random value **and** use the _`splitmix64`_:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
struct custom_hash {
|
||||||
|
static uint64_t splitmix64(uint64_t x) {
|
||||||
|
// http://xorshift.di.unimi.it/splitmix64.c
|
||||||
|
x += 0x9e3779b97f4a7c15;
|
||||||
|
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
|
||||||
|
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
|
||||||
|
return x ^ (x >> 31);
|
||||||
|
}
|
||||||
|
|
||||||
|
size_t operator()(uint64_t x) const {
|
||||||
|
static const uint64_t FIXED_RANDOM =
|
||||||
|
chrono::steady_clock::now().time_since_epoch().count();
|
||||||
|
return splitmix64(x + FIXED_RANDOM);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fallback for extreme cases
|
||||||
|
|
||||||
|
As we have mentioned above, Python resolves the conflicts by probing (it looks
|
||||||
|
for empty space somewhere else in the table, but it's deterministic about it, so
|
||||||
|
it's not “_oops, this is full, let's go one-by-one and find some spot_”). In the
|
||||||
|
case of C++ and Java, they resolve the conflicts by linked lists, as is the
|
||||||
|
usual text-book depiction of the hash table.
|
||||||
|
|
||||||
|
However Java does something more intelligent. Once you go over the threshold of
|
||||||
|
conflicts in one spot, it converts the linked list to an RB-tree that is sorted
|
||||||
|
by the hash and key respectively.
|
||||||
|
|
||||||
|
:::tip
|
||||||
|
|
||||||
|
You may wonder what sense does it make to define an ordering on the tree by the
|
||||||
|
hash, if we're dealing with conflicts. Well, there are less buckets than the
|
||||||
|
range of the hash, so if we take lower bits, we can have a conflict even though
|
||||||
|
the hashes are not the same.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
You might have noticed that if we get a **really bad** hashing function, this is
|
||||||
|
not very helpful. It is not, **but** it can help in other cases.
|
||||||
|
|
||||||
|
:::danger
|
||||||
|
|
||||||
|
As the ordering on the keys of the hash table is not required and may not be
|
||||||
|
implemented, the tree may be ordered by just the hash.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
1. Neal Wu.
|
||||||
|
[Blowing up `unordered_map`, and how to stop getting hacked on it](https://codeforces.com/blog/entry/62393).
|
208
algorithms/12-hash-tables/2023-11-28-breaking/index.md
Normal file
208
algorithms/12-hash-tables/2023-11-28-breaking/index.md
Normal file
|
@ -0,0 +1,208 @@
|
||||||
|
---
|
||||||
|
id: breaking
|
||||||
|
slug: /hash-tables/breaking
|
||||||
|
title: Breaking Hash Table
|
||||||
|
description: |
|
||||||
|
How to get the linear time complexity in a hash table.
|
||||||
|
tags:
|
||||||
|
- cpp
|
||||||
|
- python
|
||||||
|
- hash-tables
|
||||||
|
last_update:
|
||||||
|
date: 2023-11-28
|
||||||
|
---
|
||||||
|
|
||||||
|
We will try to break a hash table and discuss possible ways how to prevent such
|
||||||
|
issues to occur.
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
Hash tables are very commonly used to represent sets or dictionaries. Even when
|
||||||
|
you look up solution to some problem that requires set or dictionary, it is more
|
||||||
|
than likely that you'll find something that references usage of the hash table.
|
||||||
|
You might think it's the only possible option[^1], or it's the best one[^2].
|
||||||
|
|
||||||
|
One of the reasons to prefer hash tables over any other representation is the
|
||||||
|
fact that they are **supposed** to be faster than the alternatives, but the
|
||||||
|
truth lies somewhere in between.
|
||||||
|
|
||||||
|
One of the other possible implementations of the set is a balanced tree. Majorly
|
||||||
|
occurring implementations rely on the _red-black tree_, but you may see also
|
||||||
|
others like an _AVL tree_[^3] or _B-tree_[^4].
|
||||||
|
|
||||||
|
## Hash Table v. Trees
|
||||||
|
|
||||||
|
The most interesting part are the differences between their implementations. Why
|
||||||
|
should you choose hash table, or why should you choose the tree implementation?
|
||||||
|
Let's compare the differences one by one.
|
||||||
|
|
||||||
|
### Requirements
|
||||||
|
|
||||||
|
We will start with the fundamentals on which the underlying data structures
|
||||||
|
rely. We can also consider them as _requirements_ that must be met to be able to
|
||||||
|
use the underlying data structure.
|
||||||
|
|
||||||
|
Hash table relies on the _hash function_ that is supposed to distribute the keys
|
||||||
|
in such way that they're evenly spread across the slots where the keys (or
|
||||||
|
pairs, for dictionary) are stored, but at the same time they're somewhat unique,
|
||||||
|
so no clustering occurs.
|
||||||
|
|
||||||
|
Trees depend on the _ordering_ of the elements. They maintain the elements in
|
||||||
|
a sorted fashion, so for any pair of the elements that are used as keys, you
|
||||||
|
need to be able to decide which one of them is _smaller or equal to_ the other.
|
||||||
|
|
||||||
|
Hash function can be easily created by using the bits that _uniquely_ identify
|
||||||
|
a unique element. On the other hand, ordering may not be as easy to define.
|
||||||
|
|
||||||
|
:::tip Example
|
||||||
|
|
||||||
|
If you are familiar with complex numbers, they are a great example of a key that
|
||||||
|
does not have ordering (unless you go element-wise for the sake of storing them
|
||||||
|
in a tree; though the ordering **is not** defined on them).
|
||||||
|
|
||||||
|
Hashing them is much easier though, you can just “combine” the hashes of the
|
||||||
|
real and imaginary parts of the complex number to get a hash of the complex
|
||||||
|
number itself.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
### Underlying data structure
|
||||||
|
|
||||||
|
The most obvious difference is the _core_ of the idea behind these data
|
||||||
|
structures. Hash tables rely on data being stored in one continuous piece of
|
||||||
|
memory (the array) where you can “guess” (by using the hash function) the
|
||||||
|
location of what you're looking for in a constant time and also access that
|
||||||
|
location in the, said, constant time[^5]. In case the hash function is
|
||||||
|
_not good enough_[^6], you need to go in _blind_, and if it comes to the worst,
|
||||||
|
check everything.
|
||||||
|
|
||||||
|
:::tip tl;dr
|
||||||
|
|
||||||
|
- I know where should I look
|
||||||
|
- I can look there instantenously
|
||||||
|
- If my guesses are very wrong, I might need to check everything
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
On the other hand, tree implementations rely on the self-balancing trees in
|
||||||
|
which you don't get as _amazing_ results as with the hash table, but they're
|
||||||
|
**consistent**. Given that we have a self-balancing tree, the height of the tree
|
||||||
|
is same for **every** input and therefore checking for any element can take the
|
||||||
|
same time even in the worst case.
|
||||||
|
|
||||||
|
:::tip tl;dr
|
||||||
|
|
||||||
|
- I don't know where to look
|
||||||
|
- I know how to get there
|
||||||
|
- Wherever I look, it takes me about the same time
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
Let's compare side by side:
|
||||||
|
|
||||||
|
| time complexity | hash table | tree |
|
||||||
|
| --------------: | :--------------------: | :-------------------: |
|
||||||
|
| expected | constant | depends on the height |
|
||||||
|
| worst-case | gotta check everything | depends on the height |
|
||||||
|
|
||||||
|
## Major Factors of Hash Tables
|
||||||
|
|
||||||
|
Let's have a look at the major factors that affect the efficiency and
|
||||||
|
functioning of a hash table. We have already mentioned the hash function that
|
||||||
|
plays a crucial role, but there are also different ways how you can implement
|
||||||
|
a hash table, so we will have a look at those too.
|
||||||
|
|
||||||
|
### Hash functions
|
||||||
|
|
||||||
|
:::info
|
||||||
|
|
||||||
|
We will start with a definition of hash function in a mathematical definition
|
||||||
|
and type signature in some known language:
|
||||||
|
|
||||||
|
$$
|
||||||
|
h : T \rightarrow \mathbb{N}
|
||||||
|
$$
|
||||||
|
|
||||||
|
For a type signature we will just take the declaration from C++[^7]:
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
std::size_t operator()(const T& key) const;
|
||||||
|
```
|
||||||
|
|
||||||
|
If you compare with the mathematical definition, it is very similar, except for
|
||||||
|
the fact that the memory is not unlimited, so the _natural number_ turned into
|
||||||
|
an _unsigned integer type_ (on majority of platforms it will be a 64-bit
|
||||||
|
unsigned integer).
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
As we have already touched above, hash function gives “a guess” where to look
|
||||||
|
for the key (either when doing a look up, or for insertion to guess a suitable
|
||||||
|
spot for the insertion).
|
||||||
|
|
||||||
|
Hash functions are expected to have a so-called _avalanche effect_ which means
|
||||||
|
that the smallest change to the key should result in a massive change of hash.
|
||||||
|
Avalanche effect technically guarantees that even when your data are clustered
|
||||||
|
together, it should lower the amount of conflicts that can occur.
|
||||||
|
|
||||||
|
:::tip Exercise for the reader
|
||||||
|
|
||||||
|
Try to give an example of a hash function that is not good at all.
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
### Implementation details
|
||||||
|
|
||||||
|
There are different variations of the hash tables. You've more than likely seen
|
||||||
|
an implementation that keeps linked lists for buckets. However there are also
|
||||||
|
other variations that use probing instead.
|
||||||
|
|
||||||
|
With regards to the implementation details, we need to mention the fact that
|
||||||
|
even with the bounded hash (as we could've seen above), you're not likely to
|
||||||
|
have all the buckets for different hashes available. Most common approach to
|
||||||
|
this is having a smaller set of buckets and modifying the hash to fit within.
|
||||||
|
|
||||||
|
One of the most common approaches is to keep lengths of the hash tables in the
|
||||||
|
powers of 2 which allows bit-masking to take place.
|
||||||
|
|
||||||
|
:::tip Example
|
||||||
|
|
||||||
|
Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our
|
||||||
|
hash table. What can we do here?
|
||||||
|
|
||||||
|
Well, we definitely have a bigger hash than spots available, so we need to
|
||||||
|
“shrink” it somehow. The most common practice is to take the lower bits of the
|
||||||
|
hash to represent an index in the table:
|
||||||
|
|
||||||
|
```
|
||||||
|
h & (l - 1)
|
||||||
|
```
|
||||||
|
|
||||||
|
_Why does this work?_ Firstly we subtract 1 from the length (indices run from
|
||||||
|
`⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on
|
||||||
|
any number, we always get a valid index within the table. Let's find the index
|
||||||
|
for our hash:
|
||||||
|
|
||||||
|
```
|
||||||
|
0xDEADBEEF & 0xFFFF = 0xBEEF
|
||||||
|
```
|
||||||
|
|
||||||
|
:::
|
||||||
|
|
||||||
|
[^1]: not true
|
||||||
|
[^2]: also not true
|
||||||
|
[^3]: actually the first of its kind (the self-balanced trees)
|
||||||
|
[^4]:
|
||||||
|
Rust chose to implement this instead of the common choice of the red-black
|
||||||
|
or AVL tree; main difference lies in the fact that B-trees are not binary
|
||||||
|
trees
|
||||||
|
|
||||||
|
[^5]:
|
||||||
|
This, of course, does not hold true for the educational implementations of
|
||||||
|
the hash tables where conflicts are handled by storing the items in the
|
||||||
|
linked lists. In practice linked lists are not that commonly used for
|
||||||
|
addressing this issue as it has even worse impact on the efficiency of the
|
||||||
|
data structure.
|
||||||
|
|
||||||
|
[^6]: My guess is not very good, or it's really bad…
|
||||||
|
[^7]: https://en.cppreference.com/w/cpp/utility/hash
|
133
static/files/algorithms/hash-tables/breaking/benchmark.cpp
Normal file
133
static/files/algorithms/hash-tables/breaking/benchmark.cpp
Normal file
|
@ -0,0 +1,133 @@
|
||||||
|
#include <bit>
|
||||||
|
#include <cassert>
|
||||||
|
#include <chrono>
|
||||||
|
#include <cstdint>
|
||||||
|
#include <functional>
|
||||||
|
#include <iostream>
|
||||||
|
#include <ranges>
|
||||||
|
#include <set>
|
||||||
|
#include <string>
|
||||||
|
#include <unordered_set>
|
||||||
|
|
||||||
|
using elem_t = std::uint64_t;
|
||||||
|
|
||||||
|
const elem_t N_ELEMENTS = 10000000;
|
||||||
|
#define LOOPS 10
|
||||||
|
|
||||||
|
template <typename T> struct strategy {
|
||||||
|
virtual std::string name() const = 0;
|
||||||
|
virtual T elements() = 0;
|
||||||
|
|
||||||
|
template <typename C> void run(C &&s) {
|
||||||
|
using namespace std;
|
||||||
|
|
||||||
|
cout << "\nBenchmarking:\t\t" << name() << '\n';
|
||||||
|
|
||||||
|
auto start = chrono::steady_clock::now();
|
||||||
|
for (auto x : elements()) {
|
||||||
|
s.insert(x);
|
||||||
|
}
|
||||||
|
auto after_insertion = chrono::steady_clock::now();
|
||||||
|
|
||||||
|
auto insertion_time =
|
||||||
|
chrono::duration_cast<chrono::milliseconds>(after_insertion - start);
|
||||||
|
cout << "Insertion phase:\t" << insertion_time << "\n";
|
||||||
|
|
||||||
|
start = chrono::steady_clock::now();
|
||||||
|
for (int i = 0; i < LOOPS; ++i) {
|
||||||
|
for (auto x : elements()) {
|
||||||
|
assert(s.contains(x));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
auto after_lookups = chrono::steady_clock::now();
|
||||||
|
|
||||||
|
auto lookup_time =
|
||||||
|
chrono::duration_cast<chrono::milliseconds>(after_lookups - start);
|
||||||
|
cout << "Lookup phase:\t\t" << lookup_time << "\n";
|
||||||
|
}
|
||||||
|
|
||||||
|
virtual ~strategy() = default;
|
||||||
|
};
|
||||||
|
|
||||||
|
using iota_t =
|
||||||
|
decltype(std::views::iota(static_cast<elem_t>(0), static_cast<elem_t>(0)));
|
||||||
|
|
||||||
|
struct ascending_ordered_sequence : public strategy<iota_t> {
|
||||||
|
std::string name() const override { return "ordered sequence (ascending)"; }
|
||||||
|
iota_t elements() override {
|
||||||
|
return std::views::iota(static_cast<elem_t>(0), N_ELEMENTS);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
static elem_t reverse(elem_t x) { return static_cast<elem_t>(N_ELEMENTS) - x; }
|
||||||
|
using reversed_iota_t =
|
||||||
|
decltype(std::views::iota(static_cast<elem_t>(0), static_cast<elem_t>(0)) |
|
||||||
|
std::views::transform(reverse));
|
||||||
|
|
||||||
|
struct descending_ordered_sequence : public strategy<reversed_iota_t> {
|
||||||
|
std::string name() const override { return "ordered sequence (descending)"; }
|
||||||
|
reversed_iota_t elements() override {
|
||||||
|
return std::views::iota(static_cast<elem_t>(1), N_ELEMENTS + 1) |
|
||||||
|
std::views::transform(reverse);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
static elem_t attack(elem_t x) { return x << (5 + std::bit_width(x)); }
|
||||||
|
using attacked_iota_t =
|
||||||
|
decltype(std::views::iota(static_cast<elem_t>(0), static_cast<elem_t>(0)) |
|
||||||
|
std::views::transform(attack));
|
||||||
|
|
||||||
|
struct progressive_ascending_attack : public strategy<attacked_iota_t> {
|
||||||
|
std::string name() const override {
|
||||||
|
return "progressive sequence that self-heals on resize";
|
||||||
|
}
|
||||||
|
attacked_iota_t elements() override {
|
||||||
|
return std::views::iota(static_cast<elem_t>(0), N_ELEMENTS) |
|
||||||
|
std::views::transform(attack);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
using reversed_attacked_iota_t =
|
||||||
|
decltype(std::views::iota(static_cast<elem_t>(0), static_cast<elem_t>(0)) |
|
||||||
|
std::views::transform(reverse) | std::views::transform(attack));
|
||||||
|
|
||||||
|
struct progressive_descending_attack
|
||||||
|
: public strategy<reversed_attacked_iota_t> {
|
||||||
|
std::string name() const override {
|
||||||
|
return "progressive sequence that self-heals in the end";
|
||||||
|
}
|
||||||
|
reversed_attacked_iota_t elements() override {
|
||||||
|
return std::views::iota(static_cast<elem_t>(1), N_ELEMENTS + 1) |
|
||||||
|
std::views::transform(reverse) | std::views::transform(attack);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
static elem_t shift(elem_t x) { return x << 32; }
|
||||||
|
using shifted_iota_t =
|
||||||
|
decltype(std::views::iota(static_cast<elem_t>(0), static_cast<elem_t>(0)) |
|
||||||
|
std::views::transform(shift));
|
||||||
|
|
||||||
|
struct hard_attack : public strategy<shifted_iota_t> {
|
||||||
|
std::string name() const override { return "carefully chosen numbers"; }
|
||||||
|
shifted_iota_t elements() override {
|
||||||
|
return std::views::iota(static_cast<elem_t>(0), N_ELEMENTS) |
|
||||||
|
std::views::transform(shift);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename C> void run_all(const std::string ¬e) {
|
||||||
|
std::cout << "\n«" << note << "»\n";
|
||||||
|
|
||||||
|
ascending_ordered_sequence{}.run(C{});
|
||||||
|
descending_ordered_sequence{}.run(C{});
|
||||||
|
progressive_ascending_attack{}.run(C{});
|
||||||
|
progressive_descending_attack{}.run(C{});
|
||||||
|
hard_attack{}.run(C{});
|
||||||
|
}
|
||||||
|
|
||||||
|
int main() {
|
||||||
|
run_all<std::unordered_set<elem_t>>("hash table");
|
||||||
|
run_all<std::set<elem_t>>("red-black tree");
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
118
static/files/algorithms/hash-tables/breaking/benchmark.py
Normal file
118
static/files/algorithms/hash-tables/breaking/benchmark.py
Normal file
|
@ -0,0 +1,118 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
|
||||||
|
from functools import cached_property
|
||||||
|
from time import monotonic_ns
|
||||||
|
|
||||||
|
N_ELEMENTS = 10_000_000
|
||||||
|
LOOPS = 10
|
||||||
|
|
||||||
|
|
||||||
|
class Strategy:
|
||||||
|
def __init__(self, data_structure=set):
|
||||||
|
self._table = data_structure()
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
raise NotImplementedError("Implement for each strategy")
|
||||||
|
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
raise NotImplementedError("Implement for each strategy")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
print(f"\nBenchmarking:\t\t{self.name}")
|
||||||
|
|
||||||
|
# Extract the elements here, so that the evaluation of them does not
|
||||||
|
# slow down the relevant part of benchmark
|
||||||
|
elements = self.elements
|
||||||
|
|
||||||
|
# Insertion phase
|
||||||
|
start = monotonic_ns()
|
||||||
|
for x in elements:
|
||||||
|
self._table.add(x)
|
||||||
|
after_insertion = monotonic_ns()
|
||||||
|
|
||||||
|
print(f"Insertion phase:\t{(after_insertion - start) / 1000000:.2f}ms")
|
||||||
|
|
||||||
|
# Lookup phase
|
||||||
|
start = monotonic_ns()
|
||||||
|
for _ in range(LOOPS):
|
||||||
|
for x in elements:
|
||||||
|
assert x in self._table
|
||||||
|
after_lookups = monotonic_ns()
|
||||||
|
|
||||||
|
print(f"Lookup phase:\t\t{(after_lookups - start) / 1000000:.2f}ms")
|
||||||
|
|
||||||
|
|
||||||
|
class AscendingOrderedSequence(Strategy):
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
return "ordered sequence (ascending)"
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
return [x for x in range(N_ELEMENTS)]
|
||||||
|
|
||||||
|
|
||||||
|
class DescendingOrderedSequence(Strategy):
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
return "ordered sequence (descending)"
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
return [x for x in reversed(range(N_ELEMENTS))]
|
||||||
|
|
||||||
|
|
||||||
|
class ProgressiveAttack(Strategy):
|
||||||
|
@staticmethod
|
||||||
|
def _break(n):
|
||||||
|
return n << max(5, n.bit_length())
|
||||||
|
|
||||||
|
|
||||||
|
class ProgressiveAscendingAttack(ProgressiveAttack):
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
return "progressive sequence that self-heals on resize"
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
return [self._break(x) for x in range(N_ELEMENTS)]
|
||||||
|
|
||||||
|
|
||||||
|
class ProgressiveDescendingAttack(ProgressiveAttack):
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
return "progressive sequence that self-heals in the end"
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
return [self._break(x) for x in reversed(range(N_ELEMENTS))]
|
||||||
|
|
||||||
|
|
||||||
|
class HardAttack(Strategy):
|
||||||
|
@property
|
||||||
|
def name(self):
|
||||||
|
return "carefully chosen numbers"
|
||||||
|
|
||||||
|
@cached_property
|
||||||
|
def elements(self):
|
||||||
|
return [x << 32 for x in range(N_ELEMENTS)]
|
||||||
|
|
||||||
|
|
||||||
|
STRATEGIES = [
|
||||||
|
AscendingOrderedSequence,
|
||||||
|
DescendingOrderedSequence,
|
||||||
|
ProgressiveAscendingAttack,
|
||||||
|
ProgressiveDescendingAttack,
|
||||||
|
HardAttack,
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
for strategy in STRATEGIES:
|
||||||
|
strategy().run()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
Loading…
Reference in a new issue