2023-11-16 10:16:13 +01:00
|
|
|
---
|
|
|
|
id: mitigations
|
2023-11-28 19:18:20 +01:00
|
|
|
slug: /hash-tables/breaking/mitigations
|
2023-11-16 10:16:13 +01:00
|
|
|
title: Possible Mitigations
|
|
|
|
description: |
|
|
|
|
Talking about the ways how to prevent the attacks on the hash table.
|
|
|
|
tags:
|
|
|
|
- cpp
|
|
|
|
- python
|
|
|
|
- hash-tables
|
|
|
|
last_update:
|
|
|
|
date: 2023-11-28
|
|
|
|
---
|
|
|
|
|
|
|
|
There are multiple ways the issues created above can be mitigated. Still we can
|
|
|
|
only make it better, we cannot guarantee the ideal time complexity…
|
|
|
|
|
|
|
|
For the sake of simplicity (and referencing an article by _Neal Wu_ on the same
|
|
|
|
topic; in references below) I will use the C++ to describe the mitigations.
|
|
|
|
|
|
|
|
## Random seed
|
|
|
|
|
|
|
|
One of the options how to avoid this kind of an attack is to introduce a random
|
|
|
|
seed to the hash. That way it is not that easy to choose the _nasty_ numbers.
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
struct custom_hash {
|
|
|
|
size_t operator()(uint64_t x) const {
|
|
|
|
return x + 7529;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
|
|
|
As you may have noticed, this is not very helpful, since it just shifts the
|
|
|
|
issue by some number. Better option is to use a shift from random number
|
|
|
|
generator:
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
struct custom_hash {
|
|
|
|
size_t operator()(uint64_t x) const {
|
|
|
|
static const uint64_t FIXED_RANDOM =
|
|
|
|
chrono::steady_clock::now().time_since_epoch().count();
|
|
|
|
return x + FIXED_RANDOM;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
|
|
|
In this case the hash is using a high-precision clock to shift the number, which
|
|
|
|
is much harder to break.
|
|
|
|
|
|
|
|
## Better random seed
|
|
|
|
|
|
|
|
Building on the previous solution, we can do some _bit magic_ instead of the
|
|
|
|
shifting:
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
struct custom_hash {
|
|
|
|
size_t operator()(uint64_t x) const {
|
|
|
|
static const uint64_t FIXED_RANDOM =
|
|
|
|
chrono::steady_clock::now().time_since_epoch().count();
|
|
|
|
x ^= FIXED_RANDOM;
|
|
|
|
return x ^ (x >> 16);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
|
|
|
This not only shifts the number, it also manipulates the underlying bits of the
|
|
|
|
hash. In this case we're also applying the `XOR` operation.
|
|
|
|
|
|
|
|
## Adjusting the hash function
|
|
|
|
|
|
|
|
Another option is to switch up the hash function.
|
|
|
|
|
|
|
|
For example Rust uses [_SipHash_](https://en.wikipedia.org/wiki/SipHash) by
|
|
|
|
default.
|
|
|
|
|
|
|
|
On the other hand, you can usually specify your own hash function, here we will
|
|
|
|
follow the article by _Neal_ that uses so-called _`splitmix64`_.
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
static uint64_t splitmix64(uint64_t x) {
|
|
|
|
// http://xorshift.di.unimi.it/splitmix64.c
|
|
|
|
x += 0x9e3779b97f4a7c15;
|
|
|
|
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
|
|
|
|
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
|
|
|
|
return x ^ (x >> 31);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
As you can see, this definitely doesn't do identity on the integers :smile:
|
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
Another example would be
|
|
|
|
[`HashMap::hash()`](https://github.com/openjdk/jdk/blob/dc256fbc6490f8163adb286dbb7380c10e5e1e06/src/java.base/share/classes/java/util/HashMap.java#L320-L339)
|
|
|
|
function in Java:
|
|
|
|
|
|
|
|
```java
|
|
|
|
/**
|
|
|
|
* Computes key.hashCode() and spreads (XORs) higher bits of hash
|
|
|
|
* to lower. Because the table uses power-of-two masking, sets of
|
|
|
|
* hashes that vary only in bits above the current mask will
|
|
|
|
* always collide. (Among known examples are sets of Float keys
|
|
|
|
* holding consecutive whole numbers in small tables.) So we
|
|
|
|
* apply a transform that spreads the impact of higher bits
|
|
|
|
* downward. There is a tradeoff between speed, utility, and
|
|
|
|
* quality of bit-spreading. Because many common sets of hashes
|
|
|
|
* are already reasonably distributed (so don't benefit from
|
|
|
|
* spreading), and because we use trees to handle large sets of
|
|
|
|
* collisions in bins, we just XOR some shifted bits in the
|
|
|
|
* cheapest possible way to reduce systematic lossage, as well as
|
|
|
|
* to incorporate impact of the highest bits that would otherwise
|
|
|
|
* never be used in index calculations because of table bounds.
|
|
|
|
*/
|
|
|
|
static final int hash(Object key) {
|
|
|
|
int h;
|
|
|
|
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
You can notice that they try to include the upper bits of the hash by using
|
|
|
|
`XOR`, this would render our attack in the previous part helpless.
|
|
|
|
|
2023-11-16 10:16:13 +01:00
|
|
|
## Combining both
|
|
|
|
|
|
|
|
Can we make it better? Of course! Use multiple mitigations at the same time. In
|
|
|
|
our case, we will both inject the random value **and** use the _`splitmix64`_:
|
|
|
|
|
|
|
|
```cpp
|
|
|
|
struct custom_hash {
|
|
|
|
static uint64_t splitmix64(uint64_t x) {
|
|
|
|
// http://xorshift.di.unimi.it/splitmix64.c
|
|
|
|
x += 0x9e3779b97f4a7c15;
|
|
|
|
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
|
|
|
|
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
|
|
|
|
return x ^ (x >> 31);
|
|
|
|
}
|
|
|
|
|
|
|
|
size_t operator()(uint64_t x) const {
|
|
|
|
static const uint64_t FIXED_RANDOM =
|
|
|
|
chrono::steady_clock::now().time_since_epoch().count();
|
|
|
|
return splitmix64(x + FIXED_RANDOM);
|
|
|
|
}
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
|
|
|
## Fallback for extreme cases
|
|
|
|
|
|
|
|
As we have mentioned above, Python resolves the conflicts by probing (it looks
|
|
|
|
for empty space somewhere else in the table, but it's deterministic about it, so
|
|
|
|
it's not “_oops, this is full, let's go one-by-one and find some spot_”). In the
|
|
|
|
case of C++ and Java, they resolve the conflicts by linked lists, as is the
|
|
|
|
usual text-book depiction of the hash table.
|
|
|
|
|
|
|
|
However Java does something more intelligent. Once you go over the threshold of
|
|
|
|
conflicts in one spot, it converts the linked list to an RB-tree that is sorted
|
|
|
|
by the hash and key respectively.
|
|
|
|
|
|
|
|
:::tip
|
|
|
|
|
|
|
|
You may wonder what sense does it make to define an ordering on the tree by the
|
|
|
|
hash, if we're dealing with conflicts. Well, there are less buckets than the
|
|
|
|
range of the hash, so if we take lower bits, we can have a conflict even though
|
|
|
|
the hashes are not the same.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
You might have noticed that if we get a **really bad** hashing function, this is
|
|
|
|
not very helpful. It is not, **but** it can help in other cases.
|
|
|
|
|
|
|
|
:::danger
|
|
|
|
|
|
|
|
As the ordering on the keys of the hash table is not required and may not be
|
|
|
|
implemented, the tree may be ordered by just the hash.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
1. Neal Wu.
|
|
|
|
[Blowing up `unordered_map`, and how to stop getting hacked on it](https://codeforces.com/blog/entry/62393).
|