2023-11-16 10:16:13 +01:00
|
|
|
---
|
|
|
|
id: breaking
|
2023-11-28 19:18:20 +01:00
|
|
|
slug: /hash-tables/breaking
|
2024-06-09 15:34:38 +02:00
|
|
|
title: Breaking hash table
|
2023-11-16 10:16:13 +01:00
|
|
|
description: |
|
|
|
|
How to get the linear time complexity in a hash table.
|
|
|
|
tags:
|
|
|
|
- cpp
|
|
|
|
- python
|
|
|
|
- hash-tables
|
|
|
|
last_update:
|
|
|
|
date: 2023-11-28
|
|
|
|
---
|
|
|
|
|
|
|
|
We will try to break a hash table and discuss possible ways how to prevent such
|
|
|
|
issues to occur.
|
|
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
|
|
Hash tables are very commonly used to represent sets or dictionaries. Even when
|
|
|
|
you look up solution to some problem that requires set or dictionary, it is more
|
2023-11-28 19:32:38 +01:00
|
|
|
than likely that you'll find something that references usage of the hash table.
|
|
|
|
You might think it's the only possible option[^1], or it's the best one[^2].
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
One of the reasons to prefer hash tables over any other representation is the
|
|
|
|
fact that they are **supposed** to be faster than the alternatives, but the
|
|
|
|
truth lies somewhere in between.
|
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
One of the other possible implementations of the set is a balanced tree. Majorly
|
|
|
|
occurring implementations rely on the _red-black tree_, but you may see also
|
|
|
|
others like an _AVL tree_[^3] or _B-tree_[^4].
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
## Hash Table v. Trees
|
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
The most interesting part are the differences between their implementations. Why
|
2023-11-16 10:16:13 +01:00
|
|
|
should you choose hash table, or why should you choose the tree implementation?
|
|
|
|
Let's compare the differences one by one.
|
|
|
|
|
|
|
|
### Requirements
|
|
|
|
|
|
|
|
We will start with the fundamentals on which the underlying data structures
|
|
|
|
rely. We can also consider them as _requirements_ that must be met to be able to
|
|
|
|
use the underlying data structure.
|
|
|
|
|
|
|
|
Hash table relies on the _hash function_ that is supposed to distribute the keys
|
2023-11-28 19:32:38 +01:00
|
|
|
in such way that they're evenly spread across the slots where the keys (or
|
|
|
|
pairs, for dictionary) are stored, but at the same time they're somewhat unique,
|
|
|
|
so no clustering occurs.
|
2023-11-16 10:16:13 +01:00
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
Trees depend on the _ordering_ of the elements. They maintain the elements in
|
2023-11-16 10:16:13 +01:00
|
|
|
a sorted fashion, so for any pair of the elements that are used as keys, you
|
|
|
|
need to be able to decide which one of them is _smaller or equal to_ the other.
|
|
|
|
|
|
|
|
Hash function can be easily created by using the bits that _uniquely_ identify
|
|
|
|
a unique element. On the other hand, ordering may not be as easy to define.
|
|
|
|
|
|
|
|
:::tip Example
|
|
|
|
|
|
|
|
If you are familiar with complex numbers, they are a great example of a key that
|
|
|
|
does not have ordering (unless you go element-wise for the sake of storing them
|
|
|
|
in a tree; though the ordering **is not** defined on them).
|
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
Hashing them is much easier though, you can just “combine” the hashes of the
|
|
|
|
real and imaginary parts of the complex number to get a hash of the complex
|
|
|
|
number itself.
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
### Underlying data structure
|
|
|
|
|
|
|
|
The most obvious difference is the _core_ of the idea behind these data
|
|
|
|
structures. Hash tables rely on data being stored in one continuous piece of
|
|
|
|
memory (the array) where you can “guess” (by using the hash function) the
|
2023-11-28 19:32:38 +01:00
|
|
|
location of what you're looking for in a constant time and also access that
|
2023-11-16 10:16:13 +01:00
|
|
|
location in the, said, constant time[^5]. In case the hash function is
|
2023-11-28 19:32:38 +01:00
|
|
|
_not good enough_[^6], you need to go in _blind_, and if it comes to the worst,
|
2023-11-16 10:16:13 +01:00
|
|
|
check everything.
|
|
|
|
|
|
|
|
:::tip tl;dr
|
|
|
|
|
|
|
|
- I know where should I look
|
|
|
|
- I can look there instantenously
|
|
|
|
- If my guesses are very wrong, I might need to check everything
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
On the other hand, tree implementations rely on the self-balancing trees in
|
|
|
|
which you don't get as _amazing_ results as with the hash table, but they're
|
2023-11-28 19:32:38 +01:00
|
|
|
**consistent**. Given that we have a self-balancing tree, the height of the tree
|
|
|
|
is same for **every** input and therefore checking for any element can take the
|
|
|
|
same time even in the worst case.
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
:::tip tl;dr
|
|
|
|
|
|
|
|
- I don't know where to look
|
|
|
|
- I know how to get there
|
|
|
|
- Wherever I look, it takes me about the same time
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
Let's compare side by side:
|
|
|
|
|
|
|
|
| time complexity | hash table | tree |
|
|
|
|
| --------------: | :--------------------: | :-------------------: |
|
|
|
|
| expected | constant | depends on the height |
|
|
|
|
| worst-case | gotta check everything | depends on the height |
|
|
|
|
|
|
|
|
## Major Factors of Hash Tables
|
|
|
|
|
|
|
|
Let's have a look at the major factors that affect the efficiency and
|
|
|
|
functioning of a hash table. We have already mentioned the hash function that
|
|
|
|
plays a crucial role, but there are also different ways how you can implement
|
|
|
|
a hash table, so we will have a look at those too.
|
|
|
|
|
|
|
|
### Hash functions
|
|
|
|
|
|
|
|
:::info
|
|
|
|
|
|
|
|
We will start with a definition of hash function in a mathematical definition
|
|
|
|
and type signature in some known language:
|
|
|
|
|
|
|
|
$$
|
|
|
|
h : T \rightarrow \mathbb{N}
|
|
|
|
$$
|
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
For a type signature we will just take the declaration from C++[^7]:
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
```cpp
|
|
|
|
std::size_t operator()(const T& key) const;
|
|
|
|
```
|
|
|
|
|
|
|
|
If you compare with the mathematical definition, it is very similar, except for
|
2023-11-28 19:32:38 +01:00
|
|
|
the fact that the memory is not unlimited, so the _natural number_ turned into
|
|
|
|
an _unsigned integer type_ (on majority of platforms it will be a 64-bit
|
|
|
|
unsigned integer).
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
As we have already touched above, hash function gives “a guess” where to look
|
|
|
|
for the key (either when doing a look up, or for insertion to guess a suitable
|
|
|
|
spot for the insertion).
|
|
|
|
|
|
|
|
Hash functions are expected to have a so-called _avalanche effect_ which means
|
|
|
|
that the smallest change to the key should result in a massive change of hash.
|
|
|
|
Avalanche effect technically guarantees that even when your data are clustered
|
|
|
|
together, it should lower the amount of conflicts that can occur.
|
|
|
|
|
|
|
|
:::tip Exercise for the reader
|
|
|
|
|
|
|
|
Try to give an example of a hash function that is not good at all.
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
### Implementation details
|
|
|
|
|
2023-11-28 19:32:38 +01:00
|
|
|
There are different variations of the hash tables. You've more than likely seen
|
2023-11-16 10:16:13 +01:00
|
|
|
an implementation that keeps linked lists for buckets. However there are also
|
2023-11-28 19:32:38 +01:00
|
|
|
other variations that use probing instead.
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
With regards to the implementation details, we need to mention the fact that
|
|
|
|
even with the bounded hash (as we could've seen above), you're not likely to
|
|
|
|
have all the buckets for different hashes available. Most common approach to
|
|
|
|
this is having a smaller set of buckets and modifying the hash to fit within.
|
|
|
|
|
|
|
|
One of the most common approaches is to keep lengths of the hash tables in the
|
|
|
|
powers of 2 which allows bit-masking to take place.
|
|
|
|
|
|
|
|
:::tip Example
|
|
|
|
|
|
|
|
Let's say we're given `h = 0xDEADBEEF` and we have `l = 65536=2^16` spots in our
|
|
|
|
hash table. What can we do here?
|
|
|
|
|
|
|
|
Well, we definitely have a bigger hash than spots available, so we need to
|
2023-11-28 19:32:38 +01:00
|
|
|
“shrink” it somehow. The most common practice is to take the lower bits of the
|
|
|
|
hash to represent an index in the table:
|
2023-11-16 10:16:13 +01:00
|
|
|
|
|
|
|
```
|
|
|
|
h & (l - 1)
|
|
|
|
```
|
|
|
|
|
|
|
|
_Why does this work?_ Firstly we subtract 1 from the length (indices run from
|
2023-11-28 19:32:38 +01:00
|
|
|
`⟨0 ; l - 1⟩`, since table is zero-indexed). Therefore if we do _binary and_ on
|
2023-11-16 10:16:13 +01:00
|
|
|
any number, we always get a valid index within the table. Let's find the index
|
|
|
|
for our hash:
|
|
|
|
|
|
|
|
```
|
|
|
|
0xDEADBEEF & 0xFFFF = 0xBEEF
|
|
|
|
```
|
|
|
|
|
|
|
|
:::
|
|
|
|
|
|
|
|
[^1]: not true
|
|
|
|
[^2]: also not true
|
2023-11-28 19:32:38 +01:00
|
|
|
[^3]: actually the first of its kind (the self-balanced trees)
|
2023-11-16 10:16:13 +01:00
|
|
|
[^4]:
|
|
|
|
Rust chose to implement this instead of the common choice of the red-black
|
|
|
|
or AVL tree; main difference lies in the fact that B-trees are not binary
|
|
|
|
trees
|
|
|
|
|
|
|
|
[^5]:
|
|
|
|
This, of course, does not hold true for the educational implementations of
|
|
|
|
the hash tables where conflicts are handled by storing the items in the
|
|
|
|
linked lists. In practice linked lists are not that commonly used for
|
|
|
|
addressing this issue as it has even worse impact on the efficiency of the
|
|
|
|
data structure.
|
|
|
|
|
|
|
|
[^6]: My guess is not very good, or it's really bad…
|
|
|
|
[^7]: https://en.cppreference.com/w/cpp/utility/hash
|