fix(intro,rbt,bst): grammarly

Signed-off-by: Matej Focko <mfocko@redhat.com>
This commit is contained in:
Matej Focko 2022-05-18 19:53:49 +02:00
parent 126ece3e84
commit a438df5f8e
Signed by: mfocko
GPG key ID: 7C47D46246790496
3 changed files with 40 additions and 45 deletions

View file

@ -1,9 +1,9 @@
\chapter*{Introduction}
\addcontentsline{toc}{chapter}{Introduction}
Data structures have become a regular part of the essential toolbox for problem-solving. In many cases, they also help to improve the existing algorithm's performance, e.g. using a priority queue in Dijkstra's algorithm for the shortest path. This thesis will mainly discuss the implementation of a set.
Data structures have become a regular part of the essential toolbox for problem-solving. In many cases, they also help to improve the existing algorithm's performance, e.g. using a priority queue in \textit{Dijkstra's algorithm for the shortest path}. This thesis will mainly discuss the implementation of a set.
Currently, the most commonly used implementations of sets use hash tables, but we will talk about another common alternative, implementation via self-balancing search trees. Compared to a hash table, they provide consistent time complexity, but at the cost of a requirement for ordering on the elements. The most implemented self-balancing binary tree is a \textit{red-black tree}, as described by Guibas and Sedgewick~\cite{rbtree}. Among other alternatives, we can find (non-binary) \textit{B-tree}~\cite{btree} and \textit{AVL tree}~\cite{avl}.
Currently, the most commonly used implementations of sets use hash tables, but we will talk about another common alternative, implementation via self-balancing search trees. Compared to a hash table, they provide consistent time complexity, but at the cost of a requirement for ordering on the elements. The most implemented self-balancing binary tree is a \textit{red-black tree}, as described by \textit{Guibas and Sedgewick}~\cite{rbtree}. Among other alternatives, we can find (non-binary) \textit{B-tree}~\cite{btree} and \textit{AVL tree}~\cite{avl}.
This thesis analyses and visualizes the \textit{Weak AVL (WAVL)}\cite{wavl} tree that has more relaxed conditions than the AVL tree, but still provides better balancing than a red-black tree.

View file

@ -21,16 +21,15 @@ As we have mentioned at the beginning of \hyperref[chap:rank-balanced-trees]{thi
\textbf{AVL Rule}: Every node is (1, 1) or (1, 2).~\cite{wavl}
In case of the AVL tree, rank represents height. Here we can notice a very smart way of using the \textit{(i, j)-node} definition. If we go back to the definition and want to be explicit about the nodes that are allowed with the \textit{AVL Rule}, then we get (1, 1), (1, 2) \textbf{or} (2, 1) nodes. However it is possible to find implementations of the AVL tree that allow leaning \textbf{to only one side} as opposed to the original requirements given by \textit{Adelson-Velsky and Landis}~\cite{avl}. Forbidding interchangeability of (i, j) with (j, i)-nodes would still yield AVL trees that lean to one side.
In the case of the AVL tree, rank represents height. Here we can notice an ingenious way of using the \textit{(i, j)-node} definition. If we go back to the definition and want to be explicit about the nodes that are allowed with the \textit{AVL Rule}, then we get (1, 1), (1, 2) \textbf{or} (2, 1) nodes. However, it is possible to find implementations of the AVL tree that allow leaning \textbf{to only one side} as opposed to the original requirements given by \textit{Adelson-Velsky and Landis}~\cite{avl}. Forbidding interchangeability of (i, j) with (j, i)-nodes would still yield AVL trees that lean to one side.
Meaning of the \textit{AVL Rule} is quite simple, since rank represents height in that case. We can draw analogies using the notation used for the AVL trees, where we mark nodes with a trit (or a sign) or use a balance-factor. We have two cases to discuss:
The meaning of the \textit{AVL Rule} is quite simple since rank represents the height in that case. We can draw analogies using the notation used for the AVL trees, where we mark nodes with a trit (or a sign) or use a balance factor. We have two cases to discuss:
\begin{itemize}
\item \textbf{(1, 1) node} represents a tree where both of its subtrees have the same height. In this case we are talking about the nodes with balance-factor $0$ (respectively being signed with a $0$).
\item \textbf{(1, 2) node} represents a tree where one of its subtrees has a bigger height. In this case we are talking about the nodes with balance-factor $-1$ or $1$ (respectively being signed with a $-$ or a $+$).
\item \textbf{(1, 1) node} represents a tree where both of its subtrees have the same height. In this case, we are talking about the nodes with balance factor $0$ (respectively being signed with a $0$).
\item \textbf{(1, 2) node} represents a tree where one of its subtrees has a bigger height. In this case, we are talking about the nodes with balance factor $-1$ or $1$ (respectively being signed with a $-$ or a $+$).
\end{itemize}
Example of the AVL tree that uses ranks instead of signs or balance-factors can be seen in \autoref{fig:ranked:avl}.
\begin{figure}
\centering
\begin{tikzpicture}[>=latex',line join=bevel,scale=0.75,]
@ -75,31 +74,30 @@ Example of the AVL tree that uses ranks instead of signs or balance-factors can
\textbf{Red-Black Rule}: All rank differences are 0 or 1, and no parent of a 0-child is a 0-child.~\cite{wavl}
In case of red-black trees, rank represents number of black nodes on a path from the node to a leaf (excluding the node itself). Based on that we can discuss the \textit{Red-Black Rule} in detail:
In the case of red-black trees, rank represents a number of black nodes on a path from the node to a leaf (excluding the node itself). Based on that, we can discuss the \textit{Red-Black Rule} in detail:
\begin{enumerate}
\item \textit{All rank differences are 0 or 1} inductively enforces monotonically non-decreasing (at most by 1) count of black nodes from the leaves. In detail:
\item \textit{All rank differences are 0 or 1} inductively enforces a monotonically non-decreasing (at most by 1) count of black nodes from the leaves. In detail:
\begin{enumerate}
\item In case the \textbf{current node is black}, the rank difference must be 1, since we have one more black node on the path from the parent to the leaves than from the current node.
\item In case the \textbf{current node is red}, the rank difference must be 0, since from the parent the count of black nodes on the path to leaves has not changed.
\item And finally all other differences are invalid, since by adding a node to the beginning of a path to the leaf we can either add red node (0-child) or black node (1-child), i.e. there is one more black node on the path or not which implies the change can be only 0 or 1.
\item In case the \textbf{current node is red}, the rank difference must be 0, since from the parent, the count of black nodes on the path to leaves has not changed.
\item And finally, all other differences are invalid since adding a node to the beginning of a path to the leaf, we can either add red node (0-child) or black node (1-child), i.e. there is one more black node on the path or not, which implies the change can be only 0 or 1.
\end{enumerate}
\item \textit{No parent of a 0-child is a 0-child} ensures that there are no two consecutive red nodes, since 0-child node is equivalent to the red node.
\item \textit{No parent of a 0-child is a 0-child} ensures that there are no two consecutive red nodes, since the 0-child node is equivalent to the red node.
\end{enumerate}
Example of the red-black tree that uses ranks instead of colors can be seen in \autoref{fig:ranked:rbt}, red nodes are also colored for the convenience.
Majority of the red-black tree implementations color nodes of the tree, following that notation and \textbf{precise} definition of the red-black tree it is quite common to ask the following questions:
An example of the red-black tree that uses ranks instead of colours can be seen in \autoref{fig:ranked:rbt}. Red nodes are also coloured for convenience.
Majority of the red-black tree implementations colour nodes of the tree, following that notation and \textbf{precise} definition of the red-black tree it is pretty common to ask the following questions:
\begin{enumerate}
\item \textit{Do we count the node itself if it is black?} \\
If we do not count nodes themselves, we decrease the count of black nodes on every path to the external nodes by $1$.
\item \textit{Do we count the external nodes (leaves that do not hold any value)?} \\
If we do not count external nodes themselves, we decrease the count of black nodes on every path to the external nodes by $1$.
If we do not count the external nodes themselves, we decrease the count of black nodes on every path to the external nodes by $1$.
\end{enumerate}
Overall they do not really matter, as long as they are used consistently, since they affect the counts globally.
Overall they do not matter, as long as they are used consistently, since they affect the counts globally.
However it is also possible to color edges instead of the nodes as is presented in \textit{Průvodce labyrintem algoritmů} by \textit{Mareš and Valla}.~\cite{labyrint} In this representation color of the edge represents color of the child node. This representation is much more „natural“ for the representation using rank as it can be seen in \autoref{fig:ranked:rbt}, where edges connecting nodes with rank-difference $1$ represent \textit{black edges} and edges connecting nodes with rank-difference $0$ represent \textit{red edges}. It is also apparent that using this representation root of the tree does not hold any color anymore.
However, it is also possible to colour edges instead of the nodes as is presented in \textit{Průvodce labyrintem algoritmů} by \textit{Mareš and Valla}.~\cite{labyrint} In this representation colour of the edge represents the colour of the child node. This representation is much more „natural“ for the representation using rank as it can be seen in \autoref{fig:ranked:rbt}, where edges connecting nodes with rank-difference $1$ represent \textit{black edges} and edges connecting nodes with rank-difference $0$ represent \textit{red edges}. It is also apparent that using this representation root of the tree does not hold any colour anymore.
\begin{figure}
\centering
@ -146,25 +144,24 @@ However it is also possible to color edges instead of the nodes as is presented
\section{Implementation of other balanced trees using rank}
To show that using rank is mostly an implementation detail, we will describe an implementation of the AVL tree using rank.
Implementation of the insertion is trivial, since it is described by \textit{Haeupler et al.}~\cite{wavl} and is used in the WAVL tree. All we need to implement is the deletion from the AVL tree. We will start with a short description of the deletion rebalance as given by \textit{Mareš and Valla} in \textit{Průvodce labyrintem algoritmů}.
When propagating the error, we can encounter 3 cases (we explain them with respect to propagating deletion from the left subtree, propagation from right is mirrored and role of trits $+$ and $-$ swaps)~\cite{labyrint}:
When propagating the error, we can encounter 3 cases (we explain them for propagating deletion from the left subtree, propagation from right is mirrored, and role of trits $+$ and $-$ swaps)~\cite{labyrint}:
\begin{enumerate}
\item \textit{Node was marked with $-$.} In this case, heights of left and right subtrees are equal now and node is marked with $0$, but propagation must be continued, since the height of the whole subtree has changed.\label{avl:rules:delete:1}
\item \textit{Node was marked with $0$.} In this case, node is marked with $+$ and the height of the subtree has not changed, therefore we can stop the propagation.\label{avl:rules:delete:2}
\item \textit{Node was marked with $+$.} In this case, node would acquire balance-factor of $+2$, which is not allowed. In this situation we decide based on the mark of the node from which we are propagating the deletion in the following way (let $x$ the current node marked with $+$ and $y$ be the right child of $x$):\label{avl:rules:delete:3}
\item \textit{Node was marked with $-$.} In this case, heights of left and right subtrees are equal now, and node is marked with $0$, but propagation must be continued since the height of the whole subtree has changed.\label{avl:rules:delete:1}
\item \textit{Node was marked with $0$.} In this case, the node is marked with $+$, and the height of the subtree has not changed; therefore, we can stop the propagation.\label{avl:rules:delete:2}
\item \textit{Node was marked with $+$.} In this case, the node would acquire a balance factor of $+2$, which is not allowed. In this situation, we decide based on the mark of the node from which we are propagating the deletion in the following way (let $x$, the current node marked with $+$ and $y$, be the right child of $x$):\label{avl:rules:delete:3}
\begin{enumerate}
\item If $y$ is marked with $+$, then we rotate by $x$ to the left. After that both $x$ and $y$ can be marked with $0$. Height from the point of the parent has changed, so we continue the propagation.\label{avl:rules:delete:3a}
\item If $y$ is marked with $0$, then we rotate by $x$ to the left. After the rotation, $x$ can be marked with $+$ and $y$ with $-$. Height of the subtree has not changed, so propagation can be stopped.\label{avl:rules:delete:3b}
\item $y$ is marked with $-$. Let $z$ be the left son of $y$. We double rotate: first by $z$ to the right and then by $x$ to the left. After the double-rotation $x$ can be marked by either $0$ or $-$, $y$ by $0$ or $+$ and $z$ gets $0$. Height of the subtree has changed, therefore we must propagate further.\label{avl:rules:delete:3c}
\item If $y$ is marked with $+$, we rotate by $x$ to the left. After that, both $x$ and $y$ can be marked with $0$. Height from the point of the parent has changed, so we continue the propagation.\label{avl:rules:delete:3a}
\item If $y$ is marked with $0$, we rotate by $x$ to the left. After the rotation, $x$ can be marked with $+$ and $y$ with $-$. The height of the subtree has not changed, and propagation can be stopped.\label{avl:rules:delete:3b}
\item $y$ is marked with $-$. Let $z$ be the left son of $y$. We double rotate by $z$ to the right and then by $x$ to the left. After the double-rotation, $x$ can be marked by either $0$ or $-$, $y$ by $0$ or $+$ and $z$ gets $0$. The height of the subtree has changed; therefore, we must propagate further.\label{avl:rules:delete:3c}
\end{enumerate}
\end{enumerate}\label{avl:rules:delete}
Having knowledge about rules we have implemented the deletion rebalance by implementing the following functions:
Knowing rules, we have implemented the deletion rebalance by implementing the following functions:
\begin{enumerate}
\item \avlDeleteRebalance{} that handles updating the current node and its parent and iteratively calls subroutine handling previously described \textit{one step of a rebalancing}.
\item \avlDeleteRebalance{} that handles updating the current node and its parent and iteratively calls subroutine handling previously described \textit{as one step of a rebalancing}.
\item \avlDeleteFixNode{} that handles one adjustment of rebalancing as described above.
\item \avlDeleteRotate{} that handles rotation and updating of ranks, if necessary.
\end{enumerate}
@ -191,7 +188,7 @@ Having knowledge about rules we have implemented the deletion rebalance by imple
\caption{\texttt{deleteRebalance} algorithm for the AVL tree}\label{algorithm:avl:deleteRebalance}
\end{algorithm}
\texttt{deleteRebalance}, as can be seen in \autoref{algorithm:avl:deleteRebalance}, is quite straightforward. At the beginning we early return in case there is nothing to be rebalanced, which happens when deleting the last node from the tree. Then we handle a case where we are given only parent by correctly setting $y$ and $parent$. Following up on that, as long as we have a node to be checked, we call \autoref{algorithm:avl:deleteFixNode} to fix balancing of the current node. Algorithm for fixing node returns $true$ or $false$ depending on the need to propagate the height change further, which is utilized in the condition of the \texttt{while} loop.
\texttt{deleteRebalance}, as can be seen in \autoref{algorithm:avl:deleteRebalance}, is relatively straightforward. In the beginning, we early return in case there is nothing to be rebalanced, which happens when deleting the last node from the tree. Then we handle a case where we are given only parent by correctly setting $y$ and $parent$. Following up on that, as long as we have a node to be checked, we call \autoref{algorithm:avl:deleteFixNode} to fix the balancing of the current node. The algorithm for fixing a node returns $true$ or $false$ depending on the need to propagate the height change further, which is utilized in the \texttt{while}-loop condition.
\begin{algorithm}
\Proc{$\texttt{deleteFixNode}(T, x, parent)$}{
@ -216,9 +213,9 @@ Having knowledge about rules we have implemented the deletion rebalance by imple
\caption{\texttt{deleteFixNode} algorithm for the AVL tree}\label{algorithm:avl:deleteFixNode}
\end{algorithm}
\texttt{deleteFixNode} implements the algorithm as described in \hyperref[avl:rules:delete]{the list} with all possible cases above. We start by checking the balance-factor of the given node, in case there is no need to rotate, the rank gets updated if necessary and then we return the information whether there is a need to propagate further or not. In case the node has acquired balance-factor of $2$ we call \autoref{algorithm:avl:deleteRotate} to fix the balancing locally.
\texttt{deleteFixNode} implements the algorithm described in \hyperref[avl:rules:delete]{the list} with all possible cases above. We start by checking the balance factor of the given node. In case there is no need to rotate, the rank gets updated if necessary, and then we return the information on whether there is a need to propagate further or not. If the node has acquired a balance factor of $2$, we call \autoref{algorithm:avl:deleteRotate} to fix the balancing locally.
There are two operations that are not described using helper functions and they are done in a following way:
There are two operations that are not described using helper functions, and they are done in the following way:
\begin{itemize}
\item Balance-factor of a node $x$ is calculated as \[ rank(x.right) - rank(x.left) \]
@ -246,6 +243,6 @@ There are two operations that are not described using helper functions and they
\newpage
\texttt{deleteRotate} is handling only fixes where the rotations are required. Both \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} include comments to highlight which rules are handled. This function is also done generically regardless of the subtree from which the height change is being propagated. This is done by passing in functions used for rotations (since it is mirrored) and also by passing in the balance-factor required for just one rotation.
\texttt{deleteRotate} is handling only fixes where the rotations are required. Both \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} include comments to highlight which rules are handled. This function is also done generically regardless of the subtree from which the height change is being propagated. This is done by passing in functions used for rotations (since it is mirrored) and passing in the balance factor required for just one rotation.
In both \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} there is a key difference compared to the AVL tree implementations without ranks. Comparing the \hyperref[avl:rules:delete]{rules for deletion} with algorithms for rank-balanced implementation, it is apparent that during propagation of height change, the balance-factors of immediate nodes are already adjusted, since the information comes from either of its subtrees and it is calculated using ranks of its children that are already adjusted. This fact needs to be reflected in the implementation accordingly, since it shifts the meaning of rules as they are described above and written for the implementations that store the trit in the nodes directly, which is updated manually during rebalancing.
There is a crucial difference between \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} compared to the AVL tree implementations without ranks. If we compare the rules for deletion with algorithms for rank-balanced implementation, we can see a crucial difference. During propagation of the height change, the balance factors of the closest nodes are already adjusted. That is caused by the calculation of the balance factor based on the subtrees, not the node's rank. Furthermore, subtrees are already adjusted. This fact needs to be reflected in the implementation accordingly. It shifts the meaning of the rules described above and is written for the implementations that directly store the trit in the nodes, updated manually during rebalancing.

View file

@ -4,14 +4,14 @@ This chapter will briefly discuss the properties and fundamental ideas behind th
\section{Red-black trees}
As mentioned previously, red-black trees are among the most popular implementations in standard libraries. As always, we have a binary search tree, and each node is given \textit{red} or \textit{black} colour. A red-black tree is kept balanced by enforcing the following set of rules~\cite{rbtree}:
As mentioned previously, red-black trees are among the most popular implementations in standard libraries. We have a binary search tree, and each node is given \textit{red} or \textit{black} colour. A red-black tree is kept balanced by enforcing the following set of rules~\cite{rbtree}:
\begin{enumerate}
\item External nodes are black; internal nodes may be red or black.
\item For each internal node, all paths from it to external nodes contain the same number of black nodes.
\item No path from an internal node to an external node contains two red nodes in a row.
\item External nodes do not hold any data.
\item Root has black colour.~\footnote{This rule is optional, since it increases the count of black nodes from root to each of the external nodes. However it may be beneficial during insertion, e.g. 2 insertions into empty red-black tree.}
\item Root has black colour.~\footnote{This rule is optional since it increases the count of black nodes from the root to each external node. However, it may be beneficial during insertion, e.g. two insertions into an empty red-black tree.}
\end{enumerate}
Given this knowledge, we can safely deduce the following relation between the height of the red-black tree and nodes stored in it~\cite{cormen2009introduction}:
@ -19,9 +19,9 @@ Given this knowledge, we can safely deduce the following relation between the he
\log_2{(n + 1)} \leq h \leq 2 \cdot \log_2{(n + 2)} - 2
\]\label{rb-height}
Lower bound is given by a perfect binary tree and the upper bound is given by the minimal red-black tree.
The lower bound is given by a perfect binary tree, and the upper bound is given by the minimal red-black tree.
There are also other variants of the red-black tree that are considered to be simpler for implementation, e.g. left-leaning red-black tree, as described by \textit{Sedgewick}~\cite{llrb}.
Other variants of the red-black tree are considered to be simpler for implementation, e.g. left-leaning red-black tree, as described by \textit{Sedgewick}~\cite{llrb}.
Red-black trees are used to implement sets in C++~\footnote{\url{https://github.com/llvm/llvm-project/blob/main/libcxx/include/__tree} as of commit \texttt{990ea39}}, Java~\footnote{\url{https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/TreeMap.java} as of commit \texttt{fb469fb}} and C\#~\footnote{\url{https://github.com/dotnet/runtime/blob/main/src/libraries/System.Collections/src/System/Collections/Generic/SortedSet.cs} as of commit \texttt{215b39a}}.
@ -41,13 +41,13 @@ BalanceFactor(n) \in \{ -1, 0, 1 \}
In other words, the heights of left and right subtrees of each node differ at most in 1.~\cite{avl}
Similarly, we will deduce the height of the AVL tree from original paper, by \textit{Adelson-Velsky and Landis}~\cite{avl}, we get:
Similarly, we will deduce the height of the AVL tree from the original paper, by \textit{Adelson-Velsky and Landis}~\cite{avl}:
\[
\left( \log_2{(n + 1)} \leq \right) h < \log_{\varphi}{(n + 1)} < \frac{3}{2} \cdot \log_2{(n + 1)}
\]\label{avl-height}
If we compare the upper bounds for the height of the red-black trees and AVL trees, we can see that AVL rules are more strict than red-black rules, but at the cost of rebalancing. However, in both cases the rebalancing still takes $\log_2{n}$.
If we compare the upper bounds for the height of the red-black trees and AVL trees, we can see that AVL rules are more strict than red-black rules, but at the cost of rebalancing. However, in both cases, the rebalancing still takes $\log_2{n}$.
Regarding the implementation of AVL trees, we can see them implemented in the standard library of Agda~\footnote{\url{https://github.com/agda/agda-stdlib/blob/master/src/Data/Tree/AVL.agda} as of commit \texttt{d0841d1}}, OCaml~\footnote{\url{https://github.com/ocaml/ocaml/blob/trunk/stdlib/set.ml} as of commit \texttt{f52fdc2}} or Coq~\footnote{\url{https://github.com/coq/coq/blob/master/theories/MSets/MSetAVL.v} as of commit \texttt{c1ddf13}; they also admit to follow the OCaml implementation}.
@ -65,11 +65,9 @@ B-tree is a self-balancing tree as described by \textit{Bayer and McCreight}~\ci
Minimal number of keys stored in a node does not apply to the root node.
\end{enumerate}
We have chosen the rules from \textit{Introduction to Algorithms}~\cite{cormen2009introduction}, because the terminology is more familiar and they are more compact than the rules given in the original paper by \textit{Bayer and McCreight}~\cite{btree}, where they introduce B-trees for organization of files in filesystems as the title suggests.
We have chosen the rules from \textit{Introduction to Algorithms}~\cite{cormen2009introduction} because the terminology is more familiar and compact than the rules given in the original paper by \textit{Bayer and McCreight}~\cite{btree}, where they introduce B-trees for the organization of files in filesystems, as the title suggests.
Based on the original paper~\cite{btree} and \textit{Introduction to Algorithms}~\cite{cormen2009introduction}, we have deduced following height boundaries:
\[ 1 + \log_{t}{ \frac{(t - 1)(n - 1)}{2} + 1} \leq h \leq \log_t{\frac{n + 1}{2}} \]
Based on the original paper~\cite{btree} and \textit{Introduction to Algorithms}~\cite{cormen2009introduction}, we have deduced the following height boundaries:
\[ 1 + \log_{t}{ \left( \frac{(t - 1)(n - 1)}{2} + 1 \right)} \leq h \leq \log_t{\frac{n + 1}{2}} \]
%% TODO: ↑ Check this thing once again ↑
Even though original paper has presented B-tree to be used in filesystems or databases, there are also cases when B-tree is used to represent sets, for example in Rust~\footnote{\url{https://github.com/rust-lang/rust/blob/master/library/alloc/src/collections/btree/map.rs} as of commit \texttt{9805437}; $t = 6$}.
Even though the original paper has presented B-tree to be used in filesystems or databases, there are also cases when B-tree is used to represent sets, for example, in Rust~\footnote{\url{https://github.com/rust-lang/rust/blob/master/library/alloc/src/collections/btree/map.rs} as of commit \texttt{9805437}; $t = 6$}.