Data structures have become a regular part of the essential toolbox for problem-solving. In many cases, they also help to improve the existing algorithm's performance, e.g. using a priority queue in Dijkstra's algorithm for the shortest path. This thesis will mainly discuss the implementation of a set (which can also be adjusted to represent a dictionary or map, if you wish).
Currently, the most commonly used implementations of sets use hash tables, but we will talk about another common alternative, implementation via self-balancing search trees. Compared to a hash table, they provide us with \textbf{consistent} time complexity, but at the cost of a requirement for ordering on the elements. The most implemented self-balancing binary tree is a \textit{red-black tree}, as described by Guibas and Sedgewick~\cite{rbtree}. Among other alternatives, we can find (non-binary) \textit{B-tree}~\cite{btree} and \textit{AVL tree}~\cite{knuth1998art}.
This thesis analyses and visualizes the \textit{Weak AVL (WAVL)}\cite{wavl} tree that has more relaxed conditions than the AVL tree but still provides better balancing than a red-black tree.
We start by reiterating through commonly used search trees, explaining basic ideas behind their self-balancing and bounding the height in the worst-case scenario. Then we state used terminology and explain the rank-balanced tree. Given a rank-balanced tree, we can delve into the details behind the representation of previously shown self-balancing binary search trees using rank and the WAVL rule gives us a new self-balancing binary search tree. For the WAVL, we provide pseudocode and explain operations used for rebalancing, including diagrams. Later on, we will discuss different heuristics that can be used for rebalancing and implementing the visualization of the operations on the WAVL tree.
This chapter will briefly discuss the properties and fundamental ideas behind the most used self-balancing search trees in standard libraries to give an idea about current options and how WAVL fits among them.
As mentioned previously, red-black trees are among the most popular implementations in standard libraries. As always, we have a binary search tree, and then each node is given \textit{red} or \textit{black} colour. A red-black tree is kept balanced by enforcing the following set of rules:
\begin{enumerate}
\item External nodes are black; internal nodes may be red or black
\item For each internal node, all paths from it to external nodes contain the same number of black nodes
\item No path from an internal node to an external node contains two red nodes in a row
\end{enumerate}~\cite{rbtree}
Given this knowledge, we can safely deduce the following relation between the height of the red-black tree and nodes stored in it:
where the lower bound is given by a perfect binary tree and upper bound by the minimal red-black tree.
There are also other variants of the red-black tree that are considered to be simpler for implementation, e.g. left-leaning red-black tree, as described by Sedgewick.
Red-black trees are used to implement sets in C++, Java and C\#.
If we compare the upper bounds for the height of the red-black trees and AVL trees, we can see that AVL rules are more strict than red-black rules, but at the cost of rebalancing. However, in both cases, the rebalancing still takes $\log_2{n}$.
Regarding the implementation of AVL trees, we can see them implemented in the standard library of Agda or Coq.
In comparison to nodes in binary search trees, nodes in rank-balanced trees contain one more piece of information, and that is \textit{rank}. Each type of tree that can be implemented using this representation, e.g. red-black, 2-3-4, AVL or WAVL, has a specific set of rules that ensure the resulting tree is balanced.
\textbf{AVL Rule}: Every node is (1, 1) or (1, 2).
In case of the AVL trees rank represents height. Here we can notice a very smart way of using the \textit{(i, j) node} definition. If we go back to the definition and want to be explicit about the nodes that are allowed with the \textit{AVL Rule}, then we get (1, 1), (1, 2) \textbf{or} (2, 1) nodes. However it is possible to find implementations of the AVL tree that allow leaning \textbf{to only one side} as opposed to the original requirements given by Adelson-Velsky and Landis. Forbidding interchangeability of (i, j) with (j, i) nodes would still yield AVL trees that lean to one side.
Meaning of the \textit{AVL Rule} is quite simple, since rank represents height in that case. We can draw analogies using the notation used for the AVL trees, where we mark nodes with a trit (or a sign) or use a balance-factor. We have two cases to discuss:
\begin{itemize}
\item\textbf{(1, 1) node} represents a tree where both of its subtrees have the same height. In this case we are talking about the nodes with balance-factor $0$ (respectively being signed with a $0$).
\item\textbf{(1, 2) node} represents a tree where one of its subtrees has a bigger height. In this case we are talking about the nodes with balance-factor $-1$ or $1$ (respectively being signed with a $-$ or a $+$).
\textbf{Red-Black Rule}: All rank differences are 0 or 1, and no parent of a 0-child is a 0-child. \\
In case of red-black trees, rank represents number of black nodes on a path from the node to a leaf (excluding the node itself). Based on that we can discuss the \textit{Red-Black Rule} in detail:
\begin{enumerate}
\item\textit{All rank differences are 0 or 1} inductively enforces monotonically increasing (at most by 1) count of black nodes from the leaves. \\
In detail:
\begin{enumerate}
\item In case the \textbf{current node is black}, the rank difference must be 1, since we have one more black node on the path from the parent to the leaves than from the current node.
\item In case the \textbf{current node is red}, the rank difference must be 0, since from the parent the count of black nodes on the path to leaves has not changed.
\item And finally all other differences are invalid, since by adding a node to the beginning of a path to the leaf we can either add red node (0-child) or black node (1-child), i.e. there is one more black node on the path or not which implies the change can be only 0 or 1.
\end{enumerate}
\item\textit{No parent of a 0-child is a 0-child} ensures that there are no two consecutive red nodes, since 0-child node is equivalent to the red node.
Majority of the red-black tree implementations color nodes of the tree, following that notation and \textbf{precise} definition of the red-black tree it is quite common to ask the following questions:
However it is also possible to color edges instead of the nodes as is presented in \textit{Průvodce labyrintem algoritmů} by \textit{Mareš and Valla}.~\cite{labyrint} In this representation color of the edge represents color of the child node. This representation is much more „natural“ for the representation using rank as it can be seen in \autoref{fig:ranked:rbt}, where edges connecting nodes with rank-difference $1$ represent \textit{black edges} and edges connecting nodes with rank-difference $0$ represent \textit{red edges}. It is also apparent that using this representation root of the tree does not hold any color anymore.
To show that using rank is mostly an implementation detail, we will describe an implementation of the AVL tree using rank.
Implementation of the insertion is trivial, since it is described by \textit{Haeupler et al.} and is used in the WAVL tree. All we need to implement is the deletion from the AVL tree. We will start by short description of the deletion rebalance as given by \textit{Mareš and Valla} in \textit{Průvodce labyrintem algoritmů}.
When propagating the error, we can encounter 3 cases (we explain them with respect to propagating deletion from the left subtree, propagation from right is mirrored and role of trits $+$ and $-$ swaps):
\item\textit{Node was marked with $-$.} In this case, heights of left and right subtrees are equal now and node is marked with $0$, but propagation must be continued, since the height of the whole subtree has changed.\label{avl:rules:delete:1}
\item\textit{Node was marked with $0$.} In this case, node is marked with $+$ and the height of the subrtree has not changed, therefore we can stop the propagation.\label{avl:rules:delete:2}
\item\textit{Node was marked with $+$.} In this case, node would acquire balance-factor of $+2$, which is not allowed. In this situation we decide based on the mark of the node from which we are propagating the insertion in the following way (let $x$ the current node marked with $+$ and $y$ be the right child of $x$):\label{avl:rules:delete:3}
\item$y$ is marked with $+$, then we rotate by $x$ to the left. After that both $x$ and $y$ can be marked with $0$. Height from the point of the parent has changed, so we continue the propagation.\label{avl:rules:delete:3a}
\item$y$ is marked with $0$, then we rotate by $x$ to the left. After the rotation, $x$ can be marked with $+$ and $y$ with $-$. Height of the subtree has not changed, so propagation can be stopped.\label{avl:rules:delete:3b}
\item$y$ is marked with $-$. Let $z$ be the left son of $y$. We double rotate: first by $z$ to the right and then by $x$ to the left. After the double-rotation $x$ can be marked by either $0$ or $-$, $y$ by $0$ or $+$ and $z$ gets $0$. Height of the subtree has changed, therefore we must propagate further.\label{avl:rules:delete:3c}
\item\avlDeleteRebalance{} that handles updating the current node and its parent and iteratively calls subroutine handling previously described \textit{one step of a rebalancing}
\item\avlDeleteFixNode{} that handles one adjustment of rebalancing as described above
\item\avlDeleteRotate{} that handles rotation and updating of ranks, if necessary
\caption{\texttt{deleteRebalance} algorithm for the AVL tree}\label{algorithm:avl:deleteRebalance}
\end{algorithm}
\texttt{deleteRebalance}, as can be seen in \autoref{algorithm:avl:deleteRebalance}, is quite straightforward. At the beginning we early return in case there is nothing to be rebalanced, which happens when deleting the last node from the tree. Then we handle case when we are given only parent by correctly setting $y$ and $parent$. Following up on that, as long as we have a node to be checked, we call \autoref{algorithm:avl:deleteFixNode} to fix balancing of the current node. Algorithm for fixing node returns $true$ or $false$ depending on the need to propagate the height change further, which is utilized in the condition of the \texttt{while} loop.
\caption{\texttt{deleteFixNode} algorithm for the AVL tree}\label{algorithm:avl:deleteFixNode}
\end{algorithm}
\texttt{deleteFixNode} implements the algorithm as described in \hyperref[avl:rules:delete]{the list} with all possible cases above. We start by checking the balance-factor of the given node, in case there is no need to rotate, the rank gets updated if necessary and then we return the information whether there is a need to propagate further or not. In case the node has acquired balance-factor of $2$ we call \autoref{algorithm:avl:deleteRotate} to fix the balancing locally.
There are two operations that are not described using helper functions and they are done in a following way:
\begin{itemize}
\item Balance-factor of a node $x$ is calculated as \[ rank(x.right)- rank(x.left)\]
\item Updating rank of a node $x$ is done by setting node's rank to \[1+\max\{ rank(x.left), rank(x.right)\}\]
\end{itemize}
\begin{algorithm}
\Proc{$\texttt{deleteRotate}(T, x, y, leaning, rotateL, rotateR)$}{
\caption{\texttt{deleteRotate} algorithm for the AVL tree}\label{algorithm:avl:deleteRotate}
\end{algorithm}
\newpage
\texttt{deleteRotate} is handling only fixes where the rotations are required. Both \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} include comments to highlight which rules are handled. This function is also done generically regardless of the subtree from which the height change is being propagated. This is done passing in functions used for rotations (since it is mirrored) and also by passing in the balance-factor required for just one rotation.
In both \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} there is a key difference compared to the AVL tree implementations without ranks. Comparing the \hyperref[avl:rules:delete]{rules for deletion} with algorithms for rank-balanced implementation, it is apparent that during propagation of height change, the balance-factors of immediate nodes are already adjusted, since the information comes from either of its subtrees and it is calculated using ranks of its children that are already adjusted. This fact needs to be reflected in the implementation accordingly, since it shifts the meaning of rules as they are described above and written for the implementations that store the trit in the nodes directly, which is updated manually during rebalancing.
Based on the rank rules for implementing red-black tree (as described in \ref{chap:rb-rule}) and AVL tree (as described in \ref{chap:avl-rule}), \textit{Haeupler et al.} present a new rank rule:
\textbf{Weak AVL Rule}: All rank differences are 1 or 2, and every leaf has rank 0.~\cite{wavl}
Comparing the \textit{Weak AVL Rule} to the \textit{AVL Rule}, we can come to these conclusions:
\begin{itemize}
\item\textit{Every leaf has rank 0} holds with the AVL Rule, since every node is (1, 1) or (1, 2) and rank of a node represents height of its tree. Rank of \textit{nil} is defined as $-1$ and height of tree rooted at leaf is $0$, therefore leaves are (1, 1)-nodes
\item\textit{All rank differences are 1 or 2} does not hold in one specific case, and that is (2, 2)-node, which is allowed in the WAVL tree, but not in the AVL tree. This difference will be explained more thoroughly later on.
We have described in \autoref{chap:sb-bst} other common self-balanced binary search trees to be able to draw analogies and explain differences between them. Given the boundaries of height for red-black and AVL tree, we can safely assume that the AVL is more strict with regards to the self-balancing than the red-black tree. Let us show how does WAVL fit among them. \textit{Haeupler et al.} present following bounds:
\[ h \leq k \leq2h \text{ and } k \leq2\log_2{n}\]~\cite{wavl}
In those equations we can see $h$ and $n$ in the same context as we used it to lay boundaries for the AVL and red-black trees, but we can also see new variable $k$, which represents the rank of the tree.
One of the core differences between AVL and WAVL lies in the rebalancing after deletion. Insertion into the WAVL tree is realized in the same way as it would in the AVL tree and the benefit of (2, 2)-node is used during deletion rebalancing.
From the previous 2 statements we can come to 2 conclusions and those are:
\begin{itemize}
\item If we commit only insertions to the WAVL tree, it will always yield a valid AVL tree. In that case it means that the height boundaries are same as of the AVL tree (described in \autoref{avl-height}).
\item If we commit deletions too, we can assume the worst-case scenario where \[ h < 2\log_2{n}\] which also holds for the red-black trees.
\end{itemize}
From the two conclusions we can safely deduce that the WAVL tree is in the worst-case scenario as efficient as the red-black tree and in the best-case scenario as efficient as the AVL tree.
Inserting values into WAVL tree is equivalent to inserting values into regular binary-search tree followed up by rebalancing that ensures rank rules hold. This part can be clearly seen in \autoref{algorithm:wavl:insert}. We can also see there two early returns, one of them happens during insertion into the empty tree and other during insertion of duplicate key, which we do not allow.
In the \autoref{algorithm:wavl:insert} we have also utilized a helper function that is used to find parent of the newly inserted node and also prevents insertion of duplicate keys within the tree. Pseudocode of that function can be seen in \autoref{algorithm:findParentNode}.
Rebalancing after insertion in the WAVL tree is equivalent to rebalancing after insertion in the AVL tree. We will start with a short description of the rebalancing within AVL to lay a foundation for analogies and differences compared to the implementation using ranks.
When propagating the error, we can encounter 3 cases (we explain them with respect to propagating insertion from the left subtree, propagation from right is mirrored and role of trits $+$ and $-$ swaps):
\item\textit{Node was marked with $+$.} In this case, heights of left and right subtrees are equal now and node is marked with $0$ and propagation can be stopped.\label{avl:rules:insert:1}
\item\textit{Node was marked with $0$.} In this case, node is marked with $-$, but the height of the tree rooted at the node has changes, which means that we need to propagate the changes further.\label{avl:rules:insert:2}
\item\textit{Node was marked with $-$.} In this case, node would acquire balance-factor of $-2$, which is not allowed. In this situation we decide based on the mark of the node from which we are propagating the insertion in the following way (let $x$ be the node from which the information is being propagated and $z$ the current node marked with $-$):\label{avl:rules:insert:3}
\item$x$ is marked with $-$, then we rotate by $z$ to the right. After that both $z$ and $x$ can be marked with $0$. Height from the point of the parent has not changed, so we can stop the propagation.\label{avl:rules:insert:3a}
\item$x$ is marked with $+$, then we double rotate: first by $x$ to the left and then by $z$ to the right. Here we need to recalculate the balance-factors for $z$ and $x$, where $z$ gets $-$ or $0$ and $x$ gets $0$ or $+$. Node that was a right child to the $x$ before the double-rotation is now marked with $0$ and propagation can be stopped.\label{avl:rules:insert:3b}
\item$x$ is marked with $0$. This case is trivial, since it cannot happen, because we never propagate the height change from a node that acquired sign $0$.
In the following explanation we have to consider that valid nodes in AVL tree implemented via ranks are (1, 1) and (1, 2) and by the time of evaluating rank-differences of parent, they are already affected by the rebalancing done from the inserted leaf.
As a first step, which can be seen in \autoref{algorithm:wavl:insertRebalance}, we iteratively check rank-differences of a parent of the current node. As long as it is a (0, 1) or (1, 0) node, we promote it and propagate further. There is an interesting observation to be made about the way \textit{how parent can fulfill such requirement}. And the answer is simple, since we are adding a leaf or are already propagating the change to the root, it means that we have lowered the rank-difference of the parent, therefore it must have been (1, 1) node. From the algorithm used for usual implementations of AVL trees, this step refers to \hyperref[avl:rules:insert:2]{\textit{rule 2}}. After the promotion the rank of the parent becomes (1, 2) or (2, 1) which means that it gets sign $-$ (or $+$ respectively when propagating from the right subtree), which conforms to the usual algorithm.
After this, we might end up in two situations and those are:
\begin{enumerate}
\item Current node is not a 0-child, which means that after propagation and promotions we have gotten to a parent node that is (1, 2) or (2, 1), which refers to the \hyperref[avl:rules:insert:1]{\textit{rule 1}}.
\item Current node is a 0-child, which means that after propagation and promotions we have a node with a parent that is either (0, 2) or (2, 0) node. This case conforms to the \hyperref[avl:rules:insert:3]{\textit{rule 3}} and must be handled further to fix the broken rank rule.
\end{enumerate}
\hyperref[avl:rules:insert:3]{\textit{Rule 3}} is then handled by a helper function that can be seen in \autoref{algorithm:wavl:fix0Child}.
Here we can see, once again, an interesting pattern. When comparing to the algorithm described above, using the rank representation, we do not need to worry about changing the signs and updating the heights, since by rotating combined with demotion and promotion of the ranks, we are effectively updating the height (represented via rank) of the affected nodes. This observation could be used in \autoref{algorithm:avl:deleteFixNode} and \autoref{algorithm:avl:deleteRotate} where we turned to manual updating of ranks to show the difference.
As described by \textit{Haeupler et al.}, we start the deletion rebalancing by checking for (2, 2) node. If that is the case, we demote it and continue with the deletion rebalancing via \autoref{algorithm:wavl:bottomUpDelete} if we have created a 3-child by the demotion. Demoting the (2, 2) node is imperative, since it enforces part of the \textit{Weak AVL Rule} requiring that leaves have rank equal to zero.
For example consider the following tree in \autoref{fig:wavl:twoElements}. Deletion of key 2 from that tree would result in having only key 1 in the tree with rank equal to 1, which would be (2, 2) node and leaf at the same time. After the demotion of the remaining key, we acquire the tree as shown in \autoref{fig:wavl:twoElementsAfterDelete}
In contrast to the \textit{AVL Rule}, WAVL tree allows us to have (2, 2) nodes present. Therefore we can encounter two key differences during deletion rebalancing:
\begin{enumerate}
\item If anywhere during the deletion rebalancing, \textbf{but not} at the start, we encounter (2, 2) node, we can safely stop the rebalancing process, since rest of the tree must be correct and we have fixed errors on the way to the current node from the leaf.
\item Compared to the AVL tree, during deletion rebalancing we need to fix \textbf{3-child} nodes.
\node (Node{1}) at (31.197bp,192.0bp) [draw,ellipse] {1, 2};
\node (Node{0}) at (31.197bp,105.0bp) [draw,ellipse] {0, 1};
\node (Node{-1}) at (31.197bp,18.0bp) [draw,ellipse] {-1, 0};
\draw [->] (Node{1}) ..controls (31.197bp,162.16bp) and (31.197bp,146.55bp) .. (Node{0});
\definecolor{strokecol}{rgb}{0.0,0.0,0.0};
\pgfsetstrokecolor{strokecol}
\draw (36.197bp,148.5bp) node {1};
\draw [->] (Node{0}) ..controls (31.197bp,75.163bp) and (31.197bp,59.548bp) .. (Node{-1});
\draw (36.197bp,61.5bp) node {1};
%
\end{tikzpicture}
\caption{WAVL tree from \autoref{fig:wavl:deletionA:before} after deletion of 2, value is replaced by one of its children}\label{fig:wavl:deletionA:replacing}
For the implementation of rank-balanced trees we have used two programming languages: C\# and Python. C\# implementation has not been finished and therefore is not part of the submitted attachments. However it has given a valuable insight into the role of preconditions and invariants in algorithms while \texttt{null}-checking is enforced, since, for example, static type control cannot be aware of a node \textbf{not} being a \texttt{null} after checking specific set of conditions that forbid such scenario.
Python has been chosen as the \textit{go-to} language, since it is used for teaching foundations of programming~\cite{ib111} and also introductory course of algorithms and data structures~\cite{ib002} at our faculty.
We have started by implementing a general idea of a rank-balanced tree. Rank-balanced tree is an abstract class that \textbf{does not} implement methods specific to different kinds of trees such as
\begin{enumerate}
\item\texttt{is\_correct\_node} is used to check whether node and its subtrees satisfy rank rules
\item\texttt{\_insert\_rebalance} rebalances the tree from the given node after insertion
\item\texttt{\_delete\_rebalance} rebalances the tree from the given nodes (deleted node and parent) after deletion
Apart from that we have also implemented class for representation of nodes that provides quite rich interface that is utilized during rebalancing and also in generic methods on generic tree.
From the beginning we have employed the techniques of property-based testing (done manually during C\# implementation and uses Hypothesis~\cite{hypothesis} for Python implementation). The elementary requirement for testing and validation of implemented algorithms was a \textbf{correct}\texttt{is\_correct\_node} method that validates properties of a specific rank-balanced tree and a good set of invariants. List of set invariants follows:
We also admit abuse of property-based testing to find a \textit{minimal} sequence of operations when WAVL tree relaxation manifests. For that purpose we have implemented a \textit{comparator} class that takes two different instances of rank-balanced trees and provides rudimentary \texttt{insert} and \texttt{delete} interface enriched with \texttt{are\_same} (evaluates isomorphism and ranks) and \texttt{are\_similar} (evaluates just isomorphism) methods. While trying to find minimal counter-example, we have also discovered a bug in rebalance after deletion of WAVL tree that caused enforcement of the AVL rank rules.