Category: Data Structures

How To: Constructing a Pretty Quick Hash Table

May 21st, 2024

Introduction

This is more of a code-drop, than a formal explanation of the concept. I’ve recently got a chance to work on this and thought I’d share with the world, while also documenting it for future reference.
As the title mentions, the idea is to construct a quick hash table that prioritizes speed, rather than being a general purpose hash-table. If you are familiar with workings of the hash-table, you would probably get most out of this.
The code provided should work right off the shelf. You just gotta add it to your project and place both the files in same folder.

Implementation Details

I’ve supported following operations in my implementation:

Insert
Delete
Search
Read-Only Access using overloaded indexing operator

The hash-table has following the features:

Table prioritizes fast search/retrieval operations.
Re-Hashing occurs only during the insert operation
Open-Addressing is used to resolve hash collisions.
Single data blob is used to store the Key-Value data and is maintained by Hash-Table itself.
Tries to Rehash if there’s too much probing or exceeds the load factor defined.
Load factor should always be less than 1.
If table is optimized for minimal probing, it’ll be faster than collision resolution by chaining.
To minimize number of times table is probed, each key is split into 3 partitions. So, key is basically a logical construct that’s a combination of multiple partitions. The partition is where the Key-Value data resides in.

There’s a small performance analysis comparing the implementation to STL’s unordered map.

Code

HashTable.h

// HashTable.h
#ifndef _H_HASHTABLE
#define _H_HASHTABLE

#include <assert.h>
#include <math.h>
#include <utility>

namespace pd
{
	/*
	* Supports Insert, Delete, Search.
	* Delete assumes there's a valid entry present. Use search to query before deleting if not sure.
	* Cannot modify entries once inserted
	* Supports insertion of multiple similar valued keys, but no way to iterate over them.
	* Entries are unordered during iteration
	* Supports Range-Based for loop through a simple Iterator implementation.
	* 
	* Uses Open-Addressing with double hashing.
	* Tries to Rehash if there's too much probing or if load factor goes beyond the max allowed value.
	* Load factor should always be less than 1.
	* With minimal probing, should be faster than collision resoluton by chaining.
	*/
	template<typename _KEY
		, typename _VALUE
		, typename _HASH = std::hash<_KEY>
	>
	class HashTable
	{
		struct Data
		{
			enum State : uint8_t
			{
				Invalid = 0,
				Occupied = 1,
				Deleted = 2
			};
			struct DataPair
			{
				DataPair(_KEY pKey, _VALUE pValue)
					: key(pKey)
					, value(pValue)
				{}
				_VALUE		value;
				_KEY		key;
			};

			Data(){}
			Data(_KEY pKey, _VALUE pValue)
			: state(Invalid)
			, pair(DataPair(pKey, pValue))
			{}
			Data(const Data&) = default;
			DataPair	pair;
			State		state;
		};

		using PData = Data*;
		using C_PData = const Data*;
		using DataRef = Data&;
		using C_DataRef = const Data&;
		using KeyRef = _KEY&;
		using C_KeyRef = const _KEY&;
		using ValueRef = _VALUE&;
		using C_ValueRef = const _VALUE&;
		
		using PairRef = typename Data::DataPair&;
		using C_PairRef = const typename Data::DataPair&;

#define AS_DATAPTR(buffer) ((PData)buffer)

	public:
		class Iterator
		{
		public:
			Iterator(const Data* const buffer, uint32_t size, uint32_t index);
			
			/*Operator Overload ops*/
			Iterator& operator++();
			Iterator& operator--();
			C_PairRef operator*() const;
			bool operator==(const Iterator& other) const;
			bool operator!=(const Iterator& other) const;
		private:
			uint32_t				m_index = 0;
			uint32_t				m_size = 0;
			const Data* const		m_buffer = nullptr;
		};

	public:
		HashTable();
		HashTable(uint32_t requiredCapacity);
		HashTable(const HashTable<_KEY, _VALUE, _HASH>& other);
		HashTable operator=(const HashTable<_KEY, _VALUE, _HASH>& other);
		~HashTable();

		void Insert(C_KeyRef key, C_ValueRef value);
		void Remove(C_KeyRef key);
		const Iterator Search(C_KeyRef key) const;

		const Iterator begin() const;
		const Iterator end() const;
		C_PairRef operator[](C_KeyRef key);
	private:

		/* 
		* Inserts into the hash table. Rehashes if necessary
		* Asserts if insertion fails
		*/
		void Insert_Internal(DataRef data);
		void Remove_Internal(C_KeyRef key);
		const Iterator Search_Internal(C_KeyRef key) const;

		/* If conditions specified here are met, table will be rehased with old buffer invalidated*/
		void RehashIfRequired();
		/* Requests new memory from OS and rehashes all entries in old buffer to new one*/
		void Rehash();
		bool MaxLoadExceeded() const;
		/* capacity request here is to increase number of keys. Actual capacity will be calculated from it
		Capacity requested must be greater than currently available. Increases in powers of 2*/
		void RequestCapacity(uint32_t capacity);
		void* RequestMemory() const;

	public:
		//Debug functions
		void Print();
		void PrintTableVitals() const;
		void ProbeWholeTable() const;

	private:
		float					m_currentLoad		= 0;
		uint32_t				m_numKeys			= 0;
		uint32_t				m_size				= 0;
		uint32_t				m_capacity			= 0;
		uint32_t				m_maxProbes			= 0;
		void*					m_buffer			= nullptr;

		float					m_maxLoadAllowed	= 0.75;
		const _HASH				c_HashFunObj;
		const uint8_t			c_NumPartitions		= 3;
	};


	/*
	* m should be a power of 2
	* Result of this should be a relatively prime to m. Only then the whole table could be traversed reliably
	* Should return value in range [1 m-1] that's relatively prime to m
	*/
	inline uint32_t secondaryHash(uint32_t m, uint32_t key)
	{
		const uint32_t p = 429484801; // Random prime number
		uint32_t b = (key * p) % (m - 1); // b must be less than m. Doesnt matter even if multiplication overflows
		const uint32_t res = b | 1; // res should be odd. Powers of 2 are relatively prime to all odd numbers
		return res;
	}
	inline uint32_t primaryHash(uint32_t m, uint32_t key)
	{
		return key % m;
	}

	inline uint32_t hash(uint32_t i, uint32_t m, uint32_t key)
	{
		return (primaryHash(m, key) + i * secondaryHash(m, key)) % m;
	}

#include "HashTable.inl"
}

#endif

HashTable.inl

// HashTable.inl
#include "HashTable.h"

#define HASHTABLE_TEMPLATE_DECL template<typename _KEY, typename _VALUE, typename _HASH>
#define HASHTABLE_TEMPLATE_TYPES _KEY, _VALUE, _HASH
#define HASHTABLE_CLASS HashTable<HASHTABLE_TEMPLATE_TYPES>

#define HASHTABLE_ITERATOR_TYPE typename HashTable<HASHTABLE_TEMPLATE_TYPES>::Iterator
#define HASHTABLE_ITERATOR_CLASS HashTable<HASHTABLE_TEMPLATE_TYPES>::Iterator
#define HASHTABLE_DATA_TYPE typename HashTable<HASHTABLE_TEMPLATE_TYPES>::Data
#define HASHTABLE_DATA_CLASS HashTable<HASHTABLE_TEMPLATE_TYPES>::Data
#define HASHTABLE_DATA_PAIR_TYPE typename HashTable<HASHTABLE_TEMPLATE_TYPES>::Data::DataPair
#define HASHTABLE_DATA_PAIR_CLASS HashTable<HASHTABLE_TEMPLATE_TYPES>::Data::DataPair

// Hash Table ===========================================================================
HASHTABLE_TEMPLATE_DECL
HASHTABLE_CLASS::HashTable()
{
	RequestCapacity(10);
}

HASHTABLE_TEMPLATE_DECL
HASHTABLE_CLASS::HashTable(uint32_t requiredCapacity)
{
	RequestCapacity(requiredCapacity);
}

HASHTABLE_TEMPLATE_DECL
HASHTABLE_CLASS::HashTable(const HASHTABLE_CLASS& other)
	: m_currentLoad(other.m_currentLoad)
	, m_numKeys(other.m_numKeys)
	, m_size(other.m_size)
	, m_capacity(other.m_capacity)
	, m_maxProbes(other.m_maxProbes)
	, m_maxLoadAllowed(other.m_maxLoadAllowed)
	, c_HashFunObj(other.c_HashFunObj)
	, c_NumPartitions(other.c_NumPartitions)
{
	RequestCapacity(m_numKeys);
	if (m_buffer != nullptr)
	{
		memcpy(m_buffer, other.m_buffer, m_capacity * sizeof(Data));
	}
}

HASHTABLE_TEMPLATE_DECL
HASHTABLE_CLASS HASHTABLE_CLASS::operator=(const HASHTABLE_CLASS& other)
{
	if (m_buffer != nullptr)
	{
		free(m_buffer);
	}
	RequestCapacity(other.m_numKeys);
	for (auto itr : other)
	{
		Data data(itr.key, itr.value);
		Insert_Internal(data);
	}
	return HASHTABLE_CLASS(other);
}

HASHTABLE_TEMPLATE_DECL
HASHTABLE_CLASS::~HashTable()
{
	if (m_buffer != nullptr)
	{
		delete m_buffer;
	}
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::Insert(C_KeyRef key, C_ValueRef value)
{
	Data data(key, value);
	Insert_Internal(data);
	RehashIfRequired();
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::Remove(C_KeyRef key)
{
	Remove_Internal(key);
}

HASHTABLE_TEMPLATE_DECL
const HASHTABLE_ITERATOR_TYPE HASHTABLE_CLASS::Search(C_KeyRef key) const
{
	return Search_Internal(key);
}

HASHTABLE_TEMPLATE_DECL
const HASHTABLE_ITERATOR_TYPE HASHTABLE_CLASS::begin() const
{
	return Iterator(AS_DATAPTR(m_buffer), m_capacity, 0);
}

HASHTABLE_TEMPLATE_DECL
const HASHTABLE_ITERATOR_TYPE HASHTABLE_CLASS::end() const
{
	return Iterator(AS_DATAPTR(m_buffer), m_capacity, m_capacity);
}

HASHTABLE_TEMPLATE_DECL
const HASHTABLE_DATA_PAIR_TYPE& HASHTABLE_CLASS::operator[](C_KeyRef key)
{
	if (Search_Internal(key) == end())
	{
		Data data(key, _VALUE());
		Insert_Internal(data);
	}
	return *Search(key);
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::Insert_Internal(DataRef data)
{
	uint32_t k = 0;
	bool found = false;

	while (k < m_numKeys && !found)
	{
		// [0 .. m_Keys-1]
		uint32_t keyHash = (uint32_t)c_HashFunObj(data.pair.key);
		uint32_t tableKey = hash(k, m_numKeys, keyHash);
		uint32_t partitionKey = tableKey * c_NumPartitions;
		for (uint32_t i = 0; i < c_NumPartitions; ++i)
		{
			uint32_t dataIndex = partitionKey + i;
			//Search for empty slot
			if (AS_DATAPTR(m_buffer)[dataIndex].state != Data::Occupied)
			{
				Data* newData = new (AS_DATAPTR(m_buffer) + dataIndex) Data(data);
				newData->state = Data::Occupied;
				found = true;
				break;
			}
		}
		++k;
	}
	assert(found && "Hash Table corrupted. Could not insert the provided data");

	++m_size;
	m_currentLoad = (float)m_size / (float)m_capacity;

	if (k > m_maxProbes)
	{
		m_maxProbes = k;
	}
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::Remove_Internal(C_KeyRef key)
{
	uint32_t k = 0;

	bool deleted = false;
	while (k < m_numKeys && !deleted)
	{
		// Gets expensive if load factor inches close to 1
		uint32_t keyHash = (uint32_t)c_HashFunObj(key);
		uint32_t tableKey = hash(k, m_numKeys, keyHash);
		uint32_t partitionKey = tableKey * c_NumPartitions;
		for (uint32_t i = 0; i < c_NumPartitions; ++i)
		{
			uint32_t dataIndex = partitionKey + i;
			if (AS_DATAPTR(m_buffer)[dataIndex].state == Data::Occupied
				&& AS_DATAPTR(m_buffer)[dataIndex].pair.key == key)
			{
				AS_DATAPTR(m_buffer)[dataIndex].state = Data::Deleted;
				deleted = true;
				break;
			}
		}

		if (deleted) break;

		++k;
		assert(k < m_maxProbes && "Table corrupt. Invalid times probed while deleting");
	}

	assert(deleted && "Trying to delete a non-existent key from HashTable.");
	--m_size;
	m_currentLoad = (float)m_size / (float)m_capacity;
}

HASHTABLE_TEMPLATE_DECL
const HASHTABLE_ITERATOR_TYPE HASHTABLE_CLASS::Search_Internal(C_KeyRef key) const
{
	uint32_t k = 0;

	while (k < m_numKeys)
	{
		// Gets expensive if load factor inches close to 1
		uint32_t keyHash = (uint32_t)c_HashFunObj(key);
		uint32_t tableKey = hash(k, m_numKeys, keyHash);
		uint32_t partitionKey = tableKey * c_NumPartitions;
		for (uint32_t i = 0; i < c_NumPartitions; ++i)
		{
			uint32_t dataIndex = partitionKey + i;
			if (AS_DATAPTR(m_buffer)[dataIndex].state == Data::Occupied
				&& AS_DATAPTR(m_buffer)[dataIndex].pair.key == key)
			{
				return Iterator(AS_DATAPTR(m_buffer), m_capacity, dataIndex);
			}
		}

		++k;
		// We've exceeded the max number of probes registered. No way the required key exists
		if (k > m_maxProbes)
		{
			return end();
		}
	}

	return end();
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::RehashIfRequired()
{
	// If we are probing alot, it's time to Re-Hash
	if ((m_maxProbes > m_numKeys / 2) && (m_size > m_capacity / 4))
	{
		std::cout << "Rehashing. Reasion: Too many probes into table\n";
		PrintTableVitals();
		Rehash();
		return;
	}
	//If max load goes beyond the threshold, it's also time to rehash
	if (MaxLoadExceeded())
	{
		std::cout << "Rehashing. Reason: Max Load Exceeded\n";
		PrintTableVitals();
		Rehash();
		return;
	}
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::Rehash()
{
	//This is fater compared to when chaining is used
	uint32_t prevCapacity = m_capacity;
	uint32_t prevNumKeys = m_numKeys;
	void* oldBuffer = m_buffer;
	RequestCapacity(m_numKeys << 1);

	m_maxProbes = 0;
	m_size = 0;
	m_currentLoad = 0;
	assert(oldBuffer != nullptr && "It's not possible that old buffer is null.");

	//Insert all valid elements from old buffer into this
	for (uint32_t i = 0; i < prevCapacity; ++i)
	{
		DataRef cData = AS_DATAPTR(oldBuffer)[i];
		if (cData.state == Data::Occupied)
		{
			Insert_Internal(cData);
		}
	}
	free(oldBuffer);
}

HASHTABLE_TEMPLATE_DECL
bool HASHTABLE_CLASS::MaxLoadExceeded() const
{
	return m_currentLoad >= m_maxLoadAllowed;
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::RequestCapacity(uint32_t capacity)
{
	//TODO: Add a upper bound check
	assert(capacity < (1ul << (31ul - c_NumPartitions)) && "Cannot allocated requested capacity. Table will overflow");
	m_numKeys = 1 << (uint32_t)std::ceil(std::log2f((float)capacity));
	m_capacity = m_numKeys * c_NumPartitions;
	m_buffer = RequestMemory();
}

HASHTABLE_TEMPLATE_DECL
void* HASHTABLE_CLASS::RequestMemory() const
{
	assert(c_NumPartitions > 0 && c_NumPartitions <= 5 && "Unsupported number of partitions given");

	uint32_t size = (m_numKeys * c_NumPartitions) * sizeof(Data);
	assert(size % alignof(Data) == 0 && "Size should be a multiple of alignment");
	void* data = malloc(size);
	assert(data != nullptr && "OS couldnt grant us required memory");
	if (data != nullptr)
	{
		memset(data, 0, size);
	}
	return data;
}

//=======================================================================================
// Iterator =============================================================================

HASHTABLE_TEMPLATE_DECL
HASHTABLE_ITERATOR_CLASS::Iterator(const Data* const buffer, uint32_t size, uint32_t index)
	: m_buffer(buffer)
	, m_size(size)
	, m_index(index)
{
	//Get to the first valid element
	while (m_buffer[m_index].state != Data::Occupied && m_index < m_size)
	{
		++m_index;
	}
}

HASHTABLE_TEMPLATE_DECL
HASHTABLE_ITERATOR_TYPE& HASHTABLE_ITERATOR_CLASS::operator++()
{
	++m_index;
	while (m_buffer[m_index].state != Data::Occupied && m_index < m_size)
	{
		++m_index;
	}
	return *this;
}

HASHTABLE_TEMPLATE_DECL
HASHTABLE_ITERATOR_TYPE& HASHTABLE_ITERATOR_CLASS::operator--()
{
	--m_index;
	while (m_buffer[m_index].state != Data::Occupied && m_index >= 0)
	{
		--m_index;
	}
	return *this;
}

HASHTABLE_TEMPLATE_DECL
const HASHTABLE_DATA_PAIR_TYPE& HASHTABLE_ITERATOR_CLASS::operator*() const
{
	return (PairRef)m_buffer[m_index].pair;
}

HASHTABLE_TEMPLATE_DECL
bool HASHTABLE_ITERATOR_CLASS::operator==(const Iterator& other) const
{
	return m_index == other.m_index;
}

HASHTABLE_TEMPLATE_DECL
bool HASHTABLE_ITERATOR_CLASS::operator!=(const Iterator& other) const
{
	return !(*this == other);
}

//=======================================================================================
// Debug Functions ======================================================================

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::Print()
{
	PrintTableVitals();

	for (auto itr : (*this))
	{
		std::cout << "Key: " << itr.key << "\tValue: " << itr.value << "\n";
	}
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::PrintTableVitals() const
{
	std::cout << "=================== Printing Vitals ======================\n";
	std::cout << "Num Keys: " << m_numKeys << "\n";
	std::cout << "Num Partitions: " << (uint32_t)c_NumPartitions << "\n";
	std::cout << "Max capacity: " << m_capacity << "\n";
	std::cout << "Size: " << m_size << "\n";
	std::cout << "Max Probes: " << m_maxProbes << "\n";
	std::cout << "Max Load allowed: " << m_maxLoadAllowed << "\n";
	std::cout << "Current Load: " << m_currentLoad << "\n";
	std::cout << "==========================================================\n";
}

HASHTABLE_TEMPLATE_DECL
void HASHTABLE_CLASS::ProbeWholeTable() const
{
	srand((uint32_t)time(0) % 7);
	uint32_t key = rand();

	std::cout << "Started Probing with Key: " << key << "\n";
	uint32_t k = 0;
	while (k < m_numKeys)
	{
		uint32_t hashKey = hash(k, m_numKeys, key);
		uint32_t partitionKey = hashKey * c_NumPartitions;
		std::cout << "Probed Slot: " << hashKey << "\n";
		for (int i = 0; i < c_NumPartitions; ++i)
		{
			uint32_t dataIndex = partitionKey + i;
			AS_DATAPTR(m_buffer)[dataIndex] = { 0, 0, Data::Occupied };
		}
		++k;
	}

	bool wholeTableProbed = true;
	for (uint32_t i = 0; i < m_capacity; ++i)
	{
		if (AS_DATAPTR(m_buffer)[i].state != Data::Occupied)
		{
			wholeTableProbed = false;
			break;
		}
	}
	if (wholeTableProbed)
	{
		std::cout << "Finished probing whole table\n";
	}
	else
	{
		std::cout << "Could not probe whole table. Hash function would lead to bubbles in table\n";
	}
}

//=======================================================================================

Performance Analysis

I’ve done some performance analysis using the following code below:

    HashTable<int, int> hashTable(10);
    std::unordered_map<int, int> stlMap(10);

    // Insert
    auto start = std::chrono::system_clock::now();
    for (int i = 0; i < 10000; ++i)
    {
        hashTable.Insert(i, i);
    }
    auto end = std::chrono::system_clock::now();
    std::cout << "Insert Time for custom HashTable: " << (end - start).count() << "\n";

    start = std::chrono::system_clock::now();
    for (int i = 0; i < 10000; ++i)
    {
        stlMap.insert({ i, i });
    }
    end = std::chrono::system_clock::now();
    std::cout << "Insert Time for STL HashTable: " << (end - start).count() << "\n";

    //Delete
    start = std::chrono::system_clock::now();
    for (int i = 0; i < 5000; ++i)
        hashTable.Remove(i);
    }
    end = std::chrono::system_clock::now();
    std::cout << "Deletion Time for custom HashTable: " << (end - start).count() << "\n";

    start = std::chrono::system_clock::now();
    for (int i = 0; i < 5000; ++i)
    {
        stlMap.erase(stlMap.find(i));
    }
    end = std::chrono::system_clock::now();
    std::cout << "Deletion Time for STL HashTable: " << (end - start).count() << "\n";

For initial size of 10, following is the console output:
– Insert Time for custom HashTable: 396716
– Insert Time for STL HashTable: 502560
– Deletion Time for custom HashTable: 26462
– Deletion Time for STL HashTable: 107265

Upon increasing the initial size of our hash table to 4096, there’s minimal re-hashing. The insert time also improved considerably!!
Following is the output with initial size set to 4096 on both tables:
– Insert Time for custom HashTable: 128295
– Insert Time for STL HashTable: 488191
– Deletion Time for custom HashTable: 18199
– Deletion Time for STL HashTable: 106090

The time taken for deletion remains consistent in both the cases and is quite faster than the STL hash table.
It’s definitely not a general purpose table, but will be quite useful on performance critical portions of code.

Parallel Quick Sort: A Study With Implementation

April 16th, 2023

Introduction

Quick sort is definitely quite an efficient algorithm and is quite efficient if size of data set is low or moderate. That is why I decided it’s worth exploring how it performs under a multi-threaded scenario. While going through “Concurrency In Action” by Anthony Williams(Amazing read btw!!), I had quite some trouble implementing the example presented in Listing 8.1. The basic idea of that implementation was that the “right” recursion step(pivot to last element) of quick sort algorithm was submitted as a batch for any thread to pick up. It is also free to take any work that’s already submitted while it waits for result from other threads.
This quickly went out of hand and my code started exceeding stack limits. So, I went on an expedition to find a clue on how to implement a Parallel Quick Sort and came across this paper and was able to wrap my head around “Shared-Address-Space Parallel Formulation” and decided to give it a shot!

With this algorithm, I was able to achieve speeds nearly ~3-8x faster than serial quick-sort. Speed largely depends on the number of threads we are using and how the elements are finally split into batches. There would be a decrease in performance if all batches are not roughly of same size.

Serial Quick Sort

Quick sort is a Divide and Conquer algorithm. We split the problem into smaller chunks, solve them and combine. Before working on our parallel algorithm, let’s first break down what’s happening in a serial quick sort in following points:

We have an input Array(A) of size N, the first and last indices of elements that should be sorted.
We calculate a pivot element in the array. To make things simple, this can just be the starting element in the array provided.
Using this pivot, we divide (A) into 2 parts. Left half will have elements smaller than (A) and right half will have elements greater than or equal to (A).
We repeat the step above until (first <= last) and then the function returns.
Once the function returns, it would already have sorted the sub array perfectly and no additional step for merging the arrays is required.

    void doSortSerial(int first, int last, int* data)
    {
        numRecursion.fetch_add(1);
        if(first >= last)
        {
            return;
        }
        //First element is our pivot;
        int index = first;//std::rand() % (last - first);
        int pivot = data[index];
        //Split at pivot
        int pivotPos = first;
        for(int i = first + 1; i <= last; i++)
        {
            if(data[i] < pivot)
            {
                //Exchange data with pivot pos
                ++pivotPos;
                int temp = data[i];
                data[i] = data[pivotPos];
                data[pivotPos] = temp;
            }
        }
        //Exchange pivot with beginning element
        data[first] = data[pivotPos];
        data[pivotPos] = pivot;
        
        doSortSerial(first, pivotPos - 1, data);
        doSortSerial(pivotPos + 1, last, data);
    }

Understanding Parallel Implementation

Now that the obvious is out of the way, let’s try and understand what it means to parallelize an algorithm. In it’s simplest essence, parallelizing an algorithm is taking advantage of multiple processors and split the work we are given equally in hopes of achieving an increase in computation speed in proportion to number of processors we have dedicated for the problem.
What if we split the array we are given into uniform chunks and make each thread take care of sorting that particular chunk? Well, that was my initial thought track and there’s one obvious problem with this approach, which is depicted below:

If the problem is not already obvious, if we naively split the given array into multiple chunks and perform quick sort on them, the individual chunks would be sorted, but there’s no guarantee that the entire array would be!
There is a way to make this work if we put in some additional guidelines. The main problem is that, when we split the array arbitrarily, there isn’t any rule imposed on which elements should go into which part of the array. Therefore, we can end up with elements in left split that are greater than elements in right split and vice versa.
Therefore, we need to impose a rule asking all elements smaller than a certain value to take up the left split and elements that are greater to take up the right split. This would let us sort multiple chunks of array parallelly and the final array would just be a combination of all the chunks.

This is the overall idea of how to parallelize quick-sort. There are a few steps from the paper I glossed over, which I’ll enumerate in the subsequent sections. But, we can now formulate a strategy to parallelize our quick sort!!

Parallel Quick Sort: getting Ready

Algorithm can be broken down into following steps:

We have an input Array(A) of size N, the first and last indices of elements that should be sorted.
We decide the number of threads(T) on which algorithm is going to run.
Input array is initially split into 1 “Group” with T “Batches”. Each batch runs on a separate thread, so a data race would be impossible.
A Group is a collection of batches and contains a “Key” that’s used for segregating the batches that fall into the group.
Elements in a batch are split into 2 parts based on Group’s key. Left part contains elements smaller than key and right half contains elements that are greater. Each batch runs on it’s own thread.
Once we have split the individual batches based on group’s key, we now need to split elements in the entire group based on the same key. This can be done with the help of prefix sum arrays and an auxiliary array.
The group is now split into two and steps 3 – 7 are repeated until number of groups becomes equal to the number of threads(T).
Once Groups becomes equal to Threads, we would’ve split the input array in such a way that elements in left batch are always less than the elements in the right batch. Therefore, serial quick sort can now be performed on (T) individual batches to get the final sorted array.

Before implementing the algorithm, let’s define a few things that lets us easily split the work across threads. I created a naïve Function Wrapper that lets me store a function with arguments that I want to call and is easily accessible by all the threads. This is completely optional and you could go with any other way for submitting work to threads. All of the code will reside in the class Parallel_Quicksort.

const int numThreads = 8;
class Parallel_Quicksort
{
    enum class State
    {
        Initializing,
        Working,
        WorkReady,
        Done,
    };

    struct FunctionWrapperBase
    {
        void invoke()
        {
            valid = false;
            callback();
        }

        protected:
        std::function<void()> callback;
        bool valid = false;
    };

    template<typename T, typename OBJECT, typename ...PARAM>
    struct FunctionWrapper : public FunctionWrapperBase
    {
        FunctionWrapper(){};
        FunctionWrapper(T func, OBJECT object, PARAM... param)
        {
            callback = [func, object, param...](){
                (object->*func)(param...);
            };
            valid = true;
        }
    };
    template <typename T, typename ...PARAM>
    struct FunctionWrapper<T, nullptr_t, PARAM...> : public FunctionWrapperBase
    {
        FunctionWrapper(T func, PARAM... param)
        {
            callback = [func, param...](){
                (*func)(param...);
            };
            valid = true;
        }
    };
    
    template<typename T, typename OBJECT, typename ...PARAM>
    static FunctionWrapper<T,OBJECT, PARAM...> createWrapper(T func, OBJECT object, PARAM... param)
    {
            return FunctionWrapper<T,OBJECT, PARAM...>(func, object, param...);
    }
    template<typename T, typename ...PARAM>
    static FunctionWrapper<T, nullptr_t, PARAM...> createGlobalWrapper(T func, PARAM... param)
    {
            return FunctionWrapper<T, nullptr_t, PARAM...>(func, param...);
    }

    struct Work
    {
        Work() : done(true) {}
        Work(FunctionWrapperBase work)
        {
            wrapper = work;
            done = false;
        }
        void doWork()
        {
            done = false;
            wrapper.invoke();
            done = true;
        }

        FunctionWrapperBase wrapper;
        bool done = true;
    };
void run(int id)
    {
        while(state != State::Done)
        {
            std::this_thread::yield();
            doWork(id);
        }
    }

    void doWork(int id)
    {
        if(!workBatches[id].done)
        {
            workBatches[id].doWork();
            updateWork();
        }
    }

    void updateWork()
    {
        bool done = true;
        for(Work& work : workBatches)
        {
            done &= work.done;
        }
        if(done)
        {
            state = State::WorkReady;
        }
    }

    void waitForWork()
    {
        while(state == State::Working)
        {
            std::this_thread::yield();
        }
    }

Parallel Quick Sort: Implementation

Let’s first define two really important structures: Group and Batch. To re-iterate, a group is a collection of batches and a Batch is a portion of input array that is run on a dedicated thread.
In the code below, lowerDistMedian and upperDistMedian are used for calculating Key of the group so that elements can be split uniformly when quick sort is performed on each batch. Batch ID corresponds with the thread ID that this batch is supposed to run on.
We then start off the algorithm by calling “doSortParallel”, which creates batches containing start and end of indices of elements in the array that this batch would operate on. An auxiliary array “tmpData” is also created to help us with global re-arrangement of data.

struct Batch
    {
        int id = -1;
        int start = -1;
        int end = -1;
        bool batchReady = false;
        size_t lowerDistMedian = 0;
        size_t upperDistMedian = 0;
    };

    struct Group
    {
        std::vector<Batch> batches = std::vector<Batch>();
        int key = -1;
        int start = -1;
        int end = -1;
        std::vector<int> lowerElemPrefixSum;
        std::vector<int> higherElemPrefixSum; 
    };

    void doSortParallel(int first, int last, int* data)
    {
        size_t numElements = (last - first) + 1;
        int elementsPerBatch = numElements/numThreads;

        int excess = numElements % numThreads;

        workBatches.resize(numThreads);
        std::vector<Batch> batches(numThreads);
        
        int current = 0;
        for(int i = 0; i < numThreads; i++)
        {
            Batch batch;
            int numElements = elementsPerBatch;
            if(excess != 0)
            {
                --excess;
                ++numElements;
            }
            batch.id = i;

            batch.start = current;
            current += numElements;
            batch.end = current - 1;

            batches[i] = batch;
        }
        Group group;
        group.batches = batches;
        group.start = first;
        group.end = last;
        // Assuming elements are [0, maxElementsNum), key is selected as mid point for uniform distribution
        
        group.key = numElements / 2;
        
        group.lowerElemPrefixSum.resize(batches.size() + 1);
        group.higherElemPrefixSum.resize(batches.size() + 1);
        
        groups.push_back(group);

        size_t size = numElements * sizeof(int);
        int* tmpData = (int*)malloc(size);
        memcpy(tmpData, data, size);

        //Start Threads
        for(size_t i = 1; i < numThreads; i++)
        {
            threads[i-1] = std::thread(&Parallel_Quicksort::run, this, i);
        }

        doSortParallel_Impl(first, last, data, tmpData);

        free(tmpData);
    }

To split elements in a batch based on group’s key, we employ the same technique as in quick sort algorithm by splitting elements based on a pivot. While going through each element in a batch, we keep track of the highest and lowest element of the two halves and store a median value for better key calculation when groups are split!

Once elements are split locally in their respective batches, they must be re-arranged so that all the elements of the group are now split based on the key. This is the global rearrangement step and is done with the help of prefix Sum computations. The code for the two operations is as follows:

//This sorts data in the temp data batch
    //Happens in parallel
    void sortBatch(Group& group, int batchIndex, int* tmpData)
    {
        Batch& batch = std::ref(group.batches[batchIndex]);
        int dataSize = batch.end - batch.start + 1;
        if(dataSize < 2)
        {
            return;
        }

        //An expensive operation
        int swapPos = batch.start;
        int numLowerHalf = 0;

        int minLower = INT32_MAX; int maxLower = 0;
        int minUpper = INT32_MAX; int maxUpper = 0;
        for(int i = batch.start; i <= batch.end; i++)
        {
            if(tmpData[i] < group.key)
            {
                if(tmpData[i] < minLower)
                {
                    minLower = tmpData[i];
                }
                if(tmpData[i] > maxLower)
                {
                    maxLower = tmpData[i]; 
                }

                int temp = tmpData[i];
                tmpData[i] = tmpData[swapPos];
                tmpData[swapPos] = temp;
                
                ++swapPos;
                ++numLowerHalf;
            }
            else
            {
                if(tmpData[i] < minUpper)
                {
                    minUpper = tmpData[i];
                }
                if(tmpData[i] > maxUpper)
                {
                    maxUpper = tmpData[i]; 
                }
            }
        }
        batch.lowerDistMedian = (minLower + maxLower) / 2;   
        batch.upperDistMedian = (minUpper + maxUpper) / 2;   


        group.lowerElemPrefixSum[batchIndex] = numLowerHalf;
        group.higherElemPrefixSum[batchIndex] = dataSize - numLowerHalf;
    }

    //Happens in parallel.
    void mergeSortedWithOriginal(Group& group, int batchIndex, int* data, int* tmpData) 
    {
        int start = group.start;
        Batch& batch = group.batches[batchIndex];

        int pos = start + group.lowerElemPrefixSum[batchIndex];
        size_t size = group.lowerElemPrefixSum[batchIndex + 1] - group.lowerElemPrefixSum[batchIndex];
        int dataSize = sizeof(data[0]);
        
        memcpy((void*)(data + pos), (void*)(tmpData + batch.start), size * dataSize);
        
        int initial = size;
        pos = start + group.higherElemPrefixSum[batchIndex];
        size = group.higherElemPrefixSum[batchIndex + 1] - group.higherElemPrefixSum[batchIndex];
        memcpy((void*)(data + pos), (void*)(tmpData + batch.start + initial) , size * dataSize);
    }

After global re-arrangement, we have a group with elements that are clearly split on the given key. We then split the group into two. First group contains all the elements that are smaller than the key and the second group contains all the elements that are larger than the key.
To make things simple, batches are equally split between the two new groups and we use prefix sum arrays to calculate distribution of elements. This is how it’s done:

    void splitGroups()
    {
        //This should be quite a trivial operation. Loops groups * batches times
        std::vector<Group> newGroups;
        for(size_t i = 0; i < groups.size(); i++)
        {
            //split a group based on key if it has multiple batches 
            size_t numBatches = groups[i].batches.size();
            
            int currentLowerPos = groups[i].start;
            int currentHigherPos = currentLowerPos + groups[i].higherElemPrefixSum[0];

            if(groups[i].batches.size() > 1)
            {
                int firstGroupBatches = groups[i].batches.size() / 2;
                int secondGroupBatches =  groups[i].batches.size() - firstGroupBatches;

                std::vector<int> lowerElements;
                std::vector<int> higherElements;

                Group g1 = Group();
                g1.start = groups[i].start;
                Group g2 = Group();
                g2.start = currentHigherPos;
                g2.end = groups[i].end;
                
                int numLowerHalfElements = groups[i].lowerElemPrefixSum[numBatches];
                int elementsPerBatch = numLowerHalfElements / firstGroupBatches;
                int excess = numLowerHalfElements % firstGroupBatches;

                int key = 0;
                for(int j = 0; j < firstGroupBatches; j++)
                {
                    //Move batches from G. This retains the ID information
                    g1.batches.push_back(groups[i].batches[j]);
                    int elementsInBatch = elementsPerBatch;
                    if(excess)
                    {
                        ++elementsInBatch;
                        --excess;
                    }
                    int endPos = currentLowerPos + elementsInBatch - 1;
                    g1.batches[j].start  = currentLowerPos;
                    g1.batches[j].end  = endPos;
                    g1.end = endPos;
                    
                    key += g1.batches[j].lowerDistMedian;
                    g1.batches[j].lowerDistMedian = 0;
                    
                    currentLowerPos += elementsInBatch;
                }
                g1.key = key / firstGroupBatches;

                int numHigherHalfElements = groups[i].higherElemPrefixSum[numBatches] - numLowerHalfElements;
                elementsPerBatch = numHigherHalfElements / secondGroupBatches;
                excess = numHigherHalfElements % secondGroupBatches;
                
                key = 0;
                for(int j = 0; j < secondGroupBatches; j++)
                {
                    int batchIndex = firstGroupBatches + j;
                    g2.batches.push_back(groups[i].batches[batchIndex]);
                    int elementsInBatch = elementsPerBatch;
                    if(excess)
                    {
                        ++elementsInBatch;
                        --excess;
                    }
                    int endPos = currentHigherPos + elementsInBatch - 1;
                    g2.batches[j].start  = currentHigherPos;
                    g2.batches[j].end  = endPos;
                    
                    key += g2.batches[j].upperDistMedian;
                    g2.batches[j].upperDistMedian = 0;

                    currentHigherPos += elementsInBatch;
                }
                g2.key = key / secondGroupBatches;

                g1.lowerElemPrefixSum.resize(g1.batches.size() + 1);
                g1.higherElemPrefixSum.resize(g1.batches.size() + 1);

                g2.lowerElemPrefixSum.resize(g2.batches.size() + 1);
                g2.higherElemPrefixSum.resize(g2.batches.size() + 1);
                if(g1.batches.size() > 0)
                {
                    newGroups.push_back(g1);
                }
                if(g2.batches.size() > 0)
                {
                    newGroups.push_back(g2);
                }
            }
            if(groups[i].batches.size() == 1)
            {
                newGroups.push_back(groups[i]);
            }
        }
        
        groups = newGroups;
    }

Finally, tying it all together is our main implementation function, which performs following steps:

Loops through groups and
- submits each batch to split elements locally if no. of groups is not equal to no. of threads.
- submits each batch to perform quick sort if no. of groups is equal to no. of threads.
Waits for (1) to complete. If groups is equal to no. of threads here, we would’ve performed quick sort and the function returns.
Calculates prefix sums of lower and upper half elements of all the batches in the group and stores in its respective group.
Performs global re-arrangement step. The portion of original array that group is operating on is modified to contain elements that are split based on the key. This is a parallel operation.
Waits for (4) to complete.
Finally current groups are iterated and split into two with the help of prefix sum arrays. One of the new groups contains all the elements that are lower than the key and the other contains elements that are greater!

void doSortParallel_Impl(int first, int last, int* data, int* tmpData)
    {
        for(size_t i = 0; i < groups.size(); i++)
        {
            if(groups[i].batches.size() == 0)
            {
                continue;
            }

            // Start running all groups at once on each thread.
            // Wastes time if run before groups size becomes equal to numThreads
            if(groups[i].batches.size() == 1 && groups.size() == numThreads)
            {
                if(groups[i].batches[0].batchReady)
                {
                    continue;
                }
                //In Serial sort, eveything is sorted in lower batch step
                int dataSize = groups[i].batches[0].end - groups[i].batches[0].start + 1;
                groups[i].lowerElemPrefixSum[0] = dataSize;
                groups[i].higherElemPrefixSum[0] = 0;
                
                state = State::Working;

                auto wrapper = createWrapper(&Parallel_Quicksort::doSortSerial, this,
                groups[i].batches[0].start, groups[i].batches[0].end, data);
                workBatches[groups[i].batches[0].id] = Work(wrapper);

                groups[i].batches[0].batchReady = true;
            }
            else
            {
                //Check if batches are greater than number of threads and bail out
                int size  = groups[i].batches.size();
                for(size_t j = 0; j < groups[i].batches.size(); j++)
                {
                    if(groups[i].batches[j].start >= groups[i].batches[j].end)
                    {
                        continue;
                    }
                    state = State::Working;
                    auto wrapper = createWrapper(&Parallel_Quicksort::sortBatch, this, std::ref(groups[i]), j, tmpData);
                    workBatches[groups[i].batches[j].id] = Work(wrapper);
                }
            }
        }

        doWork(0);
        waitForWork();
        
        if(groups.size() == numThreads)
        {
            state = State::Done;
            return;
        }

        //Calculate Prefix sums in respective groups
        for(size_t i = 0; i < groups.size(); i++)
        {
            int prevValue = groups[i].lowerElemPrefixSum[0];
            groups[i].lowerElemPrefixSum[0] = 0;

            int totalLower = 0;
            for(size_t j = 1; j < groups[i].lowerElemPrefixSum.size(); j++)
            {
                int oldLower = prevValue;
                prevValue = groups[i].lowerElemPrefixSum[j]; 
                groups[i].lowerElemPrefixSum[j] = groups[i].lowerElemPrefixSum[j - 1] + oldLower;
                totalLower = groups[i].lowerElemPrefixSum[j];
            }

            prevValue = groups[i].higherElemPrefixSum[0];
            groups[i].higherElemPrefixSum[0] = totalLower;
            for(size_t j = 1; j < groups[i].higherElemPrefixSum.size(); j++)
            {
                int oldHigher = prevValue;
                prevValue = groups[i].higherElemPrefixSum[j]; 
                groups[i].higherElemPrefixSum[j] = groups[i].higherElemPrefixSum[j - 1] + oldHigher;
            }

        } 


        for(size_t i = 0; i < groups.size(); i++)
        {
            for(size_t j = 0; j < groups[i].batches.size(); j++)
            {
                int dataSize = groups[i].batches[j].end - groups[i].batches[j].start + 1;
                if(dataSize == 0)
                {
                    continue;
                }
                state = State::Working;
                auto wrapper = createWrapper(&Parallel_Quicksort::mergeSortedWithOriginal,
                    this, std::ref(groups[i]), j, data, tmpData);
                workBatches[groups[i].batches[j].id] = Work(wrapper);
            }
        }

        doWork(0);
        waitForWork();
        
        size_t size = (last - first + 1) * sizeof(int);
        memcpy(tmpData, data, size);

        splitGroups();

        doSortParallel_Impl(first, last, data, tmpData);
    }

That’s all folks! Hope you got to learn something 🙂

Red-Black Trees: Delete and Delete-Fixup Analysis

February 25th, 2023

Deleting a node is the hardest part of implementing an RB-Tree. Let’s write a helper function that moves the parent of node U to node V. This function is called “Transplant”. We will also use another helper function to get the least node in a sub-tree.

void transplant(RBNode* u, RBNode* v)
{
    if(u->m_parent == &sentinel) { root = v; }
    else if(u->m_parent->m_left == u) { u->m_parent->m_left = v; }
    else { u->m_parent->m_right = v; }
    v->m_parent = u->m_parent;
}
RBNode* getMinimumInSubTree(RBNode* node)
{
    RBNode* current = node;
    while(current->m_left != &sentinel)
    {
        current = current->m_left;
    }
    return current;
}

When deleting, we keep track of the color of node that’s being deleted and also a node(X) that takes the place of the node that’s being deleted, which might introduce violations. And, 2 things must be considered when deleting a node:
– If node has a single/no child. In this case, we store the original color of this node and make left/right child or sentinel take the node’s place.
– If node has 2 valid children, we get the least node in the right sub-tree(Y) and have that replace the node being deleted. During this operation, there should be another node that takes Y’s place. That will be Y->Right, our X in this case. Here, we keep track of Y’s(successor ) original color.
Once we delete the node, we call a DeleteFixup function if the original color of deleted node is BLACK.

void remove(RBNode* node)
{
    RBNode* y = node;
    bool yOriginalColorIsRed = y->m_isRed;
    RBNode* x = &sentinel;
    
    if(node->m_left == &sentinel)
    {
        x = node->m_right;
        transplant(node, node->m_right);
    }
    else if(node->m_right == &sentinel)
    {
        x = node->m_left;
        transplant(node, node->m_left);
    }
    else
    {
        //Node has 2 valid children
        y = getMinimumInSubTree(node->m_right);
        yOriginalColorIsRed = y->m_isRed;
        // we color Y the same color as z.
        // y->right take's Y's place.
        x = y->m_right;
        //Node being deleted is not immediate parent of Y.
        if(y->m_parent == node)
        {
            //This check is for sentinel. If y has no children, we store the parent info in sentinel
            x->m_parent = y;
        }
        else
        {
            //Move X To Y's place
            transplant(y, x);
            y->m_right = node->m_right;
            y->m_right->m_parent = y;   
        }
        //Move Y To Z'a Place and take it's color
        transplant(node, y);
        y->m_left = node->m_left;
        node->m_left->m_parent = y;
        y->m_isRed = node->m_isRed;
    }
    if(!yOriginalColorIsRed){ RBDeleteFixup(x); }
}

Delete-Fixup:

Once deleted, we could have following violations:
– If the delete node is root, then the node that replaces it might be RED.
– If the node replacing could be a RED node and it’s parent could also be RED.
– The number of black nodes in any simple path might not be equal anymore.
To fix the third scenario, we introduce “Double Blackness” and “Red/Black”. If the node that replaces the deleted node is BLACK/RED, the sub-tree would be deficit by one black node. If the replacing node is BLACK, then it would be “Double Black”, and “RED-BLACK” otherwise.

Double-Black and Red-Black are kind of abstract properties. They dont need to be represented as data in code. Now that we have established that, let’s look at the Delete Fixup scenarios. Here, I assume the node under focus is left child and “Sibling” is a right child. We are looking at 4 cases here:
1. If sibling is RED, it must mean that our parent is BLACK. We can switch colors of Sibling to BLACK and parent to RED and perform Left Rotation on parent and update the Sibling’s pointer. The might not fix the violations introduced, but it transforms the problem in a way that ensures sibling is BLACK and will be handled by following cases.

2. If sibling is BLACK and both it’s children are also BLACK. This case is simple. We just color sibling RED and make the active node as our current parent. Since we have an Double-Blackness on the left subtree, making sibling RED will make it lose a black node, thereby balancing both sub-trees.

3. If sibling is BLACK and it’s left child is RED. We color sibling RED and it’s left child to BLACK and perform RightRotation on sibling to convert this to Case 4. This might not fix the violations, but leads into Case-4

4. If sibling is BLACK and it’s right child is RED, sibling takes the color of it’s parent and parent is made RED and LeftRotation is performed on Parent. Now, the left sub-tree will have an additional black node, fixing the violation.

Hope this helped! Please like the posts if you’ve learnt something. Really boosts the motivation and drive to post more content.

Here’s the Code Snippet

void RBDeleteFixup(RBNode* node)
{
    while(node != root && !node->m_isRed)
    {
        if(node->m_parent->m_right == node)
        {
            //Uncle should always be present
            RBNode* sibling = node->m_parent->m_left;
            if(sibling->m_isRed)
            {
                sibling->m_isRed = false;
                sibling->m_parent->m_isRed = true;
                RightRotate(sibling->m_parent);
                sibling = node->m_parent->m_left;
            }
            //Case - 2: Uncle is black and Left and right children of uncle are null or red
            if(sibling == &sentinel || (!sibling->m_left->m_isRed && !sibling->m_right->m_isRed))
            {
                if(sibling != &sentinel)
                {
                    sibling->m_isRed = true;
                }
                // Extra blackness is added to parent, making in RB or BB to compensate for removing 
                // a black on uncle and node
                node = node->m_parent; 
            }
            else
            {   
                //Uncle's right is red, transform to make both left and right as black 
                if(sibling->m_right->m_isRed)
                {
                    sibling->m_right->m_isRed = false;
                    sibling->m_isRed = true;
                    LeftRotate(sibling);
                    sibling = node->m_parent->m_left;
                }
                sibling->m_isRed = node->m_parent->m_isRed;
                node->m_parent->m_isRed = false;
                sibling->m_left->m_isRed = false;

                RightRotate(node->m_parent);
                node = root;
            }
        }
        else
        { //Same as previous block with Left and Right exchanged}
    }
    node->m_isRed = false;
}

Red Black Trees Insertion and Insert-Fixup Analysis

February 24th, 2023

To insert a node into an RB-Tree, we first insert it the same way we insert a node into a binary-search tree. We start from root and compare it’s value against the value of node being added. If lesser, we compare it against the left sub-tree and right sub-tree otherwise. This is repeated until a NULL node is encountered, where this new node is inserted. After insertion, color it “Red” and perform an RB-Table fixup to maintain the properties of RB-Tree. Let’s look at the code for Red Black Trees Insertion:

void insert(RBNode* node)
{
    if(root == &sentinel)
    {
        node->m_isRed = false;
        root = node;
        return;
    }
    RBNode* current = root;
    RBNode* target = root;
    while(current != &sentinel)
    {
        target = current;
        if(node->m_size > current->m_size){ current = current->m_right; }
        else if (node->m_size == current->m_size) {
            if(node > current) { current = current->m_right; }
            else { current = current->m_left; }
        }
        else { current = current->m_left; }
    }
    if(node->m_size > target->m_size) { target->m_right = node; }
    else if(node->m_size == target->m_size) {
        if(node > target) { target->m_right = node; }
        else { target->m_left = node; }
    }
    else { target->m_left = node; }
    node->m_parent = target;
    node->m_isRed = true;
    RBInsertFixup(node);
}

Insert-Fixup:

Once we inserted new node and colored it RED, there are 2 possible violations to the properties of RB-Tree. First, if the inserted node is Root, we colored it red, when it should’ve been black. Second, if the parent of inserted node is red, then we have violated 5th property which states that both children of RED nodes should be BLACK.

Now that we have inserted a node into the tree, let’s look at implementing the fix-up function. For this, I consider the node’s parent to be a left-child and “uncle” as node->parent->parent->right. With that, we will have 3 scenarios:
– If uncle is RED, we will color the uncle and inserted node’s parent BLACK. Then, make node’s parent as the next node to be processed in the loop.
– If the uncle is black and node being checked is a right-child. In this case, we try to transform it into 3rd scenario by making the next node to be processed as current node’s parent and performing Left-Rotate operation on inserted node.
– Finally, if uncle is black and node being checked is left child, we color node->parent as BLACK and node->parent->parent as RED.

Here’s the code snippet

void RBInsertFixup(RBNode* node)
{
    while(node->m_parent != &sentinel && node->m_parent->m_isRed)
    {
        //If Node's parent is red, we have a violation in RB-Tree Property
        if(node->m_parent->m_parent->m_right == node->m_parent)
        {
            //If the current node's parent is right child
            //Case - 1: We know that node->parent->parent(NPP) is black. So, if NPP's left child is red, 
            //we can simply switch colors
            RBNode* uncle = node->m_parent->m_parent->m_left;
            if(uncle != &sentinel && uncle->m_isRed)
            {
                node->m_parent->m_parent->m_isRed = true;
                uncle->m_isRed = false;
                node->m_parent->m_isRed = false;
                node = node->m_parent->m_parent;
            }
            // Now we know that NPP's left child is black or nullptr. Now, depending on whether current node is
            // left or right child, we perform rotations
            else
            {
                //The goal here is to make newly added node as a root to node's parent and parent's parent.
                if(node->m_parent->m_left == node)
                {
                    // We right rotate here to transform newly added node as parent.
                    // With this node's parent effectively becomes it's right-child
                    // and Node's parent's parent will become node's parent
                    node = node->m_parent;
                    RightRotate(node);
                }
                // Finally Left Rotate on node's parent.
                // Newly added node will become the new parent.
                node->m_parent->m_isRed = false;
                node->m_parent->m_parent->m_isRed = true;
                LeftRotate(node->m_parent->m_parent);
            }
        }
        else if(node->m_parent->m_parent->m_left == node->m_parent)
        { //Same as the block above with left and right interchanged }
    }
    root->m_isRed = false;
}

Red Black Trees: A brief study with code

February 24th, 2023

Let’s try and de-mystify Red Black trees.
It’s basically a binary-search tree with an additional property, color. Having this property makes it possible to “approximately” balance it. By approximately balance, I mean that going from root to leaf, there’s no path that is twice as long as any other. Let’s first define the structure of RB-tree node as shown on below.
We also use a sentinel node as leaf node to help with the tree fix-up operations. So, all leaf nodes in a RB-Tree are sentinels.

extern RBNode sentinel;
extern RBNode* root;
struct RBNode
{
     RBNode(size_t size) : m_size(size) {}
     RBNode*         m_parent = &sentinel;
     RBNode*         m_left = &sentinel;
     RBNode*         m_right = &sentinel;
     size_t          m_size = { 0 };
     bool            m_isRed = false;
};

A Red-Black tree must satisfy following properties:
– Every node is red or black.
– The root of the node should be black.
– Every leaf should be black.
– Children of a red node should be black.
– For each node, all paths from node to it’s leaves should contain same number of black nodes.

During the course of Insertion and deletion operations, various properties of out RB-Tree might be violated. We shall use Left and Right rotations to restore those traits.

When left-rotating, X should already contain a right-child(Y). Y’s right node will not change and X’s left node will not change.

void LeftRotate(RBNode* x)
{
    // Pre-condition that must be satisfied here is that
    // X already has a valid right child(Y).
    // Y takes X's place
    //Move parents
    RBNode* y = x->m_right;
    y->m_parent = x->m_parent;
    x->m_parent = y;
    if(y->m_parent == &sentinel){ root = y; }
    else if(y->m_parent->m_right == x){ y->m_parent->m_right = y; }
    else { y->m_parent->m_left = y; }
    //x becomes y's left node. y's left node becomes x's right node
    x->m_right = y->m_left;
    if(x->m_right != &sentinel) { x->m_right->m_parent = x; }
    y->m_left = x;
}

When right-rotating, X should already contain a left-child(Y). Y’s left node will not change and X’s right node will not change.

void RightRotate(RBNode* x)
{
    // Pre-Condition that must be satisfied here
    // is that X has a vaild left node(Y). 
    // Y takes X's place
    RBNode* y = x->m_left;
    y->m_parent = x->m_parent;
    x->m_parent = y;
    if(y->m_parent == &sentinel){ root = y; }
    else if(y->m_parent->m_right == x) { y->m_parent->m_right = y; }
    else  { y->m_parent->m_left = y; }

    x->m_left = y->m_right;
    if(x->m_left != &sentinel) { x->m_left->m_parent = x; }
    y->m_right = x;
}