How to craft a database engine

Manthan chauhan
Dev Genius
Published in
9 min readFeb 23, 2024

--

Platform tools like database engines have always fascinated me as they provide a base upon which other applications are built and studying these tools provides deeper insights into these engineering marvels.

In this post I’ll describe how we can implement a persistent key-value database engine. We’ll explore the underlying complexities of data storage, indexing, durability and data fragmentation.

The design that we are going to implement is based on bitcask, a persistent key-value store designed for high write throughput and fast read operations. You can learn more about it in the original paper.

I used Golang but the concepts discussed are applicable across programming languages. For those interested, Golang code is available on GitHub.

Explore the code repository here: https://github.com/manthanchauhan/key_val_db_engine/tree/hashIndex

Architecture Overview

We are using the log-structured hash table design for data storage due to its simplicity. In this design, data is stored in a sequence of append-only logs or files. When new data is written, it is appended to the end of the latest file.

Why append-only?

  1. Because appending a file is efficient compared to searching the file and updating the existing record.
  2. Better concurrency handling. Since we never modify an existing record, chances are less that a thread might read any record in a partially-overwritten inconsistent state.
  3. Better data durability as older data is never modified and can be recovered in the event of system crashes or mid-write failures.

This also means that writing a key several times will not clear the older values associated to that key hence wasting disk space. But we’ll deal with this garbage collection later.

It is important to note that at any instance there could be multiple data files but only the latest one would be open to writes and the rest would be read-only forever.

Data is also indexed using a hash table, where each key is mapped to the filename and the byte offset of its latest entry. The byte offset refers to the distance, measured in bytes, from the beginning of the file to the respective data entry. Hence, with filename and byte-offset available we can locate any entry on disk in O(1) time.

This allows for fast key-based lookups, as the hash table provides direct access to the location of the data on disk.

Note that the hash table resides inside the volatile memory and hence needs to be re-built at every program startup. This can be easily achieved by reading all the data files sequentially and adding each key to the hash table.

Implementing CRUD

Having understood the underlying data storage architecture, CRUD operations can now be implemented as follows.

CREATE (key, value)

First add the new record to the active data file and then to the hash table specifying the file and the byte-offset where the value is stored.

// https://github.com/manthanchauhan/key_val_db_engine/blob/hashIndex/disk/disk.go#L37

func Create(key string, val string) string {
logFileName: = LatestSegmentName

// Open the active data file

f, deferFunc: = GetLogFile(logFileName, os.O_APPEND | os.O_WRONLY)
defer deferFunc(f)

// Write into the active data file
// get number of bytes written in return

byteCount: = dataSegment.Write(key, val, f)
fileSize: = GetSegmentFileSize(logFileName)

// calculate byte offset
// Note that the record we just added is the last record in the file
// Hence just subtract the byteCount from total bytes in file

byteOffset: = fileSize - int64(byteCount)
dataLocation: = utils.GetDataLocationFromByteOffset(logFileName, byteOffset)

// If the current file is full, create new active file

if fileSize >= constants.LogFileMaxSizeBytes {
createNextDataSegment()
}

// this data location will be stored in hash table

return dataLocation
}

READ (key)

Get the file name and byte offset from the hash index. Read the record from the file using the byte offset.

// https://github.com/manthanchauhan/key_val_db_engine/blob/hashIndex/disk/disk.go#L18

// Read "file_name:byte_offset" from hash index
// Get data stored at the "file_name:byte_offset"
func Read(dataLocation string) string {

// Get fileName & byteOffset from the value of hash table
fileName, byteOffset: = ExtractFileNameAndOffset(dataLocation)

// Open the file and seek to the byte offset
f, deferFunc: = GetLogFile(fileName, os.O_RDONLY)
defer deferFunc(f)

// read the data at the byte offset
_, err: = f.Seek(byteOffset, 0)
if err != nil {
panic(err)
}

scanner: = bufio.NewScanner(f)
scanner.Split(dataSegment.SplitAt(constants.LogNewLineDelim))
scanner.Scan()
s: = scanner.Text()

// return the data read
return s
}

UPDATE(key, value) and DELETE(key)

Both UPDATE and DELETE operations will be same as CREATE. For DELETE we’ll use a special tombstone value and handle that value in code. If this value is received while reading any data file, the key will be considered deleted.

func Update(key string, val string) string {
return Create(key, val)
}

func Delete(key string) string {
return Create(key, "#!DEL!#") // The value #!DEL!# is special tombstone value
}

Making It Efficient

Having implemented the CRUD operations over a persistent storage, we have a working but un-optimal database engine. The engine has two problems,

  1. The storage needed is proportional to the number of WRITES performed which is much larger than the size of the actual data.
  2. The in-memory index stores all the keys and hence needs large amount of expensive in-memory storage.

Let’s now solve these problems one-by-one.

Garbage Collection

The problem is,

writing a key several times will not clear the older values associated to that key hence wasting disk space.

To solve this problem we’ll use the technique called “compress and merge”.

Compression

We’ll focus on the read-only data files and remove their garbage data to reduce the storage requirement of the database.

To check if a record is active and in use, match the byte-offset of the record with the byte-offset stored in the hash index for the respective key. In case the two values match we can conclude that the record is active.

For each read-only segment,

  1. We’ll create a new blank data segment.
  2. We’ll then iterate over the original data segment record-by-record, if a record is active we’ll copy it to the newly created data segment else, we’ll ignore it and move to the next one.
  3. Once all the active records are copied into the new data segment we’ll iterate over this new data segment and for each record update the byte-offset stored in the hash index for the respective key. By doing this we have removed all the references of the original data segment.
  4. We can now delete the original data segment.

This way we have removed all the garbage data from the given data segment we’ll repeat this process for all the read-only data segments.

// https://github.com/manthanchauhan/key_val_db_engine/blob/hashIndex/compressAndMerge/compress.go

// Call this for all the read-only data segments
func compressSegment(fileName string) {
// create the new compressed segment
newFileName: = createCompressedSegment(fileName)

isCompressed: = disk.GetSegmentFileSize(fileName) > disk.GetSegmentFileSize(newFileName)
newFileIsEmpty: = disk.GetSegmentFileSize(newFileName) == constants.DataSegmentMetaDataByteSize

if isCompressed {
if newFileIsEmpty {
disk.DeleteSegment(newFileName)
} else {
// Update the hash index to use the compressed segment
hashIndex.ImportDataSegment(newFileName, hashMapImportSegmentInitValCheckForCompression(fileName))
}
disk.DeleteSegment(fileName)
} else {
disk.DeleteSegment(newFileName)
}
}

func createCompressedSegment(originalSegmentFileName string) string {
newFileName: = disk.CreateNewDataSegment()

f, deferFunc: = disk.GetLogFile(newFileName, os.O_WRONLY | os.O_APPEND)
defer deferFunc(f)

// Iterate the originalSegmentFileName
// Write each record into the new data segment

disk.ParseDataSegment(originalSegmentFileName, func(k string, v string, byteOffset int64) {
dataLocation: = utils.GetDataLocationFromByteOffset(originalSegmentFileName, byteOffset)
isUsed: = dataLocation == hashIndex.GetDataLocation(k)
if isUsed {
dataSegment.Write(k, v, f)
}
})

return newFileName
}

Merging

With time, our database engine will continue to create more and more data segments most of which will be smaller than the set size limit due to the compression process described above. Thus, we’ll have tiny data segments scattered across the persistent storage.

This way our database engine will eventually face the problem of External Fragmentation and consequently File System Fragmentation.

External Fragmentation occurs when data is scattered across the storage in very small chunks leaving only small holes of available storage. The result is that, although free storage is available, it is effectively unusable because it is divided into pieces that are too small individually.

When a storage system has external fragmentation, inserting data causes it to be stored in scattered pieces instead of close together, slowing access due to seek time and rotational latency of the read/write head, and incurring additional overhead to manage additional locations. This is called File System Fragmentation.

To contain this situation, we’ll periodically merge together small data segments into a larger data segment of reasonable size.

Here is an overview of how we can do this. Suppose the desired segment size is 5MB.

  1. Retrieve the list of read-only segments along with their respective file sizes i.e[(s1.log: 2MB), (s2.log: 2MB), (s3.log: 3MB), (s4.log: 1MB)]
  2. Identify the first n segments such that the total file size is either equal or just exceeds the desired segment size (i.e. 5 MB). For the given example, we’ll pick the first 3 segments with total file size of 7 MB.
  3. Create a new data segment and copy the records from each of the selected n segments into the new segment. Then we’ll iterate over this new data segment and for each record update the byte-offset stored in the hash index for the respective key.
  4. Now we can safely delete the initial n segments.
  5. Note that, if the total file size of selected n segments is greater than the desired segment size (i.e. 5 MB) then copy the last segment partially only and exempt this last segment from deletion. This segment will eventually be compressed and then merged in further iterations.
  6. Repeat this process from the n+1th segment.

This way we have merged together many small data segments into few data segments of reasonable size.

func merge() {
fileNames: = getReadOnlySegmentFileNames()

if len(fileNames) < 2 {
return
}
for i: = 0;i < len(fileNames); {

// Find the segement range [i, j) to merge together
start: = i
var end int
var j int
var mergedSegmentSize int64 = 0

// Considering `constants.LogFileMaxSizeBytes` as the desired segment size
for j = i + 1;j <= len(fileNames) && mergedSegmentSize < constants.LogFileMaxSizeBytes;j++{
mergedSegmentSize += disk.GetSegmentFileSize(fileNames[j - 1])
}
if mergedSegmentSize > constants.LogFileMaxSizeBytes {
end = j - 2
} else {
end = j - 1
}

// Prepare merged segment
mergedSegmentFileName: = mergeSegments(fileNames[start: end])

// Update the references in hash index
hashIndex.ImportDataSegment(mergedSegmentFileName, hashMapImportSegmentInitValCheckForMerging(fileNames[start: end]))

// Delete the original data segments
deleteSegments(fileNames[start: end])
i = end
}
}

We’ll schedule the compress & merge processes to run periodically in the background so that our data is organised neatly in more or less uniform segments without any garbage.

This can be done in Go using Go-routines & channels with ticker.

func CompressionAndMergingGoRoutine() {

// create a ticker which receiver an event every 10s
ticker: = time.NewTicker(10 * time.Second)
for {
select {
case <-ticker.C:

// compress then merge whenever an event is received
compress()
merge()
}
}
}

Sparse Indexing

The last problem that we are yet to solve is that our in-memory index stores all the keys, this is called dense indexing which needs large amount of in-memory storage.

A dense index contains an index record for every key in the data file hence the number of records in the index table is same as the number of records in the entire data.

The alternative to dense indexing is called sparse indexing.

A sparse index contains index records only for a few items. Each index record points to a block of data. Thus, sparse index uses less memory as compared to a dense index.

To implement sparse indexing in our database we can use SSTables and LSM-Trees but these concepts exceed the scope of this post.

In conclusion, diving into building a database engine offers insights into its workings. If you found this journey enlightening, consider showing support with some claps. Your appreciation fuels further exploration.

Thank you for joining me.

--

--