Making Our Own Database

ACM Roorkee
6 min readJul 3, 2021

--

A simple SQLite Clone in C

Hello everyone!! I am Richa, an enthusiastic CSE first-year undergraduate. The summers are already here and it’s time for all code-lovers to do some exciting new projects. Lately, I’ve been quite curious about the working of databases. So, I decided to learn about those by building a simple database, an SQLite clone.

We all come across several applications in our daily life. Almost each of these applications deals with some sort of data, be it a shopping application working on data to manage product details and customers, or a blog that constantly updates web pages or for some application of banks to store, update, retrieve account details or transactions or some applications keeping track of users requiring login or signup pages. It is quite obvious that creating new HTML pages for each update is quite tedious and infeasible. So here comes the need for a database that smoothens dealing with data.

So What Exactly Is A Database?

A database can be treated as a collection of data (in the form of tables), and a database management system helps to organize this data.

When coming across the idea that something like a database exists it is common for several questions that may creep in our minds. How data is stored in memory or disk? How the transfer of data takes place? How a table is designed or the basis on which the rules for making tables are based.

The best way to answer these questions and know how a database works is to build a clone. So let’s write down our SQLite clone in C. So let’s start!

How SQLite Functions?

To know the working of SQLite the documentation of SQLite is quite useful.

Whenever a user writes a query, a set of operations are performed to modify the data. The query acts as input to the front end, which compiles the SQL text into bytecode which can be run by the backend(virtual machine) until it either completes or forms a row of results to be returned or hits an error. The flow of data can be shown through this picture:

SQLite architecture (https://www.sqlite.org/arch.html)

Tokenizer

The SQL statements are first sent to the tokenizer which as its name breaks the string into tokens which are passed to the parser for further action.

Parser

The tokens are given meanings based on their contexts in the parser. It then generates a syntax tree( you can see a demo of how it works with each command here) which represents the parsed statements. This tree is then analyzed by the code generator.

Code Generator

This is the component where the magic happens! The code generator analyzes the parse tree and generates the required bytecode representation of SQL text which is compiled by the back-end. It is this place where code is generated for the SQL clauses, ‘WHERE’, ‘SELECT’, ‘UPDATE’, ‘DELETE’.

Now the role of back-end begins;

Virtual Machine

This truly acts as the heart of SQLite. This component runs a program in virtual machine language that searches, reads, or modifies the database. The program begins when instruction is 0 and continues until any error or stop instruction is provided. After completion of execution of the program, all open database cursors are closed, memory is freed.

B-Tree

It is a data structure used in databases. B-Tree is a Balanced Tree (A balanced tree is a tree where every leaf is “not more than a certain distance” away from the root than any other leaf). This data structure allows data to be stored in several nodes. Pages have a unique number starting from one.

Each storage unit is a page of fixed length. A page can be a reference to another page by using a page number. At the beginning of the page, page meta details (such as the rightmost child page number, first free cell offset, and first cell offset) are stored. We’ll learn more about ‘B-Tree’ and ‘Pages’ later when we implement them in our code.

Pager

It is responsible for reading/writing data at appropriate offsets in a database file. It also keeps a record of accessed pages and determines when those pages need to be written back to disk.

OS Interface

This helps to provide portability across operating systems. This layer differs depending on the operating system SQLite was compiled for.

This was all about the working of SQLite and gives a basic understanding of how things work. So let’s now start the actual coding of our project🤩. Throughout the coming months I’ll go through the following sequence:

Making a REPL

Whenever we start SQLite from the command line, SQLite reads the instructions, runs them, and then prints the result. SQLite starts this Read-Execute-Print-Loop.

Let us start writing our REPL.

For this, we need to write an infinite loop in the main function. Let us define an input_buffer to take the input instruction. This input_buffer will have a string(char*) to store the statement, length to store buffer size, and another variable to store the input length. This input buffer will interact with getline().

Now using this input_buffer to get the input line let’s write the function for reading the input string and a function for printing a simple prompt which prints db> at start of each input line.

The read_input function simply stores the length of input in input_buffer and stores the line in buffer using the getline(). This getline() works as:

ssize_t getline(char **lineptr, size_t *n, FILE *stream);
  • Here lineptr is a pointer to a variable that will point to the buffer which contains the input line.
  • If the lineptr is assigned NULL, it is mallocated by getline and we need to free it before.
  • Initially while defining input_buffer we assigned NULL to buffer, this ensures getline allocates enough memory for the input line and buffer points to that memory. Otherwise, it would have pointed to an invalid memory location, and further if we tried to reallocate the memory it would have created undefined behavior.
  • We’ll store the input line in input_buffer->buffer.

n is a pointer to a variable that stores the size of the allocated buffer. In our case, this is input_buffer->buffer_length.

stream here is the input stream to read from. We will use stdin standard input stream.

ssize_t is the number of bytes read, which is less than the buffer size, input_buffer->input_length.

As mentioned above we need to free our memory allocated to input_buffer and buffer at some point in time, so let’s write a function to do that:

Now we have all the functions ready to write our main block and see the program running. Until now, we haven’t talked about how we will be dealing with tables let’s treat all commands as unrecognized. We can define one .exit command to terminate the program and for all other commands, let’s, for now, print a message indicating an unrecognized command. The main block simply reads the line, executes it, and prints the result.

So with this, we have completed our basic REPL, let’s try it out!😆

The entire code looks like this now,

We’re done with this first part. Next, we’ll try to make a simple compiler and virtual machine. Till then Happy Coding!!

RESOURCES

Kudos to the original author RICHA for writing this awesome blog!

--

--

ACM Roorkee

We are a bunch of enthusiastic students who aim at uniting the computing fraternity at IIT Roorkee under one tag.