Home > Build Datastores From Scratch > Build Datastores From Scratch > Creating a Document-Oriented Database

Building a Document-Oriented Database in Python

In this tutorial, we will learn how to build a Document-Oriented Database from scratch using Python. A Document-Oriented Database is a type of NoSQL database that stores and retrieves data in the form of flexible, JSON-like documents. We will explore the key concepts of Document-Oriented Databases, including creating a database class, adding, deleting, and modifying documents, as well as querying and retrieving data based on specific conditions.

Throughout the tutorial, we will also discuss important topics such as JSON-like documents, performance considerations, and persisting data to disk for long-term storage. By the end of this tutorial, you will have a practical understanding of how Document-Oriented Databases work and be able to create your own basic version using Python. So let's get started and dive into the world of Document-Oriented Databases!

In this section, we will cover the creation of our Document Database Class. Assuming your intermediate programming knowledge and experience, you should be familiar with Object-Oriented Programming (OOP). In OOP, a class is like a blueprint for creating objects. For our database, the objects will be the documents we want to store.

In Python, we begin the class definition with the keyword class followed by the class name. In our case, it would be DocumentDB representing our Document-Oriented Database. The class will have methods for creating, deleting and modifying key-value pairs, just like our documents.

The __init__ method is used to initialize attributes. We will use a Python dictionary, denoted as docs, to store our documents. This is similar to a Hashmap in Java or an Object in JavaScript.

Our class will have theadd_doc, delete_doc and modify_doc methods, representing the basic CRUD operations(CReate, Update, Delete). Crud operations are fundamental in any persistent storage system. You can think of these operations as similar to trading stocks in finance - adding a document is like buying a stock, deleting a document is like selling, and modifying a document would be adjusting your positions.

The add_doc method will require a key and a value, it will store these as a key-value pair in the docs dictionary. The delete_doc method will remove the key-value pair associated with its given key, and modify_doc will replace the value of a key-value pair corresponding to its given key.

These methods will, for now, perform these operations in-memory meaning that changes will be lost at the end of the program execution. However, we will look at ways to persist these changes in a later section.

xxxxxxxxxx
 
class DocumentDB:
    def __init__(self):
        self.docs = {}
​
    def add_doc(self, key, value):
        self.docs[key] = value
​
    def delete_doc(self, key):
        if key in self.docs:
            del self.docs[key]
​
    def modify_doc(self, key, value):
        if key in self.docs:
            self.docs[key] = value
​
if __name__ == "__main__":
    db = DocumentDB()
    db.add_doc('1', 'Hello World')
    print(db.docs)
    db.modify_doc('1', 'Hello Algodaily')
    print(db.docs)
    db.delete_doc('1')
    print(db.docs)

Let's test your knowledge. Click the correct answer from the options.

What is the purpose of the __init__ method in our DocumentDB class?

Click the option that best answers the question.

To add a new document
To initialize attributes of a class
To delete a document
To modify a document

JSON-like documents, standing for JavaScript Object Notation, are a way to store and transport data. They are lightweight, human-readable, and are based on a subset of the JavaScript Programming Language.

Considered as "self-describing" and easy to understand, JSON structures involve a series of unordered key-value pairs. Keys must be strings, while values can be a variety of types: strings, numbers, objects (JSON objects), arrays, booleans, or null. JSON documents and Python dictionaries are actually quite similar, both being collections of key-value pairs.

For example, let's represent a software engineer's profile using a Python dictionary that exemplifies a JSON-like document. This dictionary includes values of various types (e.g., strings, integers, lists). We store this profile in our Document-Oriented Database because this person is a stock trader interested in AI and finance.

Here's how we would represent this structure using JSON-like format which is similar to Python dictionary:

PYTHON

1profile = {
2   'name': 'John Doe',
3   'age': 30,
4   'profession': 'Software Engineer',
5   'languages': ['Python', 'JavaScript', 'C++'],
6   'interests': ['AI', 'Finance']
7}

To represent data hierarchically or to establish relationships among data, we might prefer JSON-like documents over tabular storage since they offer more flexibility and better readability.

xxxxxxxxxx
 
if __name__ == "__main__":
   # Python logic here
    profile = {
        'name': 'John Doe',
        'age': 30,
        'profession': 'Software Engineer',
        'languages': ['Python', 'JavaScript', 'C++'],
        'interests': ['AI', 'Finance']
    }
    print(profile)

Are you sure you're getting this? Click the correct answer from the options.

Which of the following statements about JSON-like documents is incorrect?

Click the option that best answers the question.

They are lightweight and human-readable
They are based on a subset of the JavaScript Programming Language
They can only contain key-value pairs where the values are strings
They are similar to Python dictionaries

As you're an experienced engineer, you know that once we've defined our database and the type of documents it'll store, the next step is storing these documents.

In a Document-Oriented Database, storing a document involves adding it to the collection of documents. For this, we'll usually be implementing a method named add or insert. This method will receive a document (a dictionary, in our Python case) and add it to the database, usually generating an ID to reference it later.

On the other hand, since our documents are dictionary-like objects, they can contain nested dictionaries, thus introducing hierarchical data. This is an advantage over traditional SQL databases where storing such hierarchical data can be a little bit complex.

Let's consider an example where we store our earlier created profile of a software engineer to our Document-Oriented Database:

In the Python code snippet below we define our 'database' as a dictionary. We then present a function add_document(db, document) which adds a document to the database and returns a generated ID for the inserted document.

xxxxxxxxxx
 
import uuid
​
if __name__ == '__main__':
  # Python logic here
  profile = {
    'name': 'John Doe',
    'age': 30,
    'profession': 'Software Engineer',
    'languages': ['Python', 'JavaScript', 'C++'],
    'interests': ['AI', 'Finance']
  }
  
  database = {}
​
  def add_document(db, document):
    doc_id = str(uuid.uuid4())
    db[doc_id] = document
    return doc_id
​
  doc_id = add_document(database, profile)
  print(f'Document inserted with ID: {doc_id}')
​
  print('Database content:')
  for key, value in database.items():
    print(f'ID: {key}, Document: {value}')

Let's test your knowledge. Click the correct answer from the options.

Imagine you're building a document-oriented database in Python. Your add_document function takes two parameters - db representing the database and document a JSON-like object you want to store. Supposedly, db is a dictionary and you intend to add document to it by associating it with a unique id. Select the correct syntax for adding document to db.

Click the option that best answers the question.

`db[id] = document`
`db.add(id, document)`
`db.put(id, document)`
`db.insert(id, document)`

After successfully storing documents in our database, the next step is to actually use that data. To do this, we commonly use a get or retrieve method. This method typically accepts an ID corresponding to the document we are interested in, and returns that document.

Retrieving a document from our document-oriented database is as simple as accessing the value of a key in a Python dictionary. We can implement a function get_document(db, id), where db is our database and id is the identifier for our document.

With these methods in place, it's easy to see how we could build out the operations of a more sophisticated database system, even topping it off with an interface for a language such as SQL.

In complex systems like distributed databases or big data applications, retrieving documents might not be as straightforward due to concerns like data consistency, network latencies or fault tolerance. However, those are more advanced topics that go beyond the scope of this simple tutorial. For now, we're focusing on the fundamental concept of retrieving documents from a document-oriented database.

Below is the Python code implementing retrieval of document from our database:

xxxxxxxxxx
 
if __name__ == "__main__":
  # Python logic here
  db = {'1': {'name': 'Software Engineer', 'skills': ['python', 'java', 'c++', 'data analysis']}}
  def get_document(db, id):
    if id in db:
      return db[id]
    else:
      raise KeyError('Document with given ID is not found in the database.')
​
  try:
    doc = get_document(db, '1')
    print(doc)
  except KeyError as e:
    print(e)

Try this exercise. Is this statement true or false?

Retrieving a document from our document-oriented database is as complex as a SQL query.

Press true if you believe the statement is correct, or false otherwise.

Being able to retrieve a single document is useful but often, we will want to fetch multiple documents based on some condition. For this, we typically use a query method. This is where our document database starts to shine - as the flexibility of a JSON-like structure allows us to easily query nested attributes.

For any senior engineer working in software development, you might have come across querying data when dealing with programs related to AI or finance, or any system that needs to fetch data based on specific conditions. For our document database, a basic query operation will sift through every single document it has stored, checking to see if the document matches the given condition.

Let's illustrate this with a query operation that retrieves all documents where the key customer_id is equal to 12345. See the Python code in the code section.

xxxxxxxxxx
 
def query_document(db, key, value):
  result = []
  for id, document in db.items():
    if document.get(key) == value:
      result.append(document)
  return result
​
if __name__ == '__main__':
  db = {
    'order1': { 'customer_id': 12345, 'product': 'Book' },
    'order2': { 'customer_id': 12346, 'product': 'Shirt' },
    'order3': { 'customer_id': 12345, 'product': 'Shoes' }
  }
​
  #Query the database
  print(query_document(db, 'customer_id', 12345))

Let's test your knowledge. Is this statement true or false?

A basic query operation in a document database fetches a single document that exactly matches the given condition.

Press true if you believe the statement is correct, or false otherwise.

Performance is a crucial consideration when designing a database system. While the flexibility of our document-oriented database provides rich querying capabilities, it can also lead to performance issues if not properly managed.

Interactions with the database (like retrieval or storage of a document) can be quite costly in terms of time and resources. The time complexity of the operations can range from O(1) to O(n), depending on how efficiently we design our database.

Fetching documents based on some condition involves scanning through all documents - an operation of O(n) complexity. Iterating through every document is only feasible when we have a small amount of data. As data grows, this method becomes less and less efficient, and alternatives should be considered.

One performance improvement could be the use of indexes or hashing structures to speed up the search operations. This could potentially reduce the time complexity to O(1).

Performance in Python can be measured using the timeit module, which provides a simple way to measure the execution times of small Python codes. It has both command-line interface and callable one. The timeit function runs the setup statement once, then returns the time it takes to execute the main statement.

In our code section we use it to time the square operation over a range. By using this module you can continuously monitor the performance of your document-oriented database as you make changes and optimizations, ensuring your database stays as efficient as possible.

xxxxxxxxxx
 
if __name__ == "__main__":
    # Python logic here
    import timeit
​
    start_time = timeit.default_timer()
    # Code to measure goes here
    for i in range(1, 100):
        i**2
    elapsed = timeit.default_timer() - start_time
​
    print('Time elapsed: ', elapsed)

Try this exercise. Fill in the missing part by typing it in.

The time complexity of fetching documents based on some condition could be improved from O(n) to __ by using indexes or hashing structures.

Write the missing line below.

Now that we have a working document-oriented database, we need to think about how we can make the data in our database persistent. In its current state, if the python script is stopped or crashes, all stored data will be lost. This is because our database exists only in the application's memory.

In production usage, data is stored in disk space because it's non-volatile, that is, the data doesn't disappear when the system is turned off. This is achieved through different methods such as serializing and writing to a file, or using the in-built methods provided by some databases.

Assuming our current implementation is a dictionary, we can use the json module that comes with python for serialization and deserialization of our data. Serialization is the process of converting our data structure into a format that can be stored. Deserialization is the reverse process, where we convert our stored format back into our data structure.

We could utilize Python's built-in json module to effectively serialize our dictionary to a JSON file for disk storage, and load it back into the application memory as needed. This simple strategy transforms our volatile memory storage into a more persistent storage system - a significant leap towards robust application data management! In the code you can see how we serialize our dictionary to a JSON file and load it back to memory.

Keep in mind that this is a simplified example, and actual databases use much more sophisticated techniques for managing and persisting data.

xxxxxxxxxx
 
import os
import json
​
if __name__ == "__main__":
  db = {"AI": 1, "finance": 2, "programming": 3}
  print('Before persisting: ', db)
​
  # Save (serialize) dictionary to a json file
  with open('db.json', 'w') as json_file:
    json.dump(db, json_file)
​
  # Simulate a situation where the in-memory dictionary disappears
  del db
​
  # Load (deserialize) dictionary from a json file
  with open('db.json') as json_file:
    db = json.load(json_file)
  print('After persisting: ', db)

Are you sure you're getting this? Click the correct answer from the options.

Persistent storage in a document-oriented database prevents data loss when:

Click the option that best answers the question.

When the script is stopped
During server crashes
On system shutdown
All of the above

Throughout this lesson, we've explored Document-Oriented Databases from a practical approach where JSON-like documents are used for data organization to imitate popular Databases like MongoDB.

We initiated by setting up a basic Python class which acted as our database, complete with features to create, delete and modify key-value pairs. We then expanded to understand JSON-like documents, their structure, and how they can be stored within our Document-Oriented Database. To facilitate interaction with our database, we implemented methods to store, retrieve and query JSON-like documents.

In making our database robust, we considered several performance enhancing ideas, utilizing Python's data structures and built-in functionalities. Ultimately, we discussed how to make our database data persistent, extending our database's capabilities beyond volatile memory.

This process shows you a miniature version of how commercial Document-Oriented Databases operate, and by building your own, you have developed a deep understanding of under-the-hood workings of these databases.

xxxxxxxxxx
 
if __name__ == "__main__":
  # Python logic here
  print("Reviewing the concepts we learned in this lesson")

Are you sure you're getting this? Is this statement true or false?

Document-Oriented Databases are not suitable for handling JSON-like data structures.

Press true if you believe the statement is correct, or false otherwise.

Building a Document-Oriented Database in Python

Let's test your knowledge. Click the correct answer from the options.

Click the option that best answers the question.

Are you sure you're getting this? Click the correct answer from the options.

Click the option that best answers the question.

Let's test your knowledge. Click the correct answer from the options.

Click the option that best answers the question.

Try this exercise. Is this statement true or false?

Let's test your knowledge. Is this statement true or false?

Try this exercise. Fill in the missing part by typing it in.

Are you sure you're getting this? Click the correct answer from the options.

Click the option that best answers the question.

Are you sure you're getting this? Is this statement true or false?

Programming Categories

Popular Lessons