NOSQL DATABSE EXPLAINED




NoSQL databases have become very popular. Big companies rely on them to store hundreds of petabytes of data and run millions of queries per second. But what is a NoSQL database? How does it work, and why does it scale so much better than traditional, relational databases? Let's start by quickly explaining the problem with relational databases like MySQL, MariaDB, SQL Server, and alike. These are built to store relational data as efficiently as possible. You can have a table for customers, orders, and products, linking together logically: customers place orders and orders contain products. This tight organization is great for managing your data, but it comes at a cost: relational databases have a hard time scaling. They have to maintain these relationships, and that's an intensive process, requiring a lot of memory and compute power. So for a while, you can keep upgrading your database server, but at some point, it won't be able to handle the load. In technical terms, we say that relational databases can scale vertically, but not horizontally, whereas NoSQL databases can scale both vertically and horizontally. You can compare this to a building: vertically scaling means adding more floors to an existing building, while horizontal scaling means adding more buildings. You intuitively understand that vertical scaling is only possible to a certain extend, while horizontal scaling is much more powerful. Why do NoSQL databases scale so well? Well, first of all, they do away with these costly relationships. In NoSQL, every item in the database stands on its own. This simple modification means that they're essentially key-value stores. Each item in the database only has two fields: a unique key and a value. For instance: when you want to store product information, you can use the product's bar code as the key and the product name as the value. This seems restrictive, but the value can be something like a JSON document containing more data, like price and description. This simpler design is why NoSQL databases scale better. If a single database server is not enough to store all your data or handle all the queries, you can split the workload across two or more servers. Each server will then be responsible for only a part of your database. To give an example: Apple runs a NoSQL database that consists of 75,000 servers. In NoSQL terms, these parts of your database are called partitions, and it brings up a question. If your database is split across potentially thousands of partitions, how do you know where an item is stored? That's where the primary key comes in. Remember, NoSQL databases are key-value stores, and the key determines on what partition an item will be stored. Behind-the-scenes, NoSQL databases use a hash function to convert each item's primary key into a number that falls into a fixed range. Say between 0 and 100. This hash value and the range is then used to determine where to store an item. If your database is small enough or doesn't get many requests, you can put everything on a single server. This one will then be responsible for the entire range. If that server is becoming overloaded, you can add a secondary server, which means that the range will be split in half. Server 1 will be responsible for all items with a hash between 0 and 50, while server 2 will store everything between 50 and 100. Theoretically, you've now doubled your database capacity: both in terms of storage and in the number of queries you can execute. This range is also called a keyspace. It's a simple system that solves two problems: where to store new items and where to find existing ones. All you have to do is calculate the hash of an item's key and keep track of which server is responsible for which part of the keyspace. Now, in this example, the range of 0 to 100 is a bit small. It would only allow you to split up your database into 100 pieces at most. So, real NoSQL databases have much bigger key spaces, allowing them to scale almost without restrictions. Besides great scalability, NoSQL is schemaless, which means that items in the database don't need to have the same structure. Each one can be completely different. In a relational database, you have to define your table's structure, and then each item must conform to it. Changing this structure isn't straightforward and could even lead to loss of data. Not having a schema can be a big advantage if your application and data structure is constantly evolving. At this point, it's clear that NoSQL databases have certain advantages over relational ones. But that's not to say that relational databases are obsolete, far from it. NoSQL is more limited in the way you can retrieve your data, only allowing you to retrieve items by their primary key. Finding orders by ID is no problem, but finding all orders above a certain amount would be very inefficient. Relational databases, on the other hand, have no trouble with this. There are workarounds for this issue, but only if you know how you're going to access your data. And that might not always be the case. Another downside is that NoSQL databases are eventually consistent. When you write a new item to the database and try to read it back straight away, it might not be returned. As I've explained, NoSQL splits your database into partitions. But each partition is mirrored across multiple servers. That way, a server can go down without much impact. When you write a new item to the database, one of these mirrors will store the new item and then copy it to the others in the background. This process might take a little bit of time. So when you read that item, the NoSQL database might try to read it from a mirror that doesn't have it yet. This is not a big issue in practice because data is replicated in just a few milliseconds. And if you want consistency, most NoSQL databases do have that option. So, in summary: both NoSQL and relational databases will be around for the foreseeable future. Each with their own strengths and weaknesses. So now you know how NoSQL works, let's look at a few examples. Cloud providers heavily promote NoSQL because they can scale it more easily. AWS has DynamoDB, Google Cloud has BigTable, and Azure has CosmosDB. To give you another example of their scalability: during Amazon Prime Day in 2019, Amazon's NoSQL database peaked at 45 million requests per second. That's mind-boggling! But you can also run NoSQL databases yourself with software like Cassandra (which was developed by Facebook), Scylla, CouchDB, MongoDB, and more. Before ending this post, let's quickly talk about the name "NoSQL." It's a bit confusing as it can be interpreted in two ways. First up: "NoSQL" can mean "not only SQL," pointing to the fact that some NoSQL databases partially understand the SQL query language, on top of their own query capabilities. And secondly, it's often called "NoSQL" in the sense of "non-relational" because it can't easily store relational data.

Post a Comment

Previous Post Next Post