There are two ways to store and organize data in the database: in a column or in a row. This impacts the usage of and the search for data. The data storage format in these solutions makes instant analytical queries more efficient.
Definition of a columnar database
A column-oriented DBMS or columnar DBMS is a database management system that stores data tables by column rather than by row. Columnar databases have grown in popularity and provide performance advantages to querying data. They are optimized for reading and computing on columns efficiently.
Let's look at an example. We have a table Customers with id, first_name, last_name, and sum_price columns.
| id | first_name | last_name | sum_price |
|---|---|---|---|
| 1 | John | Smith | 456 |
| 2 | Mike | Johnson | 787 |
| 3 | Nic | Mary | 984 |
A columnar database structures all the data of every column together. For example, this is what our table will look like:
John:1, Mike:2, Nic:3;
Smith:1, Johnson:2, Mary:3;
456:1, 787:2, 984:3;
As you can see, each cell of a record matches its id field. This allows for safe data storage and fast search because every cell is attached to a unique id.
Benefits of columnar databases
Speed. Storing data by the column, instead of by the row, makes many of your queries run much faster because instead of looking row by row, you can skip multiple fields and hone in on only the most relevant data. For example, if you need to search some specific column, you can navigate directly to this column, skipping the others.
Scalability. One of the best advantages of columnar databases is scalability. Their architecture allows building massive systems making them as fast as possible, despite large amounts of data. Because the search is conducted column-by-column, it queries only necessary columns first. This gives columnar databases efficient scaling with any amount of data.
Compression. Column-oriented databases allow better compression, which is really important if you have a high volume of data. This is because compression algorithms run more efficiently on similar data. Also, it’s easy to add new columns to an existing table without having to shift all the data on the page, as is the case with row-oriented databases.
Aggregation queries. Thanks to their structure, columnar databases perform particularly well with aggregation queries (e.g SUM, COUNT, AVG, etc).
Disadvantages of columnar databases
Updating and deleting data. Columnar databases are typically less efficient when it is necessary to update or delete data, for several reasons. Columnar databases are not suited for incremental data loading - a method of updating the dataset in which only new or modified records are uploaded.
Adding new data. Writing new data takes more time in a columnar database because each column needs to be entered individually, compared to a row-oriented database, where you can insert new data.
Online Transaction Processing (OLTP) applications are not supported in column-oriented databases. OLTP is database software designed to support transaction-related applications. These systems are commonly used for order entry, financial transactions, customer relationship management, and retail sales via the Internet.
Most popular columnar databases
Due to the advantages of columnar databases, many companies that work with lots of data, use this type of database for their needs. So here's a list of the most popular column-oriented databases, that are used by big tech companies:
- MariaDB. Provides a fast, robust, and scalable database server with a full-grained ecosystem of plugins, storage engines, and several other database tools.
- ClickHouse. A very fast database. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second. ClickHouse is simple and works out-of-the-box.
- Greenplum Database. Greenplum Database. An advanced database management system with lots of great features. It provides high speed and performance for big data volumes. Also, Greenplum is open-source. So, every developer can contribute to this database.
Conclusion
In this topic, we've talked about columnar (or column-oriented) databases. They are often used because of their speed and scalability. They allow storing big amounts of data and fast searches through them. The most popular columnar databases at this time are MariaDB, ClickHouse, and Greenplum Database.
Hope you are ready for some practice! Let's see, what you remember.