PostgreSQL is a powerful object-relational database management system that offers a growing range of extensions to provide features and functionality unmatched by others. One such extension that has recently become fully open-source is Citus. Citus Data originated in 2011, was open sourced as an extension to PostgreSQL in 2016, and was acquired by Microsoft in 2019. As of June 2022, all Enterprise features have been made available for free, and Citus is now 100% open source!
Citus has the ability to transform PostgreSQL into a distributed database with the additional features of sharding, a distributed SQL engine, reference tables, and distributed tables. The distributed database consists of multiple shards of data stored in different locations, which can lead to considerable improvements in performance and data storage. Citus uses parallelism (holding more data in memory) to offer higher I/O bandwidth and significant performance improvements for multi-tenant SaaS applications, customer-facing real-time analytics dashboards, and time series workloads.
Read on to learn when you should consider using Citus and how to install it.
Citus has the ability to scale up operations significantly by adding worker nodes. One Citus user reports up to 80 billion updates per day with a 20-node cluster on Google Cloud with 2.4 TB of memory, 1280 cores, and 80TB of data. Another Citus user reports 700+ billion events on a 100-node cluster with 1.4 PB of data. As you can tell, the ability of Citus to scale up PostgreSQL is quite substantial!
One Citus use case is the multi-tenant database model, where the database serves many tenants and each tenant's data kept separate from other tenants. This notable feature of tenant isolation provides performance guarantees for large tenants. Citus not only allows full SQL coverage for the workload, but also allows scaling to 100K+ tenants. Another notable feature is the concept of reference tables, which are small tables stored on each worker node that are often referenced (hence the name) by other tables. Referencing these tables locally on a worker node helps keep resources free that would otherwise be used to request data from another node. These features allow you to scale out your tenants’ data across multiple machines and easily add more CPU, memory, and disk resources. Additionally, sharing the same database schema across multiple tenants simplifies database management and makes more efficient use of hardware resources by distributing loading across multiple instances.
Advantages with Citus for multi-tenant applications are:
- Fast queries for all tenants
- Sharding logic occurs in the database, not the application
- More data is able to be held than possible in single-node PostgreSQL; 32 TB limit per table
- Performance is maintained under high concurrency
- Fast metrics analysis across customer base
- Easy scaling for new customers
- Isolation of resource usage of large and small customers
Citus supports real-time queries for large datasets. These queries are common occurrences in rapidly growing event systems or systems with time series data. Examples include:
- Analytic dashboards with sub-second response times
- Exploratory queries on unfolding events
- Large dataset archival and reports
- Analyzing sessions with funnel, segmentation, and cohort queries
Citus provides these benefits due to its ability to parallelize query execution and scale linearly with the number of worker databases in a cluster. Advantages of Citus for real-time applications include:
- Maintain sub-second responses with a growing dataset
- Analyze new events and data as they become available in real-time
- Parallelize SQL queries
- Maintain performance under high concurrency
- Fast responses to dashboard queries
- Use one database, not a patchwork
- Rich PostgreSQL data types and extensions
When is Citus Not Beneficial?
Citus offers distributed functionality to PostgreSQL, but does not scale out all workloads. Citus’ design mainly benefits the use cases described above. Most environments will not be appropriate for Citus, or Citus may not offer performance improvements. Citus is not likely to benefit a single-node PostgreSQL that supports your application and you are not expecting to outgrow the limits of a single-node. If your analytics applications do not need to support a large number of concurrent users or your focus is offline analytics without real-time queries or ingestion, Citus will not likely be beneficial. Queries which return data-heavy ETL results instead of summaries also will not benefit from Citus.
Now that we understand what Citus is used for and when it will be beneficial to use, let’s review the structure of Citus. Citus refers to machines/instances as nodes. Citus is structured to have one coordinator node and multiple worker nodes. The coordinator node contains only sparse amounts of data (mainly metadata), while the worker nodes contain the production data broken into shards that are distributed to different nodes. The amount of worker nodes needed will depend on how much production data you have (or plan on having) and/or how you want the data distributed.
Let’s install a multi-node Citus cluster. The following commands are executed on Ubuntu 22.04 LTS instances. If you do not have curl currently installed, you can do so by issuing the following command:
sudo apt -y install curl
Next, we will add a repository to install the necessary components for the Citus extension. The following steps should be repeated on the coordinator node and each worker node. First, we need to add the repository with:
curl https://install.citusdata.com/community/deb.sh | sudo bash
Now that the Citus repository is set up, we can install PostgreSQL 14 along with the latest Citus extension with one command:
sudo apt -y install postgresql-14-citus-11.1
NOTE: This will only work properly on an instance that does not already have PostgreSQL installed.
We will be able to confirm installation after we load Citus into our demo database in our next Citus blog. Stay tuned!
We have covered many important aspects of Citus. We briefly introduced Citus, reviewed use cases, reviewed Citus structure, and installation of Citus along with PostgreSQL 14. Citus is a great fit for multi-tenant applications and real-time analytics. We also covered when Citus is not beneficial. Next in this series of blog posts is Citus Configuration.