Recently I have been working on PostgreSQL benchmarks for its data checksums feature. This incredibly valuable option to initdb -- introduced in version 9.3 in 2013 -- allows quick detection of corrupted disk data pages. It provides the glorious opportunity to simply failover to a standby before your data becomes corrupted, rather than endure the horror of discovering the corruption afterward and attempting to recover.
But people care as much about speed as safety, and the feature comes with a performance cost since that data is checksummed every time it's read from or written to disk. Clients are interested in the feature but they want to know the performance cost. Our answer so far has been "we don't really know," and if an internet search is any indicator, no one really knows.
There is another challenge with this feature: for PostgreSQL versions lower than 12, it can only be enabled when the cluster is first created. And if you create a cluster with this option enabled, there is no way to disable it. PostgreSQL version 12 introduces a utility program pg_checksums, which enables or disables data checksums in a PostgreSQL cluster, but the cluster must still be offline while the utility runs. Clients want to understand the performance implications of this feature before committing a new cluster to it, or before undertaking a migration to a new cluster with the feature enabled.
Take note: Amazon enables data checksums on all RDS PostgreSQL clusters, and on their platform it is not a configurable option. I intend to leave you with their same clarity and confidence about this feature.
To find the answer about performance, my first instinct was to run benchmarks and get the answer empirically. I did run benchmarks and I did get an answer. However, this was a really tricky benchmark to set up in a meaningful way, and there are several different benchmarks needed to understand this issue, because there are many factors at play:
- CPU time to calculate checksums
- Obscure differences in write ahead logging having dramatic effects on disk writes.
- Caching
- Workload dependence of effects
In the CommandPrompt whitepaper Performance Analysis of PostgreSQL Data Checksums, I provide the results of several carefully designed benchmarks that illustrate the common workload types that bear significant additional CPU and disk IO load because of the feature, and other workload types that do not bear an additional load. From the Technical Summary of the whitepaper, here are some conclusions about performance impacts of data checksums on different workload types:
- Any application with a high shared buffers hit ratio: little impact.
- Any application with a high ratio of reads/writes: little impact.
- Data logging application with a low ratio of reads/inserts, and few updates and deletes: little impact.
- Application with an equal ratio of reads/inserts, or many updates or deletes, and a low shared buffers hit ratio (for example, an ETL workload), especially where the rows are scattered among disk pages: expect double or greater CPU and disk I/O use.
Because hardware failure and low-level data corruption are an all-too-common occurrence in database operations, and the data-checksums feature provides strong protection against such occurrences, Command Prompt recommends that administrators plan for implementation of clusters utilizing the feature, with foreknowledge of its performance impacts. For a full analysis, please download our whitepaper by entering your email below.