ADR-003: ClickHouse for Analytical Data Storage
Date: 2024-01-08
Status: Accepted
Context
Neuroscience datasets in SynDB contain billions of rows – neurons, synapses, connectivity matrices, and physiological measurements – that are queried with columnar scans, aggregations, and large joins. PostgreSQL handles the OLTP metadata workload well (users, datasets, permissions) but performs poorly on analytical queries at this scale due to its row-oriented storage engine.
A single-database approach would force a choice between metadata flexibility and analytical performance. Scientific data is append-only and immutable once ingested, which relaxes consistency requirements for the analytical store.
Decision
Adopt a dual-database strategy:
- PostgreSQL (via SeaORM) for metadata, user accounts, permissions, and all OLTP operations.
- ClickHouse for analytical neuroscience data, partitioned by
dataset_idusing the MergeTree engine family.
The API service connects to both databases. Metadata queries hit PostgreSQL; data queries are translated to ClickHouse SQL and executed via the native ClickHouse HTTP or TCP client.
Consequences
Positive:
- Orders-of-magnitude faster columnar scans and aggregations compared to PostgreSQL at billion-row scale.
- ClickHouse’s compression (LZ4/ZSTD) dramatically reduces storage for repetitive scientific data.
- Partitioning by
dataset_idenables efficient data lifecycle management (drop partition on dataset deletion). - Eventual consistency is acceptable because scientific data is immutable after ingestion.
Negative:
- Increased operational complexity: two database systems to provision, monitor, back up, and upgrade.
- No cross-database transactions; the application must handle consistency at the orchestration layer.
- ClickHouse’s mutation and update semantics differ from PostgreSQL, requiring developer awareness.