ADR-003: ClickHouse for Analytical Data Storage

Date: 2024-01-08

Status: Accepted

Context

Neuroscience datasets in SynDB contain billions of rows – neurons, synapses, connectivity matrices, and physiological measurements – that are queried with columnar scans, aggregations, and large joins. PostgreSQL handles the OLTP metadata workload well (users, datasets, permissions) but performs poorly on analytical queries at this scale due to its row-oriented storage engine.

A single-database approach would force a choice between metadata flexibility and analytical performance. Scientific data is append-only and immutable once ingested, which relaxes consistency requirements for the analytical store.

Decision

Adopt a dual-database strategy:

PostgreSQL (via SeaORM) for metadata, user accounts, permissions, and all OLTP operations.
ClickHouse for analytical neuroscience data, partitioned by dataset_id using the MergeTree engine family.

The API service connects to both databases. Metadata queries hit PostgreSQL; data queries are translated to ClickHouse SQL and executed via the native ClickHouse HTTP or TCP client.

Consequences

Positive:

Orders-of-magnitude faster columnar scans and aggregations compared to PostgreSQL at billion-row scale.
ClickHouse’s compression (LZ4/ZSTD) dramatically reduces storage for repetitive scientific data.
Partitioning by dataset_id enables efficient data lifecycle management (drop partition on dataset deletion).
Eventual consistency is acceptable because scientific data is immutable after ingestion.

Negative:

Increased operational complexity: two database systems to provision, monitor, back up, and upgrade.
No cross-database transactions; the application must handle consistency at the orchestration layer.
ClickHouse’s mutation and update semantics differ from PostgreSQL, requiring developer awareness.

Keyboard shortcuts

Synapse DB

ADR-003: ClickHouse for Analytical Data Storage

Context

Decision

Consequences