ADR-004: Apache Arrow Flight for Data Transport

Date: 2024-01-10

Status: Accepted

Context

Neuroscience data transfers between SynDB and analytical clients can reach hundreds of megabytes per query result. Serializing this data as HTTP JSON incurs significant overhead: text encoding inflates payload size, row-oriented JSON requires full deserialization, and there is no native streaming support for back-pressure.

Data science clients in Python, R, and Julia already use Apache Arrow as their in-memory columnar format (via pandas, polars, and similar libraries). A transport layer that speaks Arrow natively would eliminate serialization costs.

Decision

Use Apache Arrow Flight (gRPC-based columnar data transport) via the tonic gRPC framework and arrow-flight crate. The Flight service runs on port 50051 alongside the Axum HTTP API on port 8080, both in the same binary process.

Query results are materialized as Arrow RecordBatches and streamed to clients via do_get.
Dataset ingestion uses do_put for streaming uploads.
Flight tickets encode query parameters as serialized descriptors.

Consequences

Positive:

Zero-copy data transfer: clients receive Arrow RecordBatches directly usable by pyarrow, polars, DataFusion, and similar tools without deserialization.
gRPC bidirectional streaming provides natural back-pressure and supports arbitrarily large result sets without buffering everything in memory.
Columnar format enables predicate pushdown and projection pruning at the transport level.

Negative:

gRPC adds complexity compared to plain HTTP: clients need a Flight-aware library rather than a simple HTTP client.
Binary protocol is harder to debug than JSON; requires tooling like grpcurl or custom Flight clients for inspection.
Running two server protocols in one binary increases shutdown coordination complexity (see ADR-005).

Keyboard shortcuts

Synapse DB

ADR-004: Apache Arrow Flight for Data Transport

Context

Decision

Consequences