ADR-004: Apache Arrow Flight for Data Transport
Date: 2024-01-10
Status: Accepted
Context
Neuroscience data transfers between SynDB and analytical clients can reach hundreds of megabytes per query result. Serializing this data as HTTP JSON incurs significant overhead: text encoding inflates payload size, row-oriented JSON requires full deserialization, and there is no native streaming support for back-pressure.
Data science clients in Python, R, and Julia already use Apache Arrow as their in-memory columnar format (via pandas, polars, and similar libraries). A transport layer that speaks Arrow natively would eliminate serialization costs.
Decision
Use Apache Arrow Flight (gRPC-based columnar data transport) via the tonic
gRPC framework and arrow-flight crate. The Flight service runs on port 50051
alongside the Axum HTTP API on port 8080, both in the same binary process.
- Query results are materialized as Arrow RecordBatches and streamed to clients
via
do_get. - Dataset ingestion uses
do_putfor streaming uploads. - Flight tickets encode query parameters as serialized descriptors.
Consequences
Positive:
- Zero-copy data transfer: clients receive Arrow RecordBatches directly usable by pyarrow, polars, DataFusion, and similar tools without deserialization.
- gRPC bidirectional streaming provides natural back-pressure and supports arbitrarily large result sets without buffering everything in memory.
- Columnar format enables predicate pushdown and projection pruning at the transport level.
Negative:
- gRPC adds complexity compared to plain HTTP: clients need a Flight-aware library rather than a simple HTTP client.
- Binary protocol is harder to debug than JSON; requires tooling like
grpcurlor custom Flight clients for inspection. - Running two server protocols in one binary increases shutdown coordination complexity (see ADR-005).