Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ADR-004: Apache Arrow Flight for Data Transport

Date: 2024-01-10

Status: Accepted

Context

Neuroscience data transfers between SynDB and analytical clients can reach hundreds of megabytes per query result. Serializing this data as HTTP JSON incurs significant overhead: text encoding inflates payload size, row-oriented JSON requires full deserialization, and there is no native streaming support for back-pressure.

Data science clients in Python, R, and Julia already use Apache Arrow as their in-memory columnar format (via pandas, polars, and similar libraries). A transport layer that speaks Arrow natively would eliminate serialization costs.

Decision

Use Apache Arrow Flight (gRPC-based columnar data transport) via the tonic gRPC framework and arrow-flight crate. The Flight service runs on port 50051 alongside the Axum HTTP API on port 8080, both in the same binary process.

  • Query results are materialized as Arrow RecordBatches and streamed to clients via do_get.
  • Dataset ingestion uses do_put for streaming uploads.
  • Flight tickets encode query parameters as serialized descriptors.

Consequences

Positive:

  • Zero-copy data transfer: clients receive Arrow RecordBatches directly usable by pyarrow, polars, DataFusion, and similar tools without deserialization.
  • gRPC bidirectional streaming provides natural back-pressure and supports arbitrarily large result sets without buffering everything in memory.
  • Columnar format enables predicate pushdown and projection pruning at the transport level.

Negative:

  • gRPC adds complexity compared to plain HTTP: clients need a Flight-aware library rather than a simple HTTP client.
  • Binary protocol is harder to debug than JSON; requires tooling like grpcurl or custom Flight clients for inspection.
  • Running two server protocols in one binary increases shutdown coordination complexity (see ADR-005).