Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Welcome to SynDB

SynDB is a platform for finding, sharing, and analyzing connectomics datasets and derived neuroanatomical tables. It supports federated deployments where institutions retain data sovereignty while participating in cross-institutional analysis.

Resources

Why use SynDB?

SynDB serves three audiences: data owners who produce microscopy data, data scientists who analyze it, and institutions that want to participate in federated analysis without giving up control of their data.

Image data owner

  • Data sharing: Others can use your data to teach, increasing the educational value of the data. SynDB follows the FAIR principles to maximize the impact of shared data.
  • Citations: Whenever your data is used in a publication, you will be cited, increasing your visibility in the scientific community.
  • Provenance tracking: Version history, lineage, and auto-generated citations (BibTeX, RIS) for your datasets.

Data scientist

  • Meta-analysis: Compare data across thousands of experiments using cross-dataset meta-analysis.
  • SyQL queries: A declarative query language that resolves metadata into optimized SQL.
  • Graph analysis: Network analysis on connectome data — motifs, shortest paths, reachability, cross-dataset comparison.
  • Data visualization: Use the data to create visualizations for publications or presentations.
  • Statistical modelling: Use the data to create models that can predict outcomes in future experiments.

Node operator / Institution

  • Data sovereignty: Keep your data on your infrastructure — it never leaves your network.
  • Federated meta-analysis: Participate in cross-institutional queries without transferring data.
  • Minimal footprint: A federation node requires only ClickHouse and the syndb-node binary.
  • Schema sync: The hub pushes DDL migrations to your node automatically.

See Federation Overview for setup details.

Installation

You can use SynDB in three ways:

  • the hosted web app at app.syndb.xyz
  • a local development stack via Docker Compose
  • the syndb CLI built from this repository

Hosted Web App

If you only need to browse datasets, authenticate, or use the UI, no local installation is required. Open:

Local Stack

For local development, build the project images and start the stack:

git clone --recurse-submodules https://github.com/memorycircuits/SynDB.git
cd SynDB
cp .env.example .env
syndb dev sync-versions
cargo run -p cli --features dev -- stack prepare
cargo run -p cli --features dev -- stack up

If you already cloned the repo without submodules, run:

git submodule update --init --recursive

The local entry points are:

  • API docs: http://localhost:8080/docs
  • UI: http://localhost:8090
  • Health check: http://localhost:8080/health

CLI From Source

Build or run the current CLI directly from the repo:

cargo run -p cli --features full -- --help
cargo run -p cli --features full -- query --help
cargo run -p cli --features dev -- stack --help

If you want a standalone binary, build the crate and run ./target/debug/syndb or ./target/release/syndb afterward.

Direct API Usage

The API can be used directly through:

If you want generated client bindings, use the OpenAPI schema above with your preferred generator.

Quick start

Web app

Use the hosted deployment at app.syndb.xyz, or run the local stack and open http://localhost:8090.

Command line interface

If syndb is already on your PATH:

syndb --help

From this repository, you can run the current CLI without installing it globally:

cargo run -p cli --features full -- --help

If this clone was not created with --recurse-submodules, initialize the nested QueryFabric workspace first:

git submodule update --init --recursive

Useful next commands:

syndb auth register
syndb auth login
syndb query --help
syndb data --help

Next steps

Authentication

SynDB uses PASETO v4 tokens for authentication. Access tokens authorize API requests; refresh tokens obtain new access tokens without re-authenticating.

Account Types

TypeHow to createCapabilities
RegularPOST /v1/user/auth/register or CLI syndb auth registerBrowse, search datasets
AcademicVerify via CILogon (institutional login)All regular + SyQL, graph analysis, meta-analysis, upload, jobs
ServicePOST /v1/user/auth/register-service with X-Service-Secret headerSame as Academic (auto-verified)
SuperUserPromoted by existing superuserAll + federation admin, ontology management

Academic verification is required for compute-intensive operations: query execution, graph analysis, analytics, meta-analysis, and dataset upload.

Registration & Login

CLI:

syndb auth register
syndb auth login

API:

# Register
curl -X POST https://api.syndb.xyz/v1/user/auth/register \
  -H "Content-Type: application/json" \
  -d '{"email": "[email protected]", "password": "..."}'

# Login — returns access_token and refresh_token
curl -X POST https://api.syndb.xyz/v1/user/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email": "[email protected]", "password": "..."}'

Token Lifecycle

  1. Login returns an access token (15 min TTL) and a refresh token (30 day TTL)
  2. Use the access token in requests: Authorization: Bearer <access_token>
  3. When the access token expires, exchange the refresh token for a new pair:
    curl -X POST https://api.syndb.xyz/v1/user/auth/refresh \
      -H "Content-Type: application/json" \
      -d '{"refresh_token": "..."}'
    
  4. Each refresh rotates the token — the old refresh token is invalidated

Refresh tokens use family-based rotation: reuse of a revoked token invalidates the entire family, forcing re-authentication.

OAuth Providers

Authenticate through institutional or social identity providers:

ProviderUse caseScopes
CILogonAcademic institutional login (universities, research labs)openid, email, org.cilogon.userinfo
GitHubSocial login + ORCID associationuser:email
GoogleSocial loginopenid, email, profile
GitLabSocial login (supports self-hosted instances)read_user
ORCIDResearcher ID association (requires existing account)openid

All OAuth flows use PKCE (Proof Key for Code Exchange) with SHA-256.

Academic Verification via CILogon

CILogon links your institutional identity to your SynDB account, automatically verifying you as an academic user:

  1. Log in to SynDB
  2. Navigate to CILogon verification (or GET /v1/user/authenticate/cilogon/authorize)
  3. Authenticate with your institution’s SSO
  4. Your account is marked as verified — unlocking SyQL, graph analysis, and upload

Service Accounts

For automated pipelines and integrations:

curl -X POST https://api.syndb.xyz/v1/user/auth/register-service \
  -H "Content-Type: application/json" \
  -H "X-Service-Secret: <SERVICE_SECRET>" \
  -d '{"email": "[email protected]", "password": "..."}'

Service accounts are auto-verified and bypass academic checks. The X-Service-Secret must match the server’s SERVICE_SECRET environment variable.

Logout

# Revokes the refresh token
curl -X POST https://api.syndb.xyz/v1/user/auth/logout \
  -H "Content-Type: application/json" \
  -d '{"refresh_token": "..."}'

Overview

The SynDB data platform is accessible through the API. By search, you may find and download analysis-ready connectomics tables and derived metrics; by upload, you may share your data to become part of a meta-analytical study.

Composition

The SynDB data platform is designed to provide a comprehensive and organized repository of high-resolution microscopy data products and associated metadata. In practice, SynDB is organized around three components: Metadata, Analysis-Ready Tables, and Optional Source Assets.

Metadata

The metadata is used to define and retrieve datasets. It stores metadata about the data in the respective dataset:

  • Brain region
  • Sourcing model animal
  • Genetic manipulations (mutations)
  • Microscopy method
  • Publication information

The metadata is defined by the data owner during upload.

Warning

Dataset

You must split your dataset into individual SynDB datasets if any of these fields differ within your own dataset.

Analysis-ready tables

The primary data in SynDB is not the raw microscopy volume. It is the analysis-ready output derived from that volume: neuron tables, synapse tables, compartment metrics, and other object-level measurements that can be searched and analyzed directly. Each neuronal compartment and structure has its own schema in SynDB.

To facilitate efficient data management, every row is linked to a dataset via its ID. This linkage enables robust search capabilities by filtering through metadata, without requiring users or ETL jobs to move terabytes of raw imaging data around. You can learn more about how dataset metadata filtration works in the article on search.

The flexible data model of SynDB supports this functionality by defining specific parameters for each compartment and structure. These varied tables are unified into comprehensive datasets through dataset metadata, which effectively organizes data groups across the platform.

Optional source assets

SynDB can also attach source-linked assets such as meshes and SWC skeleton files when they are available. These assets are optional and supplement the tabular release. SynDB does not require contributors to hand over raw imaging volumes or segmentation stores in order to ingest a dataset.

Organization & Tracking

  • Collections & Tags: Group datasets into curated collections and apply tags for discovery.
  • Provenance & Citations: Track version history, data lineage, and generate citations in BibTeX/RIS format. Export metadata as JSON-LD for linked data integration.

Search

The search feature filters through datasets based on the search terms provided by the user. The search terms can be combined to narrow down the search results.

By default, every search field is AND-based, meaning every provided term must be present in the resulting dataset.

Download the search results

Following the search, you may download the imaging derived metrics of the datasets from the search results. You will get a single .tar.xz file with parquet files inside. You may read parquet files using the pandas or polars library in Python.

Note

Other languages

Apache parquet is a file format supported by most popular programming languages. You may find libraries for reading parquet files in your preferred language.

Upload

Note

Prerequisites

This article requires that you understand how data is stored on SynDB, we recommend reading through the overview article if you are uncertain.

Uploading to SynDB is a multistep process, and requires understanding of the SynDB dataset model.

The process

Preparation

We recommend you to follow the guide in the exact sequence provided. This ensures the instructions are followed effectively and idiomatically.

Terms and conditions

You must accept the terms and conditions before uploading data. The terms include:

  • Statement that the data is not false or misleading
  • Redistribution rights
  • Data licensing agreement with the license of your choice, see guide to pick license; the current default is CC BY 4.0.

Data structuring

SynDB utilizes data standardization to facilitate uploads. Your imaging metrics must be in a tabular data format; for instance, .xlsx, .csv, or .parquet. Read more about the data structuring in the contributor’s guide.

Login

Once you enter the upload page, you will be prompted to log in to your SynDB account if you are not already; furthermore, you must verify your academic status by logging in to your institution’s account.

The upload

You can upload data using the CLI or the web UI, including mixing both approaches. The UI is usually the simplest path for a first upload, while the CLI is better for reproducible and scripted ingestion.

1. Assign IDs, and correlate relations

Each SynDB unit requires a unique ID assigned before being uploaded to the platform. The web UI does this automatically, but not the CLI. When you have multiple SynDB tables under one dataset it is expected that these have some relations with each other.

Warning

Dataset integrity

As it may lead to undefined behaviour, it is disallowed to upload SynDB table data that are unrelated under the same dataset!

Meaning that you cannot upload a table of neurons and a table of synapses under the same dataset unless each synapse has a relation to a neuron from the respective table of neurons.

Web UI

The web UI will automatically assign UUIDs to each SynDB unit. Parent-child relations are checked against the current SynDB table hierarchy during validation; see the data structuring guide for the current dataset model and naming rules.

CLI

The CLI flow is explicit and reproducible:

  1. Create the dataset metadata record and note the returned dataset ID.
syndb data new \
  --label "My connectome release" \
  --animal "Drosophila melanogaster" \
  --microscopy EM \
  --table 1 \
  --table 6 \
  --brain-structure "mushroom body" \
  --license CC_BY
  1. Prepare raw tabular files into a validated parquet upload directory.
syndb data prepare \
  --input-dir raw_dataset \
  --output-dir prepared_dataset
  1. Validate the prepared parquet files before upload.
syndb data validate --input-dir prepared_dataset
  1. Upload the prepared dataset through Arrow Flight.
syndb data upload \
  --input-dir prepared_dataset \
  --dataset-id <syndb-dataset-id>

This CLI flow mirrors the current validator and upload path used by the rest of the platform.

2. Selecting or creating the SynDB dataset metadata

As mentioned, in the overview article, every dataset has a metadata defined by the data owner during the upload. You can either select an existing dataset or create a new one.

3. Confirm and upload

Before the upload starts you will be prompted to confirm the dataset and the data you are uploading. Once you confirm, the upload will start. Should be relatively quick.

Delete owned datasets

You may at any time delete datasets that you own. This will remove the dataset and all the data associated with it. The deletion is permanent and cannot be undone.

External Sources

SynDB supports importing connectomics data from 20+ major connectome datasets. This page covers the supported imports grouped by organism. See the CLI Reference for the complete command reference.

Note

Dataset UUID

The <syndb-dataset-id> is the UUID of the SynDB dataset that will be associated with the imported data. You can copy and paste it from the dataset management page in the web UI.

What SynDB Needs From External Groups

When we say that we need the “full dataset” for SynDB ingestion, we do not mean the raw imaging volume. We mean the complete analysis-ready release needed to populate the SynDB tables for a dataset version.

For most connectomics imports, that means:

  • the complete neuron or object table for the release
  • the complete synapse table for the same release, or an aggregated connection table if that is the available downstream artifact
  • stable source identifiers such as root_id, pre_pt_root_id, and post_pt_root_id
  • coordinates and annotations needed to map the source schema into SynDB
  • optional morphology assets such as swc/ or meshes if they are part of the release

What this usually does not mean:

  • raw microscopy image stacks or volume tiles
  • Neuroglancer or CAVE precomputed segmentation volumes by themselves
  • ongoing operational access to the source group’s infrastructure after a static export has been produced

Preferred handoff

The preferred handoff is a static snapshot in an S3 or MinIO-compatible bucket, or an equivalent directory export with the same files. If the source data lives in CAVE, export the materialized tables first and hand off the files; SynDB imports the exported tables, not CAVE itself.

In practical terms, the source group’s involvement is usually limited to:

  • granting permission for SynDB to ingest and redistribute the agreed downstream artifacts
  • providing the exported snapshot in an agreed format
  • answering schema questions if a column needs clarification

For a typical neuron and synapse release, the handoff looks like:

dataset-name/
  neurons.csv.gz
  synapses.csv.gz
  connections.csv.gz   # optional aggregated fallback
  swc/                 # optional morphology assets
  meshes/              # optional geometry assets

Note

Organelle coverage is separate

Public SynDB support for a dataset’s neuron and synapse import path does not imply that vesicle or mitochondria tables are also available. For the current Hemibrain, MANC/Male CNS, MICrONS, and H01 production paths, SynDB imports neurons and synapses only. Any organelle-backed workflow such as manuscript CF08 needs a separate upstream snapshot or manual export path plus matching ETL wiring before production can populate those tables.


Drosophila melanogaster

FlyWire

Whole-brain Drosophila connectome reconstructed from a full adult female brain (FAFB). Data is exported from CAVE in CSV format.

Source: FlyWire Codex | Publication: Dorkenwald et al., 2024. Nature

Validate your FlyWire data directory:

syndb etl flywire validate --data-dir external_datasets/FlyWire

Import into your dataset:

syndb etl flywire import \
  --data-dir external_datasets/FlyWire \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

FlyWire also supports a synapses-detailed table for individual synapse positions (large, batched import).

Hemibrain

Half-brain connectome of an adult Drosophila from the Janelia FlyEM project (v1.2.1). Contains ~25,000 neurons with traced morphology and synaptic connections.

Source: Janelia FlyEM Hemibrain | Publication: Scheffer et al., 2020. eLife

Download the dataset:

syndb etl hemibrain download --output-dir external_datasets/Hemibrain --extract

Validate and import:

syndb etl hemibrain validate --data-dir external_datasets/Hemibrain

syndb etl hemibrain import \
  --data-dir external_datasets/Hemibrain \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

MANC (Male Adult Nerve Cord)

Connectome of the male Drosophila ventral nerve cord (VNC). Data is distributed as Apache Arrow Feather files.

Source: Janelia FlyEM MANC | Publication: Takemura et al., 2024. Nature

Download the dataset:

syndb etl manc download --output-dir external_datasets/MANC

Validate and import:

syndb etl manc validate --data-dir external_datasets/MANC

syndb etl manc import \
  --data-dir external_datasets/MANC \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Warning

Download size

The MANC dataset includes the connectome-weights Feather file (~1.1 GB). Ensure sufficient disk space before downloading.

Male CNS

Male Drosophila central nervous system connectome from Janelia FlyEM. Covers the brain and ventral nerve cord with neuron-level connectivity.

Source: Google Cloud Storage | Publication: Takemura et al., 2024. Nature

Download and import:

syndb etl male-cns download --output-dir external_datasets/MaleCNS

syndb etl male-cns validate --data-dir external_datasets/MaleCNS

syndb etl male-cns import \
  --data-dir external_datasets/MaleCNS \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

FANC (Female Adult Nerve Cord)

Connectome of the female Drosophila ventral nerve cord, enabling sex-specific comparisons with MANC. SynDB imports a static export of the FANC neuron and synapse tables, not the live CAVE deployment itself.

Publication: Phelps et al., 2021. Cell

Download the maintained public export:

syndb etl fanc download --output-dir external_datasets/FANC

If you are working from a custom local export instead, you can still prepare it separately:

syndb-export fanc --out-dir external_datasets/FANC --no-upload

Then validate and import:

syndb etl fanc validate --data-dir external_datasets/FANC

syndb etl fanc import \
  --data-dir external_datasets/FANC \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Optic Lobe

Drosophila optic lobe connectome from Janelia FlyEM. Maps the visual processing circuitry of the fly brain.

Source: Google Cloud Storage | Publication: Matsliah et al., 2024. Nature

syndb etl optic-lobe download --output-dir external_datasets/OpticLobe

syndb etl optic-lobe validate --data-dir external_datasets/OpticLobe

syndb etl optic-lobe import \
  --data-dir external_datasets/OpticLobe \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

BANC (Brain And Nerve Cord)

Whole-body Drosophila connectome covering the brain and ventral nerve cord in a single female specimen.

Publication: Jasper et al., 2024

syndb etl banc download --output-dir external_datasets/BANC

syndb etl banc validate --data-dir external_datasets/BANC

syndb etl banc import \
  --data-dir external_datasets/BANC \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

L1 Larval

Complete connectome of the first-instar Drosophila larval brain (~3,000 neurons), the first whole-brain connectome of any insect.

Source: GitHub | Publication: Winding et al., 2023. Science

syndb etl larval download --output-dir external_datasets/L1Larval

syndb etl larval validate --data-dir external_datasets/L1Larval

syndb etl larval import \
  --data-dir external_datasets/L1Larval \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Mouse

MICrONS (Minnie65)

Cubic millimeter of mouse visual cortex reconstructed at synaptic resolution by the MICrONS Consortium. Contains ~80,000 neurons and millions of synapses.

Source: MICrONS Explorer | Publication: MICrONS Consortium et al., 2021. bioRxiv

syndb etl microns download --output-dir external_datasets/MICrONS

syndb etl microns validate --data-dir external_datasets/MICrONS

syndb etl microns import \
  --data-dir external_datasets/MICrONS \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Spine Morphometry

Dendritic spine morphological measurements from electron microscopy. Three sub-datasets are supported:

VariantKeySourcePublication
KasthurikasthuriColumbia Academic CommonsKasthuri et al., 2015. Cell
Oferofer-confocalZenodoOfer et al., 2022
MICrONSmicronsZenodoDerived from MICrONS cortical data
syndb etl spine-morphometry download --source kasthuri --output-dir external_datasets/SpineKasthuri

syndb etl spine-morphometry validate \
  --source kasthuri \
  --data-dir external_datasets/SpineKasthuri

syndb etl spine-morphometry import \
  --data-dir external_datasets/SpineKasthuri \
  --dataset-id <syndb-dataset-id> \
  --source kasthuri

Human

H01

One cubic millimeter of human temporal cortex at nanometer resolution. Contains reconstructed neurons, synapses, and glia from a neurosurgical tissue sample.

Source: Google Cloud Storage | Publication: Shapson-Coe et al., 2024. Science

syndb etl h01 download --output-dir external_datasets/H01

syndb etl h01 validate --data-dir external_datasets/H01

syndb etl h01 import \
  --data-dir external_datasets/H01 \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

C. elegans

C. elegans Hermaphrodite

Complete connectome of the adult hermaphrodite C. elegans (~300 neurons), the first organism with a fully mapped nervous system. Data sourced from the OpenWorm ConnectomeToolbox.

Source: OpenWorm ConnectomeToolbox | Publication: Cook et al., 2019. Nature

syndb etl celegans download --output-dir external_datasets/CElegansHerm

syndb etl celegans validate --data-dir external_datasets/CElegansHerm

syndb etl celegans import \
  --data-dir external_datasets/CElegansHerm \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

C. elegans Male

Complete connectome of the adult male C. elegans, enabling sex-specific neural circuit comparisons.

Source: OpenWorm ConnectomeToolbox | Publication: Cook et al., 2019. Nature

syndb etl celegans-male download --output-dir external_datasets/CElegansMale

syndb etl celegans-male validate --data-dir external_datasets/CElegansMale

syndb etl celegans-male import \
  --data-dir external_datasets/CElegansMale \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

C. elegans Developmental

Connectomes across eight developmental stages of C. elegans, tracking how neural circuits are assembled during growth.

Source: GitHub | Publication: Witvliet et al., 2021. Nature

syndb etl witvliet download --output-dir external_datasets/CElegansDev

syndb etl witvliet validate --data-dir external_datasets/CElegansDev

syndb etl witvliet import \
  --data-dir external_datasets/CElegansDev \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Other Organisms

Platynereis dumerilii

Whole-body connectome of the marine annelid Platynereis dumerilii, a three-day-old larva with ~5,000 neurons.

Source: GitHub | Publication: Verasztó et al., 2024. bioRxiv

syndb etl platynereis download --output-dir external_datasets/Platynereis

syndb etl platynereis validate --data-dir external_datasets/Platynereis

syndb etl platynereis import \
  --data-dir external_datasets/Platynereis \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Fish1 (Larval Zebrafish)

Larval zebrafish (Danio rerio) brain connectome. Data is accessed through CAVE and requires manual export with Google OAuth authentication.

Note

Manual export

Fish1 data requires CAVE API access with Google credentials. Export a static neuron table plus either a full synapse table or an aggregated connections table, then use the validate and import commands.

Show the expected export layout:

syndb etl fish1 cave-instructions

Or export directly to a local directory:

syndb-export fish1 --out-dir external_datasets/Fish1 --no-upload
syndb etl fish1 validate --data-dir external_datasets/Fish1

syndb etl fish1 import \
  --data-dir external_datasets/Fish1 \
  --dataset-id <syndb-dataset-id> \
  --table neurons \
  --table synapses

Multi-species Databases

Allen Cell Types

Reference electrophysiology and morphology data from the Allen Institute for Brain Science. Covers mouse and human cortical neuron types with standardized measurements.

Source: Allen Cell Types Database | API: Allen Brain Map API

syndb etl allen-cell-types download --output-dir external_datasets/AllenCellTypes

syndb etl allen-cell-types validate --data-dir external_datasets/AllenCellTypes

syndb etl allen-cell-types import \
  --data-dir external_datasets/AllenCellTypes \
  --dataset-id <syndb-dataset-id> \
  --table neurons

NeuroMorpho

Curated archive of digitally reconstructed neuron morphologies from NeuroMorpho.org. Contains 200,000+ reconstructions across 100+ species.

Source: NeuroMorpho.org | Publication: Ascoli et al., 2007. Journal of Neuroscience

syndb etl neuromorpho download --output-dir external_datasets/NeuroMorpho

syndb etl neuromorpho validate --data-dir external_datasets/NeuroMorpho

syndb etl neuromorpho import \
  --data-dir external_datasets/NeuroMorpho \
  --dataset-id <syndb-dataset-id> \
  --table neurons

Collections & Tags

Organize datasets into curated collections and apply tags for discovery.

Tags

Tags are free-form metadata labels attached to datasets. They surface in search results and help users discover related data.

Add Tags

Tags are assigned during dataset creation or updated afterward via the dataset metadata endpoints.

Search by Tags

curl "https://api.syndb.xyz/v1/search/fulltext?q=drosophila+mushroom+body"

The full-text search indexes dataset tags alongside titles and descriptions. See Search.

Collections

Collections are curated groupings of datasets — for example, “All Drosophila connectomes” or “Lab X publication datasets.”

Create a Collection

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/neurodata/collections \
  -d '{
    "name": "Drosophila Connectomes",
    "notes": "All Drosophila melanogaster connectome datasets",
    "dataset_ids": [
      "11111111-1111-1111-1111-111111111111",
      "22222222-2222-2222-2222-222222222222"
    ]
  }'

List Collections

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/neurodata/collections

Get a Collection

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/neurodata/collections/{collection_id}

Collection membership is currently defined when the collection is created; there is no standalone POST /v1/neurodata/collections/{collection_id}/datasets route in the current API.

Collections are useful for meta-analysis: take the dataset_ids returned by the collection endpoints and pass them to the meta-analysis endpoint.

Provenance & Citations

SynDB tracks dataset lineage, version history, and generates machine-readable citations.

Version History

Each dataset maintains a version history. View all versions:

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/versions

Provenance Chain

The provenance endpoint shows the audit trail — who created, modified, or derived from the dataset:

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/provenance

Lineage

Track derived-from relationships between datasets:

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/lineage

Citations

Generate citations in standard formats:

# BibTeX
curl "https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/citation?format=bibtex"

# RIS (for [EndNote](https://endnote.com), [Zotero](https://www.zotero.org))
curl "https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/citation?format=ris"

JSON-LD

Export dataset metadata as linked data for integration with knowledge graphs and semantic web tools:

curl "https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/metadata.jsonld"

Returns a JSON-LD document following Schema.org and neuroscience ontology standards. See Data Standards for details on metadata formats.

Access Requests

For restricted datasets, request access from the dataset owner:

# Request access
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/access/request \
  -d '{"purpose": "Reanalysis for comparative morphology study"}'

# Check access status
curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/access

The dataset creator receives the request and can approve or deny it.

SyQL Query Language

SyQL (SynDB Query Language) is a declarative query language for neuroanatomical data. It resolves dataset metadata into optimized ClickHouse SQL, handles access control, and submits queries to the async job system.

Requires Academic verification.

Quick Start

SyQL follows familiar SQL syntax. The simplest query:

FROM neurons LIMIT 10

A more typical query:

SELECT neuron_id, cable_length, cell_type
FROM neurons
WHERE species = 'rat' AND cable_length > 100
ORDER BY cable_length DESC
LIMIT 1000

Query Structure

A full SyQL query can include:

[WITH cte_name AS (...), ...]
SELECT [DISTINCT] columns
FROM table [AS alias]
  [JOIN table [AS alias] ON conditions]
[WHERE predicates]
[GROUP BY columns]
[HAVING predicates]
[ORDER BY columns [ASC|DESC]]
[LIMIT n [OFFSET m]]
[SCOPE local|remote|federation]
[DOWNLOAD arrow|parquet|csv]

SELECT is optional — FROM neurons LIMIT 10 is equivalent to SELECT * FROM neurons LIMIT 10.


Tables

Base Compartment Tables

These are the primary data tables:

TablePrimary KeyDescription
neuronsneuron_idNeuron morphology and metadata
neuron_relationsrelation_idDirected non-synaptic neuron-to-neuron relations
synapsessynapse_idSynaptic connections between neurons
axonsaxon_idAxonal segments
dendritesdendrite_idDendritic segments
dendritic_spinesspine_idDendritic spines
pre_synaptic_terminalsterminal_idPre-synaptic terminals
vesiclesvesicle_idSynaptic vesicles
mitochondriamitochondria_idMitochondria

Every table includes dataset_id, created_at, and metadata columns.

Key Columns

neuronsneuron_id, name, brain_structure, polarity, cell_type, cell_class, cell_subclass, species, majority_neurotransmitter, gaba_avg, acetylcholine_avg, glutamate_avg, octopamine_avg, serotonin_avg, dopamine_avg, tyramine_avg, betaine_avg, cable_length, is_tree, n_branches, n_skeletons, n_trees, surface_area, max_axis_length, volume, voxel_volume, voxel_radius, mesh_volume, mesh_surface_area, mesh_area_volume_ratio, mesh_sphericity, centroid_x, centroid_y, centroid_z, s3_mesh_location, s3_swb_location

neuron_relationsrelation_id, pre_neuron_id, post_neuron_id, relation_type, neurotransmitter, strength, relation_count

synapsessynapse_id, pre_neuron_id, post_neuron_id, synapse_type, neurotransmitter, strength, synapse_count, centroid_x, centroid_y, centroid_z

Use POST /v1/syql/plan to inspect the resolved plan before execution. Current plan and explain responses include the logical plan, SQL preview or compiled SQL, optional rewrite target, rewrite advisories, optional federation scatter and gather SQL, optional query_id, and optional typed result_schema.

Materialized Views

Materialized views store pre-aggregated data for fast analytics:

ViewUse Case
mv_dataset_summaryRow counts per compartment per dataset
mv_neuron_morphometricsMorphometric averages/stddev per dataset
mv_neuron_statsNeuron stats grouped by dataset + cell type
mv_neuron_out_degreeOutgoing connections per neuron
mv_neuron_in_degreeIncoming connections per neuron
mv_platform_neuron_statsGlobal (platform-wide) neuron statistics
mv_synapse_connectivitySynapse statistics per dataset
mv_synapse_statsSynapse count/strength aggregates
mv_spatial_densitySpatial binning (bin_x, bin_y, bin_z)
mv_neurotransmitter_profileNeurotransmitter concentrations
mv_vesicle_distributionVesicle count/volume distributions
mv_mitochondria_statsMitochondria volume/count statistics
mv_vesicle_diameter_histogramVesicle diameter binning
mv_mitochondria_volume_histogramMitochondria volume binning
mv_nt_by_regionNeurotransmitter by brain region

MV columns that store intermediate aggregation state are automatically finalized — e.g., row_count becomes countMerge(row_count) in the compiled SQL. You don’t need to handle this yourself.

Precomputed Tables

Graph analysis results stored by the ETL pipeline:

TableDescription
precomputed_graph_summaryNetwork-level statistics (density, reciprocity, clustering, motifs)
precomputed_degree_histogramDegree distribution (in/out)
precomputed_celltype_connectivityCell-type-to-cell-type connectivity matrix
precomputed_bottleneck_neuronsArticulation point annotations
precomputed_clique_detailMaximal cliques with cell-type composition
precomputed_dual_networkChemical vs. electrical subnetwork metrics
precomputed_developmental_metricsPer-stage developmental metrics

Views

ViewDescription
vw_celltype_connectivityCell-type connectivity (non-materialized view)

Filtering (WHERE)

Data Column Filters

Standard SQL comparison operators:

WHERE cable_length > 100
WHERE cell_type = 'pyramidal'
WHERE cell_type != 'unknown'
WHERE cable_length BETWEEN 50 AND 200
WHERE cell_type IN ('pyramidal', 'interneuron', 'stellate')
WHERE name LIKE '%mushroom%'
WHERE cell_class IS NULL
WHERE cell_class IS NOT NULL

Boolean Logic

Combine predicates with AND, OR, NOT, and parentheses:

WHERE (cable_length > 100 AND volume < 500)
   OR (cell_type = 'pyramidal' AND NOT cell_class IS NULL)

Metadata Filters

These special columns filter by dataset metadata — they resolve against PostgreSQL and restrict which dataset_id values are included:

ColumnAliasesResolves to
speciesdataset.animal_species
brain_regionbrain_structuredataset_brain_region junction
licensedata_licensedataset.data_license
microscopymicroscopy_namedataset.microscopy_name
cluster_namefederated_cluster.name
SELECT neuron_id, cable_length
FROM neurons
WHERE species = 'mouse' AND brain_region = 'mushroom_body'
LIMIT 1000
FROM neurons WHERE species IN ('mouse', 'rat') LIMIT 1000

Expression Filters

Arithmetic and function calls work in WHERE:

WHERE cable_length * 2 > 500
WHERE SQRT(volume) < 100
WHERE ABS(centroid_x - centroid_y) < 10

Subquery Filters

WHERE neuron_id IN (SELECT pre_neuron_id FROM synapses WHERE strength > 5)
WHERE neuron_id NOT IN (SELECT neuron_id FROM axons)

SELECT Expressions

Columns and Aliases

SELECT neuron_id, cable_length AS length, cell_type
FROM neurons
LIMIT 100

Arithmetic

SELECT neuron_id,
       mesh_volume / mesh_surface_area AS volume_to_area,
       cable_length * 1000 AS cable_length_nm
FROM neurons
LIMIT 100

Operators: +, -, *, /, %

CASE Expressions

SELECT neuron_id,
       CASE
         WHEN cable_length > 1000 THEN 'long'
         WHEN cable_length > 100 THEN 'medium'
         ELSE 'short'
       END AS size_class
FROM neurons
LIMIT 100

Simple form:

CASE cell_type
  WHEN 'pyramidal' THEN 'excitatory'
  WHEN 'interneuron' THEN 'inhibitory'
  ELSE 'other'
END

DISTINCT

SELECT DISTINCT cell_type, brain_structure
FROM neurons

Aggregate Functions

FunctionDescription
COUNT(*)Count rows
COUNT(column)Count non-null values
COUNT(DISTINCT column)Count unique values
SUM(column)Sum
AVG(column)Mean
MIN(column)Minimum
MAX(column)Maximum
STDDEV_POP(column)Population standard deviation
VAR_POP(column)Population variance
QUANTILE(p)(column)Quantile at level p (0.0–1.0)
MEDIAN(column)Median (alias for QUANTILE(0.5))
CORR(col1, col2)Pearson correlation (two columns)
SELECT dataset_id,
       COUNT(*) AS neuron_count,
       AVG(cable_length) AS avg_cable,
       STDDEV_POP(cable_length) AS std_cable,
       MEDIAN(mesh_volume) AS median_volume
FROM neurons
GROUP BY dataset_id

GROUP BY and HAVING

Group rows and filter groups:

SELECT cell_type, COUNT(*) AS n, AVG(cable_length) AS avg_cable
FROM neurons
WHERE dataset_id = '...'
GROUP BY cell_type
HAVING COUNT(*) > 10
ORDER BY avg_cable DESC

HAVING supports arithmetic, function calls, and subqueries:

HAVING STDDEV_POP(cable_length) < 50
HAVING COUNT(*) > (SELECT COUNT(*) / 100 FROM neurons WHERE dataset_id = '...')

Scalar Functions

Numeric

FunctionDescription
ABS(x)Absolute value
ROUND(x [, precision])Round
FLOOR(x)Round down
CEIL(x) / CEILING(x)Round up
SQRT(x)Square root
POWER(x, y) / POW(x, y)Exponentiation
GREATEST(a, b, ...)Maximum of arguments
LEAST(a, b, ...)Minimum of arguments

Conditional

FunctionDescription
IF(cond, then, else)Ternary conditional
IFNULL(value, default)Null coalescing
COALESCE(a, b, ...)First non-null value
NULLIF(a, b)Returns null if a = b

String

FunctionDescription
LENGTH(s)String length
LOWER(s)Lowercase
UPPER(s)Uppercase
TRIM(s)Strip whitespace
CONCAT(a, b, ...)Concatenate strings
SUBSTRING(s, pos [, len]) / SUBSTR(...)Extract substring

Type Casting

FunctionDescription
TOFLOAT64(x)Cast to Float64
TOINT32(x)Cast to Int32
TOINT64(x)Cast to Int64
TOUINT64(x)Cast to UInt64
TOSTRING(x)Cast to String

Window Functions

Window functions compute values across a set of rows related to the current row.

Syntax

function(...) OVER (
  [PARTITION BY expr, ...]
  [ORDER BY expr [ASC|DESC], ...]
  [ROWS|RANGE BETWEEN start AND end]
)

Ranking Functions

SELECT neuron_id,
       cable_length,
       RANK() OVER (PARTITION BY dataset_id ORDER BY cable_length DESC) AS rank,
       DENSE_RANK() OVER (PARTITION BY dataset_id ORDER BY cable_length DESC) AS dense_rank,
       ROW_NUMBER() OVER (PARTITION BY dataset_id ORDER BY cable_length DESC) AS row_num
FROM neurons

Offset Functions

SELECT neuron_id,
       cable_length,
       LAG(cable_length) OVER (ORDER BY neuron_id) AS prev_cable,
       LEAD(cable_length, 2) OVER (ORDER BY neuron_id) AS next_2_cable,
       FIRST_VALUE(cable_length) OVER (PARTITION BY dataset_id ORDER BY neuron_id) AS first_cable,
       LAST_VALUE(cable_length) OVER (PARTITION BY dataset_id ORDER BY neuron_id) AS last_cable
FROM neurons

Window Frames

SUM(cable_length) OVER (
  ORDER BY neuron_id
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total

Frame bounds: UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, UNBOUNDED FOLLOWING.

Frame units: ROWS or RANGE.

Aggregates as Window Functions

Any aggregate function can be used with OVER:

SELECT neuron_id,
       cable_length,
       AVG(cable_length) OVER (PARTITION BY dataset_id) AS dataset_avg
FROM neurons

JOINs

Supported Join Types

  • INNER JOIN (or just JOIN)
  • LEFT JOIN
  • RIGHT JOIN
  • FULL OUTER JOIN
  • CROSS JOIN

ON Conditions

ON clauses support equality conditions chained with AND:

SELECT n.neuron_id, n.cable_length, s.strength
FROM neurons AS n
INNER JOIN synapses AS s
  ON n.dataset_id = s.dataset_id AND n.neuron_id = s.pre_neuron_id
WHERE n.dataset_id = '...'
LIMIT 1000

Self-Joins

Useful for analyzing reciprocal connections:

SELECT
    count() / 2 AS reciprocal_pairs,
    toFloat64(count()) / 2.0
      / greatest(toFloat64((SELECT count() FROM synapses WHERE dataset_id = '...')), 1.0)
      AS reciprocity
FROM synapses AS s1
INNER JOIN synapses AS s2
    ON s1.dataset_id = s2.dataset_id
    AND s1.pre_neuron_id = s2.post_neuron_id
    AND s1.post_neuron_id = s2.pre_neuron_id
WHERE s1.dataset_id = '...'
    AND s1.pre_neuron_id < s1.post_neuron_id

Joining Materialized Views

SELECT o.neuron_id, out_degree, in_degree
FROM mv_neuron_out_degree AS o
FULL OUTER JOIN mv_neuron_in_degree AS i
    ON o.dataset_id = i.dataset_id AND o.neuron_id = i.neuron_id
WHERE o.dataset_id = '...' OR i.dataset_id = '...'
GROUP BY o.neuron_id, i.neuron_id
ORDER BY (out_degree + in_degree) DESC
LIMIT 100

CROSS JOIN with Subqueries

Useful for z-score comparisons against global statistics:

SELECT
    ds.dataset_id,
    (AVG(ds.cable_length) - global.global_avg)
        / greatest(global.global_std, 0.0000000001) AS zscore
FROM neurons AS ds
CROSS JOIN (
    SELECT AVG(cable_length) AS global_avg, STDDEV_POP(cable_length) AS global_std
    FROM neurons
) AS global
WHERE ds.dataset_id IN ('...')
GROUP BY ds.dataset_id

CTEs (WITH Clauses)

Common Table Expressions let you name intermediate result sets:

WITH all_neurons AS (
    SELECT pre_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
    UNION ALL
    SELECT post_neuron_id FROM synapses WHERE dataset_id = '...'
)
SELECT
    neuron_count,
    edge_count,
    toFloat64(edge_count) / greatest(toFloat64(neuron_count) * (toFloat64(neuron_count) - 1), 1) AS density
FROM (
    SELECT
        (SELECT COUNT(DISTINCT neuron_id) FROM all_neurons) AS neuron_count,
        count() AS edge_count
    FROM synapses
    WHERE dataset_id = '...'
) AS t

Later CTEs can reference earlier ones. WITH RECURSIVE is not supported.


UNION ALL

Combine multiple queries:

SELECT 'pre' AS direction, pre_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
UNION ALL
SELECT 'post' AS direction, post_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
ORDER BY neuron_id
LIMIT 1000

Only UNION ALL is supported. UNION (without ALL), EXCEPT, and INTERSECT are not available. Outer ORDER BY, LIMIT, and OFFSET apply across the combined result.


Ordering and Pagination

ORDER BY cable_length DESC
LIMIT 100
OFFSET 200

ORDER BY supports expressions:

ORDER BY cable_length * 2 DESC
ORDER BY SQRT(volume) ASC

Parameter Binding

Parameterized queries prevent injection and allow reuse.

Positional Parameters

Use ? placeholders — they are assigned 1-based indices left to right:

{
  "query": "SELECT neuron_id, cable_length FROM neurons WHERE cable_length > ? AND volume < ? LIMIT ?",
  "params": [100, 999.5, 1000]
}

Named Parameters

Use :name placeholders:

{
  "query": "SELECT neuron_id FROM neurons WHERE cell_type = :ct AND species = :species LIMIT 1000",
  "named_params": {"ct": "pyramidal", "species": "rat"}
}

Parameters can be used in WHERE, SELECT expressions, HAVING, ORDER BY, and function arguments.


Output Format (DOWNLOAD)

Control the result format:

SELECT neuron_id, cable_length FROM neurons LIMIT 1000 DOWNLOAD csv
FormatDescription
arrowApache Arrow (default)
parquetApache Parquet
csvCSV

Federation Scope (SCOPE)

Control where the query executes:

SELECT COUNT(*) FROM neurons GROUP BY dataset_id SCOPE federation
ScopeDescription
localExecute on the local database (default)
remoteExecute on a remote federated cluster
federationScatter/gather across all federated nodes

When using federation scope, the hub automatically decomposes the query into scatter SQL (sent to each node) and gather SQL (merged locally). See Cross-Cluster Queries.

Automatic MV Rewriting

SyQL automatically rewrites queries to use materialized views when possible. For example:

SELECT AVG(cable_length) FROM neurons GROUP BY dataset_id

This is automatically rewritten to query mv_neuron_morphometrics instead of scanning the full neurons table — significantly faster for large datasets.

Use the explain endpoint to see whether your query was rewritten and which MV was selected. The response includes advisories explaining why alternative MVs were rejected.


API Workflow

SyQL has a three-stage pipeline:

StageEndpointWhat it does
PlanPOST /v1/syql/planParse → validate → resolve metadata → return logical plan
ExplainPOST /v1/syql/explainPlan + compile to SQL → return compiled query and advisories
ExecutePOST /v1/syql/execPlan + compile + submit to job queue → return job ID

Use plan to validate syntax and inspect the resolved schema. Use explain to preview the generated SQL before committing to execution. Use exec when you’re ready to run.

Plan

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/syql/plan \
  -d '{"query": "SELECT neuron_id, cable_length FROM neurons WHERE species = '\''mouse'\'' LIMIT 100"}'

Returns the parsed logical plan: resolved tables, columns, filters, metadata, and result schema.

Explain

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/syql/explain \
  -d '{"query": "SELECT AVG(cable_length) FROM neurons GROUP BY dataset_id"}'

Returns:

  • The compiled ClickHouse SQL (with any MV rewrites applied)
  • Query advisories and any selected MV rewrite target
  • Optional federation scatter_sql and gather_sql
  • Optional query_id and typed result_schema

Execute

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/syql/exec \
  -d '{"query": "SELECT neuron_id, cable_length FROM neurons WHERE species = '\''mouse'\'' ORDER BY cable_length DESC LIMIT 1000"}'

Returns a job_id. Track and download results via the Jobs System.

Cancel

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/syql/cancel \
  -d '{"query_id": "..."}'

The query_id may be returned by plan, explain, or execute when the query can be cancelled through ClickHouse.


Examples

Neuron morphometrics per dataset

SELECT dataset_id,
       COUNT(*) AS n,
       AVG(cable_length) AS avg_cable,
       STDDEV_POP(cable_length) AS std_cable,
       AVG(mesh_volume) AS avg_volume,
       AVG(mesh_sphericity) AS avg_sphericity
FROM neurons
GROUP BY dataset_id
ORDER BY n DESC

Top connected neurons

SELECT o.neuron_id, out_degree, in_degree,
       (out_degree + in_degree) AS total_degree
FROM mv_neuron_out_degree AS o
FULL OUTER JOIN mv_neuron_in_degree AS i
    ON o.dataset_id = i.dataset_id AND o.neuron_id = i.neuron_id
WHERE o.dataset_id = '...'
GROUP BY o.neuron_id, i.neuron_id
ORDER BY total_degree DESC
LIMIT 50

Z-score comparison across datasets

SELECT
    toString(ds.dataset_id) AS dataset_id,
    'cable_length' AS metric,
    (ds.dataset_avg - global.global_avg)
        / greatest(global.global_std, 0.0000000001) AS zscore
FROM (
    SELECT dataset_id, AVG(cable_length) AS dataset_avg
    FROM neurons
    WHERE dataset_id IN ('uuid1', 'uuid2')
    GROUP BY dataset_id
) AS ds
CROSS JOIN (
    SELECT AVG(cable_length) AS global_avg, STDDEV_POP(cable_length) AS global_std
    FROM neurons
) AS global

Network reciprocity

SELECT
    count() / 2 AS reciprocal_pairs,
    toFloat64(count()) / 2.0
      / greatest(toFloat64((SELECT count() FROM synapses WHERE dataset_id = '...')), 1.0)
      AS reciprocity
FROM synapses AS s1
INNER JOIN synapses AS s2
    ON s1.dataset_id = s2.dataset_id
    AND s1.pre_neuron_id = s2.post_neuron_id
    AND s1.post_neuron_id = s2.pre_neuron_id
WHERE s1.dataset_id = '...'
    AND s1.pre_neuron_id < s1.post_neuron_id

Graph density with CTEs

WITH all_neurons AS (
    SELECT pre_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
    UNION ALL
    SELECT post_neuron_id FROM synapses WHERE dataset_id = '...'
)
SELECT
    neuron_count,
    edge_count,
    avg_strength,
    toFloat64(edge_count)
      / greatest(toFloat64(neuron_count) * (toFloat64(neuron_count) - 1), 1)
      AS density
FROM (
    SELECT
        (SELECT COUNT(DISTINCT neuron_id) FROM all_neurons) AS neuron_count,
        count() AS edge_count,
        avg(strength) AS avg_strength
    FROM synapses
    WHERE dataset_id = '...'
) AS t

Ranking neurons within a dataset

SELECT neuron_id, cable_length,
       RANK() OVER (ORDER BY cable_length DESC) AS rank
FROM neurons
WHERE dataset_id = '...'
LIMIT 100

Parameterized query

{
  "query": "SELECT neuron_id, cable_length, cell_type FROM neurons WHERE cell_type = :ct AND cable_length > :min_cable ORDER BY cable_length DESC LIMIT :n",
  "named_params": {"ct": "pyramidal", "min_cable": 500, "n": 100}
}

Unsupported Features

These SQL features are intentionally not supported:

  • INSERT, UPDATE, DELETE, CREATE, DROP — SyQL is read-only
  • WITH RECURSIVE — recursive CTEs
  • UNION (without ALL), EXCEPT, INTERSECT — only UNION ALL is available
  • GROUPS window frame unit — only ROWS and RANGE
  • Arbitrary ClickHouse functions — only whitelisted functions are allowed

Saved Queries

Frequently used SyQL queries can be saved for reuse. See Saved Queries.

Saved Queries

Save SyQL queries for reuse, sharing, and scheduled re-execution.

Requires Academic verification.

Save a Query

For most workflows, save directly from SyQL. The structured POST /v1/queries route is mainly for already-resolved table and dataset selections.

From SyQL

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/queries/from-syql \
  -d '{
    "label": "Mushroom body neuron volumes",
    "query": "SELECT mesh_volume FROM neurons WHERE brain_region = '\''mushroom_body'\''",
    "description": "All neuron mesh volumes in the mushroom body"
  }'

Direct Save

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/queries \
  -d '{
    "label": "Neuron cable lengths",
    "description": "Neuron morphology subset for one dataset",
    "syndb_table": 1,
    "dataset_ids": ["11111111-1111-1111-1111-111111111111"],
    "columns": ["neuron_id", "cable_length"],
    "query_scope": "local"
  }'

syndb_table is the current numeric SyndbTable discriminant. Use the SyQL-backed save route when you want server-side resolution from a query string instead of a pre-selected table and dataset list.

List Saved Queries

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/queries

Get a Query

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/queries/{query_id}

Update

curl -X PUT -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/queries/{query_id} \
  -d '{"label": "Updated label", "description": "Updated description"}'

Delete

curl -X DELETE -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/queries/{query_id}

Run a Saved Query

curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/queries/{query_id}/run

Submits the query to the job system and returns a job ID.

Refresh Run Statuses

curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/queries/{query_id}/refresh

This polls all non-terminal runs attached to the saved query and returns the refreshed saved-query record.

CLI

The CLI saved-query commands are server-backed and operate on the same saved query store as the web UI and API.

syndb query list
syndb query save-syql --label "My query" "SELECT neuron_id FROM neurons LIMIT 100"
syndb query save --label "Neuron subset" --table 1 --dataset-id {dataset_id} --column neuron_id
syndb query show --id {query_id}
syndb query run --id {query_id}
syndb query status --id {query_id}
syndb query update --id {query_id} --label "New label"
syndb query delete --id {query_id}

Analytics

Pre-computed analytics endpoints for dataset exploration. These query ClickHouse materialized views and return results quickly (cached for 5 minutes).

Requires Academic verification.

Dataset Summary

Row counts per compartment type:

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/analytics/summary?dataset_ids=uuid1,uuid2"

Returns per-dataset table counts plus total_rows.

Neuron Morphometrics

Morphological statistics for neurons in a dataset:

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/analytics/morphometrics?dataset_ids=uuid1,uuid2"

Returns per-dataset means and standard deviations for metrics such as cable length, surface area, volume, mesh volume, mesh sphericity, and branch counts.

Z-Score Comparison

Standardized comparison of a metric across multiple datasets:

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/analytics/comparison?dataset_ids=uuid1,uuid2,uuid3&metric=mesh_volume"

Omit metric to return the current six-metric neuron morphometrics comparison; include metric to request a single z-score series.

Graph Summary

Network-level statistics for connectome datasets:

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/analytics/graph/{dataset_id}/summary

Returns neuron_count, edge_count, density, and avg_strength.

Reciprocity

Fraction of bidirectional synaptic connections:

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/analytics/graph/{dataset_id}/reciprocity

Degree Distribution

Top neurons by connectivity:

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/analytics/graph/{dataset_id}/degree-distribution?top_k=50"

Returns the top top_k neurons with in-degree, out-degree, and average inbound and outbound strength.

Graph Analysis

In-memory graph analysis on connectome datasets. SynDB constructs a directed graph from synapse data in ClickHouse (up to 10M edges) and runs network algorithms using petgraph.

Requires Academic verification.

Graph Metrics

Basic network statistics:

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/metrics

Returns current graph metrics including node count, edge count, density, reciprocity, average in and out degree, maximum in and out degree, and strongly connected component counts.

Motif Analysis (Triadic Census)

Count all 16 three-node subgraph patterns (triadic census):

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/motifs \
  -d '{}'

Compare by Synapse Type

Compare motif distributions across synapse types within the same dataset:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/motifs/compare-synapse-types \
  -d '{"sample_size": 500}'

Shortest Path

Find the shortest path between two neurons:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/shortest-path \
  -d '{
    "source_neuron_id": "11111111-1111-1111-1111-111111111111",
    "target_neuron_id": "22222222-2222-2222-2222-222222222222",
    "weight_mode": "hops"
  }'

Uses Dijkstra’s algorithm. Supports configurable edge weight modes.

Reachability

Find all neurons reachable within N hops:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/reachability \
  -d '{"source_neuron_id": "11111111-1111-1111-1111-111111111111", "max_hops": 3}'

BFS traversal, maximum 100 hops.

Reachability Curve

Sample how reachability grows with hop count:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/reachability-curve \
  -d '{"sample_size": 100, "max_hops": 20, "seed": 42}'

Returns the mean and standard deviation of the reachable fraction at each hop distance. Current limits are max 500 samples and max 20 hops.

Full Analysis

Run metrics + motifs + hub neuron detection in one call:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/graph/{dataset_id}/full-analysis \
  -d '{"max_edges": 5000000, "top_hubs": 20}'

Cross-Dataset Comparison

Compare graph properties across multiple datasets:

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/graph/compare?dataset_ids=uuid-1,uuid-2,uuid-3"

Graph Precompute (CLI)

For large datasets, precompute graph metrics and store results in ClickHouse materialized tables:

syndb graph-precompute --dataset flywire

--dataset accepts current dataset keys such as flywire, manc, or h01. This is a batch operation typically run as part of the ETL pipeline or as a Kubernetes job.

Meta-Analysis

Cross-dataset meta-analysis computes effect sizes and heterogeneity statistics across multiple datasets, enabling comparisons that no single dataset can answer.

Requires Academic verification.

Cross-Dataset Analysis

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/meta-analysis \
  -d '{
    "table": "neurons",
    "metric": "mesh_volume",
    "grouping": "brain_structure",
    "dataset_ids": "uuid-1,uuid-2,uuid-3"
  }'

Parameters

FieldRequiredDescription
tableYesTarget table: neurons, synapses, dendrites, axons, pre_synaptic_terminals, dendritic_spines, vesicles, mitochondria
metricYesColumn to analyze (e.g., mesh_volume, mesh_surface_area, connection_score)
groupingYesGrouping dimension (e.g., species, brain_structure, cell_type, dataset)
dataset_idsYesComma-separated dataset UUIDs
scopeNo"local" (default) or "federation"
cluster_idsNoComma-separated federation cluster UUIDs; required when scope is federation

Atlas Comparison

Compare dataset metrics against reference atlases (pre-aggregated materialized views):

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/meta-analysis/atlas/compare?dataset_ids=uuid-1,uuid-2&grouping=species&metric=mesh_volume"

Federation Scope

To run meta-analysis across federated nodes:

{
  "table": "synapses",
  "metric": "connection_score",
  "grouping": "dataset",
  "scope": "federation",
  "dataset_ids": "uuid-1,uuid-2",
  "cluster_ids": "cluster-uuid-1,cluster-uuid-2"
}

The hub fans the aggregation out to each specified cluster and merges the results. See Cross-Cluster Queries.

Jobs System

Long-running queries execute asynchronously through the job system. Submit a job, check its status, and download results when ready.

Requires Academic verification.

Workflow

Submit job → Job queued → Job running → Job completed → Download result
                                      → Job failed (check error, rerun)

Submit a Query Job

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/jobs \
  -d '{
    "syndb_table": 1,
    "dataset_ids": ["11111111-1111-1111-1111-111111111111"],
    "columns": ["neuron_id", "cable_length"],
    "query_scope": "local",
    "row_limit": 1000
  }'

Returns a job_id for tracking.

For most ad hoc querying, use SyQL execution or syndb query exec; they compile and submit this structured job request for you.

Submit a Graph Job

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/jobs/graph \
  -d '{
    "dataset_id": "11111111-1111-1111-1111-111111111111",
    "max_edges": 5000000,
    "motif_sample_size": 1000,
    "top_hubs": 20
  }'

Check Status

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/jobs/{job_id}
StatusMeaning
pendingQueued, waiting for a worker
runningCurrently executing
completedResults available for download
failedExecution error (check error)
cancelledCancelled by user

List Your Jobs

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/jobs

Download Results

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/jobs/{job_id}/result \
  -o result.arrow

Cancel a Job

curl -X DELETE -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/jobs/{job_id}

Rerun a Job

curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/jobs/{job_id}/rerun

Creates a new job with the same parameters.

Configuration

ParameterDefaultEnvironment Variable
Max concurrent workers4JOB_QUEUE_MAX_WORKERS
Result TTL24 hoursJOB_RESULT_TTL_HOURS
Max result size1 GBJOB_MAX_RESULT_BYTES

Results are stored in object storage and automatically cleaned up after the TTL expires.

Federation Overview

SynDB federation allows multiple institutions to participate in a shared neuroscience data network while retaining full control of their data. Each institution runs a node with its own ClickHouse instance; a central hub coordinates queries across all nodes.

Why Federate?

ConcernWithout federationWith federation
Data sovereigntyUpload all data to a central serverData stays on your infrastructure
Meta-analysisLimited to datasets on one instanceQuery across all participating institutions
ComplianceData leaves your networkData never leaves — only query results cross boundaries
LatencySingle point of accessLocal reads are fast; cross-cluster queries pay network cost

Key Concepts

Hub — The coordinating instance that runs the full SynDB stack (API, PostgreSQL, ClickHouse, Meilisearch, S3). It maintains a registry of federated clusters, monitors their health, and routes cross-cluster queries.

Node — A lightweight participant running ClickHouse and the syndb-node binary. Nodes register with the hub via libp2p or HTTP, receive schema migrations, and respond to delegated queries.

Schema versioning — The hub pushes ClickHouse DDL migrations to all nodes. Queries only route to nodes whose schema version is compatible.

Health monitoring — The hub periodically checks each node’s health. Nodes are classified as Healthy, Degraded, Unreachable, or Unknown. Unhealthy nodes are excluded from federation queries.

Federation password — A shared secret that nodes present when registering with the hub. Prevents unauthorized clusters from joining.

When to Federate vs. Upload

Federate when:

  • Institutional policy requires data to stay on-premise
  • You have existing ClickHouse infrastructure
  • You want to contribute to cross-institutional meta-analysis without data transfer

Upload directly when:

  • You don’t have infrastructure to maintain
  • Your data has no residency requirements
  • You want the simplest path to sharing

Architecture at a Glance

┌─────────────────────────────────┐
│            Hub                  │
│  API + PostgreSQL + ClickHouse  │
│  + S3 + Meilisearch + libp2p   │
└──────┬──────────────┬───────────┘
       │ libp2p/QUIC  │ libp2p/QUIC
  ┌────▼────┐    ┌────▼────┐
  │ Node A  │    │ Node B  │
  │ CH + syndb-node │    │ CH + syndb-node │
  └─────────┘    └─────────┘

Queries flow: User → Hub API → Hub ClickHouse → remote() to Node ClickHouse → results aggregated at Hub.

See Architecture for the full technical breakdown.

Federation Architecture

Components

Hub

The hub runs the full SynDB stack and coordinates the federation:

ComponentRole
syndb-apiHTTP API (port 8080) + Arrow Flight (port 50051)
PostgreSQLUser accounts, dataset metadata, cluster registry, job queue, benchmarks
ClickHouseLocal data warehouse + remote() queries to nodes
S3/MinIOMesh files, job results, ETL staging
MeilisearchFull-text search index
HubRegistryActorlibp2p actor managing cluster registration and health
FederationHealthMonitorPeriodic health checks with circuit-breaker logic

Node

Nodes are lightweight — no PostgreSQL, no S3, no Meilisearch:

ComponentRole
syndb-nodeFederation daemon with Arrow Flight server (port 50052)
ClickHouseLocal data warehouse (HTTP port 8124, native port 9003/9440)
ClusterActorlibp2p actor handling hub communication

Networking: libp2p

Federation uses libp2p for peer-to-peer communication:

  • Transport: QUIC with built-in TLS 1.3 (encrypted, multiplexed)
  • Discovery: mDNS for LAN (zero-config), DHT for WAN
  • NAT traversal: Relay nodes for peers behind NAT
  • Actor model: kameo actors manage the swarm event loop

DHT Registration

Services register under well-known names in the DHT:

NameActor
syndb-hubHubRegistryActor
syndb-cluster:{name}ClusterActor

The ClusterActor on each node looks up syndb-hub in the DHT to find and register with the hub.

Actor Messages

The ClusterActor handles these message types:

MessageDirectionPurpose
HealthPingHub → NodePeriodic liveness check
SchemaSyncHub → NodePush DDL migrations
DatasetCatalogRequestHub → NodeDiscover datasets on node
GetFlightEndpointHub → NodeResolve Flight address for data transfer
AnalyticsQueryHub → NodeDelegated analytics computation
OntologySyncHub → NodePush ontology terms

Data Plane

Two mechanisms move data between hub and nodes:

ClickHouse remote()

For SQL queries, the hub compiles a remote('node-host:port', 'syndb', 'table', 'user', 'password') call that executes directly on the node’s ClickHouse and streams results back.

Arrow Flight (Internal)

For large result sets and non-SQL workloads (graph analysis, analytics), the hub delegates to the node’s internal Flight server (port 50052). Results stream back as Arrow IPC batches.

Schema Versioning

Each ClickHouse DDL migration has a version number. The hub tracks the current version and each node’s version:

  1. Hub receives a schema sync request (POST /v1/federation/schema/sync)
  2. Hub sends pending migrations to each active node via SchemaSync message
  3. Nodes apply migrations and report their new version
  4. Queries only route to nodes whose schema version is compatible

Health Monitoring

The FederationHealthMonitorActor runs on the hub:

StateMeaningQuery routing
HealthyResponds to pings, schema compatibleIncluded
DegradedResponds but slow or partially failingIncluded with lower priority
UnreachableFailed consecutive pingsExcluded
UnknownNewly registered, not yet checkedExcluded until first successful ping

Health transitions are logged and stored in PostgreSQL for audit.

Concurrency Model

  • Lock-free reads: The hub’s cluster registry uses papaya concurrent hash maps — reads never block, even under high query load
  • Actor isolation: Each cluster connection is managed by its own actor, preventing one slow node from blocking others
  • Supervisor trees: Actor failures are caught and restarted by the kameo supervisor

Node Setup

This guide walks through joining the SynDB federation as a node operator.

Prerequisites

  • ClickHouse instance with a syndb database
  • Network reachability to the hub (or mDNS on the same LAN)
  • The federation password (provided by the hub administrator)
  • syndb binary with federation support

Step 1: Initialize

syndb ops federation init \
  --cluster-name "my-lab-node" \
  --clickhouse-endpoint "clickhouse.mylab.edu" \
  --clickhouse-http-port 8123 \
  --clickhouse-port 9440 \
  --federation-password "$SYNDB_FEDERATION_PASSWORD" \
  --institution "My University" \
  --contact-email "[email protected]"

This command:

  1. Bootstraps a libp2p swarm and discovers the hub via mDNS or configured multiaddrs
  2. Registers the node with the hub (presenting the federation password)
  3. Applies any pending ClickHouse schema migrations
  4. Saves configuration to ~/.config/syndb/federation.json

Optional flags

FlagDefaultDescription
--listen-addrOS-assignedlibp2p listen address (e.g., /ip4/0.0.0.0/udp/4001/quic-v1)
--descriptionHuman-readable cluster description

Step 2: Verify

# Show federation config
syndb ops federation status

# Test connectivity (3s mDNS discovery + hub + ClickHouse check)
syndb ops federation test

federation test performs:

  1. Bootstraps a temporary libp2p swarm with mDNS discovery
  2. Looks up the hub in the DHT
  3. Tests ClickHouse connectivity

Step 3: Sync Schema

If the hub has newer schema migrations:

# Preview changes
export SYNDB_HUB_URL="https://api.syndb.xyz/v1"
syndb ops federation sync-schema --dry-run

# Apply
syndb ops federation sync-schema

sync-schema currently uses an HTTP fallback endpoint and expects SYNDB_HUB_URL to point at the hub API base.

Step 4: Confirm Registration

List all federated clusters to verify your node appears:

export SYNDB_HUB_URL="https://api.syndb.xyz/v1"
syndb ops federation clusters

Environment Variables

VariableRequiredDefaultDescription
SYNDB_FEDERATION_PASSWORDYesShared secret for hub registration
SYNDB_SERVER_URLNohttps://api.syndb.xyzDefault server URL for the CLI root command tree
SYNDB_HUB_URLFor sync-schema and clusters fallback flowsHub API base including /v1 (for example https://api.syndb.xyz/v1)
FEDERATION_CLUSTER_NAMEYes (node mode)Unique cluster identifier
FEDERATION_NODE_FLIGHT_PORTNo50052Internal Flight gRPC port
FEDERATION_NODE_FLIGHT_ADVERTISENoderivedAdvertised Flight endpoint for remote delegation
FEDERATION_ENABLE_MDNSNotrueEnable mDNS for LAN discovery
FEDERATION_LISTEN_ADDRNoOS-assignedlibp2p listen address
FEDERATION_HUB_MULTIADDRSNoComma-separated hub multiaddrs for WAN
FEDERATION_CLUSTER_NATIVE_PORTNo9440ClickHouse native port for remote() queries

Docker Compose (Development)

For local development, the federation profile starts a hub and one node:

docker compose --profile federation up -d

This starts:

  • clickhouse-node — ClickHouse on HTTP 8124, native 9003
  • clickhouse-node-setup — Creates federation user on the node
  • clickhouse-hub-fed-setup — Creates federation user on the hub
  • syndb-node — Federation daemon with Flight on 50052, libp2p on 4001

All services use network_mode: host and discover each other via localhost.

Removing a Node

syndb ops federation logout

This deletes ~/.config/syndb/federation.json. The hub administrator can also deactivate the cluster via DELETE /v1/federation/clusters/{id}.

Hub Administration

All hub administration endpoints require SuperUser authentication.

Federation Status

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/status
{
  "total_clusters": 5,
  "active_clusters": 4,
  "healthy": 3,
  "degraded": 1,
  "unreachable": 0,
  "schema_version": 12
}

Cluster Management

List Clusters

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters

Returns each cluster’s ID, name, endpoint, port, health status, and active flag.

Register a Cluster

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/federation/clusters \
  -d '{
    "name": "partner-lab",
    "endpoint": "ch.partner-lab.edu",
    "federation_password": "shared-secret",
    "description": "Partner Lab ClickHouse node",
    "institution": "Partner University",
    "contact_email": "[email protected]"
  }'

Clusters can also self-register via POST /v1/federation/register using the federation password (no SuperUser required).

Deactivate a Cluster

curl -X DELETE -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{cluster_id}

Sets is_active = false. The cluster is excluded from future queries but its record is preserved.

Health Checks

Single Cluster

curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/health

Verification Tests

Three targeted tests for diagnosing cluster issues:

# Test ClickHouse connectivity and measure latency
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/test/connectivity

# Verify schema version compatibility
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/test/schema

# Run a test cross-cluster query
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/test/query

Schema Sync

Push pending DDL migrations to all active clusters:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/schema/sync

Get the current schema version and migrations:

# All migrations
curl -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/schema

# Migrations since version 10
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/federation/schema?since_version=10"

Benchmarks

Track federation query performance:

# Submit a benchmark record
curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/federation/benchmarks \
  -d '{
    "cluster_id": "...",
    "query_type": "remote_single",
    "latency_ms": 145,
    "row_count": 50000,
    "cluster_count": 1,
    "payload_bytes": 2048000,
    "success": true
  }'

# List benchmarks with filters
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/federation/benchmarks?query_type=remote_single&limit=50"

# Aggregate stats grouped by query type
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.syndb.xyz/v1/federation/benchmarks/aggregate?since=2024-01-01"

Query Types

TypeDescription
remote_singleQuery to one remote cluster
remote_multiQuery spanning multiple clusters
federation_unionUnion across all federated clusters
federation_searchFederated search
health_checkHealth check probe

Cross-Cluster Queries

Federation queries let you analyze data across all participating nodes from a single API call.

How It Works

  1. User submits a query via SyQL or meta-analysis endpoint with federation scope
  2. Hub resolves targets — checks dataset locality index to determine which nodes hold relevant data
  3. Hub compiles remote queries — generates ClickHouse remote('node:port', 'syndb', 'table', 'user', 'pass') calls
  4. Nodes execute locally — each node runs its portion of the query against local data
  5. Hub aggregates — results stream back and are merged at the hub

SyQL with Federation Scope

SyQL queries can target the federation by specifying scope inside the query text:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/syql/exec \
  -d '{
    "query": "SCOPE federation\nSELECT neuron_id FROM neurons WHERE brain_region = '\''mushroom_body'\'' LIMIT 1000"
  }'

The hub transparently fans the query out to nodes that hold matching datasets.

Meta-Analysis Across Clusters

Specify cluster_ids to include specific nodes:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/meta-analysis \
  -d '{
    "table": "neurons",
    "metric": "mesh_volume",
    "grouping": "brain_structure",
    "dataset_ids": "uuid-1,uuid-2,uuid-3",
    "scope": "federation",
    "cluster_ids": "cluster-uuid-1,cluster-uuid-2"
  }'

Data Plane: Arrow Flight

For large result sets and non-SQL workloads (graph analysis, analytics), the hub delegates to each node’s internal Flight server:

  • Hub sends a Flight DoGet request to the node’s advertised Flight endpoint (default port 50052)
  • Results stream back as Arrow IPC record batches
  • The hub merges batches from multiple nodes before returning to the client

Limitations

ConstraintDetail
LatencyCross-cluster queries add network round-trip time per node
Schema compatibilityNodes must be at a compatible schema version; incompatible nodes are excluded
Node healthOnly Healthy and Degraded nodes receive queries; Unreachable nodes are skipped
Delegation timeoutDefault 30s (FEDERATION_DELEGATION_TIMEOUT_SECS); long-running queries may need async jobs
No cross-node joinsEach node executes independently; joins happen only against local data

Best Practices

  • Use async jobs (POST /v1/jobs) for large federation queries to avoid HTTP timeouts
  • Check federation status before running large queries to know which nodes are available
  • Prefer meta-analysis endpoints for cross-dataset aggregation — they handle fan-out efficiently
  • Monitor benchmarks to track federation query performance over time

Federation Troubleshooting

Node Cannot Find Hub

Symptom: syndb ops federation init or syndb ops federation test hangs during hub discovery.

Causes and fixes:

CauseFix
mDNS blocked by firewallOpen UDP port 5353 or set FEDERATION_ENABLE_MDNS=false and use explicit multiaddrs
Hub and node on different networksSet FEDERATION_HUB_MULTIADDRS to the hub’s libp2p address (e.g., /ip4/hub-ip/udp/4001/quic-v1)
Hub not runningVerify hub process is up and listening on its libp2p port

Registration Rejected

Symptom: "Invalid federation password" error.

Fix: Ensure SYNDB_FEDERATION_PASSWORD matches the hub’s FEDERATION_PASSWORD exactly. Check for trailing whitespace or newlines in environment variables.

Schema Version Mismatch

Symptom: Node excluded from federation queries; hub logs show schema incompatibility.

Fix:

# Check current schema
syndb ops federation status

# Sync to latest
syndb ops federation sync-schema

If sync fails, verify the node’s ClickHouse is reachable and the syndb database exists.

Health States

StateMeaningAction
HealthyAll checks passNone
DegradedResponds but slow or partially failingCheck ClickHouse load, disk space, network
UnreachableFailed consecutive pingsCheck firewall, ClickHouse process, network connectivity
UnknownNewly registeredWait for first health check cycle or trigger manual verify

Trigger a manual health check from the hub:

curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{id}/verify

Docker Compose Issues

Port Conflicts

The federation profile uses network_mode: host. Check for conflicts:

  • Hub ClickHouse: HTTP 8123, native 9002
  • Node ClickHouse: HTTP 8124, native 9003
  • Federation Flight: 50052
  • libp2p: UDP 4001

Node Fails to Start

Check that hub ClickHouse setup containers completed first:

docker compose --profile federation logs clickhouse-hub-fed-setup
docker compose --profile federation logs clickhouse-node-setup

These create the federation user on each ClickHouse instance. If they fail, the node cannot authenticate for remote() queries.

Connectivity Test Sequence

Run targeted tests to isolate the failure:

# 1. Test ClickHouse connectivity
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{id}/test/connectivity

# 2. Test schema compatibility
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{id}/test/schema

# 3. Test cross-cluster query
curl -X POST -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/federation/clusters/{id}/test/query

Each test returns a pass/fail result with latency and error details. Work through them in order — later tests depend on earlier ones passing.

Docker Compose

Local development and single-machine deployment using Docker Compose.

Base Stack

cargo run -p cli --features dev -- stack up

Starts the core services:

ServicePortDescription
syndb-api8080 (HTTP), 50051 (Flight)REST API + Arrow Flight
syndb-ui8090Web frontend
postgres5433Metadata, users, access control
clickhouse8123 (HTTP), 9002 (native)ClickHouse data warehouse
s39000 (API), 9001 (console)MinIO object storage
meilisearch7700Meilisearch full-text search

All services use network_mode: host — they bind directly to the host network.

Local Search Smoke

The compose stack includes Meilisearch and wires it into the API. To rebuild the local dataset search index from PostgreSQL and query the public search endpoint:

nix develop . -c syndb test meilisearch-local

This smoke path:

  1. ensures the stack is up
  2. runs syndb data search reconcile against the local PostgreSQL and Meilisearch services
  3. queries http://localhost:8080/v1/search/fulltext?q=test&limit=5

The response may legitimately contain zero hits on a fresh stack, but the command should complete successfully and return valid JSON from the API.

Federation Profile

docker compose --profile federation up -d

Adds federation services on top of the base stack:

ServicePortDescription
clickhouse-node8124 (HTTP), 9003 (native)Node ClickHouse
clickhouse-node-setupCreates federation user on node
clickhouse-hub-fed-setupCreates federation user on hub
syndb-node50052 (Flight), 4001/UDP (libp2p)Federation node daemon

Note: The federation and federation-world profiles share port 8124 and are mutually exclusive. federation-world runs 5 regional ClickHouse nodes for benchmarking only.

ETL Profile

Run dataset imports:

docker compose --profile etl run syndb-etl etl <dataset> <command>

Example:

docker compose --profile etl run syndb-etl etl hemibrain download
docker compose --profile etl run syndb-etl etl hemibrain import --data-dir /data/Hemibrain --table neurons --dataset-id <uuid>

Version Management

All service versions are defined in versions.nix. After changing versions (built with Nix):

syndb dev sync-versions

This regenerates .env with the correct image tags.

Image Building

Build container images from Nix:

cargo run -p cli --features dev -- stack prepare

This builds the local development images used by Compose: syndb-api-rust:dev, syndb-etl:dev, and syndb-ui:dev.

Volumes

VolumeServiceContent
clickhouse-dataclickhouseClickHouse data
clickhouse-node-dataclickhouse-nodeNode ClickHouse data
postgres-datapostgresPostgreSQL data
minio-datas3S3 object storage
meilisearch-datameilisearchSearch index

Cleanup

ClickHouse creates files with UID 100100 and restrictive permissions. To clean volumes:

podman unshare rm -rf <volume-path>  # requires Podman (https://podman.io)

Prefer keeping data in Docker volumes rather than bind mounts to avoid permission issues.

Kubernetes & Helm

Production deployment on Kubernetes using Helm charts.

Charts Overview

ChartDescription
syndb-hubHub deployment (API, UI, depends on syndb-clickhouse)
syndb-federation-nodeFederation node (syndb-node, depends on syndb-clickhouse)
syndb-clickhouseShared ClickHouse subchart (used by both hub and node)
syndb-etlETL batch jobs (download, prepare, import, graph-precompute)
nautilusUmbrella chart for the NRP Nautilus cluster deployment

Charts are located under infrastructure/helm/.

Hub Deployment

The hub chart deploys the full SynDB stack. Key values:

syndb-clickhouse:
  clusterName: syndb-hub
  shardRegions:
    - name: dc1
      region: dc1
      replicas: 3

api:
  image:
    repository: docker.io/caniko/syndb-api
    tag: "0.10.47"
  flightPort: 50051
  resources:
    requests:
      cpu: "1"
      memory: 2Gi

ui:
  image:
    repository: docker.io/caniko/syndb-ui
    tag: "0.10.47"

The chart also creates a remote_servers.xml ConfigMap for ClickHouse cluster topology.

Meilisearch on Nautilus

The Nautilus umbrella chart now deploys Meilisearch as an internal-only production dependency for /v1/search/fulltext.

  • Deployment shape: single-replica StatefulSet
  • Service type: ClusterIP
  • Default storage: rook-ceph-block
  • Default volume: 20Gi
  • Public ingress: none
  • Shared secret: syndb-api-secrets.meilisearch_api_key

The API and the reconcile CronJob both receive:

  • MEILISEARCH_URL=http://syndb-meilisearch:7700
  • MEILISEARCH_API_KEY from syndb-api-secrets

Meilisearch itself receives the same secret as MEILI_MASTER_KEY, with MEILI_NO_ANALYTICS=true.

Reconcile job

Nautilus also deploys an hourly CronJob that runs:

syndb data search reconcile

using the lightweight oci-syndb-cli image. This is the repair mechanism for index drift and missed write-side updates.

Rollout order

For a production cutover:

  1. land the code and image changes
  2. update syndb-api-secrets so it contains meilisearch_api_key
  3. deploy the Nautilus chart
  4. wait for /health to report configured Meilisearch
  5. run one manual reconcile job
  6. verify /v1/search/fulltext through the public API

The manual one-shot reconcile command inside the supported devshell is:

nix develop . -c env \
  POSTGRES_HOST=<host> \
  POSTGRES_READ_HOST=<read-host> \
  POSTGRES_PORT=<port> \
  POSTGRES_USERNAME=<user> \
  POSTGRES_PASSWORD=<password> \
  POSTGRES_PATH=<database> \
  MEILISEARCH_URL=http://syndb-meilisearch:7700 \
  MEILISEARCH_API_KEY=<key> \
  cargo run -p cli --features dataset -- dataset search reconcile

Node Deployment

Deploy a federation node at your institution:

syndb-clickhouse:
  clusterName: syndb-node
  shardRegions:
    - name: dc1
      region: dc1
      replicas: 2

nodeApi:
  enabled: true
  image: syndb-api-rust:latest
  flightPort: 50052
  libp2pPort: 4001
  hubMultiaddrs: "/ip4/<hub-ip>/udp/4001/quic-v1"
  federationPassword: "<shared-secret>"
  resources:
    requests:
      cpu: 500m
      memory: 512Mi

When nodeApi.enabled=true, the chart deploys:

  • A Deployment running syndb-node with Flight (TCP) and libp2p (UDP) ports
  • A Service exposing both ports
  • Environment variables auto-populated from values (cluster name, endpoints, passwords)

In Kubernetes, mDNS is disabled — use hubMultiaddrs for explicit hub discovery.

ETL Jobs

ETL runs through the syndb-etl chart values, primarily downloadJobs, prepareJobs, seed, and graphPrecompute:

syndb-etl:
  image:
    repository: docker.io/caniko/syndb-etl
    tag: "0.10.47"
  flight:
    enabled: true
    serverUrl: "http://syndb-api-service:80"
    port: "50051"
  downloadJobs:
    - pipeline: hemibrain
      emptyDirSizeLimit: 8Gi
      downloadResources:
        requests: { cpu: "500m", memory: "512Mi" }
        limits: { cpu: "600m", memory: "614Mi" }
  prepareJobs:
    - pipeline: hemibrain
      emptyDirSizeLimit: 25Gi
  graphPrecompute:
    enabled: true

Important: Kubernetes Jobs are immutable. Before running helm upgrade when resource values changed, delete failed or running ETL jobs:

nix develop . -c kubectl delete job -n syndb -l app=syndb-etl --field-selector status.successful!=1

Skip override semantics: when syndb ops k8s nautilus apply receives explicit syndb-etl.skipPipelines[...] flags, SynDB now unions them with both config/etl-skip.ron and the live skip set derived from current ETL Jobs. Manual skip flags are additive; they do not replace the detected live skip set.

emptyDir warning: emptyDir volumes default to tmpfs and count against the pod’s memory cgroup limit. Add expected emptyDir data size to the memory limit.

Applying Changes

nix develop . -c cargo run -p cli --features dev -- ops k8s nautilus apply

Or manually:

nix develop . -c helm upgrade --install syndb-nautilus infrastructure/helm/nautilus/ \
  -n syndb --create-namespace \
  -f infrastructure/helm/nautilus/values.yaml

Pending Helm Releases

SynDB now refuses to apply when syndb-nautilus is already in one of Helm’s pending states (pending-install, pending-upgrade, pending-rollback). This prevents a generic:

another operation (install/upgrade/rollback) is in progress

from landing after ETL reset work has already started.

If the pending revision is newer than 10 minutes, treat it as possibly active and inspect it first:

nix develop . -c helm status syndb-nautilus -n syndb
nix develop . -c helm history syndb-nautilus -n syndb

If the pending revision is older than 10 minutes, treat it as stale and roll back to the newest deployed revision before retrying the apply.

Current example from April 19, 2026:

  • revision 293 was stuck in pending-upgrade
  • Helm reported last_deployed = 2026-04-19T18:51:43.666197216+02:00
  • the newest deployed revision was 291

Recovery:

nix develop . -c helm rollback syndb-nautilus 291 -n syndb
nix develop . -c cargo run -p cli --features dev -- ops k8s nautilus apply

QueryFabric Rollout

The QueryFabric cutover adds two PostgreSQL metadata invariants that the API now enforces at startup:

  • every saved query must have query_text
  • every pending query job must have sql_plan

Use the SynDB devshell and either run the checks manually:

nix develop . -c syndb test queryfabric-full
nix develop . -c syndb test queryfabric-rollout

or use the convenience wrapper:

nix develop . -c syndb ops k8s nautilus deploy queryfabric

test-queryfabric-rollout checks the PostgreSQL environment described by the current POSTGRES_* / POSTGRES_READ_HOST variables and performs the same saved-query backfill step the API runs at startup. For production, point those variables at the target metadata database before running the preflight.

deploy-bump-queryfabric is a safe wrapper over deploy-bump: it runs the full local QueryFabric + SynDB validation path first, then the target-DB preflight, and only then publishes images and upgrades Helm on trunk.

Environment Reference

All configuration is controlled through environment variables. This page documents the application defaults from crates/services/api/src/settings/mod.rs and calls out the local docker-compose.yaml overrides where they differ.

Database

VariableApp defaultLocal composeDescription
POSTGRES_HOSTlocalhostlocalhostPostgreSQL host
POSTGRES_PORT54325433PostgreSQL port
POSTGRES_USERNAMEsyndbsyndbPostgreSQL user
POSTGRES_PASSWORDsyndbsyndbPostgreSQL password
POSTGRES_PATHsyndbsyndb_testDatabase name
POSTGRES_READ_HOSTunsetunsetOptional read replica host
DB_POOL_MAX20unchangedMax connection pool size
DB_POOL_MIN2unchangedMin idle connections
DB_CONNECT_TIMEOUT_SECS10unchangedPostgreSQL connect timeout
CLICKHOUSE_HOSTlocalhostlocalhostClickHouse host
CLICKHOUSE_PORT84438123ClickHouse HTTP port
CLICKHOUSE_USERNAMEdefaultdefaultClickHouse user
CLICKHOUSE_DATABASEsyndbsyndbClickHouse database
CLICKHOUSE_SECUREtruefalseUse HTTPS/TLS for ClickHouse

Object Storage (S3/MinIO)

VariableDefaultDescription
S3_ACCESS_KEYAccess key
S3_SECRET_KEYSecret key
S3_ENDPOINTunsetCustom endpoint for MinIO or other S3-compatible storage
S3_REGIONunsetAWS region

Bucket names: syndb-mesh, syndb-swb, syndb-search, syndb-jobs. No underscores allowed in bucket names.

Authentication

VariableDefaultDescription
PASSLIB_SECRETPASETO v4.local symmetric key (minimum 32 bytes)
SERVICE_SECRETService account registration secret
UI_BASE_URLhttp://localhost:8090/uiOAuth callback redirect base URL
ACCESS_TOKEN_LIFETIME900 (15 min)Access token TTL in seconds
REFRESH_TOKEN_LIFETIME2592000 (30 days)Refresh token TTL in seconds
COOKIE_SAME_SITEStrictSameSite attribute for auth cookies
COOKIE_SECUREtrueWhether auth cookies require HTTPS
REQUIRE_AUTHENTICATIONtrueRequire auth on protected endpoints

OAuth Providers

VariableDescription
OA_GITHUB_ID, OA_GITHUB_SECRETGitHub OAuth app credentials
OA_GOOGLE_ID, OA_GOOGLE_SECRETGoogle OAuth credentials
OA_ORCID_ID, OA_ORCID_SECRETORCID OAuth credentials
OA_CILOGON_ID, OA_CILOGON_SECRETCILogon OAuth credentials
OA_GITLAB_ID, OA_GITLAB_SECRETGitLab OAuth credentials
OA_GITLAB_URLCustom GitLab instance URL
OA_ORCID_SANDBOXUse sandbox.orcid.org (false)
OA_CILOGON_SANDBOXUse test.cilogon.org (false)
OAUTH_PROVIDER_BASE_URLOverride provider URLs (testing)

Federation

VariableDefaultDescription
FEDERATION_LISTEN_ADDROS-assignedlibp2p listen address
FEDERATION_ENABLE_MDNStrueEnable mDNS LAN discovery
FEDERATION_HUB_MULTIADDRSComma-separated hub multiaddrs for WAN
FEDERATION_CLUSTER_NAMECluster identifier (required for node mode)
FEDERATION_CLUSTER_DESCRIPTIONCluster description
FEDERATION_CLUSTER_INSTITUTIONInstitution name
FEDERATION_PASSWORDShared federation secret
FEDERATION_CLUSTER_NATIVE_PORT9440ClickHouse native port for remote()
FEDERATION_NODE_FLIGHT_PORT50052Internal Flight gRPC port
FEDERATION_NODE_FLIGHT_ADVERTISEunsetAdvertised internal Flight endpoint (host:port); defaults to localhost:<FEDERATION_NODE_FLIGHT_PORT> when omitted
FEDERATION_DELEGATION_TIMEOUT_SECS30Timeout for delegated requests

Server

VariableDefaultDescription
API_DOMAINlocalhostPublic API host name used for generated links
DEV_MODEfalsePermissive CORS, data seeding
DEBUGfalseVerbose SQL logging
TESTINGfalseSkip federation/job queue init
REQUEST_TIMEOUT_SECS60HTTP handler timeout
HTTP_CLIENT_TIMEOUT_SECS30Internal HTTP client timeout
UPLOAD_TIMEOUT21600 (6 hours)Upload timeout
FLIGHT_PORT50051Arrow Flight server port

Rate Limiting

VariableDefaultDescription
RATE_LIMIT_PER_SECOND100Sustained request rate per IP
RATE_LIMIT_BURST200Burst capacity per IP

Job Queue

VariableDefaultDescription
JOB_QUEUE_MAX_WORKERS4Max concurrent job workers
JOB_RESULT_TTL_HOURS24Result retention
JOB_MAX_RESULT_BYTES1073741824 (1 GB)Max result size

Search

VariableDefaultDescription
MEILISEARCH_URLunsetBase URL for Meilisearch, for example http://localhost:7700
MEILISEARCH_API_KEYMeilisearch API key

API Overview

Base URL: https://api.syndb.xyz/v1

Interactive OpenAPI documentation: api.syndb.xyz/docs

OpenAPI spec: GET /openapi.json

This page is a curated route map for the current public surface. The generated OpenAPI document is the authoritative exhaustive reference.

Authentication

Pass a PASETO access token in the Authorization header:

Authorization: Bearer <access_token>

See Authentication for how to obtain tokens.

Content Types

  • Requests: application/json
  • Responses: application/json (API), Apache Arrow IPC (job results), BibTeX/RIS (citations)

Error Format

{
  "error": "Human-readable error message"
}

Standard HTTP status codes: 400 (bad request), 401 (unauthenticated), 403 (insufficient permissions), 404 (not found), 409 (conflict), 429 (rate limited).

Route Map

Public routes

  • GET /health — service health check
  • POST /v1/user/auth/register
  • POST /v1/user/auth/login
  • POST /v1/user/auth/register-service
  • POST /v1/user/auth/refresh
  • POST /v1/user/auth/logout
  • GET /v1/search/fulltext
  • GET /v1/federation/ping
  • POST /v1/federation/register
  • GET /v1/ontology/vocabularies
  • GET /v1/ontology/terms
  • GET /v1/ontology/terms/{id}
  • GET /v1/ontology/terms/{id}/children
  • GET /v1/ontology/terms/{id}/ancestors
  • POST /v1/ontology/terms/validate

Authenticated user routes

  • GET /v1/user/profile
  • PATCH /v1/user/profile
  • POST /v1/user/profile/scientist-tag
  • GET /v1/user/profile/{user_id}
  • GET /v1/user/authenticate/cilogon
  • GET /v1/user/authenticate/cilogon/authorize

Academic user routes

  • Dataset metadata and assets: POST /v1/neurodata/datasets, GET /v1/neurodata/datasets/owned, GET /v1/neurodata/datasets/modifiable, GET /v1/neurodata/datasets/incomplete, GET /v1/neurodata/datasets/{dataset_id}, DELETE /v1/neurodata/datasets/{dataset_id}, GET /v1/neurodata/datasets/{dataset_id}/provenance, GET /v1/neurodata/datasets/{dataset_id}/versions, GET /v1/neurodata/datasets/{dataset_id}/metadata.jsonld, GET /v1/neurodata/datasets/{dataset_id}/citation, GET /v1/neurodata/datasets/{dataset_id}/lineage, POST /v1/neurodata/datasets/{dataset_id}/lineage, POST /v1/neurodata/datasets/{dataset_id}/access/request, GET /v1/neurodata/datasets/{dataset_id}/access, GET /v1/neurodata/collections, POST /v1/neurodata/collections, GET /v1/neurodata/collections/{collection_id}, DELETE /v1/neurodata/collections/{collection_id}
  • SyQL: POST /v1/syql/plan, POST /v1/syql/explain, POST /v1/syql/exec, POST /v1/syql/cancel
  • Saved queries: GET /v1/queries, POST /v1/queries, POST /v1/queries/from-syql, GET /v1/queries/{id}, PUT /v1/queries/{id}, DELETE /v1/queries/{id}, POST /v1/queries/{id}/run, POST /v1/queries/{id}/refresh
  • Jobs: POST /v1/jobs, POST /v1/jobs/graph, GET /v1/jobs, GET /v1/jobs/{job_id}, DELETE /v1/jobs/{job_id}, GET /v1/jobs/{job_id}/result, POST /v1/jobs/{job_id}/rerun
  • Analytics: GET /v1/analytics/summary, GET /v1/analytics/morphometrics, GET /v1/analytics/comparison, GET /v1/analytics/graph/{dataset_id}/summary, GET /v1/analytics/graph/{dataset_id}/reciprocity, GET /v1/analytics/graph/{dataset_id}/degree-distribution
  • Graph: GET /v1/graph/{dataset_id}/metrics, POST /v1/graph/{dataset_id}/motifs, POST /v1/graph/{dataset_id}/motifs/compare-synapse-types, POST /v1/graph/{dataset_id}/shortest-path, POST /v1/graph/{dataset_id}/reachability, POST /v1/graph/{dataset_id}/reachability-curve, POST /v1/graph/{dataset_id}/full-analysis, GET /v1/graph/compare
  • Meta-analysis: POST /v1/meta-analysis, GET /v1/meta-analysis/atlas/compare

SuperUser routes

  • Federation administration: GET /v1/federation/status, GET /v1/federation/schema, POST /v1/federation/schema/sync, GET /v1/federation/clusters, POST /v1/federation/clusters, DELETE /v1/federation/clusters/{cluster_id}, POST /v1/federation/clusters/{cluster_id}/verify, POST /v1/federation/clusters/{cluster_id}/test/connectivity, POST /v1/federation/clusters/{cluster_id}/test/schema, POST /v1/federation/clusters/{cluster_id}/test/query, GET /v1/federation/benchmarks, POST /v1/federation/benchmarks, GET /v1/federation/benchmarks/aggregate
  • Ontology writes: POST /v1/ontology/terms, PUT /v1/ontology/terms/{id}, DELETE /v1/ontology/terms/{id}, POST /v1/ontology/import/csv

Middleware Stack

Requests pass through these layers in order:

  1. Request ID — UUID v7, propagated via X-Request-ID
  2. Tracing — structured request/response logging
  3. Rate limiting — per-IP token bucket (see Rate Limiting)
  4. Timeout — 60s default, 408 on expiry
  5. CORS — permissive in dev mode, restricted to api_domain in production
  6. Compression — automatic response compression
  7. Body limit — 100 MB max request body
  8. API versionapi-version: v1 response header

Health Check

curl https://api.syndb.xyz/health
{
  "status": "healthy",
  "components": {
    "postgres": { "status": "ok", "latency_ms": 5 },
    "clickhouse": { "status": "ok", "latency_ms": 12 },
    "storage": { "status": "ok", "latency_ms": 8 },
    "meilisearch": { "status": "ok" }
  }
}

Status is degraded if any required component fails, or if Meilisearch is configured but unreachable. Meilisearch is still optional — when it is unset, the health payload reports meilisearch.status = "not_configured" without degrading the overall status.

CLI Reference

The SynDB CLI (syndb) provides command-line access to account management, saved queries, dataset upload, federation, ETL, and Kubernetes workflows.

This page documents the current command surface. If you are working from this repository, you can run the CLI directly without a global install:

cargo run -p cli --features full -- --help

If the repository was cloned without submodules, initialize them first:

git submodule update --init --recursive

Global Options

OptionEnvironment VariableDescription
--server-urlSYNDB_SERVER_URLAPI base URL
--flight-urlSYNDB_FLIGHT_URLArrow Flight endpoint
--flight-portSYNDB_FLIGHT_PORTArrow Flight port

Commands

user — account management

  • syndb auth register — create a new account
  • syndb auth login — authenticate and store the token locally
  • syndb auth logout — revoke the current session

query — saved queries and SyQL helpers

These saved-query commands operate on the server-backed QueryFabric path, not a local on-disk query store.

syndb query list
syndb query save --label "Neuron subset" --table 1 --dataset-id <uuid> --column neuron_id --column cell_type
syndb query save-syql --label "Mouse neurons" "FROM neurons WHERE species = 'mouse' LIMIT 1000"
syndb query show <query-id>
syndb query update <query-id> --label "Updated label"
syndb query run <query-id>
syndb query status <query-id>
syndb query delete <query-id>
syndb query exec "FROM neurons LIMIT 10"
syndb query explain "FROM neurons LIMIT 10"

Current subcommands: save, list, show, update, delete, run, status, save-syql, exec, and explain.

dataset — dataset management

syndb data new --label "Example dataset" --animal "Mus musculus" --microscopy EM --table 1 --brain-structure hippocampus
syndb data prepare --input-dir raw_dataset --output-dir prepared_dataset
syndb data validate --input-dir prepared_dataset
syndb data upload --input-dir prepared_dataset --dataset-id <uuid>
syndb data download --dataset-id <uuid> --output-dir download_dir
syndb data mesh-upload --dataset-id <uuid> --input-dir meshes
syndb data swb-upload --dataset-id <uuid> --input-dir swb
syndb data delete --dataset-id <uuid>
syndb data search reconcile --dry-run
syndb data cache-tags
syndb data gen-test-data --output-dir tmp/test-data

Current subcommands: new, prepare, validate, upload, download, mesh-upload, swb-upload, delete, search reconcile, cache-tags, and gen-test-data.

syndb data search reconcile is the repair path for public full-text search. It rebuilds the local portion of the shared datasets Meilisearch index from PostgreSQL, preserves federated documents, and deletes stale local entries. Use --dry-run to inspect the planned changes without mutating Meilisearch.

The reconcile command reads its runtime contract from environment variables:

  • POSTGRES_HOST / POSTGRES_READ_HOST
  • POSTGRES_PORT
  • POSTGRES_USERNAME
  • POSTGRES_PASSWORD
  • POSTGRES_PATH
  • MEILISEARCH_URL
  • MEILISEARCH_API_KEY

Example:

POSTGRES_HOST=localhost \
POSTGRES_READ_HOST=localhost \
POSTGRES_PORT=5433 \
POSTGRES_USERNAME=syndb \
POSTGRES_PASSWORD=syndb \
POSTGRES_PATH=syndb_test \
MEILISEARCH_URL=http://localhost:7700 \
MEILISEARCH_API_KEY=meili_dev_key \
syndb data search reconcile --dry-run

etl — dataset import pipeline

Most datasets support download, validate, and import subcommands. CAVE-backed datasets may use manual export instead of download, followed by validate and import:

syndb etl <dataset> download       # when a static release exists
syndb etl <dataset> validate
syndb etl <dataset> import --data-dir external_datasets/<name> --table neurons --dataset-id <uuid>

spine-morphometry is the main special case: it uses --source kasthuri|ofer-confocal|microns instead of separate dataset keys.

Dataset keys

DatasetKeyDescription
FlyWireflywireWhole-brain Drosophila connectome
HemibrainhemibrainJanelia FlyEM v1.2.1
MANCmancMale Adult Nerve Cord
Spine Morphometryspine-morphometryDendritic spine morphometry (--source required)
C. elegans HermaphroditecelegansComplete hermaphrodite wiring
Larval DrosophilalarvalL1 larval brain connectome
Allen Cell Typesallen-cell-typesAllen Institute reference
NeuroMorphoneuromorphoNeuroMorpho.org archive
Witvliet DevelopmentalwitvlietDevelopmental C. elegans connectome
C. elegans Malecelegans-maleComplete male wiring
CionacionaLarval CNS connectome
PlatynereisplatynereisMarine annelid connectome
MICrONSmicronsMouse visual cortex
H01h01Human cortical tissue
Optic Lobeoptic-lobeDrosophila optic lobe
Male CNSmale-cnsMale central nervous system
BANCbancBrain And Nerve Cord
FANCfancFemale Adult Nerve Cord
Fish1fish1Zebrafish brain

Additional ETL utility commands: seed, update-all, and status.

federation — federation management

syndb ops federation init --cluster-name my-lab-node --clickhouse-endpoint clickhouse.mylab.edu --federation-password "$SYNDB_FEDERATION_PASSWORD"
syndb ops federation status
syndb ops federation sync-schema --dry-run
syndb ops federation test
syndb ops federation clusters
syndb ops federation logout

See Node Setup for detailed usage.

graph-precompute — Batch Graph Computation

syndb graph-precompute --dataset flywire

Pre-computes graph metrics and stores results in ClickHouse materialized tables for one or more current dataset keys.

k8s — Kubernetes administration

Current top-level groups:

  • syndb ops k8s etl — ETL jobs on Kubernetes
  • syndb ops k8s secrets — secret management helpers
  • syndb ops k8s sites — federation site helpers

Common ETL operations:

syndb ops k8s etl status
syndb ops k8s etl report
syndb ops k8s etl watch
syndb ops k8s etl retry
syndb ops k8s etl cleanup
syndb ops k8s etl reset

bench — Benchmarking

Performance testing suite for API and federation queries.

syndb-plot — Manuscript figures

The plotting package exposes a separate syndb-plot CLI for benchmark plots and manuscript figure rendering.

Benchmark-backed manuscript panels read benchmark/results/ by default. To render against a production benchmark bundle elsewhere, pass the directory explicitly:

syndb-plot manuscript-panels \
  --benchmark-results-dir documentation/manuscript/benchmarks/cluster-rebuild-2026-05-23

The same flag is available on focused panel/build commands:

syndb-plot manuscript-inspect-panel 5 A --benchmark-results-dir <results-dir>
syndb-plot manuscript-build-one 5 --benchmark-results-dir <results-dir>

manuscript-panels uses the provided directory for its benchmark preflight and for benchmark-backed composite figures. manuscript-composites only compiles already-rendered panels and does not read benchmark parquet.

ci — CI helpers

Internal build, test, and image automation helpers used by project workflows.

completions — Shell Completions

syndb completions bash > ~/.local/share/bash-completion/completions/syndb
syndb completions zsh > ~/.zfunc/_syndb
syndb completions fish > ~/.config/fish/completions/syndb.fish

Ontology & Vocabularies

SynDB uses controlled vocabularies to standardize dataset metadata — brain regions, species, microscopy techniques, and neurotransmitter types.

Browsing Terms

List All Vocabularies

curl https://api.syndb.xyz/v1/ontology/vocabularies

List Terms in a Vocabulary

curl "https://api.syndb.xyz/v1/ontology/terms?vocabulary=brain_region"

Search Terms

curl "https://api.syndb.xyz/v1/ontology/terms?q=mushroom"

Term Hierarchy

# Get child terms
curl https://api.syndb.xyz/v1/ontology/terms/{term_id}/children

# Get ancestor terms
curl https://api.syndb.xyz/v1/ontology/terms/{term_id}/ancestors

Validating Terms

Before submitting dataset metadata, validate that your terms exist:

curl -X POST -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/ontology/terms/validate \
  -d '{"terms": ["mushroom_body", "lateral_horn"]}'

Returns which terms are valid and which are unrecognized.

Administration (SuperUser)

Create a Term

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/ontology/terms \
  -d '{
    "vocabulary": "brain_region",
    "code": "MB_CALYX",
    "label": "calyx",
    "parent_id": "00000000-0000-0000-0000-000000000001",
    "uri": "https://example.org/terms/mb-calyx",
    "metadata": {
      "source": "manual"
    }
  }'

Update a Term

curl -X PUT -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/ontology/terms/{term_id} \
  -d '{
    "label": "calyx",
    "uri": "https://example.org/terms/mb-calyx",
    "metadata": {
      "status": "reviewed"
    }
  }'

Deprecate a Term

curl -X DELETE -H "Authorization: Bearer $TOKEN" \
  https://api.syndb.xyz/v1/ontology/terms/{term_id}

Deprecated terms remain in the system but are flagged in search results and validation.

Bulk Import

curl -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://api.syndb.xyz/v1/ontology/import/csv \
  -d '{
    "vocabulary": "brain_region",
    "csv_data": "code,label,uri,parent_code\nMB,mushroom body,https://example.org/terms/mb,\nMB_CALYX,calyx,https://example.org/terms/mb-calyx,MB"
  }'

CSV format: code,label,uri,parent_code

Integration with Datasets

When creating or updating dataset metadata, brain region, species, and microscopy fields are validated against the ontology. Invalid terms are rejected with an error listing the closest matches.

External References

SynDB’s ontology system draws from established biomedical and neuroscience ontologies:

  • OBO Foundry — community-maintained interoperable ontologies for biology and biomedicine
  • UBERON — multi-species anatomy ontology used for brain region terms
  • ChEBI — Chemical Entities of Biological Interest, covering neurotransmitter classifications
  • Allen Brain Atlas — reference atlas for mouse and human brain region parcellations
  • Virtual Fly BrainDrosophila neuroanatomy ontology browser and data integration hub

Rate Limiting

SynDB enforces per-IP rate limiting using a token bucket algorithm.

Defaults

ParameterDefaultEnvironment Variable
Requests per second100RATE_LIMIT_PER_SECOND
Burst capacity200RATE_LIMIT_BURST

The bucket refills at the sustained rate. Burst capacity allows short spikes above the sustained rate.

Client IP Detection

The rate limiter identifies clients by IP address, checked in order:

  1. X-Forwarded-For header (first address)
  2. X-Real-IP header
  3. Localhost (fallback for direct connections)

Behind a reverse proxy, ensure X-Forwarded-For is set correctly.

Response on Limit

When the rate limit is exceeded:

HTTP/1.1 429 Too Many Requests
Retry-After: 1

Too many requests

Client Handling

Respect the Retry-After header and implement exponential backoff:

import time
import requests

def request_with_backoff(url, headers, max_retries=3):
    for attempt in range(max_retries):
        resp = requests.get(url, headers=headers)
        if resp.status_code != 429:
            return resp
        wait = int(resp.headers.get("Retry-After", 1)) * (2 ** attempt)
        time.sleep(wait)
    raise Exception("Rate limited after retries")

For batch operations, throttle to well under 100 req/s to leave headroom for interactive use.

Data Standards

SynDB implements open data standards to ensure interoperability, discoverability, and long-term preservation of neuroscience datasets.

FAIR Data Principles

SynDB aligns with the FAIR principles for scientific data management:

  • Findable: Datasets are indexed by Meilisearch full-text search. Each dataset is assigned a persistent UUID. Metadata is exposed via JSON-LD for search engine discovery.
  • Accessible: A RESTful API with an OpenAPI specification provides structured access. Arrow Flight enables high-throughput data transfer. Authentication uses standardized PASETO tokens.
  • Interoperable: Metadata is serialized as JSON-LD using Schema.org vocabulary. Controlled vocabularies draw from OBO Foundry ontologies. Data is exported in Apache Parquet and Apache Arrow formats.
  • Reusable: Licenses are stored as machine-readable SPDX identifiers. Provenance tracking, version history, and auto-generated citations support reproducibility.

Metadata Standards

SynDB dataset metadata follows established web standards:

  • Schema.org: Dataset metadata uses the Schema.org Dataset type, enabling discovery by Google Dataset Search and other aggregators.
  • JSON-LD: Metadata is serialized as JSON-LD – a linked data format that embeds semantic context in standard JSON. Access via GET /v1/neurodata/datasets/{id}/metadata.jsonld.
  • DCAT: Vocabulary alignment with the W3C Data Catalog Vocabulary for catalog interoperability.
  • Dublin Core: Core metadata terms (title, creator, date, rights) follow Dublin Core conventions.
  • SynDB Connectomics Data Profile: Required profile for ontology-backed dataset metadata, DataCite relation types, JSON-LD export, and archival metadata bundles.

Citation Formats

SynDB generates citations in multiple formats via GET /v1/neurodata/datasets/{id}/citation?format=<fmt>:

FormatUse CaseSpecification
BibTeXLaTeX documents.bib entries
RISReference managers (Zotero, EndNote, Mendeley)Tagged text format
APAInline text citationsAPA 7th edition
CSL-JSONProgrammatic citation processingCitation Style Language data model
CFFSoftware/dataset citation filesCITATION.cff format

License Identifiers

SynDB uses SPDX license identifiers internally. When you select a license during dataset creation, it is stored as an SPDX expression (e.g., ODC-BY-1.0, CC-BY-4.0). This enables machine-readable license detection and compatibility checking.

See the license selection guide for help choosing a license.

Data Formats

FormatMIME TypeUsed For
Apache Parquetapplication/vnd.apache.parquetDataset export and DOWNLOAD parquet in SyQL
Apache Arrow IPCapplication/vnd.apache.arrow.streamJob results, Flight data transfer
CSVtext/csvDOWNLOAD csv in SyQL, ontology bulk import

Arrow IPC and Parquet files can be read with pandas, Polars, DuckDB, or any Arrow-compatible library.

External Integrations

SynDB metadata is designed to interoperate with these neuroscience data ecosystems:

PlatformIntegration
DataCiteDOI registration and metadata schema alignment (DataCite Metadata Schema 4.5)
DANDI ArchiveComplementary neurophysiology data archive
OpenNeuroComplementary neuroimaging data archive
Google Dataset SearchAutomatic discovery via Schema.org/JSON-LD metadata

SynDB Connectomics Data Profile

The SynDB Connectomics Data Profile defines the metadata contract required for datasets to be findable, accessible, interoperable, and reusable in SynDB.

Required Dataset Metadata

Every dataset must include:

  • UUIDv7 dataset identifier.
  • DataCite dataset DOI for production F1=3 FAIR claims.
  • Human-readable dataset label.
  • SPDX-compatible data license.
  • Access policy: open, registered, or restricted.
  • Species resolved to an active ontology_term in the ncbi_taxon vocabulary.
  • Microscopy technique resolved to an active ontology_term in the microscopy vocabulary.
  • Brain regions resolved to active ontology terms in uberon, fbbt, or SynDB’s internal brain_region vocabulary.
  • Declared SynDB table list and uploaded table state.
  • Provenance, version, citation, lineage, and archive links.

Ontology Vocabularies

SynDB metadata uses these vocabularies:

VocabularyUse
ncbi_taxonSpecies and taxonomic identity
microscopyImaging and reconstruction modality
uberonVertebrate anatomical structures
fbbtDrosophila anatomical structures
brain_regionSynDB terms for structures not yet mapped to external vocabularies
chebiNeurotransmitter and chemical identity

All required ontology terms must be active, have a URI, and carry a registry version.

Relations

Dataset lineage and external references use DataCite relation types, including IsDerivedFrom, IsSourceOf, IsPartOf, HasPart, References, IsReferencedBy, IsVersionOf, and HasVersion.

Invalid relation strings are rejected.

Linked Data And Archive Contract

Each dataset exposes:

  • GET /v1/neurodata/datasets/{dataset_id}/metadata.jsonld
  • GET /v1/neurodata/datasets/{dataset_id}/archive.json
  • GET /v1/neurodata/datasets/{dataset_id}/doi
  • GET /v1/neurodata/datasets/{dataset_id}/provenance
  • GET /v1/neurodata/datasets/{dataset_id}/versions
  • GET /v1/neurodata/datasets/{dataset_id}/lineage
  • GET /v1/neurodata/datasets/{dataset_id}/citation

The JSON-LD document uses Schema.org, DCAT, Dublin Core, DataCite, PROV, SPDX, and SynDB terms. It includes conformsTo pointing to this profile.

The archive bundle is the long-term metadata preservation surface. It includes dataset metadata, JSON-LD, DataCite metadata, citations, provenance, versions, lineage, external references, deletion status, and navigation links.

Dataset DOIs are minted through DataCite when the deployment has DATACITE_ENABLED, repository credentials, and a DOI prefix configured. Local or development deployments must report DOI minting as unavailable rather than fabricating DOI identifiers.

Validation Rules

Dataset creation fails if any required species, microscopy, or brain-region term cannot be resolved to a non-deprecated ontology term with a URI.

Startup validation fails if the ontology registry is incomplete for any neurometa enum variant or persisted dataset metadata term. Missing data must be fixed upstream by adding or repairing the relevant ontology term before SynDB starts.

For production FAIR scoring, each published dataset must have a DataCite DOI record. Publication DOIs remain related identifiers and do not replace the dataset DOI.

Architecture Decisions

These records preserve the rationale behind core SynDB architecture choices. They are historical decision notes, not exhaustive operational guides; use the topic-specific documentation pages for current procedures and route details.

ADR-001: PASETO v4.local for Authentication

Date: 2024-01-15

Status: Accepted

Context

SynDB exposes two server protocols from the same binary: an Axum HTTP API and an Apache Arrow Flight gRPC service. Both require stateless authentication that can be validated without a database round-trip on every request.

JSON Web Tokens (JWT) are the industry default, but they carry well-documented pitfalls: algorithm confusion attacks (alg: none), RSA/HMAC substitution, and an overly flexible header that increases the attack surface. Because both servers run in the same process and share an AppState, there is no need for public-key cryptography or third-party token verification – a symmetric scheme is simpler and sufficient.

Decision

Use PASETO v4.local (symmetric authenticated encryption with XChaCha20-Poly1305) via the rusty_paseto crate. Tokens carry user claims encrypted with a shared 256-bit secret key.

  • Access tokens are short-lived (15 minutes).
  • Refresh tokens are rotated on each use and stored in PostgreSQL, enabling server-side revocation.
  • The shared secret is loaded once at startup from the application settings and held in AppState.

Consequences

Positive:

  • Eliminates the entire class of JWT algorithm confusion vulnerabilities.
  • Symmetric encryption means tokens are opaque to clients – no claim leakage.
  • Single shared secret is trivial to manage when both protocols live in one binary.
  • Short-lived access tokens plus refresh rotation limit the blast radius of a leaked token.

Negative:

  • Tokens cannot be decoded or inspected on the client side (by design, but complicates client-side debugging).
  • If SynDB ever splits into multiple independently deployed services, the shared secret must be distributed securely to each service.
  • Refresh token storage adds a PostgreSQL dependency to the auth flow.

ADR-002: Kameo Actors for Federation Orchestration

Date: 2024-03-10

Status: Accepted

Context

SynDB federation delegates queries to multiple remote nodes simultaneously. A federated query must fan out Flight gRPC calls, enforce per-node timeouts, retry transient failures, and aggregate partial results into a single response stream.

Implementing this with raw tokio::spawn and channels leads to scattered state, ad-hoc cancellation logic, and difficult-to-test concurrency patterns. We need a structured concurrency model that encapsulates per-query state and lifecycle.

Decision

Use the kameo actor framework for federation query orchestration. Each federated query spawns a coordinator actor that in turn spawns per-node worker actors. Workers issue Flight gRPC calls and stream results back to the coordinator via typed messages.

  • Actor mailboxes provide natural back-pressure.
  • Supervision trees handle worker failures without crashing the coordinator.
  • Actor state is private and mutation-free from the caller’s perspective.

Consequences

Positive:

  • Clean separation of concerns: each actor owns its state and lifecycle.
  • Message-passing eliminates shared mutable state across concurrent operations.
  • Supervision and timeout semantics are built into the framework rather than hand-rolled.
  • Actors are straightforward to unit-test in isolation by sending messages directly.

Negative:

  • Adds a runtime dependency on the kameo crate and its executor integration.
  • Actor mailbox overhead exists, though it is negligible for the federation workload (tens of messages, not millions).
  • Developers must learn the actor model; it is less familiar than plain async/await to most Rust programmers.

ADR-003: ClickHouse for Analytical Data Storage

Date: 2024-01-08

Status: Accepted

Context

Neuroscience datasets in SynDB contain billions of rows – neurons, synapses, connectivity matrices, and physiological measurements – that are queried with columnar scans, aggregations, and large joins. PostgreSQL handles the OLTP metadata workload well (users, datasets, permissions) but performs poorly on analytical queries at this scale due to its row-oriented storage engine.

A single-database approach would force a choice between metadata flexibility and analytical performance. Scientific data is append-only and immutable once ingested, which relaxes consistency requirements for the analytical store.

Decision

Adopt a dual-database strategy:

  • PostgreSQL (via SeaORM) for metadata, user accounts, permissions, and all OLTP operations.
  • ClickHouse for analytical neuroscience data, partitioned by dataset_id using the MergeTree engine family.

The API service connects to both databases. Metadata queries hit PostgreSQL; data queries are translated to ClickHouse SQL and executed via the native ClickHouse HTTP or TCP client.

Consequences

Positive:

  • Orders-of-magnitude faster columnar scans and aggregations compared to PostgreSQL at billion-row scale.
  • ClickHouse’s compression (LZ4/ZSTD) dramatically reduces storage for repetitive scientific data.
  • Partitioning by dataset_id enables efficient data lifecycle management (drop partition on dataset deletion).
  • Eventual consistency is acceptable because scientific data is immutable after ingestion.

Negative:

  • Increased operational complexity: two database systems to provision, monitor, back up, and upgrade.
  • No cross-database transactions; the application must handle consistency at the orchestration layer.
  • ClickHouse’s mutation and update semantics differ from PostgreSQL, requiring developer awareness.

ADR-004: Apache Arrow Flight for Data Transport

Date: 2024-01-10

Status: Accepted

Context

Neuroscience data transfers between SynDB and analytical clients can reach hundreds of megabytes per query result. Serializing this data as HTTP JSON incurs significant overhead: text encoding inflates payload size, row-oriented JSON requires full deserialization, and there is no native streaming support for back-pressure.

Data science clients in Python, R, and Julia already use Apache Arrow as their in-memory columnar format (via pandas, polars, and similar libraries). A transport layer that speaks Arrow natively would eliminate serialization costs.

Decision

Use Apache Arrow Flight (gRPC-based columnar data transport) via the tonic gRPC framework and arrow-flight crate. The Flight service runs on port 50051 alongside the Axum HTTP API on port 8080, both in the same binary process.

  • Query results are materialized as Arrow RecordBatches and streamed to clients via do_get.
  • Dataset ingestion uses do_put for streaming uploads.
  • Flight tickets encode query parameters as serialized descriptors.

Consequences

Positive:

  • Zero-copy data transfer: clients receive Arrow RecordBatches directly usable by pyarrow, polars, DataFusion, and similar tools without deserialization.
  • gRPC bidirectional streaming provides natural back-pressure and supports arbitrarily large result sets without buffering everything in memory.
  • Columnar format enables predicate pushdown and projection pruning at the transport level.

Negative:

  • gRPC adds complexity compared to plain HTTP: clients need a Flight-aware library rather than a simple HTTP client.
  • Binary protocol is harder to debug than JSON; requires tooling like grpcurl or custom Flight clients for inspection.
  • Running two server protocols in one binary increases shutdown coordination complexity (see ADR-005).

ADR-005: Dual HTTP + gRPC Server Architecture

Date: 2024-01-10

Status: Accepted

Context

SynDB serves two distinct client populations with different transport needs:

  1. Web UI and REST clients require HTTP/JSON endpoints for CRUD operations on metadata, user management, and administrative tasks.
  2. Data science clients require high-performance Apache Arrow Flight (gRPC) for streaming large analytical datasets (see ADR-004).

Running these as separate binaries would double deployment artifacts, duplicate shared state (database connection pools, S3 clients, configuration), and complicate service discovery and health checking.

Decision

Run both Axum HTTP (port 8080) and tonic Flight gRPC (port 50051) servers in the same binary process. Both servers share a single AppState containing database connections, S3 client, PASETO secret, and application settings.

Concurrent operation is achieved via tokio::select! on both server futures, with a shared shutdown signal (broadcast channel) for coordinated graceful termination.

Consequences

Positive:

  • Single deployment artifact (one container image, one binary) simplifies CI/CD and operational management.
  • Shared connection pools for PostgreSQL, ClickHouse, and S3 reduce total resource consumption versus two separate processes.
  • Both servers validate PASETO tokens with the same in-memory secret – no secret distribution problem.
  • Health checks and readiness probes need only target one process.

Negative:

  • A crash in either server’s accept loop brings down both protocols.
  • Graceful shutdown must coordinate two listeners; a stalled gRPC stream can delay HTTP shutdown and vice versa.
  • Resource contention (thread pool, memory) between HTTP and gRPC workloads is harder to isolate than with separate processes.

ADR-006: libp2p for Peer-to-Peer Federation

Date: 2024-03-10

Status: Accepted

Context

SynDB federation enables multiple institutions to share and query datasets without relying on a central broker or registry. Participating nodes may sit behind NATs, institutional firewalls, or cloud VPCs, so the networking layer must handle peer discovery, NAT traversal, and encrypted transport without requiring manual endpoint configuration.

A centralized hub-and-spoke model would create a single point of failure and raise data-sovereignty concerns for institutions that want to retain control over their datasets.

Decision

Use libp2p with the following configuration for federation networking:

  • QUIC transport for encrypted, multiplexed connections with built-in TLS 1.3.
  • mDNS for zero-configuration local/LAN peer discovery.
  • Relay nodes for NAT traversal when direct connections are not possible.
  • The federation swarm is managed by kameo actors (see ADR-002), with the swarm event loop running in a dedicated actor.
  • The node registry uses papaya lock-free concurrent hash maps for high-throughput reads without contention.

Consequences

Positive:

  • True decentralized federation: no central broker, no single point of failure.
  • QUIC provides encryption and multiplexing out of the box, eliminating the need for a separate TLS termination layer.
  • mDNS enables instant discovery in development and on-premise deployments without configuration.
  • Lock-free maps via papaya allow the node registry to scale to many concurrent readers without mutex contention.

Negative:

  • Distributed systems complexity: the federation must handle network partitions, partial failures, and eventually consistent peer state.
  • libp2p’s Rust implementation has a large dependency tree and can increase compile times.
  • NAT traversal via relay nodes adds latency and requires at least one publicly reachable relay to be available.
  • Debugging peer-to-peer networking issues is harder than debugging client-server HTTP calls.

Glossary

Neuroscience Terms

Axon — The elongated projection of a neuron that conducts electrical impulses away from the cell body. GO:0030424

Brain region — An anatomically or functionally defined subdivision of the brain. SynDB uses terms from the UBERON multi-species anatomy ontology.

Cable length — The total path length of a neuron’s skeletal reconstruction, measured in nanometers.

Cell type — A classification of neurons by morphology, connectivity pattern, molecular markers, or electrophysiology.

Connectome — A comprehensive map of neural connections in a nervous system. Wikipedia

Degree (in/out) — The number of incoming (in-degree) or outgoing (out-degree) synaptic connections of a neuron.

Dendrite — A branched projection of a neuron that receives synaptic input. GO:0030425

Dendritic spine — A small membranous protrusion on a dendrite that forms the postsynaptic side of most excitatory synapses. GO:0043197

Mitochondria — Organelles responsible for ATP production; their density in neurons correlates with synaptic activity. GO:0005739

Neuron — The fundamental unit of the nervous system; an electrically excitable cell that communicates via synapses. CL:0000540

Neurotransmitter — A chemical substance released at a synapse to transmit signals. Common types in SynDB: GABA, glutamate, acetylcholine, dopamine, octopamine, serine.

Pre-synaptic terminal — The axon terminal from which neurotransmitter is released into the synaptic cleft. GO:0098793

Reciprocity — The fraction of synaptic connections in a network that are bidirectional.

Sphericity — A measure of how closely a shape approximates a sphere (1.0 = perfect sphere).

Synapse — A junction between two neurons where signals are transmitted chemically or electrically. GO:0045202

Triadic census — An enumeration of all 16 possible three-node directed subgraph patterns in a network, used to characterize network motifs.

Vesicle — A small membrane-bound compartment in the pre-synaptic terminal that stores neurotransmitter molecules. GO:0099503

Platform Terms

Academic verification — Identity verification via CILogon institutional login, required for compute-intensive operations.

Arrow Flight — A high-performance gRPC protocol from Apache Arrow used for streaming data transfer between SynDB components.

Collection — A curated grouping of datasets for organizational or meta-analysis purposes. See Collections & Tags.

Compartment — A structural subdivision of a neuron (axon, dendrite, spine, terminal, vesicle, mitochondria) that maps to a SynDB table.

Dataset — A collection of neuroanatomical measurements sharing common metadata (species, brain region, microscopy method). The fundamental unit of organization in SynDB.

ETL — Extract, Transform, Load. The pipeline that imports external connectomics datasets into SynDB’s schema. See External sources.

Federation — A decentralized architecture where multiple institutions run independent SynDB nodes while participating in cross-institutional queries. See Federation overview.

Hub — The central coordinating instance in a federation. Runs the full SynDB stack and routes cross-cluster queries.

Job — An asynchronous unit of work (query execution, graph analysis) managed by the job queue. See Jobs system.

Materialized view (MV) — A pre-aggregated ClickHouse table that stores intermediate results for fast analytics. SyQL automatically rewrites eligible queries to use MVs.

Node — A lightweight federation participant running ClickHouse and the syndb-node binary. Data stays on the node’s infrastructure.

Provenance — The audit trail tracking who created, modified, or derived from a dataset. See Provenance & Citations.

SyQL — SynDB Query Language. A declarative SQL-like language that resolves dataset metadata into optimized ClickHouse queries. See SyQL documentation.

Table — A typed schema within SynDB corresponding to a neuronal compartment (e.g., neurons, synapses, axons, dendrites). Each table has its own column definitions.

Choosing a license for your dataset

When sharing microscopy data derived datasets, selecting an appropriate license is crucial for ensuring the proper use and distribution of your work. Different licenses offer varying degrees of freedom and control over your data. Here, we outline some popular licenses, their key features, and considerations to help you choose the right one for your needs.

Considerations for Choosing a License

  • Intended Use: Determine whether you want your data to be used freely or with certain restrictions, such as non-commercial use only.
  • Credit and Attribution: Decide if you want to receive credit for your work and if it’s important for you to see how others are using your data.
  • Derivative Works: Consider whether you want derivative works to be allowed and if they should be shared under the same terms.
  • Commercial Use: Reflect on whether you want to permit commercial use of your data. Your institution may have specific policies regarding commercial use.

Licenses

The following are some common licenses used for sharing data on the web, which we also use on the SynDB platform.

Tip

Current default

The current SynDB UI and CLI default to CC BY 4.0. Open Data Commons licenses are still available and may be a better fit for some dataset-sharing policies.

Open Data Commons (ODC) Licenses

Open Data Commons (ODC) licenses are specifically tailored for datasets and databases, focusing on maximizing accessibility and proper attribution in data sharing.

PDDL (Public Domain Dedication and License)

Places the dataset in the public domain, allowing unrestricted use and maximizing openness and usability.

ODC-BY (Attribution License)

Allows use with proper credit to the original creator, ensuring acknowledgment while enabling broad use.

ODC-ODbL (Open Database License)

Permits sharing, modifying, and using the dataset with attribution and requires derivative databases to be shared under the same license, promoting open access and collaborative improvement while keeping derivative databases equally accessible.

Creative Commons (CC) Licenses

Creative Commons (CC) licenses are versatile and well-suited for a wide range of creative works, including datasets

CC0 (Public Domain Dedication)

Allows the use of the dataset without any restrictions, making it ideal for maximizing usability and dissemination.

CC BY (Attribution)

Allows users to use the dataset as long as they provide appropriate credit to the original creator, ensuring wide use while acknowledging the creator’s work.

CC BY-SA (Attribution-ShareAlike)

Permits use of the dataset with appropriate credit and requires sharing derivative works under the same license, keeping derivative works open and shareable under the same terms.

CC BY-NC (Attribution-NonCommercial)

Allows use for non-commercial purposes with proper credit, restricting use to non-commercial purposes while still enabling academic and research use.

CC BY-NC-SA (Attribution-NonCommercial-ShareAlike)

Permits non-commercial use with appropriate credit and sharing of derivative works under the same license, ensuring non-commercial use and open sharing under the same terms.

Conclusion

Selecting the right license for your microscopy data derived dataset is essential for controlling how your data is used and ensuring it meets your sharing objectives. By considering the options and your specific needs, you can choose a license that balances openness, credit, and control, fostering collaboration and advancement in your field.

SynDB stores licenses as SPDX identifiers for machine-readable compatibility. See Data Standards for details.

Metrics structuring for contribution

Note

Prerequisites

This article requires that you understand how data is stored on SynDB, we recommend reading through the overview article if you are uncertain.

This article is a guide for contributors who wish to upload their data to SynDB. Please don’t hesitate to ask for help on the Discord channel if you have any questions; this part can be challenging.

Data structuring

Schema

Each SynDB table has its own expected columns and types. The current CLI and ETL importers are the authoritative validators: structure your data so it passes syndb data validate for direct uploads, or the relevant syndb etl <dataset> validate command for a supported importer.

The column names and the values stored under them must match the current importer schema for the table you are contributing to. Use the glossary at the end of this article as a quick reference, then validate early with the current tooling before preparing a full upload.

Note

Nano

We use nanometers as the unit for all measurements; includes volume, radius, and distance.

Supporting source assets

SynDB expects your primary contribution to be tabular, analysis-ready data. You may also attach supporting source assets such as meshes or SWC skeletons. This does not refer to raw imaging volumes. Place the absolute path to the file in your table file. The following are supported:

  • Meshes in .glb format, column name: mesh_path
  • SWC files, .swc, column name: swc_path

This list is the main tracker for the supported formats. You may request additional formats on the Discord channel. The SynDB team will review the request and consider adding the new format to the platform.

Columns

Most column types are self-explanatory, but some require additional explanation.

Identifiers and relations

The CID column defined in your table can have any unique hashable value, it will be replaced by a UUID when uploaded to SynDB. When uploading a relational dataset, the cid column in the parent will be used to correlate the relations to the children by their parent_id; meaning the hashable value in the parent cid column must match the parent_id in the child. parent_enum can be omitted as the compartments are defined at the tabular level, and will, therefore, be added automatically.

Example

Notice the parent_id column in the child table, this is the cid of the parent table. The parent_enum column is not present in the child table, as it is defined at the tabular file name.

vesicle.csv, child

cidneurotransmittervoxel_radiusdistance_to_active_zoneminimum_normal_lengthparent_idcentroid_zcentroid_xcentroid_y
0glutamate26.9129705.24502314505.2321996.2244953.6
1glutamate25.5388615.02132314505.2321996.2244953.6
2glutamate29.5260513.07012314505.2321996.2244953.6
3glutamate30.5131479.92242314505.2321996.2244953.6
4glutamate28.3977454.82482314505.2321996.2244953.6
5glutamate30.2033459.75572324505.2321996.2244953.6
6glutamate33.4548374.81312324505.2321996.2244953.6
7glutamate32.0890455.92932344505.2321996.2244953.6

axon.csv, parent

voxel_volumemitochondria_counttotal_mitochondria_volumecid
385668034.56193208043.521
1492089016.324412054179.842
327740497.92004

Glossary

KeyDescription
dataset_idThe unique identifier for the dataset, of type uuid.
cidThe unique identifier for a SynDB unit within the dataset, of type uuid.
parent_idThe CID of the parent component, of type uuid.
parent_enumAn integer representing the type or category of the parent component, of type int.
polarityThe polarity of the neuron, of type ascii.
voxel_volumeThe volume of the voxel, of type double.
voxel_radiusThe radius of the voxel, of type double.
s3_mesh_locationThe location of the mesh in S3 storage, of type smallint.
mesh_volumeThe volume of the mesh, of type double.
mesh_surface_areaThe surface area of the mesh, of type double.
mesh_area_volume_ratioThe ratio of the surface area to the volume of the mesh, of type double.
mesh_sphericityThe sphericity of the mesh, of type double.
centroid_zThe z-coordinate of the centroid, of type double.
centroid_xThe x-coordinate of the centroid, of type double.
centroid_yThe y-coordinate of the centroid, of type double.
s3_swb_locationThe location of the SWB in S3 storage, of type smallint.
terminal_countThe count of terminals, of type int.
mitochondria_countThe count of mitochondria, of type int.
total_mitochondria_volumeThe total volume of mitochondria, of type double.
neuron_idThe unique identifier for the associated neuron, of type uuid.
vesicle_countThe count of vesicles, of type int.
total_vesicle_volumeThe total volume of vesicles, of type double.
forms_synapse_withThe unique identifier of the synapse that the component forms with, of type uuid.
connection_scoreThe score representing the strength or quality of the connection, of type double.
cleft_scoreThe score for the synaptic cleft, of type int.
GABAThe concentration or presence of GABA neurotransmitter, of type double.
acetylcholineThe concentration or presence of acetylcholine neurotransmitter, of type double.
glutamateThe concentration or presence of glutamate neurotransmitter, of type double.
octopamineThe concentration or presence of octopamine neurotransmitter, of type double.
serineThe concentration or presence of serine neurotransmitter, of type double.
dopamineThe concentration or presence of dopamine neurotransmitter, of type double.
root_idThe external root identifier from the source platform (e.g. FlyWire), of type int.
pre_idThe unique identifier of the pre-synaptic component, of type uuid.
post_idThe unique identifier of the post-synaptic component, of type uuid.
dendritic_spine_countThe count of dendritic spines, of type int.
neurotransmitterThe type of neurotransmitter present in a vesicle, of type ascii.
distance_to_active_zoneThe distance from the vesicle to the active zone, of type double.
minimum_normal_lengthThe minimum normal length, of type int.
ribosome_countThe count of ribosomes within the endoplasmic reticulum, of type int.

ETL Operations

This guide is for developers operating production ETL, cache population, and graph precompute jobs. It records the operational invariants that are easy to miss when a dataset is large enough that a normal import can partly succeed before the real failure appears.

Core Invariants

SynDB must not synthesize missing graph data to make downstream figures pass. If a graph product is missing, first prove whether the source import is complete and internally consistent.

For any connectome dataset used by graph precompute:

  • Every edge endpoint must resolve to a neuron in the same dataset.
  • dataset_table_state must describe the whole uploaded table, not a partial batch.
  • Dataset-scoped materialized views and precomputed tables must be regenerated from the same canonical source rows.
  • Empty graph products are valid only when the source graph is truly empty or the product is explicitly not applicable.

The most useful validation is an endpoint check against vw_graph_edges:

WITH toUUID('<dataset-id>') AS ds
SELECT 'pre_missing' AS check_name, count() AS rows
FROM syndb.vw_graph_edges AS e
LEFT JOIN syndb.neurons AS n
  ON n.dataset_id = ds AND n.neuron_id = e.pre_neuron_id
WHERE e.dataset_id = ds AND n.neuron_id IS NULL
UNION ALL
SELECT 'post_missing' AS check_name, count() AS rows
FROM syndb.vw_graph_edges AS e
LEFT JOIN syndb.neurons AS n
  ON n.dataset_id = ds AND n.neuron_id = e.post_neuron_id
WHERE e.dataset_id = ds AND n.neuron_id IS NULL;

Both rows must be zero before graph precompute results are trusted.

MANC Repair Case Study

On April 27, 2026, syndb article cache populate skipped manc:precomputed_bottleneck_neurons because the live table was empty. The root cause was not cache population. MANC had repeatedly exposed several architecture problems in the large-dataset path:

  • The MANC weights source contained endpoint body IDs that were not present in the MANC annotation source.
  • The Flight import path opened one DoPut per streaming batch, so the first batch could mark a table as uploaded and later batches then failed as duplicates.
  • The Flight streaming path did not apply the same source-aware pre-filter as the direct ClickHouse path.
  • The default Flight client timeout was too short for large table uploads.
  • Small Arrow record batches created tens of thousands of Flight messages for a single large table.
  • Graph precompute assumed neurons.polarity was UTF-8, but production MANC could expose it as Int8.

The repaired production import had:

TableRows
neurons211743
synapses26036056
precomputed_bottleneck_neurons1637

Before repair, MANC had 211743 neurons and 247186482 synapses, with 12617642 missing pre endpoints and 121202907 missing post endpoints. Those missing endpoints are a source-model mismatch, not a reason to create placeholder neurons.

MANC Source Model

For MANC v0.9, the canonical SynDB graph universe is the annotated neuron set from:

body-annotations-male-cns-v0.9-minconf-0.5.feather

The weighted connectome source:

connectome-weights-male-cns-v0.9-minconf-0.5.feather

can contain body IDs outside that annotation universe. The importer must filter connection rows to annotated pre and post body IDs before transforming them into SynDB synapses. Do not silently add synthetic neurons for unannotated endpoints: that changes the biological scope of the dataset and corrupts downstream coverage and graph summaries.

This source-aware filter must be applied in every import mode. If both direct ClickHouse upload and Arrow Flight upload exist, both paths must run the same pre-filter before table upload.

Flight Upload Rules

For streaming tables, a table upload is atomic at the table level from the metadata system’s perspective. Do not open one Flight DoPut per batch. Open one DoPut stream for the table and send all record batches through that stream.

The failure signature for the broken pattern was:

table already exists

after the first streaming batch had already succeeded. That left production in a misleading state: the metadata table could say a source table was uploaded, while ClickHouse only contained the first slice of the table.

Large Flight uploads also need operationally appropriate transport settings:

  • Use a long ETL upload connection timeout, currently two hours.
  • Convert large dataframes to larger Arrow record batches, up to 65536 rows per batch, rather than relying on small default bridge batches.

Cleaning A Partial Import

When repairing a dataset with partial or semantically invalid rows, clean every dataset-scoped product derived from that dataset before re-importing. Do not delete global views that do not have dataset_id.

First, identify the PostgreSQL leader before changing metadata:

kubectl exec -n syndb syndb-postgres-1 -- patronictl list

Use the current leader pod for writes.

For ClickHouse, delete source rows and dataset-scoped derived rows. Use long timeouts and synchronous mutations for large datasets:

kubectl exec -n syndb <chi-pod> -- \
  clickhouse-client \
    --host syndb-cluster \
    --user syndb \
    --password "$CLICKHOUSE_PASSWORD" \
    --receive_timeout 3600 \
    --send_timeout 3600 \
    --query "
      ALTER TABLE syndb.neurons
      DELETE WHERE dataset_id = '<dataset-id>'
      SETTINGS mutations_sync = 2;

      ALTER TABLE syndb.synapses
      DELETE WHERE dataset_id = '<dataset-id>'
      SETTINGS mutations_sync = 2;
    "

Then remove dataset-scoped materialized-view targets and precompute products. Generate the table list from ClickHouse metadata so global tables are not accidentally deleted:

SELECT table
FROM system.columns
WHERE database = 'syndb'
  AND name = 'dataset_id'
  AND (
    table LIKE 'mv_%'
    OR table LIKE 'precomputed_%'
  )
ORDER BY table;

For each returned table:

ALTER TABLE syndb.<table>
DELETE WHERE dataset_id = '<dataset-id>'
SETTINGS mutations_sync = 2;

Finally, reset the relevant PostgreSQL upload state on the leader:

UPDATE dataset_table_state
SET upload_state = 'pending',
    row_count = NULL,
    uploaded_at = NULL,
    error_message = NULL
WHERE dataset_id = '<dataset-id>'
  AND table_id IN (<table-ids-to-rerun>);

Only reset the table IDs that will actually be regenerated.

MANC-Only Helm Apply

Kubernetes Jobs are immutable, so delete failed or running ETL jobs before changing their specs:

nix develop . -c kubectl delete job -n syndb \
  -l app=syndb-etl \
  --field-selector status.successful!=1

The helper apply path can overlay registry-generated ETL skip flags. When you need a surgical MANC-only run, use a direct Helm apply and explicitly skip every other pipeline:

pipelines=(
  allen-cell-types banc celegans celegans-dauer celegans-male ciona fanc
  flywire h01 hemibrain larval medulla-7col microns neuromorpho optic-lobe
  platynereis spine-morphometry witvliet wormneuroatlas
)

args=()
for pipeline in "${pipelines[@]}"; do
  args+=(--set "syndb-etl.skipPipelines.${pipeline}=true")
done

nix develop . -c helm upgrade --install syndb-nautilus \
  infrastructure/helm/nautilus \
  -n syndb \
  -f infrastructure/helm/nautilus/values.yaml \
  "${args[@]}"

After the import job completes, validate source rows, upload state, and endpoint integrity before running cache population.

Graph Precompute For Large Graphs

MANC is too large for approaches that assume all products can be materialized in memory. Use the large topology backend for exact bottleneck results over the canonical vw_graph_edges stream.

For local operation through the devshell:

nix develop . -c syndb article graph precompute \
  --dataset manc \
  --resume=false \
  --replace-existing=true \
  --max-edges 200000000 \
  --small-network-threshold 5000 \
  --fail-fast=true

If running in-cluster, make sure the deployed ETL image contains the graph-precompute polarity fix:

toString(polarity) AS polarity

in the large-topology neuron query. Without that cast, production MANC can fail when neurons.polarity is represented as Int8 rather than UTF-8.

Validate the graph products directly:

SELECT count()
FROM syndb.precomputed_bottleneck_neurons
WHERE dataset_id = '<dataset-id>';

SELECT status, backend, row_count, exact
FROM syndb.precomputed_analysis_status
WHERE dataset_id = '<dataset-id>'
ORDER BY product;

For the repaired MANC import on April 27, 2026, precomputed_bottleneck_neurons contained 1637 rows and the analysis status reported exact graph_precompute_streaming_topology output.

Cache Population

syndb article cache populate downloads derived production tables and materialized views into local manuscript cache parquet files. It is not the raw dataset import path. A skipped product during cache population should be treated as a symptom of a missing or empty production data product unless the skip is explicitly expected.

For MANC bottlenecks, the correct order is:

  1. Validate the imported source rows and endpoint integrity.
  2. Regenerate precomputed_bottleneck_neurons.
  3. Validate the precomputed row count and analysis status.
  4. Re-run syndb article cache populate.

Do not patch the manuscript cache with synthetic rows. The cache should reflect the deployed data products.

Build Caching

SynDB uses multiple layers of caching to keep compile times short across local development, CI pipelines, and production deploys.

Cargo compiler flags

Configured in .cargo/config.toml, these flags speed up every local cargo invocation:

FlagEffect
-C link-arg=-fuse-ld=moldMold linker — significantly faster than the default ld or lld (Linux only)
-Zshare-generics=yShare monomorphized generics between crates, reducing codegen work
-Zthreads=8Parallel compiler frontend (parsing, macro expansion, type checking)
codegen-backend = "cranelift"Dev profile uses Cranelift instead of LLVM for faster debug builds
codegen-backend = "llvm" (for deps)Dependencies still use LLVM for better optimization

CI caching (GitHub Actions)

In CI, syndb-ci runs tests and builds directly on the host (no Docker for the ci subcommand). Cargo artifacts are cached between runs via Swatinem/rust-cache@v2, which persists target/ and the cargo registry keyed by branch and Cargo.lock hash.

For integration tests (local-stack-test, e2e-test), syndb-ci uses bollard to start ephemeral Docker containers (PostgreSQL, ClickHouse, MinIO) on a shared Docker network. Test binaries run on the host with environment variables pointing at localhost:<port>. No cargo cache volumes are needed inside containers — the host target/ is used directly.

Nix OCI cache

Local stack images (syndb stack prepare) for the API and ETL are built with Nix and cached using syndb-ci nix-oci-cache. This command uses Nix store paths as content-addressed fingerprints to skip unnecessary rebuilds:

  1. nix build .#oci-syndb-api produces a store path (a hash of all inputs)
  2. The script compares the current store path against a stamp file (/tmp/.oci-syndb-api.storepath)
  3. If unchanged, the build is skipped entirely
  4. If changed, the new tarball is copied to /tmp/ and loaded into Docker

This means syndb stack prepare is near-instant when source hasn’t changed.

Nix Crane dependency caching

For nix flake check (CI) and Nix-based OCI image builds, the project uses Crane with a split dependency build:

# nix/rust.nix
mkCargoArtifacts = system:
    craneLib.buildDepsOnly (mkCommonArgs system);

buildDepsOnly compiles all workspace dependencies into a cached Nix derivation. Subsequent builds of workspace crates reuse these artifacts, so only the project’s own code is recompiled. Since Nix derivations are content-addressed, the dependency cache is automatically invalidated when Cargo.lock changes.

UI Source-Hash Cache

The UI image path used by syndb stack prepare uses a source-hash stamp to skip rebuilds when Rust UI source files, static assets, or the packaged QueryFabric catalog sources haven’t changed:

  1. SHA-256 of all files in crates/services/ui/src/, crates/services/ui/public/, queryfabric/crates/queryfabric/src/, queryfabric/crates/queryfabric-catalog/src/, queryfabric/crates/queryfabric-web/src/, queryfabric/crates/queryfabric-web/assets/, queryfabric/crates/queryfabric-leptos/src/, and queryfabric/crates/queryfabric-dialect-syql/src/, plus the relevant workspace and crate Cargo.toml files and Cargo.lock
  2. Compared against /tmp/.syndb-ui.srchash
  3. If the hash matches and syndb-ui:dev exists in Docker, the build is skipped

The extracted SyQL editor asset is now served from /static/queryfabric_syql_editor.js.

Summary

LayerScopeMechanismInvalidation
Cargo flagsLocal devMold, Cranelift, parallel frontendN/A (always active)
GitHub Actions cacheCI pipelinesSwatinem/rust-cache@v2Branch + Cargo.lock hash
Nix OCI cachesyndb stack prepare (API, ETL)Nix store-path stampsContent-addressed (any input change)
Crane depsnix flake check, OCI imagesbuildDepsOnly derivationCargo.lock changes
UI source hashsyndb stack prepare (UI)SHA-256 file stampSource file changes

Troubleshooting

Find up-to-date explanations of different types of errors and pointers on how to resolve them.

403, Unauthorized

Verification

Academic verification is required for computationally or network-heavy tasks. This is to ensure that the resources are not being misused. You may verify yourself after registering on the platform — see Authentication for details on CILogon verification.

Dataset

A dataset belongs to the creator, and groups that the creator chooses to share its ownership. If you are unable to access a dataset, you fit neither of these categories. You may request access to the dataset from the creator.

429, Too Many Requests

You have exceeded the rate limit (100 requests/second by default). Respect the Retry-After header and implement exponential backoff. See Rate Limiting.

Job Failures

If a submitted job fails:

  1. Check the job status: GET /v1/jobs/{job_id} — the error field describes the failure
  2. Common causes: query timeout, result too large (>1 GB), ClickHouse resource limits
  3. Rerun the job: POST /v1/jobs/{job_id}/rerun

See Jobs System for details.

SyQL Errors

  • Parse errors: Check SyQL syntax in the error message. Use POST /v1/syql/plan to validate without executing.
  • Resolution errors: A referenced table or column does not exist. Check the Data Structuring guide for valid column names.
  • Timeout: Large queries may exceed the 60s HTTP timeout. Use POST /v1/syql/exec to submit as an async job instead.

Federation Issues

See Federation Troubleshooting for:

  • Node discovery failures (mDNS, multiaddrs)
  • Schema version mismatches
  • Cluster health states
  • Docker Compose federation profile issues