Welcome to SynDB
SynDB is a platform for finding, sharing, and analyzing connectomics datasets and derived neuroanatomical tables. It supports federated deployments where institutions retain data sovereignty while participating in cross-institutional analysis.
Quick Links
- Installation — hosted app, local stack, or CLI-from-source
- Upload — share your data
- Search — find datasets
- SyQL — query neuroanatomical data
- Graph Analysis — network analysis on connectomes
- Federation — join as a federated node
- API Reference — full route map and auth details
- CLI Reference — all commands and options
Resources
Related Platforms
- DANDI Archive — neurophysiology data archive
- OpenNeuro — neuroimaging data archive
- NeuroMorpho.org — neuron morphology database
Why use SynDB?
SynDB serves three audiences: data owners who produce microscopy data, data scientists who analyze it, and institutions that want to participate in federated analysis without giving up control of their data.
Image data owner
- Data sharing: Others can use your data to teach, increasing the educational value of the data. SynDB follows the FAIR principles to maximize the impact of shared data.
- Citations: Whenever your data is used in a publication, you will be cited, increasing your visibility in the scientific community.
- Provenance tracking: Version history, lineage, and auto-generated citations (BibTeX, RIS) for your datasets.
Data scientist
- Meta-analysis: Compare data across thousands of experiments using cross-dataset meta-analysis.
- SyQL queries: A declarative query language that resolves metadata into optimized SQL.
- Graph analysis: Network analysis on connectome data — motifs, shortest paths, reachability, cross-dataset comparison.
- Data visualization: Use the data to create visualizations for publications or presentations.
- Statistical modelling: Use the data to create models that can predict outcomes in future experiments.
Node operator / Institution
- Data sovereignty: Keep your data on your infrastructure — it never leaves your network.
- Federated meta-analysis: Participate in cross-institutional queries without transferring data.
- Minimal footprint: A federation node requires only ClickHouse and the
syndb-nodebinary. - Schema sync: The hub pushes DDL migrations to your node automatically.
See Federation Overview for setup details.
Installation
You can use SynDB in three ways:
- the hosted web app at app.syndb.xyz
- a local development stack via Docker Compose
- the
syndbCLI built from this repository
Hosted Web App
If you only need to browse datasets, authenticate, or use the UI, no local installation is required. Open:
- app.syndb.xyz for the web app
- api.syndb.xyz/docs for the OpenAPI UI
Local Stack
For local development, build the project images and start the stack:
git clone --recurse-submodules https://github.com/memorycircuits/SynDB.git
cd SynDB
cp .env.example .env
syndb dev sync-versions
cargo run -p cli --features dev -- stack prepare
cargo run -p cli --features dev -- stack up
If you already cloned the repo without submodules, run:
git submodule update --init --recursive
The local entry points are:
- API docs:
http://localhost:8080/docs - UI:
http://localhost:8090 - Health check:
http://localhost:8080/health
CLI From Source
Build or run the current CLI directly from the repo:
cargo run -p cli --features full -- --help
cargo run -p cli --features full -- query --help
cargo run -p cli --features dev -- stack --help
If you want a standalone binary, build the crate and run ./target/debug/syndb or ./target/release/syndb afterward.
Direct API Usage
The API can be used directly through:
- api.syndb.xyz/docs
GET https://api.syndb.xyz/openapi.json
If you want generated client bindings, use the OpenAPI schema above with your preferred generator.
Quick start
Web app
Use the hosted deployment at app.syndb.xyz, or run the local stack and open http://localhost:8090.
Command line interface
If syndb is already on your PATH:
syndb --help
From this repository, you can run the current CLI without installing it globally:
cargo run -p cli --features full -- --help
If this clone was not created with --recurse-submodules, initialize the
nested QueryFabric workspace first:
git submodule update --init --recursive
Useful next commands:
syndb auth register
syndb auth login
syndb query --help
syndb data --help
Next steps
- Authenticate — set up your account and verify for academic access
- Search data
- Upload data
- SyQL queries — query neuroanatomical data (requires academic verification)
- Federation — join as a federated node
- API documentation
Authentication
SynDB uses PASETO v4 tokens for authentication. Access tokens authorize API requests; refresh tokens obtain new access tokens without re-authenticating.
Account Types
| Type | How to create | Capabilities |
|---|---|---|
| Regular | POST /v1/user/auth/register or CLI syndb auth register | Browse, search datasets |
| Academic | Verify via CILogon (institutional login) | All regular + SyQL, graph analysis, meta-analysis, upload, jobs |
| Service | POST /v1/user/auth/register-service with X-Service-Secret header | Same as Academic (auto-verified) |
| SuperUser | Promoted by existing superuser | All + federation admin, ontology management |
Academic verification is required for compute-intensive operations: query execution, graph analysis, analytics, meta-analysis, and dataset upload.
Registration & Login
CLI:
syndb auth register
syndb auth login
API:
# Register
curl -X POST https://api.syndb.xyz/v1/user/auth/register \
-H "Content-Type: application/json" \
-d '{"email": "[email protected]", "password": "..."}'
# Login — returns access_token and refresh_token
curl -X POST https://api.syndb.xyz/v1/user/auth/login \
-H "Content-Type: application/json" \
-d '{"email": "[email protected]", "password": "..."}'
Token Lifecycle
- Login returns an access token (15 min TTL) and a refresh token (30 day TTL)
- Use the access token in requests:
Authorization: Bearer <access_token> - When the access token expires, exchange the refresh token for a new pair:
curl -X POST https://api.syndb.xyz/v1/user/auth/refresh \ -H "Content-Type: application/json" \ -d '{"refresh_token": "..."}' - Each refresh rotates the token — the old refresh token is invalidated
Refresh tokens use family-based rotation: reuse of a revoked token invalidates the entire family, forcing re-authentication.
OAuth Providers
Authenticate through institutional or social identity providers:
| Provider | Use case | Scopes |
|---|---|---|
| CILogon | Academic institutional login (universities, research labs) | openid, email, org.cilogon.userinfo |
| GitHub | Social login + ORCID association | user:email |
| Social login | openid, email, profile | |
| GitLab | Social login (supports self-hosted instances) | read_user |
| ORCID | Researcher ID association (requires existing account) | openid |
All OAuth flows use PKCE (Proof Key for Code Exchange) with SHA-256.
Academic Verification via CILogon
CILogon links your institutional identity to your SynDB account, automatically verifying you as an academic user:
- Log in to SynDB
- Navigate to CILogon verification (or
GET /v1/user/authenticate/cilogon/authorize) - Authenticate with your institution’s SSO
- Your account is marked as verified — unlocking SyQL, graph analysis, and upload
Service Accounts
For automated pipelines and integrations:
curl -X POST https://api.syndb.xyz/v1/user/auth/register-service \
-H "Content-Type: application/json" \
-H "X-Service-Secret: <SERVICE_SECRET>" \
-d '{"email": "[email protected]", "password": "..."}'
Service accounts are auto-verified and bypass academic checks. The X-Service-Secret must match the server’s SERVICE_SECRET environment variable.
Logout
# Revokes the refresh token
curl -X POST https://api.syndb.xyz/v1/user/auth/logout \
-H "Content-Type: application/json" \
-d '{"refresh_token": "..."}'
Overview
The SynDB data platform is accessible through the API. By search, you may find and download analysis-ready connectomics tables and derived metrics; by upload, you may share your data to become part of a meta-analytical study.
Composition
The SynDB data platform is designed to provide a comprehensive and organized repository of high-resolution microscopy data products and associated metadata. In practice, SynDB is organized around three components: Metadata, Analysis-Ready Tables, and Optional Source Assets.
Metadata
The metadata is used to define and retrieve datasets. It stores metadata about the data in the respective dataset:
- Brain region
- Sourcing model animal
- Genetic manipulations (mutations)
- Microscopy method
- Publication information
The metadata is defined by the data owner during upload.
Warning
Dataset
You must split your dataset into individual SynDB datasets if any of these fields differ within your own dataset.
Analysis-ready tables
The primary data in SynDB is not the raw microscopy volume. It is the analysis-ready output derived from that volume: neuron tables, synapse tables, compartment metrics, and other object-level measurements that can be searched and analyzed directly. Each neuronal compartment and structure has its own schema in SynDB.
To facilitate efficient data management, every row is linked to a dataset via its ID. This linkage enables robust search capabilities by filtering through metadata, without requiring users or ETL jobs to move terabytes of raw imaging data around. You can learn more about how dataset metadata filtration works in the article on search.
The flexible data model of SynDB supports this functionality by defining specific parameters for each compartment and structure. These varied tables are unified into comprehensive datasets through dataset metadata, which effectively organizes data groups across the platform.
Optional source assets
SynDB can also attach source-linked assets such as meshes and SWC skeleton files when they are available. These assets are optional and supplement the tabular release. SynDB does not require contributors to hand over raw imaging volumes or segmentation stores in order to ingest a dataset.
Organization & Tracking
- Collections & Tags: Group datasets into curated collections and apply tags for discovery.
- Provenance & Citations: Track version history, data lineage, and generate citations in BibTeX/RIS format. Export metadata as JSON-LD for linked data integration.
Search
The search feature filters through datasets based on the search terms provided by the user. The search terms can be combined to narrow down the search results.
By default, every search field is AND-based, meaning every provided term must be present in the resulting dataset.
Download the search results
Following the search, you may download the imaging derived metrics of the datasets from the search results. You will get a single .tar.xz file with parquet files inside. You may read parquet files using the pandas or polars library in Python.
Note
Other languages
Apache parquet is a file format supported by most popular programming languages. You may find libraries for reading parquet files in your preferred language.
Upload
Note
Prerequisites
This article requires that you understand how data is stored on SynDB, we recommend reading through the overview article if you are uncertain.
Uploading to SynDB is a multistep process, and requires understanding of the SynDB dataset model.
The process
Preparation
We recommend you to follow the guide in the exact sequence provided. This ensures the instructions are followed effectively and idiomatically.
Terms and conditions
You must accept the terms and conditions before uploading data. The terms include:
- Statement that the data is not false or misleading
- Redistribution rights
- Data licensing agreement with the license of your choice, see guide to pick license; the current default is CC BY 4.0.
Data structuring
SynDB utilizes data standardization to facilitate uploads. Your imaging metrics must be in a tabular data format; for instance, .xlsx, .csv, or .parquet. Read more about the data structuring in the contributor’s guide.
Login
Once you enter the upload page, you will be prompted to log in to your SynDB account if you are not already; furthermore, you must verify your academic status by logging in to your institution’s account.
The upload
You can upload data using the CLI or the web UI, including mixing both approaches. The UI is usually the simplest path for a first upload, while the CLI is better for reproducible and scripted ingestion.
1. Assign IDs, and correlate relations
Each SynDB unit requires a unique ID assigned before being uploaded to the platform. The web UI does this automatically, but not the CLI. When you have multiple SynDB tables under one dataset it is expected that these have some relations with each other.
Warning
Dataset integrity
As it may lead to undefined behaviour, it is disallowed to upload SynDB table data that are unrelated under the same dataset!
Meaning that you cannot upload a table of neurons and a table of synapses under the same dataset unless each synapse has a relation to a neuron from the respective table of neurons.
Web UI
The web UI will automatically assign UUIDs to each SynDB unit. Parent-child relations are checked against the current SynDB table hierarchy during validation; see the data structuring guide for the current dataset model and naming rules.
CLI
The CLI flow is explicit and reproducible:
- Create the dataset metadata record and note the returned dataset ID.
syndb data new \
--label "My connectome release" \
--animal "Drosophila melanogaster" \
--microscopy EM \
--table 1 \
--table 6 \
--brain-structure "mushroom body" \
--license CC_BY
- Prepare raw tabular files into a validated parquet upload directory.
syndb data prepare \
--input-dir raw_dataset \
--output-dir prepared_dataset
- Validate the prepared parquet files before upload.
syndb data validate --input-dir prepared_dataset
- Upload the prepared dataset through Arrow Flight.
syndb data upload \
--input-dir prepared_dataset \
--dataset-id <syndb-dataset-id>
This CLI flow mirrors the current validator and upload path used by the rest of the platform.
2. Selecting or creating the SynDB dataset metadata
As mentioned, in the overview article, every dataset has a metadata defined by the data owner during the upload. You can either select an existing dataset or create a new one.
3. Confirm and upload
Before the upload starts you will be prompted to confirm the dataset and the data you are uploading. Once you confirm, the upload will start. Should be relatively quick.
Delete owned datasets
You may at any time delete datasets that you own. This will remove the dataset and all the data associated with it. The deletion is permanent and cannot be undone.
External Sources
SynDB supports importing connectomics data from 20+ major connectome datasets. This page covers the supported imports grouped by organism. See the CLI Reference for the complete command reference.
Note
Dataset UUID
The
<syndb-dataset-id>is the UUID of the SynDB dataset that will be associated with the imported data. You can copy and paste it from the dataset management page in the web UI.
What SynDB Needs From External Groups
When we say that we need the “full dataset” for SynDB ingestion, we do not mean the raw imaging volume. We mean the complete analysis-ready release needed to populate the SynDB tables for a dataset version.
For most connectomics imports, that means:
- the complete neuron or object table for the release
- the complete synapse table for the same release, or an aggregated connection table if that is the available downstream artifact
- stable source identifiers such as
root_id,pre_pt_root_id, andpost_pt_root_id - coordinates and annotations needed to map the source schema into SynDB
- optional morphology assets such as
swc/or meshes if they are part of the release
What this usually does not mean:
- raw microscopy image stacks or volume tiles
- Neuroglancer or CAVE precomputed segmentation volumes by themselves
- ongoing operational access to the source group’s infrastructure after a static export has been produced
Preferred handoff
The preferred handoff is a static snapshot in an S3 or MinIO-compatible bucket, or an equivalent directory export with the same files. If the source data lives in CAVE, export the materialized tables first and hand off the files; SynDB imports the exported tables, not CAVE itself.
In practical terms, the source group’s involvement is usually limited to:
- granting permission for SynDB to ingest and redistribute the agreed downstream artifacts
- providing the exported snapshot in an agreed format
- answering schema questions if a column needs clarification
For a typical neuron and synapse release, the handoff looks like:
dataset-name/
neurons.csv.gz
synapses.csv.gz
connections.csv.gz # optional aggregated fallback
swc/ # optional morphology assets
meshes/ # optional geometry assets
Note
Organelle coverage is separate
Public SynDB support for a dataset’s neuron and synapse import path does not imply that vesicle or mitochondria tables are also available. For the current Hemibrain, MANC/Male CNS, MICrONS, and H01 production paths, SynDB imports neurons and synapses only. Any organelle-backed workflow such as manuscript
CF08needs a separate upstream snapshot or manual export path plus matching ETL wiring before production can populate those tables.
Drosophila melanogaster
FlyWire
Whole-brain Drosophila connectome reconstructed from a full adult female brain (FAFB). Data is exported from CAVE in CSV format.
Source: FlyWire Codex | Publication: Dorkenwald et al., 2024. Nature
Validate your FlyWire data directory:
syndb etl flywire validate --data-dir external_datasets/FlyWire
Import into your dataset:
syndb etl flywire import \
--data-dir external_datasets/FlyWire \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
FlyWire also supports a synapses-detailed table for individual synapse positions (large, batched import).
Hemibrain
Half-brain connectome of an adult Drosophila from the Janelia FlyEM project (v1.2.1). Contains ~25,000 neurons with traced morphology and synaptic connections.
Source: Janelia FlyEM Hemibrain | Publication: Scheffer et al., 2020. eLife
Download the dataset:
syndb etl hemibrain download --output-dir external_datasets/Hemibrain --extract
Validate and import:
syndb etl hemibrain validate --data-dir external_datasets/Hemibrain
syndb etl hemibrain import \
--data-dir external_datasets/Hemibrain \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
MANC (Male Adult Nerve Cord)
Connectome of the male Drosophila ventral nerve cord (VNC). Data is distributed as Apache Arrow Feather files.
Source: Janelia FlyEM MANC | Publication: Takemura et al., 2024. Nature
Download the dataset:
syndb etl manc download --output-dir external_datasets/MANC
Validate and import:
syndb etl manc validate --data-dir external_datasets/MANC
syndb etl manc import \
--data-dir external_datasets/MANC \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Warning
Download size
The MANC dataset includes the connectome-weights Feather file (~1.1 GB). Ensure sufficient disk space before downloading.
Male CNS
Male Drosophila central nervous system connectome from Janelia FlyEM. Covers the brain and ventral nerve cord with neuron-level connectivity.
Source: Google Cloud Storage | Publication: Takemura et al., 2024. Nature
Download and import:
syndb etl male-cns download --output-dir external_datasets/MaleCNS
syndb etl male-cns validate --data-dir external_datasets/MaleCNS
syndb etl male-cns import \
--data-dir external_datasets/MaleCNS \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
FANC (Female Adult Nerve Cord)
Connectome of the female Drosophila ventral nerve cord, enabling sex-specific comparisons with MANC. SynDB imports a static export of the FANC neuron and synapse tables, not the live CAVE deployment itself.
Publication: Phelps et al., 2021. Cell
Download the maintained public export:
syndb etl fanc download --output-dir external_datasets/FANC
If you are working from a custom local export instead, you can still prepare it separately:
syndb-export fanc --out-dir external_datasets/FANC --no-upload
Then validate and import:
syndb etl fanc validate --data-dir external_datasets/FANC
syndb etl fanc import \
--data-dir external_datasets/FANC \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Optic Lobe
Drosophila optic lobe connectome from Janelia FlyEM. Maps the visual processing circuitry of the fly brain.
Source: Google Cloud Storage | Publication: Matsliah et al., 2024. Nature
syndb etl optic-lobe download --output-dir external_datasets/OpticLobe
syndb etl optic-lobe validate --data-dir external_datasets/OpticLobe
syndb etl optic-lobe import \
--data-dir external_datasets/OpticLobe \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
BANC (Brain And Nerve Cord)
Whole-body Drosophila connectome covering the brain and ventral nerve cord in a single female specimen.
Publication: Jasper et al., 2024
syndb etl banc download --output-dir external_datasets/BANC
syndb etl banc validate --data-dir external_datasets/BANC
syndb etl banc import \
--data-dir external_datasets/BANC \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
L1 Larval
Complete connectome of the first-instar Drosophila larval brain (~3,000 neurons), the first whole-brain connectome of any insect.
Source: GitHub | Publication: Winding et al., 2023. Science
syndb etl larval download --output-dir external_datasets/L1Larval
syndb etl larval validate --data-dir external_datasets/L1Larval
syndb etl larval import \
--data-dir external_datasets/L1Larval \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Mouse
MICrONS (Minnie65)
Cubic millimeter of mouse visual cortex reconstructed at synaptic resolution by the MICrONS Consortium. Contains ~80,000 neurons and millions of synapses.
Source: MICrONS Explorer | Publication: MICrONS Consortium et al., 2021. bioRxiv
syndb etl microns download --output-dir external_datasets/MICrONS
syndb etl microns validate --data-dir external_datasets/MICrONS
syndb etl microns import \
--data-dir external_datasets/MICrONS \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Spine Morphometry
Dendritic spine morphological measurements from electron microscopy. Three sub-datasets are supported:
| Variant | Key | Source | Publication |
|---|---|---|---|
| Kasthuri | kasthuri | Columbia Academic Commons | Kasthuri et al., 2015. Cell |
| Ofer | ofer-confocal | Zenodo | Ofer et al., 2022 |
| MICrONS | microns | Zenodo | Derived from MICrONS cortical data |
syndb etl spine-morphometry download --source kasthuri --output-dir external_datasets/SpineKasthuri
syndb etl spine-morphometry validate \
--source kasthuri \
--data-dir external_datasets/SpineKasthuri
syndb etl spine-morphometry import \
--data-dir external_datasets/SpineKasthuri \
--dataset-id <syndb-dataset-id> \
--source kasthuri
Human
H01
One cubic millimeter of human temporal cortex at nanometer resolution. Contains reconstructed neurons, synapses, and glia from a neurosurgical tissue sample.
Source: Google Cloud Storage | Publication: Shapson-Coe et al., 2024. Science
syndb etl h01 download --output-dir external_datasets/H01
syndb etl h01 validate --data-dir external_datasets/H01
syndb etl h01 import \
--data-dir external_datasets/H01 \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
C. elegans
C. elegans Hermaphrodite
Complete connectome of the adult hermaphrodite C. elegans (~300 neurons), the first organism with a fully mapped nervous system. Data sourced from the OpenWorm ConnectomeToolbox.
Source: OpenWorm ConnectomeToolbox | Publication: Cook et al., 2019. Nature
syndb etl celegans download --output-dir external_datasets/CElegansHerm
syndb etl celegans validate --data-dir external_datasets/CElegansHerm
syndb etl celegans import \
--data-dir external_datasets/CElegansHerm \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
C. elegans Male
Complete connectome of the adult male C. elegans, enabling sex-specific neural circuit comparisons.
Source: OpenWorm ConnectomeToolbox | Publication: Cook et al., 2019. Nature
syndb etl celegans-male download --output-dir external_datasets/CElegansMale
syndb etl celegans-male validate --data-dir external_datasets/CElegansMale
syndb etl celegans-male import \
--data-dir external_datasets/CElegansMale \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
C. elegans Developmental
Connectomes across eight developmental stages of C. elegans, tracking how neural circuits are assembled during growth.
Source: GitHub | Publication: Witvliet et al., 2021. Nature
syndb etl witvliet download --output-dir external_datasets/CElegansDev
syndb etl witvliet validate --data-dir external_datasets/CElegansDev
syndb etl witvliet import \
--data-dir external_datasets/CElegansDev \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Other Organisms
Platynereis dumerilii
Whole-body connectome of the marine annelid Platynereis dumerilii, a three-day-old larva with ~5,000 neurons.
Source: GitHub | Publication: Verasztó et al., 2024. bioRxiv
syndb etl platynereis download --output-dir external_datasets/Platynereis
syndb etl platynereis validate --data-dir external_datasets/Platynereis
syndb etl platynereis import \
--data-dir external_datasets/Platynereis \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Fish1 (Larval Zebrafish)
Larval zebrafish (Danio rerio) brain connectome. Data is accessed through CAVE and requires manual export with Google OAuth authentication.
Note
Manual export
Fish1 data requires CAVE API access with Google credentials. Export a static neuron table plus either a full synapse table or an aggregated connections table, then use the validate and import commands.
Show the expected export layout:
syndb etl fish1 cave-instructions
Or export directly to a local directory:
syndb-export fish1 --out-dir external_datasets/Fish1 --no-upload
syndb etl fish1 validate --data-dir external_datasets/Fish1
syndb etl fish1 import \
--data-dir external_datasets/Fish1 \
--dataset-id <syndb-dataset-id> \
--table neurons \
--table synapses
Multi-species Databases
Allen Cell Types
Reference electrophysiology and morphology data from the Allen Institute for Brain Science. Covers mouse and human cortical neuron types with standardized measurements.
Source: Allen Cell Types Database | API: Allen Brain Map API
syndb etl allen-cell-types download --output-dir external_datasets/AllenCellTypes
syndb etl allen-cell-types validate --data-dir external_datasets/AllenCellTypes
syndb etl allen-cell-types import \
--data-dir external_datasets/AllenCellTypes \
--dataset-id <syndb-dataset-id> \
--table neurons
NeuroMorpho
Curated archive of digitally reconstructed neuron morphologies from NeuroMorpho.org. Contains 200,000+ reconstructions across 100+ species.
Source: NeuroMorpho.org | Publication: Ascoli et al., 2007. Journal of Neuroscience
syndb etl neuromorpho download --output-dir external_datasets/NeuroMorpho
syndb etl neuromorpho validate --data-dir external_datasets/NeuroMorpho
syndb etl neuromorpho import \
--data-dir external_datasets/NeuroMorpho \
--dataset-id <syndb-dataset-id> \
--table neurons
Collections & Tags
Organize datasets into curated collections and apply tags for discovery.
Tags
Tags are free-form metadata labels attached to datasets. They surface in search results and help users discover related data.
Add Tags
Tags are assigned during dataset creation or updated afterward via the dataset metadata endpoints.
Search by Tags
curl "https://api.syndb.xyz/v1/search/fulltext?q=drosophila+mushroom+body"
The full-text search indexes dataset tags alongside titles and descriptions. See Search.
Collections
Collections are curated groupings of datasets — for example, “All Drosophila connectomes” or “Lab X publication datasets.”
Create a Collection
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/neurodata/collections \
-d '{
"name": "Drosophila Connectomes",
"notes": "All Drosophila melanogaster connectome datasets",
"dataset_ids": [
"11111111-1111-1111-1111-111111111111",
"22222222-2222-2222-2222-222222222222"
]
}'
List Collections
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/neurodata/collections
Get a Collection
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/neurodata/collections/{collection_id}
Collection membership is currently defined when the collection is created; there is no standalone POST /v1/neurodata/collections/{collection_id}/datasets route in the current API.
Collections are useful for meta-analysis: take the dataset_ids returned by the collection endpoints and pass them to the meta-analysis endpoint.
Provenance & Citations
SynDB tracks dataset lineage, version history, and generates machine-readable citations.
Version History
Each dataset maintains a version history. View all versions:
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/versions
Provenance Chain
The provenance endpoint shows the audit trail — who created, modified, or derived from the dataset:
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/provenance
Lineage
Track derived-from relationships between datasets:
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/lineage
Citations
Generate citations in standard formats:
# BibTeX
curl "https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/citation?format=bibtex"
# RIS (for [EndNote](https://endnote.com), [Zotero](https://www.zotero.org))
curl "https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/citation?format=ris"
JSON-LD
Export dataset metadata as linked data for integration with knowledge graphs and semantic web tools:
curl "https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/metadata.jsonld"
Returns a JSON-LD document following Schema.org and neuroscience ontology standards. See Data Standards for details on metadata formats.
Access Requests
For restricted datasets, request access from the dataset owner:
# Request access
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/access/request \
-d '{"purpose": "Reanalysis for comparative morphology study"}'
# Check access status
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/neurodata/datasets/{dataset_id}/access
The dataset creator receives the request and can approve or deny it.
SyQL Query Language
SyQL (SynDB Query Language) is a declarative query language for neuroanatomical data. It resolves dataset metadata into optimized ClickHouse SQL, handles access control, and submits queries to the async job system.
Requires Academic verification.
Quick Start
SyQL follows familiar SQL syntax. The simplest query:
FROM neurons LIMIT 10
A more typical query:
SELECT neuron_id, cable_length, cell_type
FROM neurons
WHERE species = 'rat' AND cable_length > 100
ORDER BY cable_length DESC
LIMIT 1000
Query Structure
A full SyQL query can include:
[WITH cte_name AS (...), ...]
SELECT [DISTINCT] columns
FROM table [AS alias]
[JOIN table [AS alias] ON conditions]
[WHERE predicates]
[GROUP BY columns]
[HAVING predicates]
[ORDER BY columns [ASC|DESC]]
[LIMIT n [OFFSET m]]
[SCOPE local|remote|federation]
[DOWNLOAD arrow|parquet|csv]
SELECT is optional — FROM neurons LIMIT 10 is equivalent to SELECT * FROM neurons LIMIT 10.
Tables
Base Compartment Tables
These are the primary data tables:
| Table | Primary Key | Description |
|---|---|---|
neurons | neuron_id | Neuron morphology and metadata |
neuron_relations | relation_id | Directed non-synaptic neuron-to-neuron relations |
synapses | synapse_id | Synaptic connections between neurons |
axons | axon_id | Axonal segments |
dendrites | dendrite_id | Dendritic segments |
dendritic_spines | spine_id | Dendritic spines |
pre_synaptic_terminals | terminal_id | Pre-synaptic terminals |
vesicles | vesicle_id | Synaptic vesicles |
mitochondria | mitochondria_id | Mitochondria |
Every table includes dataset_id, created_at, and metadata columns.
Key Columns
neurons — neuron_id, name, brain_structure, polarity, cell_type, cell_class, cell_subclass, species, majority_neurotransmitter, gaba_avg, acetylcholine_avg, glutamate_avg, octopamine_avg, serotonin_avg, dopamine_avg, tyramine_avg, betaine_avg, cable_length, is_tree, n_branches, n_skeletons, n_trees, surface_area, max_axis_length, volume, voxel_volume, voxel_radius, mesh_volume, mesh_surface_area, mesh_area_volume_ratio, mesh_sphericity, centroid_x, centroid_y, centroid_z, s3_mesh_location, s3_swb_location
neuron_relations — relation_id, pre_neuron_id, post_neuron_id, relation_type, neurotransmitter, strength, relation_count
synapses — synapse_id, pre_neuron_id, post_neuron_id, synapse_type, neurotransmitter, strength, synapse_count, centroid_x, centroid_y, centroid_z
Use POST /v1/syql/plan to inspect the resolved plan before execution. Current plan and explain responses include the logical plan, SQL preview or compiled SQL, optional rewrite target, rewrite advisories, optional federation scatter and gather SQL, optional query_id, and optional typed result_schema.
Materialized Views
Materialized views store pre-aggregated data for fast analytics:
| View | Use Case |
|---|---|
mv_dataset_summary | Row counts per compartment per dataset |
mv_neuron_morphometrics | Morphometric averages/stddev per dataset |
mv_neuron_stats | Neuron stats grouped by dataset + cell type |
mv_neuron_out_degree | Outgoing connections per neuron |
mv_neuron_in_degree | Incoming connections per neuron |
mv_platform_neuron_stats | Global (platform-wide) neuron statistics |
mv_synapse_connectivity | Synapse statistics per dataset |
mv_synapse_stats | Synapse count/strength aggregates |
mv_spatial_density | Spatial binning (bin_x, bin_y, bin_z) |
mv_neurotransmitter_profile | Neurotransmitter concentrations |
mv_vesicle_distribution | Vesicle count/volume distributions |
mv_mitochondria_stats | Mitochondria volume/count statistics |
mv_vesicle_diameter_histogram | Vesicle diameter binning |
mv_mitochondria_volume_histogram | Mitochondria volume binning |
mv_nt_by_region | Neurotransmitter by brain region |
MV columns that store intermediate aggregation state are automatically finalized — e.g., row_count becomes countMerge(row_count) in the compiled SQL. You don’t need to handle this yourself.
Precomputed Tables
Graph analysis results stored by the ETL pipeline:
| Table | Description |
|---|---|
precomputed_graph_summary | Network-level statistics (density, reciprocity, clustering, motifs) |
precomputed_degree_histogram | Degree distribution (in/out) |
precomputed_celltype_connectivity | Cell-type-to-cell-type connectivity matrix |
precomputed_bottleneck_neurons | Articulation point annotations |
precomputed_clique_detail | Maximal cliques with cell-type composition |
precomputed_dual_network | Chemical vs. electrical subnetwork metrics |
precomputed_developmental_metrics | Per-stage developmental metrics |
Views
| View | Description |
|---|---|
vw_celltype_connectivity | Cell-type connectivity (non-materialized view) |
Filtering (WHERE)
Data Column Filters
Standard SQL comparison operators:
WHERE cable_length > 100
WHERE cell_type = 'pyramidal'
WHERE cell_type != 'unknown'
WHERE cable_length BETWEEN 50 AND 200
WHERE cell_type IN ('pyramidal', 'interneuron', 'stellate')
WHERE name LIKE '%mushroom%'
WHERE cell_class IS NULL
WHERE cell_class IS NOT NULL
Boolean Logic
Combine predicates with AND, OR, NOT, and parentheses:
WHERE (cable_length > 100 AND volume < 500)
OR (cell_type = 'pyramidal' AND NOT cell_class IS NULL)
Metadata Filters
These special columns filter by dataset metadata — they resolve against PostgreSQL and restrict which dataset_id values are included:
| Column | Aliases | Resolves to |
|---|---|---|
species | — | dataset.animal_species |
brain_region | brain_structure | dataset_brain_region junction |
license | data_license | dataset.data_license |
microscopy | microscopy_name | dataset.microscopy_name |
cluster_name | — | federated_cluster.name |
SELECT neuron_id, cable_length
FROM neurons
WHERE species = 'mouse' AND brain_region = 'mushroom_body'
LIMIT 1000
FROM neurons WHERE species IN ('mouse', 'rat') LIMIT 1000
Expression Filters
Arithmetic and function calls work in WHERE:
WHERE cable_length * 2 > 500
WHERE SQRT(volume) < 100
WHERE ABS(centroid_x - centroid_y) < 10
Subquery Filters
WHERE neuron_id IN (SELECT pre_neuron_id FROM synapses WHERE strength > 5)
WHERE neuron_id NOT IN (SELECT neuron_id FROM axons)
SELECT Expressions
Columns and Aliases
SELECT neuron_id, cable_length AS length, cell_type
FROM neurons
LIMIT 100
Arithmetic
SELECT neuron_id,
mesh_volume / mesh_surface_area AS volume_to_area,
cable_length * 1000 AS cable_length_nm
FROM neurons
LIMIT 100
Operators: +, -, *, /, %
CASE Expressions
SELECT neuron_id,
CASE
WHEN cable_length > 1000 THEN 'long'
WHEN cable_length > 100 THEN 'medium'
ELSE 'short'
END AS size_class
FROM neurons
LIMIT 100
Simple form:
CASE cell_type
WHEN 'pyramidal' THEN 'excitatory'
WHEN 'interneuron' THEN 'inhibitory'
ELSE 'other'
END
DISTINCT
SELECT DISTINCT cell_type, brain_structure
FROM neurons
Aggregate Functions
| Function | Description |
|---|---|
COUNT(*) | Count rows |
COUNT(column) | Count non-null values |
COUNT(DISTINCT column) | Count unique values |
SUM(column) | Sum |
AVG(column) | Mean |
MIN(column) | Minimum |
MAX(column) | Maximum |
STDDEV_POP(column) | Population standard deviation |
VAR_POP(column) | Population variance |
QUANTILE(p)(column) | Quantile at level p (0.0–1.0) |
MEDIAN(column) | Median (alias for QUANTILE(0.5)) |
CORR(col1, col2) | Pearson correlation (two columns) |
SELECT dataset_id,
COUNT(*) AS neuron_count,
AVG(cable_length) AS avg_cable,
STDDEV_POP(cable_length) AS std_cable,
MEDIAN(mesh_volume) AS median_volume
FROM neurons
GROUP BY dataset_id
GROUP BY and HAVING
Group rows and filter groups:
SELECT cell_type, COUNT(*) AS n, AVG(cable_length) AS avg_cable
FROM neurons
WHERE dataset_id = '...'
GROUP BY cell_type
HAVING COUNT(*) > 10
ORDER BY avg_cable DESC
HAVING supports arithmetic, function calls, and subqueries:
HAVING STDDEV_POP(cable_length) < 50
HAVING COUNT(*) > (SELECT COUNT(*) / 100 FROM neurons WHERE dataset_id = '...')
Scalar Functions
Numeric
| Function | Description |
|---|---|
ABS(x) | Absolute value |
ROUND(x [, precision]) | Round |
FLOOR(x) | Round down |
CEIL(x) / CEILING(x) | Round up |
SQRT(x) | Square root |
POWER(x, y) / POW(x, y) | Exponentiation |
GREATEST(a, b, ...) | Maximum of arguments |
LEAST(a, b, ...) | Minimum of arguments |
Conditional
| Function | Description |
|---|---|
IF(cond, then, else) | Ternary conditional |
IFNULL(value, default) | Null coalescing |
COALESCE(a, b, ...) | First non-null value |
NULLIF(a, b) | Returns null if a = b |
String
| Function | Description |
|---|---|
LENGTH(s) | String length |
LOWER(s) | Lowercase |
UPPER(s) | Uppercase |
TRIM(s) | Strip whitespace |
CONCAT(a, b, ...) | Concatenate strings |
SUBSTRING(s, pos [, len]) / SUBSTR(...) | Extract substring |
Type Casting
| Function | Description |
|---|---|
TOFLOAT64(x) | Cast to Float64 |
TOINT32(x) | Cast to Int32 |
TOINT64(x) | Cast to Int64 |
TOUINT64(x) | Cast to UInt64 |
TOSTRING(x) | Cast to String |
Window Functions
Window functions compute values across a set of rows related to the current row.
Syntax
function(...) OVER (
[PARTITION BY expr, ...]
[ORDER BY expr [ASC|DESC], ...]
[ROWS|RANGE BETWEEN start AND end]
)
Ranking Functions
SELECT neuron_id,
cable_length,
RANK() OVER (PARTITION BY dataset_id ORDER BY cable_length DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY dataset_id ORDER BY cable_length DESC) AS dense_rank,
ROW_NUMBER() OVER (PARTITION BY dataset_id ORDER BY cable_length DESC) AS row_num
FROM neurons
Offset Functions
SELECT neuron_id,
cable_length,
LAG(cable_length) OVER (ORDER BY neuron_id) AS prev_cable,
LEAD(cable_length, 2) OVER (ORDER BY neuron_id) AS next_2_cable,
FIRST_VALUE(cable_length) OVER (PARTITION BY dataset_id ORDER BY neuron_id) AS first_cable,
LAST_VALUE(cable_length) OVER (PARTITION BY dataset_id ORDER BY neuron_id) AS last_cable
FROM neurons
Window Frames
SUM(cable_length) OVER (
ORDER BY neuron_id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
Frame bounds: UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, UNBOUNDED FOLLOWING.
Frame units: ROWS or RANGE.
Aggregates as Window Functions
Any aggregate function can be used with OVER:
SELECT neuron_id,
cable_length,
AVG(cable_length) OVER (PARTITION BY dataset_id) AS dataset_avg
FROM neurons
JOINs
Supported Join Types
INNER JOIN(or justJOIN)LEFT JOINRIGHT JOINFULL OUTER JOINCROSS JOIN
ON Conditions
ON clauses support equality conditions chained with AND:
SELECT n.neuron_id, n.cable_length, s.strength
FROM neurons AS n
INNER JOIN synapses AS s
ON n.dataset_id = s.dataset_id AND n.neuron_id = s.pre_neuron_id
WHERE n.dataset_id = '...'
LIMIT 1000
Self-Joins
Useful for analyzing reciprocal connections:
SELECT
count() / 2 AS reciprocal_pairs,
toFloat64(count()) / 2.0
/ greatest(toFloat64((SELECT count() FROM synapses WHERE dataset_id = '...')), 1.0)
AS reciprocity
FROM synapses AS s1
INNER JOIN synapses AS s2
ON s1.dataset_id = s2.dataset_id
AND s1.pre_neuron_id = s2.post_neuron_id
AND s1.post_neuron_id = s2.pre_neuron_id
WHERE s1.dataset_id = '...'
AND s1.pre_neuron_id < s1.post_neuron_id
Joining Materialized Views
SELECT o.neuron_id, out_degree, in_degree
FROM mv_neuron_out_degree AS o
FULL OUTER JOIN mv_neuron_in_degree AS i
ON o.dataset_id = i.dataset_id AND o.neuron_id = i.neuron_id
WHERE o.dataset_id = '...' OR i.dataset_id = '...'
GROUP BY o.neuron_id, i.neuron_id
ORDER BY (out_degree + in_degree) DESC
LIMIT 100
CROSS JOIN with Subqueries
Useful for z-score comparisons against global statistics:
SELECT
ds.dataset_id,
(AVG(ds.cable_length) - global.global_avg)
/ greatest(global.global_std, 0.0000000001) AS zscore
FROM neurons AS ds
CROSS JOIN (
SELECT AVG(cable_length) AS global_avg, STDDEV_POP(cable_length) AS global_std
FROM neurons
) AS global
WHERE ds.dataset_id IN ('...')
GROUP BY ds.dataset_id
CTEs (WITH Clauses)
Common Table Expressions let you name intermediate result sets:
WITH all_neurons AS (
SELECT pre_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
UNION ALL
SELECT post_neuron_id FROM synapses WHERE dataset_id = '...'
)
SELECT
neuron_count,
edge_count,
toFloat64(edge_count) / greatest(toFloat64(neuron_count) * (toFloat64(neuron_count) - 1), 1) AS density
FROM (
SELECT
(SELECT COUNT(DISTINCT neuron_id) FROM all_neurons) AS neuron_count,
count() AS edge_count
FROM synapses
WHERE dataset_id = '...'
) AS t
Later CTEs can reference earlier ones. WITH RECURSIVE is not supported.
UNION ALL
Combine multiple queries:
SELECT 'pre' AS direction, pre_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
UNION ALL
SELECT 'post' AS direction, post_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
ORDER BY neuron_id
LIMIT 1000
Only UNION ALL is supported. UNION (without ALL), EXCEPT, and INTERSECT are not available. Outer ORDER BY, LIMIT, and OFFSET apply across the combined result.
Ordering and Pagination
ORDER BY cable_length DESC
LIMIT 100
OFFSET 200
ORDER BY supports expressions:
ORDER BY cable_length * 2 DESC
ORDER BY SQRT(volume) ASC
Parameter Binding
Parameterized queries prevent injection and allow reuse.
Positional Parameters
Use ? placeholders — they are assigned 1-based indices left to right:
{
"query": "SELECT neuron_id, cable_length FROM neurons WHERE cable_length > ? AND volume < ? LIMIT ?",
"params": [100, 999.5, 1000]
}
Named Parameters
Use :name placeholders:
{
"query": "SELECT neuron_id FROM neurons WHERE cell_type = :ct AND species = :species LIMIT 1000",
"named_params": {"ct": "pyramidal", "species": "rat"}
}
Parameters can be used in WHERE, SELECT expressions, HAVING, ORDER BY, and function arguments.
Output Format (DOWNLOAD)
Control the result format:
SELECT neuron_id, cable_length FROM neurons LIMIT 1000 DOWNLOAD csv
| Format | Description |
|---|---|
arrow | Apache Arrow (default) |
parquet | Apache Parquet |
csv | CSV |
Federation Scope (SCOPE)
Control where the query executes:
SELECT COUNT(*) FROM neurons GROUP BY dataset_id SCOPE federation
| Scope | Description |
|---|---|
local | Execute on the local database (default) |
remote | Execute on a remote federated cluster |
federation | Scatter/gather across all federated nodes |
When using federation scope, the hub automatically decomposes the query into scatter SQL (sent to each node) and gather SQL (merged locally). See Cross-Cluster Queries.
Automatic MV Rewriting
SyQL automatically rewrites queries to use materialized views when possible. For example:
SELECT AVG(cable_length) FROM neurons GROUP BY dataset_id
This is automatically rewritten to query mv_neuron_morphometrics instead of scanning the full neurons table — significantly faster for large datasets.
Use the explain endpoint to see whether your query was rewritten and which MV was selected. The response includes advisories explaining why alternative MVs were rejected.
API Workflow
SyQL has a three-stage pipeline:
| Stage | Endpoint | What it does |
|---|---|---|
| Plan | POST /v1/syql/plan | Parse → validate → resolve metadata → return logical plan |
| Explain | POST /v1/syql/explain | Plan + compile to SQL → return compiled query and advisories |
| Execute | POST /v1/syql/exec | Plan + compile + submit to job queue → return job ID |
Use plan to validate syntax and inspect the resolved schema. Use explain to preview the generated SQL before committing to execution. Use exec when you’re ready to run.
Plan
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/syql/plan \
-d '{"query": "SELECT neuron_id, cable_length FROM neurons WHERE species = '\''mouse'\'' LIMIT 100"}'
Returns the parsed logical plan: resolved tables, columns, filters, metadata, and result schema.
Explain
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/syql/explain \
-d '{"query": "SELECT AVG(cable_length) FROM neurons GROUP BY dataset_id"}'
Returns:
- The compiled ClickHouse SQL (with any MV rewrites applied)
- Query advisories and any selected MV rewrite target
- Optional federation
scatter_sqlandgather_sql - Optional
query_idand typedresult_schema
Execute
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/syql/exec \
-d '{"query": "SELECT neuron_id, cable_length FROM neurons WHERE species = '\''mouse'\'' ORDER BY cable_length DESC LIMIT 1000"}'
Returns a job_id. Track and download results via the Jobs System.
Cancel
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/syql/cancel \
-d '{"query_id": "..."}'
The query_id may be returned by plan, explain, or execute when the query can be cancelled through ClickHouse.
Examples
Neuron morphometrics per dataset
SELECT dataset_id,
COUNT(*) AS n,
AVG(cable_length) AS avg_cable,
STDDEV_POP(cable_length) AS std_cable,
AVG(mesh_volume) AS avg_volume,
AVG(mesh_sphericity) AS avg_sphericity
FROM neurons
GROUP BY dataset_id
ORDER BY n DESC
Top connected neurons
SELECT o.neuron_id, out_degree, in_degree,
(out_degree + in_degree) AS total_degree
FROM mv_neuron_out_degree AS o
FULL OUTER JOIN mv_neuron_in_degree AS i
ON o.dataset_id = i.dataset_id AND o.neuron_id = i.neuron_id
WHERE o.dataset_id = '...'
GROUP BY o.neuron_id, i.neuron_id
ORDER BY total_degree DESC
LIMIT 50
Z-score comparison across datasets
SELECT
toString(ds.dataset_id) AS dataset_id,
'cable_length' AS metric,
(ds.dataset_avg - global.global_avg)
/ greatest(global.global_std, 0.0000000001) AS zscore
FROM (
SELECT dataset_id, AVG(cable_length) AS dataset_avg
FROM neurons
WHERE dataset_id IN ('uuid1', 'uuid2')
GROUP BY dataset_id
) AS ds
CROSS JOIN (
SELECT AVG(cable_length) AS global_avg, STDDEV_POP(cable_length) AS global_std
FROM neurons
) AS global
Network reciprocity
SELECT
count() / 2 AS reciprocal_pairs,
toFloat64(count()) / 2.0
/ greatest(toFloat64((SELECT count() FROM synapses WHERE dataset_id = '...')), 1.0)
AS reciprocity
FROM synapses AS s1
INNER JOIN synapses AS s2
ON s1.dataset_id = s2.dataset_id
AND s1.pre_neuron_id = s2.post_neuron_id
AND s1.post_neuron_id = s2.pre_neuron_id
WHERE s1.dataset_id = '...'
AND s1.pre_neuron_id < s1.post_neuron_id
Graph density with CTEs
WITH all_neurons AS (
SELECT pre_neuron_id AS neuron_id FROM synapses WHERE dataset_id = '...'
UNION ALL
SELECT post_neuron_id FROM synapses WHERE dataset_id = '...'
)
SELECT
neuron_count,
edge_count,
avg_strength,
toFloat64(edge_count)
/ greatest(toFloat64(neuron_count) * (toFloat64(neuron_count) - 1), 1)
AS density
FROM (
SELECT
(SELECT COUNT(DISTINCT neuron_id) FROM all_neurons) AS neuron_count,
count() AS edge_count,
avg(strength) AS avg_strength
FROM synapses
WHERE dataset_id = '...'
) AS t
Ranking neurons within a dataset
SELECT neuron_id, cable_length,
RANK() OVER (ORDER BY cable_length DESC) AS rank
FROM neurons
WHERE dataset_id = '...'
LIMIT 100
Parameterized query
{
"query": "SELECT neuron_id, cable_length, cell_type FROM neurons WHERE cell_type = :ct AND cable_length > :min_cable ORDER BY cable_length DESC LIMIT :n",
"named_params": {"ct": "pyramidal", "min_cable": 500, "n": 100}
}
Unsupported Features
These SQL features are intentionally not supported:
INSERT,UPDATE,DELETE,CREATE,DROP— SyQL is read-onlyWITH RECURSIVE— recursive CTEsUNION(without ALL),EXCEPT,INTERSECT— onlyUNION ALLis availableGROUPSwindow frame unit — onlyROWSandRANGE- Arbitrary ClickHouse functions — only whitelisted functions are allowed
Saved Queries
Frequently used SyQL queries can be saved for reuse. See Saved Queries.
Saved Queries
Save SyQL queries for reuse, sharing, and scheduled re-execution.
Requires Academic verification.
Save a Query
For most workflows, save directly from SyQL. The structured POST /v1/queries route is mainly for already-resolved table and dataset selections.
From SyQL
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/queries/from-syql \
-d '{
"label": "Mushroom body neuron volumes",
"query": "SELECT mesh_volume FROM neurons WHERE brain_region = '\''mushroom_body'\''",
"description": "All neuron mesh volumes in the mushroom body"
}'
Direct Save
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/queries \
-d '{
"label": "Neuron cable lengths",
"description": "Neuron morphology subset for one dataset",
"syndb_table": 1,
"dataset_ids": ["11111111-1111-1111-1111-111111111111"],
"columns": ["neuron_id", "cable_length"],
"query_scope": "local"
}'
syndb_table is the current numeric SyndbTable discriminant. Use the SyQL-backed save route when you want server-side resolution from a query string instead of a pre-selected table and dataset list.
List Saved Queries
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/queries
Get a Query
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/queries/{query_id}
Update
curl -X PUT -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/queries/{query_id} \
-d '{"label": "Updated label", "description": "Updated description"}'
Delete
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/queries/{query_id}
Run a Saved Query
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/queries/{query_id}/run
Submits the query to the job system and returns a job ID.
Refresh Run Statuses
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/queries/{query_id}/refresh
This polls all non-terminal runs attached to the saved query and returns the refreshed saved-query record.
CLI
The CLI saved-query commands are server-backed and operate on the same saved query store as the web UI and API.
syndb query list
syndb query save-syql --label "My query" "SELECT neuron_id FROM neurons LIMIT 100"
syndb query save --label "Neuron subset" --table 1 --dataset-id {dataset_id} --column neuron_id
syndb query show --id {query_id}
syndb query run --id {query_id}
syndb query status --id {query_id}
syndb query update --id {query_id} --label "New label"
syndb query delete --id {query_id}
Analytics
Pre-computed analytics endpoints for dataset exploration. These query ClickHouse materialized views and return results quickly (cached for 5 minutes).
Requires Academic verification.
Dataset Summary
Row counts per compartment type:
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/analytics/summary?dataset_ids=uuid1,uuid2"
Returns per-dataset table counts plus total_rows.
Neuron Morphometrics
Morphological statistics for neurons in a dataset:
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/analytics/morphometrics?dataset_ids=uuid1,uuid2"
Returns per-dataset means and standard deviations for metrics such as cable length, surface area, volume, mesh volume, mesh sphericity, and branch counts.
Z-Score Comparison
Standardized comparison of a metric across multiple datasets:
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/analytics/comparison?dataset_ids=uuid1,uuid2,uuid3&metric=mesh_volume"
Omit metric to return the current six-metric neuron morphometrics comparison; include metric to request a single z-score series.
Graph Summary
Network-level statistics for connectome datasets:
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/analytics/graph/{dataset_id}/summary
Returns neuron_count, edge_count, density, and avg_strength.
Reciprocity
Fraction of bidirectional synaptic connections:
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/analytics/graph/{dataset_id}/reciprocity
Degree Distribution
Top neurons by connectivity:
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/analytics/graph/{dataset_id}/degree-distribution?top_k=50"
Returns the top top_k neurons with in-degree, out-degree, and average inbound and outbound strength.
Graph Analysis
In-memory graph analysis on connectome datasets. SynDB constructs a directed graph from synapse data in ClickHouse (up to 10M edges) and runs network algorithms using petgraph.
Requires Academic verification.
Graph Metrics
Basic network statistics:
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/graph/{dataset_id}/metrics
Returns current graph metrics including node count, edge count, density, reciprocity, average in and out degree, maximum in and out degree, and strongly connected component counts.
Motif Analysis (Triadic Census)
Count all 16 three-node subgraph patterns (triadic census):
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/graph/{dataset_id}/motifs \
-d '{}'
Compare by Synapse Type
Compare motif distributions across synapse types within the same dataset:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/graph/{dataset_id}/motifs/compare-synapse-types \
-d '{"sample_size": 500}'
Shortest Path
Find the shortest path between two neurons:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/graph/{dataset_id}/shortest-path \
-d '{
"source_neuron_id": "11111111-1111-1111-1111-111111111111",
"target_neuron_id": "22222222-2222-2222-2222-222222222222",
"weight_mode": "hops"
}'
Uses Dijkstra’s algorithm. Supports configurable edge weight modes.
Reachability
Find all neurons reachable within N hops:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/graph/{dataset_id}/reachability \
-d '{"source_neuron_id": "11111111-1111-1111-1111-111111111111", "max_hops": 3}'
BFS traversal, maximum 100 hops.
Reachability Curve
Sample how reachability grows with hop count:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/graph/{dataset_id}/reachability-curve \
-d '{"sample_size": 100, "max_hops": 20, "seed": 42}'
Returns the mean and standard deviation of the reachable fraction at each hop distance. Current limits are max 500 samples and max 20 hops.
Full Analysis
Run metrics + motifs + hub neuron detection in one call:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/graph/{dataset_id}/full-analysis \
-d '{"max_edges": 5000000, "top_hubs": 20}'
Cross-Dataset Comparison
Compare graph properties across multiple datasets:
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/graph/compare?dataset_ids=uuid-1,uuid-2,uuid-3"
Graph Precompute (CLI)
For large datasets, precompute graph metrics and store results in ClickHouse materialized tables:
syndb graph-precompute --dataset flywire
--dataset accepts current dataset keys such as flywire, manc, or h01. This is a batch operation typically run as part of the ETL pipeline or as a Kubernetes job.
Meta-Analysis
Cross-dataset meta-analysis computes effect sizes and heterogeneity statistics across multiple datasets, enabling comparisons that no single dataset can answer.
Requires Academic verification.
Cross-Dataset Analysis
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/meta-analysis \
-d '{
"table": "neurons",
"metric": "mesh_volume",
"grouping": "brain_structure",
"dataset_ids": "uuid-1,uuid-2,uuid-3"
}'
Parameters
| Field | Required | Description |
|---|---|---|
table | Yes | Target table: neurons, synapses, dendrites, axons, pre_synaptic_terminals, dendritic_spines, vesicles, mitochondria |
metric | Yes | Column to analyze (e.g., mesh_volume, mesh_surface_area, connection_score) |
grouping | Yes | Grouping dimension (e.g., species, brain_structure, cell_type, dataset) |
dataset_ids | Yes | Comma-separated dataset UUIDs |
scope | No | "local" (default) or "federation" |
cluster_ids | No | Comma-separated federation cluster UUIDs; required when scope is federation |
Atlas Comparison
Compare dataset metrics against reference atlases (pre-aggregated materialized views):
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/meta-analysis/atlas/compare?dataset_ids=uuid-1,uuid-2&grouping=species&metric=mesh_volume"
Federation Scope
To run meta-analysis across federated nodes:
{
"table": "synapses",
"metric": "connection_score",
"grouping": "dataset",
"scope": "federation",
"dataset_ids": "uuid-1,uuid-2",
"cluster_ids": "cluster-uuid-1,cluster-uuid-2"
}
The hub fans the aggregation out to each specified cluster and merges the results. See Cross-Cluster Queries.
Jobs System
Long-running queries execute asynchronously through the job system. Submit a job, check its status, and download results when ready.
Requires Academic verification.
Workflow
Submit job → Job queued → Job running → Job completed → Download result
→ Job failed (check error, rerun)
Submit a Query Job
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/jobs \
-d '{
"syndb_table": 1,
"dataset_ids": ["11111111-1111-1111-1111-111111111111"],
"columns": ["neuron_id", "cable_length"],
"query_scope": "local",
"row_limit": 1000
}'
Returns a job_id for tracking.
For most ad hoc querying, use SyQL execution or syndb query exec; they compile and submit this structured job request for you.
Submit a Graph Job
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/jobs/graph \
-d '{
"dataset_id": "11111111-1111-1111-1111-111111111111",
"max_edges": 5000000,
"motif_sample_size": 1000,
"top_hubs": 20
}'
Check Status
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/jobs/{job_id}
| Status | Meaning |
|---|---|
pending | Queued, waiting for a worker |
running | Currently executing |
completed | Results available for download |
failed | Execution error (check error) |
cancelled | Cancelled by user |
List Your Jobs
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/jobs
Download Results
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/jobs/{job_id}/result \
-o result.arrow
Cancel a Job
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/jobs/{job_id}
Rerun a Job
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/jobs/{job_id}/rerun
Creates a new job with the same parameters.
Configuration
| Parameter | Default | Environment Variable |
|---|---|---|
| Max concurrent workers | 4 | JOB_QUEUE_MAX_WORKERS |
| Result TTL | 24 hours | JOB_RESULT_TTL_HOURS |
| Max result size | 1 GB | JOB_MAX_RESULT_BYTES |
Results are stored in object storage and automatically cleaned up after the TTL expires.
Federation Overview
SynDB federation allows multiple institutions to participate in a shared neuroscience data network while retaining full control of their data. Each institution runs a node with its own ClickHouse instance; a central hub coordinates queries across all nodes.
Why Federate?
| Concern | Without federation | With federation |
|---|---|---|
| Data sovereignty | Upload all data to a central server | Data stays on your infrastructure |
| Meta-analysis | Limited to datasets on one instance | Query across all participating institutions |
| Compliance | Data leaves your network | Data never leaves — only query results cross boundaries |
| Latency | Single point of access | Local reads are fast; cross-cluster queries pay network cost |
Key Concepts
Hub — The coordinating instance that runs the full SynDB stack (API, PostgreSQL, ClickHouse, Meilisearch, S3). It maintains a registry of federated clusters, monitors their health, and routes cross-cluster queries.
Node — A lightweight participant running ClickHouse and the syndb-node binary. Nodes register with the hub via libp2p or HTTP, receive schema migrations, and respond to delegated queries.
Schema versioning — The hub pushes ClickHouse DDL migrations to all nodes. Queries only route to nodes whose schema version is compatible.
Health monitoring — The hub periodically checks each node’s health. Nodes are classified as Healthy, Degraded, Unreachable, or Unknown. Unhealthy nodes are excluded from federation queries.
Federation password — A shared secret that nodes present when registering with the hub. Prevents unauthorized clusters from joining.
When to Federate vs. Upload
Federate when:
- Institutional policy requires data to stay on-premise
- You have existing ClickHouse infrastructure
- You want to contribute to cross-institutional meta-analysis without data transfer
Upload directly when:
- You don’t have infrastructure to maintain
- Your data has no residency requirements
- You want the simplest path to sharing
Architecture at a Glance
┌─────────────────────────────────┐
│ Hub │
│ API + PostgreSQL + ClickHouse │
│ + S3 + Meilisearch + libp2p │
└──────┬──────────────┬───────────┘
│ libp2p/QUIC │ libp2p/QUIC
┌────▼────┐ ┌────▼────┐
│ Node A │ │ Node B │
│ CH + syndb-node │ │ CH + syndb-node │
└─────────┘ └─────────┘
Queries flow: User → Hub API → Hub ClickHouse → remote() to Node ClickHouse → results aggregated at Hub.
See Architecture for the full technical breakdown.
Federation Architecture
Components
Hub
The hub runs the full SynDB stack and coordinates the federation:
| Component | Role |
|---|---|
| syndb-api | HTTP API (port 8080) + Arrow Flight (port 50051) |
| PostgreSQL | User accounts, dataset metadata, cluster registry, job queue, benchmarks |
| ClickHouse | Local data warehouse + remote() queries to nodes |
| S3/MinIO | Mesh files, job results, ETL staging |
| Meilisearch | Full-text search index |
| HubRegistryActor | libp2p actor managing cluster registration and health |
| FederationHealthMonitor | Periodic health checks with circuit-breaker logic |
Node
Nodes are lightweight — no PostgreSQL, no S3, no Meilisearch:
| Component | Role |
|---|---|
| syndb-node | Federation daemon with Arrow Flight server (port 50052) |
| ClickHouse | Local data warehouse (HTTP port 8124, native port 9003/9440) |
| ClusterActor | libp2p actor handling hub communication |
Networking: libp2p
Federation uses libp2p for peer-to-peer communication:
- Transport: QUIC with built-in TLS 1.3 (encrypted, multiplexed)
- Discovery: mDNS for LAN (zero-config), DHT for WAN
- NAT traversal: Relay nodes for peers behind NAT
- Actor model: kameo actors manage the swarm event loop
DHT Registration
Services register under well-known names in the DHT:
| Name | Actor |
|---|---|
syndb-hub | HubRegistryActor |
syndb-cluster:{name} | ClusterActor |
The ClusterActor on each node looks up syndb-hub in the DHT to find and register with the hub.
Actor Messages
The ClusterActor handles these message types:
| Message | Direction | Purpose |
|---|---|---|
HealthPing | Hub → Node | Periodic liveness check |
SchemaSync | Hub → Node | Push DDL migrations |
DatasetCatalogRequest | Hub → Node | Discover datasets on node |
GetFlightEndpoint | Hub → Node | Resolve Flight address for data transfer |
AnalyticsQuery | Hub → Node | Delegated analytics computation |
OntologySync | Hub → Node | Push ontology terms |
Data Plane
Two mechanisms move data between hub and nodes:
ClickHouse remote()
For SQL queries, the hub compiles a remote('node-host:port', 'syndb', 'table', 'user', 'password') call that executes directly on the node’s ClickHouse and streams results back.
Arrow Flight (Internal)
For large result sets and non-SQL workloads (graph analysis, analytics), the hub delegates to the node’s internal Flight server (port 50052). Results stream back as Arrow IPC batches.
Schema Versioning
Each ClickHouse DDL migration has a version number. The hub tracks the current version and each node’s version:
- Hub receives a schema sync request (
POST /v1/federation/schema/sync) - Hub sends pending migrations to each active node via
SchemaSyncmessage - Nodes apply migrations and report their new version
- Queries only route to nodes whose schema version is compatible
Health Monitoring
The FederationHealthMonitorActor runs on the hub:
| State | Meaning | Query routing |
|---|---|---|
| Healthy | Responds to pings, schema compatible | Included |
| Degraded | Responds but slow or partially failing | Included with lower priority |
| Unreachable | Failed consecutive pings | Excluded |
| Unknown | Newly registered, not yet checked | Excluded until first successful ping |
Health transitions are logged and stored in PostgreSQL for audit.
Concurrency Model
- Lock-free reads: The hub’s cluster registry uses
papayaconcurrent hash maps — reads never block, even under high query load - Actor isolation: Each cluster connection is managed by its own actor, preventing one slow node from blocking others
- Supervisor trees: Actor failures are caught and restarted by the kameo supervisor
Node Setup
This guide walks through joining the SynDB federation as a node operator.
Prerequisites
- ClickHouse instance with a
syndbdatabase - Network reachability to the hub (or mDNS on the same LAN)
- The federation password (provided by the hub administrator)
syndbbinary with federation support
Step 1: Initialize
syndb ops federation init \
--cluster-name "my-lab-node" \
--clickhouse-endpoint "clickhouse.mylab.edu" \
--clickhouse-http-port 8123 \
--clickhouse-port 9440 \
--federation-password "$SYNDB_FEDERATION_PASSWORD" \
--institution "My University" \
--contact-email "[email protected]"
This command:
- Bootstraps a libp2p swarm and discovers the hub via mDNS or configured multiaddrs
- Registers the node with the hub (presenting the federation password)
- Applies any pending ClickHouse schema migrations
- Saves configuration to
~/.config/syndb/federation.json
Optional flags
| Flag | Default | Description |
|---|---|---|
--listen-addr | OS-assigned | libp2p listen address (e.g., /ip4/0.0.0.0/udp/4001/quic-v1) |
--description | — | Human-readable cluster description |
Step 2: Verify
# Show federation config
syndb ops federation status
# Test connectivity (3s mDNS discovery + hub + ClickHouse check)
syndb ops federation test
federation test performs:
- Bootstraps a temporary libp2p swarm with mDNS discovery
- Looks up the hub in the DHT
- Tests ClickHouse connectivity
Step 3: Sync Schema
If the hub has newer schema migrations:
# Preview changes
export SYNDB_HUB_URL="https://api.syndb.xyz/v1"
syndb ops federation sync-schema --dry-run
# Apply
syndb ops federation sync-schema
sync-schema currently uses an HTTP fallback endpoint and expects SYNDB_HUB_URL to point at the hub API base.
Step 4: Confirm Registration
List all federated clusters to verify your node appears:
export SYNDB_HUB_URL="https://api.syndb.xyz/v1"
syndb ops federation clusters
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
SYNDB_FEDERATION_PASSWORD | Yes | — | Shared secret for hub registration |
SYNDB_SERVER_URL | No | https://api.syndb.xyz | Default server URL for the CLI root command tree |
SYNDB_HUB_URL | For sync-schema and clusters fallback flows | — | Hub API base including /v1 (for example https://api.syndb.xyz/v1) |
FEDERATION_CLUSTER_NAME | Yes (node mode) | — | Unique cluster identifier |
FEDERATION_NODE_FLIGHT_PORT | No | 50052 | Internal Flight gRPC port |
FEDERATION_NODE_FLIGHT_ADVERTISE | No | derived | Advertised Flight endpoint for remote delegation |
FEDERATION_ENABLE_MDNS | No | true | Enable mDNS for LAN discovery |
FEDERATION_LISTEN_ADDR | No | OS-assigned | libp2p listen address |
FEDERATION_HUB_MULTIADDRS | No | — | Comma-separated hub multiaddrs for WAN |
FEDERATION_CLUSTER_NATIVE_PORT | No | 9440 | ClickHouse native port for remote() queries |
Docker Compose (Development)
For local development, the federation profile starts a hub and one node:
docker compose --profile federation up -d
This starts:
clickhouse-node— ClickHouse on HTTP 8124, native 9003clickhouse-node-setup— Creates federation user on the nodeclickhouse-hub-fed-setup— Creates federation user on the hubsyndb-node— Federation daemon with Flight on 50052, libp2p on 4001
All services use network_mode: host and discover each other via localhost.
Removing a Node
syndb ops federation logout
This deletes ~/.config/syndb/federation.json. The hub administrator can also deactivate the cluster via DELETE /v1/federation/clusters/{id}.
Hub Administration
All hub administration endpoints require SuperUser authentication.
Federation Status
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/status
{
"total_clusters": 5,
"active_clusters": 4,
"healthy": 3,
"degraded": 1,
"unreachable": 0,
"schema_version": 12
}
Cluster Management
List Clusters
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters
Returns each cluster’s ID, name, endpoint, port, health status, and active flag.
Register a Cluster
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/federation/clusters \
-d '{
"name": "partner-lab",
"endpoint": "ch.partner-lab.edu",
"federation_password": "shared-secret",
"description": "Partner Lab ClickHouse node",
"institution": "Partner University",
"contact_email": "[email protected]"
}'
Clusters can also self-register via POST /v1/federation/register using the federation password (no SuperUser required).
Deactivate a Cluster
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{cluster_id}
Sets is_active = false. The cluster is excluded from future queries but its record is preserved.
Health Checks
Single Cluster
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/health
Verification Tests
Three targeted tests for diagnosing cluster issues:
# Test ClickHouse connectivity and measure latency
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/test/connectivity
# Verify schema version compatibility
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/test/schema
# Run a test cross-cluster query
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{cluster_id}/test/query
Schema Sync
Push pending DDL migrations to all active clusters:
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/schema/sync
Get the current schema version and migrations:
# All migrations
curl -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/schema
# Migrations since version 10
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/federation/schema?since_version=10"
Benchmarks
Track federation query performance:
# Submit a benchmark record
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/federation/benchmarks \
-d '{
"cluster_id": "...",
"query_type": "remote_single",
"latency_ms": 145,
"row_count": 50000,
"cluster_count": 1,
"payload_bytes": 2048000,
"success": true
}'
# List benchmarks with filters
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/federation/benchmarks?query_type=remote_single&limit=50"
# Aggregate stats grouped by query type
curl -H "Authorization: Bearer $TOKEN" \
"https://api.syndb.xyz/v1/federation/benchmarks/aggregate?since=2024-01-01"
Query Types
| Type | Description |
|---|---|
remote_single | Query to one remote cluster |
remote_multi | Query spanning multiple clusters |
federation_union | Union across all federated clusters |
federation_search | Federated search |
health_check | Health check probe |
Cross-Cluster Queries
Federation queries let you analyze data across all participating nodes from a single API call.
How It Works
- User submits a query via SyQL or meta-analysis endpoint with federation scope
- Hub resolves targets — checks dataset locality index to determine which nodes hold relevant data
- Hub compiles remote queries — generates ClickHouse
remote('node:port', 'syndb', 'table', 'user', 'pass')calls - Nodes execute locally — each node runs its portion of the query against local data
- Hub aggregates — results stream back and are merged at the hub
SyQL with Federation Scope
SyQL queries can target the federation by specifying scope inside the query text:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/syql/exec \
-d '{
"query": "SCOPE federation\nSELECT neuron_id FROM neurons WHERE brain_region = '\''mushroom_body'\'' LIMIT 1000"
}'
The hub transparently fans the query out to nodes that hold matching datasets.
Meta-Analysis Across Clusters
Specify cluster_ids to include specific nodes:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/meta-analysis \
-d '{
"table": "neurons",
"metric": "mesh_volume",
"grouping": "brain_structure",
"dataset_ids": "uuid-1,uuid-2,uuid-3",
"scope": "federation",
"cluster_ids": "cluster-uuid-1,cluster-uuid-2"
}'
Data Plane: Arrow Flight
For large result sets and non-SQL workloads (graph analysis, analytics), the hub delegates to each node’s internal Flight server:
- Hub sends a Flight
DoGetrequest to the node’s advertised Flight endpoint (default port 50052) - Results stream back as Arrow IPC record batches
- The hub merges batches from multiple nodes before returning to the client
Limitations
| Constraint | Detail |
|---|---|
| Latency | Cross-cluster queries add network round-trip time per node |
| Schema compatibility | Nodes must be at a compatible schema version; incompatible nodes are excluded |
| Node health | Only Healthy and Degraded nodes receive queries; Unreachable nodes are skipped |
| Delegation timeout | Default 30s (FEDERATION_DELEGATION_TIMEOUT_SECS); long-running queries may need async jobs |
| No cross-node joins | Each node executes independently; joins happen only against local data |
Best Practices
- Use async jobs (
POST /v1/jobs) for large federation queries to avoid HTTP timeouts - Check federation status before running large queries to know which nodes are available
- Prefer meta-analysis endpoints for cross-dataset aggregation — they handle fan-out efficiently
- Monitor benchmarks to track federation query performance over time
Federation Troubleshooting
Node Cannot Find Hub
Symptom: syndb ops federation init or syndb ops federation test hangs during hub discovery.
Causes and fixes:
| Cause | Fix |
|---|---|
| mDNS blocked by firewall | Open UDP port 5353 or set FEDERATION_ENABLE_MDNS=false and use explicit multiaddrs |
| Hub and node on different networks | Set FEDERATION_HUB_MULTIADDRS to the hub’s libp2p address (e.g., /ip4/hub-ip/udp/4001/quic-v1) |
| Hub not running | Verify hub process is up and listening on its libp2p port |
Registration Rejected
Symptom: "Invalid federation password" error.
Fix: Ensure SYNDB_FEDERATION_PASSWORD matches the hub’s FEDERATION_PASSWORD exactly. Check for trailing whitespace or newlines in environment variables.
Schema Version Mismatch
Symptom: Node excluded from federation queries; hub logs show schema incompatibility.
Fix:
# Check current schema
syndb ops federation status
# Sync to latest
syndb ops federation sync-schema
If sync fails, verify the node’s ClickHouse is reachable and the syndb database exists.
Health States
| State | Meaning | Action |
|---|---|---|
| Healthy | All checks pass | None |
| Degraded | Responds but slow or partially failing | Check ClickHouse load, disk space, network |
| Unreachable | Failed consecutive pings | Check firewall, ClickHouse process, network connectivity |
| Unknown | Newly registered | Wait for first health check cycle or trigger manual verify |
Trigger a manual health check from the hub:
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{id}/verify
Docker Compose Issues
Port Conflicts
The federation profile uses network_mode: host. Check for conflicts:
- Hub ClickHouse: HTTP 8123, native 9002
- Node ClickHouse: HTTP 8124, native 9003
- Federation Flight: 50052
- libp2p: UDP 4001
Node Fails to Start
Check that hub ClickHouse setup containers completed first:
docker compose --profile federation logs clickhouse-hub-fed-setup
docker compose --profile federation logs clickhouse-node-setup
These create the federation user on each ClickHouse instance. If they fail, the node cannot authenticate for remote() queries.
Connectivity Test Sequence
Run targeted tests to isolate the failure:
# 1. Test ClickHouse connectivity
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{id}/test/connectivity
# 2. Test schema compatibility
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{id}/test/schema
# 3. Test cross-cluster query
curl -X POST -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/federation/clusters/{id}/test/query
Each test returns a pass/fail result with latency and error details. Work through them in order — later tests depend on earlier ones passing.
Docker Compose
Local development and single-machine deployment using Docker Compose.
Base Stack
cargo run -p cli --features dev -- stack up
Starts the core services:
| Service | Port | Description |
|---|---|---|
syndb-api | 8080 (HTTP), 50051 (Flight) | REST API + Arrow Flight |
syndb-ui | 8090 | Web frontend |
postgres | 5433 | Metadata, users, access control |
clickhouse | 8123 (HTTP), 9002 (native) | ClickHouse data warehouse |
s3 | 9000 (API), 9001 (console) | MinIO object storage |
meilisearch | 7700 | Meilisearch full-text search |
All services use network_mode: host — they bind directly to the host network.
Local Search Smoke
The compose stack includes Meilisearch and wires it into the API. To rebuild the local dataset search index from PostgreSQL and query the public search endpoint:
nix develop . -c syndb test meilisearch-local
This smoke path:
- ensures the stack is up
- runs
syndb data search reconcileagainst the local PostgreSQL and Meilisearch services - queries
http://localhost:8080/v1/search/fulltext?q=test&limit=5
The response may legitimately contain zero hits on a fresh stack, but the command should complete successfully and return valid JSON from the API.
Federation Profile
docker compose --profile federation up -d
Adds federation services on top of the base stack:
| Service | Port | Description |
|---|---|---|
clickhouse-node | 8124 (HTTP), 9003 (native) | Node ClickHouse |
clickhouse-node-setup | — | Creates federation user on node |
clickhouse-hub-fed-setup | — | Creates federation user on hub |
syndb-node | 50052 (Flight), 4001/UDP (libp2p) | Federation node daemon |
Note: The
federationandfederation-worldprofiles share port 8124 and are mutually exclusive.federation-worldruns 5 regional ClickHouse nodes for benchmarking only.
ETL Profile
Run dataset imports:
docker compose --profile etl run syndb-etl etl <dataset> <command>
Example:
docker compose --profile etl run syndb-etl etl hemibrain download
docker compose --profile etl run syndb-etl etl hemibrain import --data-dir /data/Hemibrain --table neurons --dataset-id <uuid>
Version Management
All service versions are defined in versions.nix. After changing versions (built with Nix):
syndb dev sync-versions
This regenerates .env with the correct image tags.
Image Building
Build container images from Nix:
cargo run -p cli --features dev -- stack prepare
This builds the local development images used by Compose: syndb-api-rust:dev, syndb-etl:dev, and syndb-ui:dev.
Volumes
| Volume | Service | Content |
|---|---|---|
clickhouse-data | clickhouse | ClickHouse data |
clickhouse-node-data | clickhouse-node | Node ClickHouse data |
postgres-data | postgres | PostgreSQL data |
minio-data | s3 | S3 object storage |
meilisearch-data | meilisearch | Search index |
Cleanup
ClickHouse creates files with UID 100100 and restrictive permissions. To clean volumes:
podman unshare rm -rf <volume-path> # requires Podman (https://podman.io)
Prefer keeping data in Docker volumes rather than bind mounts to avoid permission issues.
Kubernetes & Helm
Production deployment on Kubernetes using Helm charts.
Charts Overview
| Chart | Description |
|---|---|
syndb-hub | Hub deployment (API, UI, depends on syndb-clickhouse) |
syndb-federation-node | Federation node (syndb-node, depends on syndb-clickhouse) |
syndb-clickhouse | Shared ClickHouse subchart (used by both hub and node) |
syndb-etl | ETL batch jobs (download, prepare, import, graph-precompute) |
nautilus | Umbrella chart for the NRP Nautilus cluster deployment |
Charts are located under infrastructure/helm/.
Hub Deployment
The hub chart deploys the full SynDB stack. Key values:
syndb-clickhouse:
clusterName: syndb-hub
shardRegions:
- name: dc1
region: dc1
replicas: 3
api:
image:
repository: docker.io/caniko/syndb-api
tag: "0.10.47"
flightPort: 50051
resources:
requests:
cpu: "1"
memory: 2Gi
ui:
image:
repository: docker.io/caniko/syndb-ui
tag: "0.10.47"
The chart also creates a remote_servers.xml ConfigMap for ClickHouse cluster topology.
Meilisearch on Nautilus
The Nautilus umbrella chart now deploys Meilisearch as an internal-only
production dependency for /v1/search/fulltext.
- Deployment shape: single-replica
StatefulSet - Service type:
ClusterIP - Default storage:
rook-ceph-block - Default volume:
20Gi - Public ingress: none
- Shared secret:
syndb-api-secrets.meilisearch_api_key
The API and the reconcile CronJob both receive:
MEILISEARCH_URL=http://syndb-meilisearch:7700MEILISEARCH_API_KEYfromsyndb-api-secrets
Meilisearch itself receives the same secret as MEILI_MASTER_KEY, with
MEILI_NO_ANALYTICS=true.
Reconcile job
Nautilus also deploys an hourly CronJob that runs:
syndb data search reconcile
using the lightweight oci-syndb-cli image. This is the repair mechanism for
index drift and missed write-side updates.
Rollout order
For a production cutover:
- land the code and image changes
- update
syndb-api-secretsso it containsmeilisearch_api_key - deploy the Nautilus chart
- wait for
/healthto report configured Meilisearch - run one manual reconcile job
- verify
/v1/search/fulltextthrough the public API
The manual one-shot reconcile command inside the supported devshell is:
nix develop . -c env \
POSTGRES_HOST=<host> \
POSTGRES_READ_HOST=<read-host> \
POSTGRES_PORT=<port> \
POSTGRES_USERNAME=<user> \
POSTGRES_PASSWORD=<password> \
POSTGRES_PATH=<database> \
MEILISEARCH_URL=http://syndb-meilisearch:7700 \
MEILISEARCH_API_KEY=<key> \
cargo run -p cli --features dataset -- dataset search reconcile
Node Deployment
Deploy a federation node at your institution:
syndb-clickhouse:
clusterName: syndb-node
shardRegions:
- name: dc1
region: dc1
replicas: 2
nodeApi:
enabled: true
image: syndb-api-rust:latest
flightPort: 50052
libp2pPort: 4001
hubMultiaddrs: "/ip4/<hub-ip>/udp/4001/quic-v1"
federationPassword: "<shared-secret>"
resources:
requests:
cpu: 500m
memory: 512Mi
When nodeApi.enabled=true, the chart deploys:
- A Deployment running
syndb-nodewith Flight (TCP) and libp2p (UDP) ports - A Service exposing both ports
- Environment variables auto-populated from values (cluster name, endpoints, passwords)
In Kubernetes, mDNS is disabled — use hubMultiaddrs for explicit hub discovery.
ETL Jobs
ETL runs through the syndb-etl chart values, primarily downloadJobs, prepareJobs, seed, and graphPrecompute:
syndb-etl:
image:
repository: docker.io/caniko/syndb-etl
tag: "0.10.47"
flight:
enabled: true
serverUrl: "http://syndb-api-service:80"
port: "50051"
downloadJobs:
- pipeline: hemibrain
emptyDirSizeLimit: 8Gi
downloadResources:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "600m", memory: "614Mi" }
prepareJobs:
- pipeline: hemibrain
emptyDirSizeLimit: 25Gi
graphPrecompute:
enabled: true
Important: Kubernetes Jobs are immutable. Before running
helm upgradewhen resource values changed, delete failed or running ETL jobs:nix develop . -c kubectl delete job -n syndb -l app=syndb-etl --field-selector status.successful!=1
Skip override semantics: when
syndb ops k8s nautilus applyreceives explicitsyndb-etl.skipPipelines[...]flags, SynDB now unions them with bothconfig/etl-skip.ronand the live skip set derived from current ETL Jobs. Manual skip flags are additive; they do not replace the detected live skip set.
emptyDir warning:
emptyDirvolumes default to tmpfs and count against the pod’s memory cgroup limit. Add expected emptyDir data size to the memory limit.
Applying Changes
nix develop . -c cargo run -p cli --features dev -- ops k8s nautilus apply
Or manually:
nix develop . -c helm upgrade --install syndb-nautilus infrastructure/helm/nautilus/ \
-n syndb --create-namespace \
-f infrastructure/helm/nautilus/values.yaml
Pending Helm Releases
SynDB now refuses to apply when syndb-nautilus is already in one of Helm’s
pending states (pending-install, pending-upgrade, pending-rollback).
This prevents a generic:
another operation (install/upgrade/rollback) is in progress
from landing after ETL reset work has already started.
If the pending revision is newer than 10 minutes, treat it as possibly active and inspect it first:
nix develop . -c helm status syndb-nautilus -n syndb
nix develop . -c helm history syndb-nautilus -n syndb
If the pending revision is older than 10 minutes, treat it as stale and roll back to the newest deployed revision before retrying the apply.
Current example from April 19, 2026:
- revision
293was stuck inpending-upgrade - Helm reported
last_deployed = 2026-04-19T18:51:43.666197216+02:00 - the newest deployed revision was
291
Recovery:
nix develop . -c helm rollback syndb-nautilus 291 -n syndb
nix develop . -c cargo run -p cli --features dev -- ops k8s nautilus apply
QueryFabric Rollout
The QueryFabric cutover adds two PostgreSQL metadata invariants that the API now enforces at startup:
- every saved query must have
query_text - every pending query job must have
sql_plan
Use the SynDB devshell and either run the checks manually:
nix develop . -c syndb test queryfabric-full
nix develop . -c syndb test queryfabric-rollout
or use the convenience wrapper:
nix develop . -c syndb ops k8s nautilus deploy queryfabric
test-queryfabric-rollout checks the PostgreSQL environment described by the
current POSTGRES_* / POSTGRES_READ_HOST variables and performs the same
saved-query backfill step the API runs at startup. For production, point those
variables at the target metadata database before running the preflight.
deploy-bump-queryfabric is a safe wrapper over deploy-bump: it runs the full
local QueryFabric + SynDB validation path first, then the target-DB preflight,
and only then publishes images and upgrades Helm on trunk.
Environment Reference
All configuration is controlled through environment variables. This page documents the application defaults from crates/services/api/src/settings/mod.rs and calls out the local docker-compose.yaml overrides where they differ.
Database
| Variable | App default | Local compose | Description |
|---|---|---|---|
POSTGRES_HOST | localhost | localhost | PostgreSQL host |
POSTGRES_PORT | 5432 | 5433 | PostgreSQL port |
POSTGRES_USERNAME | syndb | syndb | PostgreSQL user |
POSTGRES_PASSWORD | syndb | syndb | PostgreSQL password |
POSTGRES_PATH | syndb | syndb_test | Database name |
POSTGRES_READ_HOST | unset | unset | Optional read replica host |
DB_POOL_MAX | 20 | unchanged | Max connection pool size |
DB_POOL_MIN | 2 | unchanged | Min idle connections |
DB_CONNECT_TIMEOUT_SECS | 10 | unchanged | PostgreSQL connect timeout |
CLICKHOUSE_HOST | localhost | localhost | ClickHouse host |
CLICKHOUSE_PORT | 8443 | 8123 | ClickHouse HTTP port |
CLICKHOUSE_USERNAME | default | default | ClickHouse user |
CLICKHOUSE_DATABASE | syndb | syndb | ClickHouse database |
CLICKHOUSE_SECURE | true | false | Use HTTPS/TLS for ClickHouse |
Object Storage (S3/MinIO)
| Variable | Default | Description |
|---|---|---|
S3_ACCESS_KEY | — | Access key |
S3_SECRET_KEY | — | Secret key |
S3_ENDPOINT | unset | Custom endpoint for MinIO or other S3-compatible storage |
S3_REGION | unset | AWS region |
Bucket names: syndb-mesh, syndb-swb, syndb-search, syndb-jobs. No underscores allowed in bucket names.
Authentication
| Variable | Default | Description |
|---|---|---|
PASSLIB_SECRET | — | PASETO v4.local symmetric key (minimum 32 bytes) |
SERVICE_SECRET | — | Service account registration secret |
UI_BASE_URL | http://localhost:8090/ui | OAuth callback redirect base URL |
ACCESS_TOKEN_LIFETIME | 900 (15 min) | Access token TTL in seconds |
REFRESH_TOKEN_LIFETIME | 2592000 (30 days) | Refresh token TTL in seconds |
COOKIE_SAME_SITE | Strict | SameSite attribute for auth cookies |
COOKIE_SECURE | true | Whether auth cookies require HTTPS |
REQUIRE_AUTHENTICATION | true | Require auth on protected endpoints |
OAuth Providers
| Variable | Description |
|---|---|
OA_GITHUB_ID, OA_GITHUB_SECRET | GitHub OAuth app credentials |
OA_GOOGLE_ID, OA_GOOGLE_SECRET | Google OAuth credentials |
OA_ORCID_ID, OA_ORCID_SECRET | ORCID OAuth credentials |
OA_CILOGON_ID, OA_CILOGON_SECRET | CILogon OAuth credentials |
OA_GITLAB_ID, OA_GITLAB_SECRET | GitLab OAuth credentials |
OA_GITLAB_URL | Custom GitLab instance URL |
OA_ORCID_SANDBOX | Use sandbox.orcid.org (false) |
OA_CILOGON_SANDBOX | Use test.cilogon.org (false) |
OAUTH_PROVIDER_BASE_URL | Override provider URLs (testing) |
Federation
| Variable | Default | Description |
|---|---|---|
FEDERATION_LISTEN_ADDR | OS-assigned | libp2p listen address |
FEDERATION_ENABLE_MDNS | true | Enable mDNS LAN discovery |
FEDERATION_HUB_MULTIADDRS | — | Comma-separated hub multiaddrs for WAN |
FEDERATION_CLUSTER_NAME | — | Cluster identifier (required for node mode) |
FEDERATION_CLUSTER_DESCRIPTION | — | Cluster description |
FEDERATION_CLUSTER_INSTITUTION | — | Institution name |
FEDERATION_PASSWORD | — | Shared federation secret |
FEDERATION_CLUSTER_NATIVE_PORT | 9440 | ClickHouse native port for remote() |
FEDERATION_NODE_FLIGHT_PORT | 50052 | Internal Flight gRPC port |
FEDERATION_NODE_FLIGHT_ADVERTISE | unset | Advertised internal Flight endpoint (host:port); defaults to localhost:<FEDERATION_NODE_FLIGHT_PORT> when omitted |
FEDERATION_DELEGATION_TIMEOUT_SECS | 30 | Timeout for delegated requests |
Server
| Variable | Default | Description |
|---|---|---|
API_DOMAIN | localhost | Public API host name used for generated links |
DEV_MODE | false | Permissive CORS, data seeding |
DEBUG | false | Verbose SQL logging |
TESTING | false | Skip federation/job queue init |
REQUEST_TIMEOUT_SECS | 60 | HTTP handler timeout |
HTTP_CLIENT_TIMEOUT_SECS | 30 | Internal HTTP client timeout |
UPLOAD_TIMEOUT | 21600 (6 hours) | Upload timeout |
FLIGHT_PORT | 50051 | Arrow Flight server port |
Rate Limiting
| Variable | Default | Description |
|---|---|---|
RATE_LIMIT_PER_SECOND | 100 | Sustained request rate per IP |
RATE_LIMIT_BURST | 200 | Burst capacity per IP |
Job Queue
| Variable | Default | Description |
|---|---|---|
JOB_QUEUE_MAX_WORKERS | 4 | Max concurrent job workers |
JOB_RESULT_TTL_HOURS | 24 | Result retention |
JOB_MAX_RESULT_BYTES | 1073741824 (1 GB) | Max result size |
Search
| Variable | Default | Description |
|---|---|---|
MEILISEARCH_URL | unset | Base URL for Meilisearch, for example http://localhost:7700 |
MEILISEARCH_API_KEY | — | Meilisearch API key |
API Overview
Base URL: https://api.syndb.xyz/v1
Interactive OpenAPI documentation: api.syndb.xyz/docs
OpenAPI spec: GET /openapi.json
This page is a curated route map for the current public surface. The generated OpenAPI document is the authoritative exhaustive reference.
Authentication
Pass a PASETO access token in the Authorization header:
Authorization: Bearer <access_token>
See Authentication for how to obtain tokens.
Content Types
- Requests:
application/json - Responses:
application/json(API), Apache Arrow IPC (job results), BibTeX/RIS (citations)
Error Format
{
"error": "Human-readable error message"
}
Standard HTTP status codes: 400 (bad request), 401 (unauthenticated), 403 (insufficient permissions), 404 (not found), 409 (conflict), 429 (rate limited).
Route Map
Public routes
GET /health— service health checkPOST /v1/user/auth/registerPOST /v1/user/auth/loginPOST /v1/user/auth/register-servicePOST /v1/user/auth/refreshPOST /v1/user/auth/logoutGET /v1/search/fulltextGET /v1/federation/pingPOST /v1/federation/registerGET /v1/ontology/vocabulariesGET /v1/ontology/termsGET /v1/ontology/terms/{id}GET /v1/ontology/terms/{id}/childrenGET /v1/ontology/terms/{id}/ancestorsPOST /v1/ontology/terms/validate
Authenticated user routes
GET /v1/user/profilePATCH /v1/user/profilePOST /v1/user/profile/scientist-tagGET /v1/user/profile/{user_id}GET /v1/user/authenticate/cilogonGET /v1/user/authenticate/cilogon/authorize
Academic user routes
- Dataset metadata and assets:
POST /v1/neurodata/datasets,GET /v1/neurodata/datasets/owned,GET /v1/neurodata/datasets/modifiable,GET /v1/neurodata/datasets/incomplete,GET /v1/neurodata/datasets/{dataset_id},DELETE /v1/neurodata/datasets/{dataset_id},GET /v1/neurodata/datasets/{dataset_id}/provenance,GET /v1/neurodata/datasets/{dataset_id}/versions,GET /v1/neurodata/datasets/{dataset_id}/metadata.jsonld,GET /v1/neurodata/datasets/{dataset_id}/citation,GET /v1/neurodata/datasets/{dataset_id}/lineage,POST /v1/neurodata/datasets/{dataset_id}/lineage,POST /v1/neurodata/datasets/{dataset_id}/access/request,GET /v1/neurodata/datasets/{dataset_id}/access,GET /v1/neurodata/collections,POST /v1/neurodata/collections,GET /v1/neurodata/collections/{collection_id},DELETE /v1/neurodata/collections/{collection_id} - SyQL:
POST /v1/syql/plan,POST /v1/syql/explain,POST /v1/syql/exec,POST /v1/syql/cancel - Saved queries:
GET /v1/queries,POST /v1/queries,POST /v1/queries/from-syql,GET /v1/queries/{id},PUT /v1/queries/{id},DELETE /v1/queries/{id},POST /v1/queries/{id}/run,POST /v1/queries/{id}/refresh - Jobs:
POST /v1/jobs,POST /v1/jobs/graph,GET /v1/jobs,GET /v1/jobs/{job_id},DELETE /v1/jobs/{job_id},GET /v1/jobs/{job_id}/result,POST /v1/jobs/{job_id}/rerun - Analytics:
GET /v1/analytics/summary,GET /v1/analytics/morphometrics,GET /v1/analytics/comparison,GET /v1/analytics/graph/{dataset_id}/summary,GET /v1/analytics/graph/{dataset_id}/reciprocity,GET /v1/analytics/graph/{dataset_id}/degree-distribution - Graph:
GET /v1/graph/{dataset_id}/metrics,POST /v1/graph/{dataset_id}/motifs,POST /v1/graph/{dataset_id}/motifs/compare-synapse-types,POST /v1/graph/{dataset_id}/shortest-path,POST /v1/graph/{dataset_id}/reachability,POST /v1/graph/{dataset_id}/reachability-curve,POST /v1/graph/{dataset_id}/full-analysis,GET /v1/graph/compare - Meta-analysis:
POST /v1/meta-analysis,GET /v1/meta-analysis/atlas/compare
SuperUser routes
- Federation administration:
GET /v1/federation/status,GET /v1/federation/schema,POST /v1/federation/schema/sync,GET /v1/federation/clusters,POST /v1/federation/clusters,DELETE /v1/federation/clusters/{cluster_id},POST /v1/federation/clusters/{cluster_id}/verify,POST /v1/federation/clusters/{cluster_id}/test/connectivity,POST /v1/federation/clusters/{cluster_id}/test/schema,POST /v1/federation/clusters/{cluster_id}/test/query,GET /v1/federation/benchmarks,POST /v1/federation/benchmarks,GET /v1/federation/benchmarks/aggregate - Ontology writes:
POST /v1/ontology/terms,PUT /v1/ontology/terms/{id},DELETE /v1/ontology/terms/{id},POST /v1/ontology/import/csv
Middleware Stack
Requests pass through these layers in order:
- Request ID — UUID v7, propagated via
X-Request-ID - Tracing — structured request/response logging
- Rate limiting — per-IP token bucket (see Rate Limiting)
- Timeout — 60s default, 408 on expiry
- CORS — permissive in dev mode, restricted to
api_domainin production - Compression — automatic response compression
- Body limit — 100 MB max request body
- API version —
api-version: v1response header
Health Check
curl https://api.syndb.xyz/health
{
"status": "healthy",
"components": {
"postgres": { "status": "ok", "latency_ms": 5 },
"clickhouse": { "status": "ok", "latency_ms": 12 },
"storage": { "status": "ok", "latency_ms": 8 },
"meilisearch": { "status": "ok" }
}
}
Status is degraded if any required component fails, or if Meilisearch is
configured but unreachable. Meilisearch is still optional — when it is unset,
the health payload reports meilisearch.status = "not_configured" without
degrading the overall status.
CLI Reference
The SynDB CLI (syndb) provides command-line access to account management, saved queries, dataset upload, federation, ETL, and Kubernetes workflows.
This page documents the current command surface. If you are working from this repository, you can run the CLI directly without a global install:
cargo run -p cli --features full -- --help
If the repository was cloned without submodules, initialize them first:
git submodule update --init --recursive
Global Options
| Option | Environment Variable | Description |
|---|---|---|
--server-url | SYNDB_SERVER_URL | API base URL |
--flight-url | SYNDB_FLIGHT_URL | Arrow Flight endpoint |
--flight-port | SYNDB_FLIGHT_PORT | Arrow Flight port |
Commands
user — account management
syndb auth register— create a new accountsyndb auth login— authenticate and store the token locallysyndb auth logout— revoke the current session
query — saved queries and SyQL helpers
These saved-query commands operate on the server-backed QueryFabric path, not a local on-disk query store.
syndb query list
syndb query save --label "Neuron subset" --table 1 --dataset-id <uuid> --column neuron_id --column cell_type
syndb query save-syql --label "Mouse neurons" "FROM neurons WHERE species = 'mouse' LIMIT 1000"
syndb query show <query-id>
syndb query update <query-id> --label "Updated label"
syndb query run <query-id>
syndb query status <query-id>
syndb query delete <query-id>
syndb query exec "FROM neurons LIMIT 10"
syndb query explain "FROM neurons LIMIT 10"
Current subcommands: save, list, show, update, delete, run, status, save-syql, exec, and explain.
dataset — dataset management
syndb data new --label "Example dataset" --animal "Mus musculus" --microscopy EM --table 1 --brain-structure hippocampus
syndb data prepare --input-dir raw_dataset --output-dir prepared_dataset
syndb data validate --input-dir prepared_dataset
syndb data upload --input-dir prepared_dataset --dataset-id <uuid>
syndb data download --dataset-id <uuid> --output-dir download_dir
syndb data mesh-upload --dataset-id <uuid> --input-dir meshes
syndb data swb-upload --dataset-id <uuid> --input-dir swb
syndb data delete --dataset-id <uuid>
syndb data search reconcile --dry-run
syndb data cache-tags
syndb data gen-test-data --output-dir tmp/test-data
Current subcommands: new, prepare, validate, upload, download, mesh-upload, swb-upload, delete, search reconcile, cache-tags, and gen-test-data.
syndb data search reconcile is the repair path for public full-text
search. It rebuilds the local portion of the shared datasets Meilisearch
index from PostgreSQL, preserves federated documents, and deletes stale local
entries. Use --dry-run to inspect the planned changes without mutating
Meilisearch.
The reconcile command reads its runtime contract from environment variables:
POSTGRES_HOST/POSTGRES_READ_HOSTPOSTGRES_PORTPOSTGRES_USERNAMEPOSTGRES_PASSWORDPOSTGRES_PATHMEILISEARCH_URLMEILISEARCH_API_KEY
Example:
POSTGRES_HOST=localhost \
POSTGRES_READ_HOST=localhost \
POSTGRES_PORT=5433 \
POSTGRES_USERNAME=syndb \
POSTGRES_PASSWORD=syndb \
POSTGRES_PATH=syndb_test \
MEILISEARCH_URL=http://localhost:7700 \
MEILISEARCH_API_KEY=meili_dev_key \
syndb data search reconcile --dry-run
etl — dataset import pipeline
Most datasets support download, validate, and import subcommands. CAVE-backed datasets may use manual export instead of download, followed by validate and import:
syndb etl <dataset> download # when a static release exists
syndb etl <dataset> validate
syndb etl <dataset> import --data-dir external_datasets/<name> --table neurons --dataset-id <uuid>
spine-morphometry is the main special case: it uses --source kasthuri|ofer-confocal|microns instead of separate dataset keys.
Dataset keys
| Dataset | Key | Description |
|---|---|---|
| FlyWire | flywire | Whole-brain Drosophila connectome |
| Hemibrain | hemibrain | Janelia FlyEM v1.2.1 |
| MANC | manc | Male Adult Nerve Cord |
| Spine Morphometry | spine-morphometry | Dendritic spine morphometry (--source required) |
| C. elegans Hermaphrodite | celegans | Complete hermaphrodite wiring |
| Larval Drosophila | larval | L1 larval brain connectome |
| Allen Cell Types | allen-cell-types | Allen Institute reference |
| NeuroMorpho | neuromorpho | NeuroMorpho.org archive |
| Witvliet Developmental | witvliet | Developmental C. elegans connectome |
| C. elegans Male | celegans-male | Complete male wiring |
| Ciona | ciona | Larval CNS connectome |
| Platynereis | platynereis | Marine annelid connectome |
| MICrONS | microns | Mouse visual cortex |
| H01 | h01 | Human cortical tissue |
| Optic Lobe | optic-lobe | Drosophila optic lobe |
| Male CNS | male-cns | Male central nervous system |
| BANC | banc | Brain And Nerve Cord |
| FANC | fanc | Female Adult Nerve Cord |
| Fish1 | fish1 | Zebrafish brain |
Additional ETL utility commands: seed, update-all, and status.
federation — federation management
syndb ops federation init --cluster-name my-lab-node --clickhouse-endpoint clickhouse.mylab.edu --federation-password "$SYNDB_FEDERATION_PASSWORD"
syndb ops federation status
syndb ops federation sync-schema --dry-run
syndb ops federation test
syndb ops federation clusters
syndb ops federation logout
See Node Setup for detailed usage.
graph-precompute — Batch Graph Computation
syndb graph-precompute --dataset flywire
Pre-computes graph metrics and stores results in ClickHouse materialized tables for one or more current dataset keys.
k8s — Kubernetes administration
Current top-level groups:
syndb ops k8s etl— ETL jobs on Kubernetessyndb ops k8s secrets— secret management helperssyndb ops k8s sites— federation site helpers
Common ETL operations:
syndb ops k8s etl status
syndb ops k8s etl report
syndb ops k8s etl watch
syndb ops k8s etl retry
syndb ops k8s etl cleanup
syndb ops k8s etl reset
bench — Benchmarking
Performance testing suite for API and federation queries.
syndb-plot — Manuscript figures
The plotting package exposes a separate syndb-plot CLI for benchmark plots
and manuscript figure rendering.
Benchmark-backed manuscript panels read benchmark/results/ by default. To
render against a production benchmark bundle elsewhere, pass the directory
explicitly:
syndb-plot manuscript-panels \
--benchmark-results-dir documentation/manuscript/benchmarks/cluster-rebuild-2026-05-23
The same flag is available on focused panel/build commands:
syndb-plot manuscript-inspect-panel 5 A --benchmark-results-dir <results-dir>
syndb-plot manuscript-build-one 5 --benchmark-results-dir <results-dir>
manuscript-panels uses the provided directory for its benchmark preflight and
for benchmark-backed composite figures. manuscript-composites only compiles
already-rendered panels and does not read benchmark parquet.
ci — CI helpers
Internal build, test, and image automation helpers used by project workflows.
completions — Shell Completions
syndb completions bash > ~/.local/share/bash-completion/completions/syndb
syndb completions zsh > ~/.zfunc/_syndb
syndb completions fish > ~/.config/fish/completions/syndb.fish
Ontology & Vocabularies
SynDB uses controlled vocabularies to standardize dataset metadata — brain regions, species, microscopy techniques, and neurotransmitter types.
Browsing Terms
List All Vocabularies
curl https://api.syndb.xyz/v1/ontology/vocabularies
List Terms in a Vocabulary
curl "https://api.syndb.xyz/v1/ontology/terms?vocabulary=brain_region"
Search Terms
curl "https://api.syndb.xyz/v1/ontology/terms?q=mushroom"
Term Hierarchy
# Get child terms
curl https://api.syndb.xyz/v1/ontology/terms/{term_id}/children
# Get ancestor terms
curl https://api.syndb.xyz/v1/ontology/terms/{term_id}/ancestors
Validating Terms
Before submitting dataset metadata, validate that your terms exist:
curl -X POST -H "Content-Type: application/json" \
https://api.syndb.xyz/v1/ontology/terms/validate \
-d '{"terms": ["mushroom_body", "lateral_horn"]}'
Returns which terms are valid and which are unrecognized.
Administration (SuperUser)
Create a Term
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/ontology/terms \
-d '{
"vocabulary": "brain_region",
"code": "MB_CALYX",
"label": "calyx",
"parent_id": "00000000-0000-0000-0000-000000000001",
"uri": "https://example.org/terms/mb-calyx",
"metadata": {
"source": "manual"
}
}'
Update a Term
curl -X PUT -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/ontology/terms/{term_id} \
-d '{
"label": "calyx",
"uri": "https://example.org/terms/mb-calyx",
"metadata": {
"status": "reviewed"
}
}'
Deprecate a Term
curl -X DELETE -H "Authorization: Bearer $TOKEN" \
https://api.syndb.xyz/v1/ontology/terms/{term_id}
Deprecated terms remain in the system but are flagged in search results and validation.
Bulk Import
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://api.syndb.xyz/v1/ontology/import/csv \
-d '{
"vocabulary": "brain_region",
"csv_data": "code,label,uri,parent_code\nMB,mushroom body,https://example.org/terms/mb,\nMB_CALYX,calyx,https://example.org/terms/mb-calyx,MB"
}'
CSV format: code,label,uri,parent_code
Integration with Datasets
When creating or updating dataset metadata, brain region, species, and microscopy fields are validated against the ontology. Invalid terms are rejected with an error listing the closest matches.
External References
SynDB’s ontology system draws from established biomedical and neuroscience ontologies:
- OBO Foundry — community-maintained interoperable ontologies for biology and biomedicine
- UBERON — multi-species anatomy ontology used for brain region terms
- ChEBI — Chemical Entities of Biological Interest, covering neurotransmitter classifications
- Allen Brain Atlas — reference atlas for mouse and human brain region parcellations
- Virtual Fly Brain — Drosophila neuroanatomy ontology browser and data integration hub
Rate Limiting
SynDB enforces per-IP rate limiting using a token bucket algorithm.
Defaults
| Parameter | Default | Environment Variable |
|---|---|---|
| Requests per second | 100 | RATE_LIMIT_PER_SECOND |
| Burst capacity | 200 | RATE_LIMIT_BURST |
The bucket refills at the sustained rate. Burst capacity allows short spikes above the sustained rate.
Client IP Detection
The rate limiter identifies clients by IP address, checked in order:
X-Forwarded-Forheader (first address)X-Real-IPheader- Localhost (fallback for direct connections)
Behind a reverse proxy, ensure X-Forwarded-For is set correctly.
Response on Limit
When the rate limit is exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 1
Too many requests
Client Handling
Respect the Retry-After header and implement exponential backoff:
import time
import requests
def request_with_backoff(url, headers, max_retries=3):
for attempt in range(max_retries):
resp = requests.get(url, headers=headers)
if resp.status_code != 429:
return resp
wait = int(resp.headers.get("Retry-After", 1)) * (2 ** attempt)
time.sleep(wait)
raise Exception("Rate limited after retries")
For batch operations, throttle to well under 100 req/s to leave headroom for interactive use.
Data Standards
SynDB implements open data standards to ensure interoperability, discoverability, and long-term preservation of neuroscience datasets.
FAIR Data Principles
SynDB aligns with the FAIR principles for scientific data management:
- Findable: Datasets are indexed by Meilisearch full-text search. Each dataset is assigned a persistent UUID. Metadata is exposed via JSON-LD for search engine discovery.
- Accessible: A RESTful API with an OpenAPI specification provides structured access. Arrow Flight enables high-throughput data transfer. Authentication uses standardized PASETO tokens.
- Interoperable: Metadata is serialized as JSON-LD using Schema.org vocabulary. Controlled vocabularies draw from OBO Foundry ontologies. Data is exported in Apache Parquet and Apache Arrow formats.
- Reusable: Licenses are stored as machine-readable SPDX identifiers. Provenance tracking, version history, and auto-generated citations support reproducibility.
Metadata Standards
SynDB dataset metadata follows established web standards:
- Schema.org: Dataset metadata uses the Schema.org Dataset type, enabling discovery by Google Dataset Search and other aggregators.
- JSON-LD: Metadata is serialized as JSON-LD – a linked data format that embeds semantic context in standard JSON. Access via
GET /v1/neurodata/datasets/{id}/metadata.jsonld. - DCAT: Vocabulary alignment with the W3C Data Catalog Vocabulary for catalog interoperability.
- Dublin Core: Core metadata terms (title, creator, date, rights) follow Dublin Core conventions.
- SynDB Connectomics Data Profile: Required profile for ontology-backed dataset metadata, DataCite relation types, JSON-LD export, and archival metadata bundles.
Citation Formats
SynDB generates citations in multiple formats via GET /v1/neurodata/datasets/{id}/citation?format=<fmt>:
| Format | Use Case | Specification |
|---|---|---|
| BibTeX | LaTeX documents | .bib entries |
| RIS | Reference managers (Zotero, EndNote, Mendeley) | Tagged text format |
| APA | Inline text citations | APA 7th edition |
| CSL-JSON | Programmatic citation processing | Citation Style Language data model |
| CFF | Software/dataset citation files | CITATION.cff format |
License Identifiers
SynDB uses SPDX license identifiers internally. When you select a license during dataset creation, it is stored as an SPDX expression (e.g., ODC-BY-1.0, CC-BY-4.0). This enables machine-readable license detection and compatibility checking.
See the license selection guide for help choosing a license.
Data Formats
| Format | MIME Type | Used For |
|---|---|---|
| Apache Parquet | application/vnd.apache.parquet | Dataset export and DOWNLOAD parquet in SyQL |
| Apache Arrow IPC | application/vnd.apache.arrow.stream | Job results, Flight data transfer |
| CSV | text/csv | DOWNLOAD csv in SyQL, ontology bulk import |
Arrow IPC and Parquet files can be read with pandas, Polars, DuckDB, or any Arrow-compatible library.
External Integrations
SynDB metadata is designed to interoperate with these neuroscience data ecosystems:
| Platform | Integration |
|---|---|
| DataCite | DOI registration and metadata schema alignment (DataCite Metadata Schema 4.5) |
| DANDI Archive | Complementary neurophysiology data archive |
| OpenNeuro | Complementary neuroimaging data archive |
| Google Dataset Search | Automatic discovery via Schema.org/JSON-LD metadata |
SynDB Connectomics Data Profile
The SynDB Connectomics Data Profile defines the metadata contract required for datasets to be findable, accessible, interoperable, and reusable in SynDB.
Required Dataset Metadata
Every dataset must include:
- UUIDv7 dataset identifier.
- DataCite dataset DOI for production F1=3 FAIR claims.
- Human-readable dataset label.
- SPDX-compatible data license.
- Access policy:
open,registered, orrestricted. - Species resolved to an active
ontology_termin thencbi_taxonvocabulary. - Microscopy technique resolved to an active
ontology_termin themicroscopyvocabulary. - Brain regions resolved to active ontology terms in
uberon,fbbt, or SynDB’s internalbrain_regionvocabulary. - Declared SynDB table list and uploaded table state.
- Provenance, version, citation, lineage, and archive links.
Ontology Vocabularies
SynDB metadata uses these vocabularies:
| Vocabulary | Use |
|---|---|
ncbi_taxon | Species and taxonomic identity |
microscopy | Imaging and reconstruction modality |
uberon | Vertebrate anatomical structures |
fbbt | Drosophila anatomical structures |
brain_region | SynDB terms for structures not yet mapped to external vocabularies |
chebi | Neurotransmitter and chemical identity |
All required ontology terms must be active, have a URI, and carry a registry version.
Relations
Dataset lineage and external references use DataCite relation types, including IsDerivedFrom, IsSourceOf, IsPartOf, HasPart, References, IsReferencedBy, IsVersionOf, and HasVersion.
Invalid relation strings are rejected.
Linked Data And Archive Contract
Each dataset exposes:
GET /v1/neurodata/datasets/{dataset_id}/metadata.jsonldGET /v1/neurodata/datasets/{dataset_id}/archive.jsonGET /v1/neurodata/datasets/{dataset_id}/doiGET /v1/neurodata/datasets/{dataset_id}/provenanceGET /v1/neurodata/datasets/{dataset_id}/versionsGET /v1/neurodata/datasets/{dataset_id}/lineageGET /v1/neurodata/datasets/{dataset_id}/citation
The JSON-LD document uses Schema.org, DCAT, Dublin Core, DataCite, PROV, SPDX, and SynDB terms. It includes conformsTo pointing to this profile.
The archive bundle is the long-term metadata preservation surface. It includes dataset metadata, JSON-LD, DataCite metadata, citations, provenance, versions, lineage, external references, deletion status, and navigation links.
Dataset DOIs are minted through DataCite when the deployment has DATACITE_ENABLED, repository credentials, and a DOI prefix configured. Local or development deployments must report DOI minting as unavailable rather than fabricating DOI identifiers.
Validation Rules
Dataset creation fails if any required species, microscopy, or brain-region term cannot be resolved to a non-deprecated ontology term with a URI.
Startup validation fails if the ontology registry is incomplete for any neurometa enum variant or persisted dataset metadata term. Missing data must be fixed upstream by adding or repairing the relevant ontology term before SynDB starts.
For production FAIR scoring, each published dataset must have a DataCite DOI record. Publication DOIs remain related identifiers and do not replace the dataset DOI.
Architecture Decisions
These records preserve the rationale behind core SynDB architecture choices. They are historical decision notes, not exhaustive operational guides; use the topic-specific documentation pages for current procedures and route details.
- ADR-001: PASETO v4.local for Authentication
- ADR-002: Kameo Actors for Federation Orchestration
- ADR-003: ClickHouse for Analytical Data Storage
- ADR-004: Apache Arrow Flight for Data Transport
- ADR-005: Dual HTTP + gRPC Server Architecture
- ADR-006: libp2p for Peer-to-Peer Federation
ADR-001: PASETO v4.local for Authentication
Date: 2024-01-15
Status: Accepted
Context
SynDB exposes two server protocols from the same binary: an Axum HTTP API and an Apache Arrow Flight gRPC service. Both require stateless authentication that can be validated without a database round-trip on every request.
JSON Web Tokens (JWT) are the industry default, but they carry well-documented
pitfalls: algorithm confusion attacks (alg: none), RSA/HMAC substitution, and
an overly flexible header that increases the attack surface. Because both
servers run in the same process and share an AppState, there is no need for
public-key cryptography or third-party token verification – a symmetric scheme
is simpler and sufficient.
Decision
Use PASETO v4.local (symmetric authenticated encryption with XChaCha20-Poly1305)
via the rusty_paseto crate. Tokens carry user claims encrypted with a shared
256-bit secret key.
- Access tokens are short-lived (15 minutes).
- Refresh tokens are rotated on each use and stored in PostgreSQL, enabling server-side revocation.
- The shared secret is loaded once at startup from the application settings and
held in
AppState.
Consequences
Positive:
- Eliminates the entire class of JWT algorithm confusion vulnerabilities.
- Symmetric encryption means tokens are opaque to clients – no claim leakage.
- Single shared secret is trivial to manage when both protocols live in one binary.
- Short-lived access tokens plus refresh rotation limit the blast radius of a leaked token.
Negative:
- Tokens cannot be decoded or inspected on the client side (by design, but complicates client-side debugging).
- If SynDB ever splits into multiple independently deployed services, the shared secret must be distributed securely to each service.
- Refresh token storage adds a PostgreSQL dependency to the auth flow.
ADR-002: Kameo Actors for Federation Orchestration
Date: 2024-03-10
Status: Accepted
Context
SynDB federation delegates queries to multiple remote nodes simultaneously. A federated query must fan out Flight gRPC calls, enforce per-node timeouts, retry transient failures, and aggregate partial results into a single response stream.
Implementing this with raw tokio::spawn and channels leads to scattered state,
ad-hoc cancellation logic, and difficult-to-test concurrency patterns. We need a
structured concurrency model that encapsulates per-query state and lifecycle.
Decision
Use the kameo actor framework for federation query orchestration. Each federated query spawns a coordinator actor that in turn spawns per-node worker actors. Workers issue Flight gRPC calls and stream results back to the coordinator via typed messages.
- Actor mailboxes provide natural back-pressure.
- Supervision trees handle worker failures without crashing the coordinator.
- Actor state is private and mutation-free from the caller’s perspective.
Consequences
Positive:
- Clean separation of concerns: each actor owns its state and lifecycle.
- Message-passing eliminates shared mutable state across concurrent operations.
- Supervision and timeout semantics are built into the framework rather than hand-rolled.
- Actors are straightforward to unit-test in isolation by sending messages directly.
Negative:
- Adds a runtime dependency on the kameo crate and its executor integration.
- Actor mailbox overhead exists, though it is negligible for the federation workload (tens of messages, not millions).
- Developers must learn the actor model; it is less familiar than plain async/await to most Rust programmers.
ADR-003: ClickHouse for Analytical Data Storage
Date: 2024-01-08
Status: Accepted
Context
Neuroscience datasets in SynDB contain billions of rows – neurons, synapses, connectivity matrices, and physiological measurements – that are queried with columnar scans, aggregations, and large joins. PostgreSQL handles the OLTP metadata workload well (users, datasets, permissions) but performs poorly on analytical queries at this scale due to its row-oriented storage engine.
A single-database approach would force a choice between metadata flexibility and analytical performance. Scientific data is append-only and immutable once ingested, which relaxes consistency requirements for the analytical store.
Decision
Adopt a dual-database strategy:
- PostgreSQL (via SeaORM) for metadata, user accounts, permissions, and all OLTP operations.
- ClickHouse for analytical neuroscience data, partitioned by
dataset_idusing the MergeTree engine family.
The API service connects to both databases. Metadata queries hit PostgreSQL; data queries are translated to ClickHouse SQL and executed via the native ClickHouse HTTP or TCP client.
Consequences
Positive:
- Orders-of-magnitude faster columnar scans and aggregations compared to PostgreSQL at billion-row scale.
- ClickHouse’s compression (LZ4/ZSTD) dramatically reduces storage for repetitive scientific data.
- Partitioning by
dataset_idenables efficient data lifecycle management (drop partition on dataset deletion). - Eventual consistency is acceptable because scientific data is immutable after ingestion.
Negative:
- Increased operational complexity: two database systems to provision, monitor, back up, and upgrade.
- No cross-database transactions; the application must handle consistency at the orchestration layer.
- ClickHouse’s mutation and update semantics differ from PostgreSQL, requiring developer awareness.
ADR-004: Apache Arrow Flight for Data Transport
Date: 2024-01-10
Status: Accepted
Context
Neuroscience data transfers between SynDB and analytical clients can reach hundreds of megabytes per query result. Serializing this data as HTTP JSON incurs significant overhead: text encoding inflates payload size, row-oriented JSON requires full deserialization, and there is no native streaming support for back-pressure.
Data science clients in Python, R, and Julia already use Apache Arrow as their in-memory columnar format (via pandas, polars, and similar libraries). A transport layer that speaks Arrow natively would eliminate serialization costs.
Decision
Use Apache Arrow Flight (gRPC-based columnar data transport) via the tonic
gRPC framework and arrow-flight crate. The Flight service runs on port 50051
alongside the Axum HTTP API on port 8080, both in the same binary process.
- Query results are materialized as Arrow RecordBatches and streamed to clients
via
do_get. - Dataset ingestion uses
do_putfor streaming uploads. - Flight tickets encode query parameters as serialized descriptors.
Consequences
Positive:
- Zero-copy data transfer: clients receive Arrow RecordBatches directly usable by pyarrow, polars, DataFusion, and similar tools without deserialization.
- gRPC bidirectional streaming provides natural back-pressure and supports arbitrarily large result sets without buffering everything in memory.
- Columnar format enables predicate pushdown and projection pruning at the transport level.
Negative:
- gRPC adds complexity compared to plain HTTP: clients need a Flight-aware library rather than a simple HTTP client.
- Binary protocol is harder to debug than JSON; requires tooling like
grpcurlor custom Flight clients for inspection. - Running two server protocols in one binary increases shutdown coordination complexity (see ADR-005).
ADR-005: Dual HTTP + gRPC Server Architecture
Date: 2024-01-10
Status: Accepted
Context
SynDB serves two distinct client populations with different transport needs:
- Web UI and REST clients require HTTP/JSON endpoints for CRUD operations on metadata, user management, and administrative tasks.
- Data science clients require high-performance Apache Arrow Flight (gRPC) for streaming large analytical datasets (see ADR-004).
Running these as separate binaries would double deployment artifacts, duplicate shared state (database connection pools, S3 clients, configuration), and complicate service discovery and health checking.
Decision
Run both Axum HTTP (port 8080) and tonic Flight gRPC (port 50051)
servers in the same binary process. Both servers share a single AppState
containing database connections, S3 client, PASETO secret, and application
settings.
Concurrent operation is achieved via tokio::select! on both server futures,
with a shared shutdown signal (broadcast channel) for coordinated graceful
termination.
Consequences
Positive:
- Single deployment artifact (one container image, one binary) simplifies CI/CD and operational management.
- Shared connection pools for PostgreSQL, ClickHouse, and S3 reduce total resource consumption versus two separate processes.
- Both servers validate PASETO tokens with the same in-memory secret – no secret distribution problem.
- Health checks and readiness probes need only target one process.
Negative:
- A crash in either server’s accept loop brings down both protocols.
- Graceful shutdown must coordinate two listeners; a stalled gRPC stream can delay HTTP shutdown and vice versa.
- Resource contention (thread pool, memory) between HTTP and gRPC workloads is harder to isolate than with separate processes.
ADR-006: libp2p for Peer-to-Peer Federation
Date: 2024-03-10
Status: Accepted
Context
SynDB federation enables multiple institutions to share and query datasets without relying on a central broker or registry. Participating nodes may sit behind NATs, institutional firewalls, or cloud VPCs, so the networking layer must handle peer discovery, NAT traversal, and encrypted transport without requiring manual endpoint configuration.
A centralized hub-and-spoke model would create a single point of failure and raise data-sovereignty concerns for institutions that want to retain control over their datasets.
Decision
Use libp2p with the following configuration for federation networking:
- QUIC transport for encrypted, multiplexed connections with built-in TLS 1.3.
- mDNS for zero-configuration local/LAN peer discovery.
- Relay nodes for NAT traversal when direct connections are not possible.
- The federation swarm is managed by kameo actors (see ADR-002), with the swarm event loop running in a dedicated actor.
- The node registry uses papaya lock-free concurrent hash maps for high-throughput reads without contention.
Consequences
Positive:
- True decentralized federation: no central broker, no single point of failure.
- QUIC provides encryption and multiplexing out of the box, eliminating the need for a separate TLS termination layer.
- mDNS enables instant discovery in development and on-premise deployments without configuration.
- Lock-free maps via papaya allow the node registry to scale to many concurrent readers without mutex contention.
Negative:
- Distributed systems complexity: the federation must handle network partitions, partial failures, and eventually consistent peer state.
- libp2p’s Rust implementation has a large dependency tree and can increase compile times.
- NAT traversal via relay nodes adds latency and requires at least one publicly reachable relay to be available.
- Debugging peer-to-peer networking issues is harder than debugging client-server HTTP calls.
Glossary
Neuroscience Terms
Axon — The elongated projection of a neuron that conducts electrical impulses away from the cell body. GO:0030424
Brain region — An anatomically or functionally defined subdivision of the brain. SynDB uses terms from the UBERON multi-species anatomy ontology.
Cable length — The total path length of a neuron’s skeletal reconstruction, measured in nanometers.
Cell type — A classification of neurons by morphology, connectivity pattern, molecular markers, or electrophysiology.
Connectome — A comprehensive map of neural connections in a nervous system. Wikipedia
Degree (in/out) — The number of incoming (in-degree) or outgoing (out-degree) synaptic connections of a neuron.
Dendrite — A branched projection of a neuron that receives synaptic input. GO:0030425
Dendritic spine — A small membranous protrusion on a dendrite that forms the postsynaptic side of most excitatory synapses. GO:0043197
Mitochondria — Organelles responsible for ATP production; their density in neurons correlates with synaptic activity. GO:0005739
Neuron — The fundamental unit of the nervous system; an electrically excitable cell that communicates via synapses. CL:0000540
Neurotransmitter — A chemical substance released at a synapse to transmit signals. Common types in SynDB: GABA, glutamate, acetylcholine, dopamine, octopamine, serine.
Pre-synaptic terminal — The axon terminal from which neurotransmitter is released into the synaptic cleft. GO:0098793
Reciprocity — The fraction of synaptic connections in a network that are bidirectional.
Sphericity — A measure of how closely a shape approximates a sphere (1.0 = perfect sphere).
Synapse — A junction between two neurons where signals are transmitted chemically or electrically. GO:0045202
Triadic census — An enumeration of all 16 possible three-node directed subgraph patterns in a network, used to characterize network motifs.
Vesicle — A small membrane-bound compartment in the pre-synaptic terminal that stores neurotransmitter molecules. GO:0099503
Platform Terms
Academic verification — Identity verification via CILogon institutional login, required for compute-intensive operations.
Arrow Flight — A high-performance gRPC protocol from Apache Arrow used for streaming data transfer between SynDB components.
Collection — A curated grouping of datasets for organizational or meta-analysis purposes. See Collections & Tags.
Compartment — A structural subdivision of a neuron (axon, dendrite, spine, terminal, vesicle, mitochondria) that maps to a SynDB table.
Dataset — A collection of neuroanatomical measurements sharing common metadata (species, brain region, microscopy method). The fundamental unit of organization in SynDB.
ETL — Extract, Transform, Load. The pipeline that imports external connectomics datasets into SynDB’s schema. See External sources.
Federation — A decentralized architecture where multiple institutions run independent SynDB nodes while participating in cross-institutional queries. See Federation overview.
Hub — The central coordinating instance in a federation. Runs the full SynDB stack and routes cross-cluster queries.
Job — An asynchronous unit of work (query execution, graph analysis) managed by the job queue. See Jobs system.
Materialized view (MV) — A pre-aggregated ClickHouse table that stores intermediate results for fast analytics. SyQL automatically rewrites eligible queries to use MVs.
Node — A lightweight federation participant running ClickHouse and the syndb-node binary. Data stays on the node’s infrastructure.
Provenance — The audit trail tracking who created, modified, or derived from a dataset. See Provenance & Citations.
SyQL — SynDB Query Language. A declarative SQL-like language that resolves dataset metadata into optimized ClickHouse queries. See SyQL documentation.
Table — A typed schema within SynDB corresponding to a neuronal compartment (e.g., neurons, synapses, axons, dendrites). Each table has its own column definitions.
Choosing a license for your dataset
When sharing microscopy data derived datasets, selecting an appropriate license is crucial for ensuring the proper use and distribution of your work. Different licenses offer varying degrees of freedom and control over your data. Here, we outline some popular licenses, their key features, and considerations to help you choose the right one for your needs.
Considerations for Choosing a License
- Intended Use: Determine whether you want your data to be used freely or with certain restrictions, such as non-commercial use only.
- Credit and Attribution: Decide if you want to receive credit for your work and if it’s important for you to see how others are using your data.
- Derivative Works: Consider whether you want derivative works to be allowed and if they should be shared under the same terms.
- Commercial Use: Reflect on whether you want to permit commercial use of your data. Your institution may have specific policies regarding commercial use.
Licenses
The following are some common licenses used for sharing data on the web, which we also use on the SynDB platform.
Tip
Current default
The current SynDB UI and CLI default to CC BY 4.0. Open Data Commons licenses are still available and may be a better fit for some dataset-sharing policies.
Open Data Commons (ODC) Licenses
Open Data Commons (ODC) licenses are specifically tailored for datasets and databases, focusing on maximizing accessibility and proper attribution in data sharing.
PDDL (Public Domain Dedication and License)
Places the dataset in the public domain, allowing unrestricted use and maximizing openness and usability.
ODC-BY (Attribution License)
Allows use with proper credit to the original creator, ensuring acknowledgment while enabling broad use.
ODC-ODbL (Open Database License)
Permits sharing, modifying, and using the dataset with attribution and requires derivative databases to be shared under the same license, promoting open access and collaborative improvement while keeping derivative databases equally accessible.
Creative Commons (CC) Licenses
Creative Commons (CC) licenses are versatile and well-suited for a wide range of creative works, including datasets
CC0 (Public Domain Dedication)
Allows the use of the dataset without any restrictions, making it ideal for maximizing usability and dissemination.
CC BY (Attribution)
Allows users to use the dataset as long as they provide appropriate credit to the original creator, ensuring wide use while acknowledging the creator’s work.
CC BY-SA (Attribution-ShareAlike)
Permits use of the dataset with appropriate credit and requires sharing derivative works under the same license, keeping derivative works open and shareable under the same terms.
CC BY-NC (Attribution-NonCommercial)
Allows use for non-commercial purposes with proper credit, restricting use to non-commercial purposes while still enabling academic and research use.
CC BY-NC-SA (Attribution-NonCommercial-ShareAlike)
Permits non-commercial use with appropriate credit and sharing of derivative works under the same license, ensuring non-commercial use and open sharing under the same terms.
Conclusion
Selecting the right license for your microscopy data derived dataset is essential for controlling how your data is used and ensuring it meets your sharing objectives. By considering the options and your specific needs, you can choose a license that balances openness, credit, and control, fostering collaboration and advancement in your field.
SynDB stores licenses as SPDX identifiers for machine-readable compatibility. See Data Standards for details.
Metrics structuring for contribution
Note
Prerequisites
This article requires that you understand how data is stored on SynDB, we recommend reading through the overview article if you are uncertain.
This article is a guide for contributors who wish to upload their data to SynDB. Please don’t hesitate to ask for help on the Discord channel if you have any questions; this part can be challenging.
Data structuring
Schema
Each SynDB table has its own expected columns and types. The current CLI and ETL importers are the authoritative validators: structure your data so it passes syndb data validate for direct uploads, or the relevant syndb etl <dataset> validate command for a supported importer.
The column names and the values stored under them must match the current importer schema for the table you are contributing to. Use the glossary at the end of this article as a quick reference, then validate early with the current tooling before preparing a full upload.
Note
Nano
We use nanometers as the unit for all measurements; includes volume, radius, and distance.
Supporting source assets
SynDB expects your primary contribution to be tabular, analysis-ready data. You may also attach supporting source assets such as meshes or SWC skeletons. This does not refer to raw imaging volumes. Place the absolute path to the file in your table file. The following are supported:
- Meshes in
.glbformat, column name:mesh_path - SWC files,
.swc, column name:swc_path
This list is the main tracker for the supported formats. You may request additional formats on the Discord channel. The SynDB team will review the request and consider adding the new format to the platform.
Columns
Most column types are self-explanatory, but some require additional explanation.
Identifiers and relations
The CID column defined in your table can have any unique hashable value, it will be replaced by a UUID when uploaded to SynDB. When uploading a relational dataset, the cid column in the parent will be used to correlate the relations to the children by their parent_id; meaning the hashable value in the parent cid column must match the parent_id in the child. parent_enum can be omitted as the compartments are defined at the tabular level, and will, therefore, be added automatically.
Example
Notice the parent_id column in the child table, this is the cid of the parent table. The parent_enum column is not present in the child table, as it is defined at the tabular file name.
vesicle.csv, child
| cid | neurotransmitter | voxel_radius | distance_to_active_zone | minimum_normal_length | parent_id | centroid_z | centroid_x | centroid_y |
|---|---|---|---|---|---|---|---|---|
| 0 | glutamate | 26.9129 | 705.2450 | 23 | 1 | 4505.232 | 1996.224 | 4953.6 |
| 1 | glutamate | 25.5388 | 615.0213 | 23 | 1 | 4505.232 | 1996.224 | 4953.6 |
| 2 | glutamate | 29.5260 | 513.0701 | 23 | 1 | 4505.232 | 1996.224 | 4953.6 |
| 3 | glutamate | 30.5131 | 479.9224 | 23 | 1 | 4505.232 | 1996.224 | 4953.6 |
| 4 | glutamate | 28.3977 | 454.8248 | 23 | 1 | 4505.232 | 1996.224 | 4953.6 |
| 5 | glutamate | 30.2033 | 459.7557 | 23 | 2 | 4505.232 | 1996.224 | 4953.6 |
| 6 | glutamate | 33.4548 | 374.8131 | 23 | 2 | 4505.232 | 1996.224 | 4953.6 |
| 7 | glutamate | 32.0890 | 455.9293 | 23 | 4 | 4505.232 | 1996.224 | 4953.6 |
axon.csv, parent
| voxel_volume | mitochondria_count | total_mitochondria_volume | cid |
|---|---|---|---|
| 385668034.56 | 1 | 93208043.52 | 1 |
| 1492089016.32 | 4 | 412054179.84 | 2 |
| 327740497.92 | 0 | 0 | 4 |
Glossary
| Key | Description |
|---|---|
dataset_id | The unique identifier for the dataset, of type uuid. |
cid | The unique identifier for a SynDB unit within the dataset, of type uuid. |
parent_id | The CID of the parent component, of type uuid. |
parent_enum | An integer representing the type or category of the parent component, of type int. |
polarity | The polarity of the neuron, of type ascii. |
voxel_volume | The volume of the voxel, of type double. |
voxel_radius | The radius of the voxel, of type double. |
s3_mesh_location | The location of the mesh in S3 storage, of type smallint. |
mesh_volume | The volume of the mesh, of type double. |
mesh_surface_area | The surface area of the mesh, of type double. |
mesh_area_volume_ratio | The ratio of the surface area to the volume of the mesh, of type double. |
mesh_sphericity | The sphericity of the mesh, of type double. |
centroid_z | The z-coordinate of the centroid, of type double. |
centroid_x | The x-coordinate of the centroid, of type double. |
centroid_y | The y-coordinate of the centroid, of type double. |
s3_swb_location | The location of the SWB in S3 storage, of type smallint. |
terminal_count | The count of terminals, of type int. |
mitochondria_count | The count of mitochondria, of type int. |
total_mitochondria_volume | The total volume of mitochondria, of type double. |
neuron_id | The unique identifier for the associated neuron, of type uuid. |
vesicle_count | The count of vesicles, of type int. |
total_vesicle_volume | The total volume of vesicles, of type double. |
forms_synapse_with | The unique identifier of the synapse that the component forms with, of type uuid. |
connection_score | The score representing the strength or quality of the connection, of type double. |
cleft_score | The score for the synaptic cleft, of type int. |
GABA | The concentration or presence of GABA neurotransmitter, of type double. |
acetylcholine | The concentration or presence of acetylcholine neurotransmitter, of type double. |
glutamate | The concentration or presence of glutamate neurotransmitter, of type double. |
octopamine | The concentration or presence of octopamine neurotransmitter, of type double. |
serine | The concentration or presence of serine neurotransmitter, of type double. |
dopamine | The concentration or presence of dopamine neurotransmitter, of type double. |
root_id | The external root identifier from the source platform (e.g. FlyWire), of type int. |
pre_id | The unique identifier of the pre-synaptic component, of type uuid. |
post_id | The unique identifier of the post-synaptic component, of type uuid. |
dendritic_spine_count | The count of dendritic spines, of type int. |
neurotransmitter | The type of neurotransmitter present in a vesicle, of type ascii. |
distance_to_active_zone | The distance from the vesicle to the active zone, of type double. |
minimum_normal_length | The minimum normal length, of type int. |
ribosome_count | The count of ribosomes within the endoplasmic reticulum, of type int. |
ETL Operations
This guide is for developers operating production ETL, cache population, and graph precompute jobs. It records the operational invariants that are easy to miss when a dataset is large enough that a normal import can partly succeed before the real failure appears.
Core Invariants
SynDB must not synthesize missing graph data to make downstream figures pass. If a graph product is missing, first prove whether the source import is complete and internally consistent.
For any connectome dataset used by graph precompute:
- Every edge endpoint must resolve to a neuron in the same dataset.
dataset_table_statemust describe the whole uploaded table, not a partial batch.- Dataset-scoped materialized views and precomputed tables must be regenerated from the same canonical source rows.
- Empty graph products are valid only when the source graph is truly empty or the product is explicitly not applicable.
The most useful validation is an endpoint check against vw_graph_edges:
WITH toUUID('<dataset-id>') AS ds
SELECT 'pre_missing' AS check_name, count() AS rows
FROM syndb.vw_graph_edges AS e
LEFT JOIN syndb.neurons AS n
ON n.dataset_id = ds AND n.neuron_id = e.pre_neuron_id
WHERE e.dataset_id = ds AND n.neuron_id IS NULL
UNION ALL
SELECT 'post_missing' AS check_name, count() AS rows
FROM syndb.vw_graph_edges AS e
LEFT JOIN syndb.neurons AS n
ON n.dataset_id = ds AND n.neuron_id = e.post_neuron_id
WHERE e.dataset_id = ds AND n.neuron_id IS NULL;
Both rows must be zero before graph precompute results are trusted.
MANC Repair Case Study
On April 27, 2026, syndb article cache populate skipped
manc:precomputed_bottleneck_neurons because the live table was empty. The
root cause was not cache population. MANC had repeatedly exposed several
architecture problems in the large-dataset path:
- The MANC weights source contained endpoint body IDs that were not present in the MANC annotation source.
- The Flight import path opened one DoPut per streaming batch, so the first batch could mark a table as uploaded and later batches then failed as duplicates.
- The Flight streaming path did not apply the same source-aware pre-filter as the direct ClickHouse path.
- The default Flight client timeout was too short for large table uploads.
- Small Arrow record batches created tens of thousands of Flight messages for a single large table.
- Graph precompute assumed
neurons.polaritywas UTF-8, but production MANC could expose it asInt8.
The repaired production import had:
| Table | Rows |
|---|---|
neurons | 211743 |
synapses | 26036056 |
precomputed_bottleneck_neurons | 1637 |
Before repair, MANC had 211743 neurons and 247186482 synapses, with
12617642 missing pre endpoints and 121202907 missing post endpoints. Those
missing endpoints are a source-model mismatch, not a reason to create
placeholder neurons.
MANC Source Model
For MANC v0.9, the canonical SynDB graph universe is the annotated neuron set from:
body-annotations-male-cns-v0.9-minconf-0.5.feather
The weighted connectome source:
connectome-weights-male-cns-v0.9-minconf-0.5.feather
can contain body IDs outside that annotation universe. The importer must filter connection rows to annotated pre and post body IDs before transforming them into SynDB synapses. Do not silently add synthetic neurons for unannotated endpoints: that changes the biological scope of the dataset and corrupts downstream coverage and graph summaries.
This source-aware filter must be applied in every import mode. If both direct ClickHouse upload and Arrow Flight upload exist, both paths must run the same pre-filter before table upload.
Flight Upload Rules
For streaming tables, a table upload is atomic at the table level from the metadata system’s perspective. Do not open one Flight DoPut per batch. Open one DoPut stream for the table and send all record batches through that stream.
The failure signature for the broken pattern was:
table already exists
after the first streaming batch had already succeeded. That left production in a misleading state: the metadata table could say a source table was uploaded, while ClickHouse only contained the first slice of the table.
Large Flight uploads also need operationally appropriate transport settings:
- Use a long ETL upload connection timeout, currently two hours.
- Convert large dataframes to larger Arrow record batches, up to
65536rows per batch, rather than relying on small default bridge batches.
Cleaning A Partial Import
When repairing a dataset with partial or semantically invalid rows, clean every
dataset-scoped product derived from that dataset before re-importing. Do not
delete global views that do not have dataset_id.
First, identify the PostgreSQL leader before changing metadata:
kubectl exec -n syndb syndb-postgres-1 -- patronictl list
Use the current leader pod for writes.
For ClickHouse, delete source rows and dataset-scoped derived rows. Use long timeouts and synchronous mutations for large datasets:
kubectl exec -n syndb <chi-pod> -- \
clickhouse-client \
--host syndb-cluster \
--user syndb \
--password "$CLICKHOUSE_PASSWORD" \
--receive_timeout 3600 \
--send_timeout 3600 \
--query "
ALTER TABLE syndb.neurons
DELETE WHERE dataset_id = '<dataset-id>'
SETTINGS mutations_sync = 2;
ALTER TABLE syndb.synapses
DELETE WHERE dataset_id = '<dataset-id>'
SETTINGS mutations_sync = 2;
"
Then remove dataset-scoped materialized-view targets and precompute products. Generate the table list from ClickHouse metadata so global tables are not accidentally deleted:
SELECT table
FROM system.columns
WHERE database = 'syndb'
AND name = 'dataset_id'
AND (
table LIKE 'mv_%'
OR table LIKE 'precomputed_%'
)
ORDER BY table;
For each returned table:
ALTER TABLE syndb.<table>
DELETE WHERE dataset_id = '<dataset-id>'
SETTINGS mutations_sync = 2;
Finally, reset the relevant PostgreSQL upload state on the leader:
UPDATE dataset_table_state
SET upload_state = 'pending',
row_count = NULL,
uploaded_at = NULL,
error_message = NULL
WHERE dataset_id = '<dataset-id>'
AND table_id IN (<table-ids-to-rerun>);
Only reset the table IDs that will actually be regenerated.
MANC-Only Helm Apply
Kubernetes Jobs are immutable, so delete failed or running ETL jobs before changing their specs:
nix develop . -c kubectl delete job -n syndb \
-l app=syndb-etl \
--field-selector status.successful!=1
The helper apply path can overlay registry-generated ETL skip flags. When you need a surgical MANC-only run, use a direct Helm apply and explicitly skip every other pipeline:
pipelines=(
allen-cell-types banc celegans celegans-dauer celegans-male ciona fanc
flywire h01 hemibrain larval medulla-7col microns neuromorpho optic-lobe
platynereis spine-morphometry witvliet wormneuroatlas
)
args=()
for pipeline in "${pipelines[@]}"; do
args+=(--set "syndb-etl.skipPipelines.${pipeline}=true")
done
nix develop . -c helm upgrade --install syndb-nautilus \
infrastructure/helm/nautilus \
-n syndb \
-f infrastructure/helm/nautilus/values.yaml \
"${args[@]}"
After the import job completes, validate source rows, upload state, and endpoint integrity before running cache population.
Graph Precompute For Large Graphs
MANC is too large for approaches that assume all products can be materialized in
memory. Use the large topology backend for exact bottleneck results over the
canonical vw_graph_edges stream.
For local operation through the devshell:
nix develop . -c syndb article graph precompute \
--dataset manc \
--resume=false \
--replace-existing=true \
--max-edges 200000000 \
--small-network-threshold 5000 \
--fail-fast=true
If running in-cluster, make sure the deployed ETL image contains the graph-precompute polarity fix:
toString(polarity) AS polarity
in the large-topology neuron query. Without that cast, production MANC can fail
when neurons.polarity is represented as Int8 rather than UTF-8.
Validate the graph products directly:
SELECT count()
FROM syndb.precomputed_bottleneck_neurons
WHERE dataset_id = '<dataset-id>';
SELECT status, backend, row_count, exact
FROM syndb.precomputed_analysis_status
WHERE dataset_id = '<dataset-id>'
ORDER BY product;
For the repaired MANC import on April 27, 2026,
precomputed_bottleneck_neurons contained 1637 rows and the analysis status
reported exact graph_precompute_streaming_topology output.
Cache Population
syndb article cache populate downloads derived production tables and materialized views
into local manuscript cache parquet files. It is not the raw dataset import
path. A skipped product during cache population should be treated as a symptom
of a missing or empty production data product unless the skip is explicitly
expected.
For MANC bottlenecks, the correct order is:
- Validate the imported source rows and endpoint integrity.
- Regenerate
precomputed_bottleneck_neurons. - Validate the precomputed row count and analysis status.
- Re-run
syndb article cache populate.
Do not patch the manuscript cache with synthetic rows. The cache should reflect the deployed data products.
Build Caching
SynDB uses multiple layers of caching to keep compile times short across local development, CI pipelines, and production deploys.
Cargo compiler flags
Configured in .cargo/config.toml, these flags speed up every local cargo
invocation:
| Flag | Effect |
|---|---|
-C link-arg=-fuse-ld=mold | Mold linker — significantly faster than the default ld or lld (Linux only) |
-Zshare-generics=y | Share monomorphized generics between crates, reducing codegen work |
-Zthreads=8 | Parallel compiler frontend (parsing, macro expansion, type checking) |
codegen-backend = "cranelift" | Dev profile uses Cranelift instead of LLVM for faster debug builds |
codegen-backend = "llvm" (for deps) | Dependencies still use LLVM for better optimization |
CI caching (GitHub Actions)
In CI, syndb-ci runs tests and builds directly on the host (no Docker for
the ci subcommand). Cargo artifacts are cached between runs via
Swatinem/rust-cache@v2, which
persists target/ and the cargo registry keyed by branch and Cargo.lock
hash.
For integration tests (local-stack-test, e2e-test), syndb-ci uses
bollard to start ephemeral Docker containers (PostgreSQL, ClickHouse, MinIO)
on a shared Docker network. Test binaries run on the host with environment
variables pointing at localhost:<port>. No cargo cache volumes are needed
inside containers — the host target/ is used directly.
Nix OCI cache
Local stack images (syndb stack prepare) for the API and ETL are built with
Nix and cached using syndb-ci nix-oci-cache. This command uses Nix
store paths as content-addressed fingerprints to skip unnecessary rebuilds:
nix build .#oci-syndb-apiproduces a store path (a hash of all inputs)- The script compares the current store path against a stamp file
(
/tmp/.oci-syndb-api.storepath) - If unchanged, the build is skipped entirely
- If changed, the new tarball is copied to
/tmp/and loaded into Docker
This means syndb stack prepare is near-instant when source hasn’t changed.
Nix Crane dependency caching
For nix flake check (CI) and Nix-based OCI image builds, the project uses
Crane with a split dependency build:
# nix/rust.nix
mkCargoArtifacts = system:
craneLib.buildDepsOnly (mkCommonArgs system);
buildDepsOnly compiles all workspace dependencies into a cached Nix
derivation. Subsequent builds of workspace crates reuse these artifacts,
so only the project’s own code is recompiled. Since Nix derivations are
content-addressed, the dependency cache is automatically invalidated when
Cargo.lock changes.
UI Source-Hash Cache
The UI image path used by syndb stack prepare uses a source-hash stamp to skip
rebuilds when Rust UI source files, static assets, or the packaged QueryFabric
catalog sources haven’t changed:
- SHA-256 of all files in
crates/services/ui/src/,crates/services/ui/public/,queryfabric/crates/queryfabric/src/,queryfabric/crates/queryfabric-catalog/src/,queryfabric/crates/queryfabric-web/src/,queryfabric/crates/queryfabric-web/assets/,queryfabric/crates/queryfabric-leptos/src/, andqueryfabric/crates/queryfabric-dialect-syql/src/, plus the relevant workspace and crateCargo.tomlfiles andCargo.lock - Compared against
/tmp/.syndb-ui.srchash - If the hash matches and
syndb-ui:devexists in Docker, the build is skipped
The extracted SyQL editor asset is now served from /static/queryfabric_syql_editor.js.
Summary
| Layer | Scope | Mechanism | Invalidation |
|---|---|---|---|
| Cargo flags | Local dev | Mold, Cranelift, parallel frontend | N/A (always active) |
| GitHub Actions cache | CI pipelines | Swatinem/rust-cache@v2 | Branch + Cargo.lock hash |
| Nix OCI cache | syndb stack prepare (API, ETL) | Nix store-path stamps | Content-addressed (any input change) |
| Crane deps | nix flake check, OCI images | buildDepsOnly derivation | Cargo.lock changes |
| UI source hash | syndb stack prepare (UI) | SHA-256 file stamp | Source file changes |
Troubleshooting
Find up-to-date explanations of different types of errors and pointers on how to resolve them.
403, Unauthorized
Verification
Academic verification is required for computationally or network-heavy tasks. This is to ensure that the resources are not being misused. You may verify yourself after registering on the platform — see Authentication for details on CILogon verification.
Dataset
A dataset belongs to the creator, and groups that the creator chooses to share its ownership. If you are unable to access a dataset, you fit neither of these categories. You may request access to the dataset from the creator.
429, Too Many Requests
You have exceeded the rate limit (100 requests/second by default). Respect the Retry-After header and implement exponential backoff. See Rate Limiting.
Job Failures
If a submitted job fails:
- Check the job status:
GET /v1/jobs/{job_id}— theerrorfield describes the failure - Common causes: query timeout, result too large (>1 GB), ClickHouse resource limits
- Rerun the job:
POST /v1/jobs/{job_id}/rerun
See Jobs System for details.
SyQL Errors
- Parse errors: Check SyQL syntax in the error message. Use
POST /v1/syql/planto validate without executing. - Resolution errors: A referenced table or column does not exist. Check the Data Structuring guide for valid column names.
- Timeout: Large queries may exceed the 60s HTTP timeout. Use
POST /v1/syql/execto submit as an async job instead.
Federation Issues
See Federation Troubleshooting for:
- Node discovery failures (mDNS, multiaddrs)
- Schema version mismatches
- Cluster health states
- Docker Compose federation profile issues