ETL Operations

This guide is for developers operating production ETL, cache population, and graph precompute jobs. It records the operational invariants that are easy to miss when a dataset is large enough that a normal import can partly succeed before the real failure appears.

Core Invariants

SynDB must not synthesize missing graph data to make downstream figures pass. If a graph product is missing, first prove whether the source import is complete and internally consistent.

For any connectome dataset used by graph precompute:

Every edge endpoint must resolve to a neuron in the same dataset.
dataset_table_state must describe the whole uploaded table, not a partial batch.
Dataset-scoped materialized views and precomputed tables must be regenerated from the same canonical source rows.
Empty graph products are valid only when the source graph is truly empty or the product is explicitly not applicable.

The most useful validation is an endpoint check against vw_graph_edges:

WITH toUUID('<dataset-id>') AS ds
SELECT 'pre_missing' AS check_name, count() AS rows
FROM syndb.vw_graph_edges AS e
LEFT JOIN syndb.neurons AS n
  ON n.dataset_id = ds AND n.neuron_id = e.pre_neuron_id
WHERE e.dataset_id = ds AND n.neuron_id IS NULL
UNION ALL
SELECT 'post_missing' AS check_name, count() AS rows
FROM syndb.vw_graph_edges AS e
LEFT JOIN syndb.neurons AS n
  ON n.dataset_id = ds AND n.neuron_id = e.post_neuron_id
WHERE e.dataset_id = ds AND n.neuron_id IS NULL;

Both rows must be zero before graph precompute results are trusted.

MANC Repair Case Study

On April 27, 2026, syndb article cache populate skipped manc:precomputed_bottleneck_neurons because the live table was empty. The root cause was not cache population. MANC had repeatedly exposed several architecture problems in the large-dataset path:

The MANC weights source contained endpoint body IDs that were not present in the MANC annotation source.
The Flight import path opened one DoPut per streaming batch, so the first batch could mark a table as uploaded and later batches then failed as duplicates.
The Flight streaming path did not apply the same source-aware pre-filter as the direct ClickHouse path.
The default Flight client timeout was too short for large table uploads.
Small Arrow record batches created tens of thousands of Flight messages for a single large table.
Graph precompute assumed neurons.polarity was UTF-8, but production MANC could expose it as Int8.

The repaired production import had:

Table	Rows
`neurons`	`211743`
`synapses`	`26036056`
`precomputed_bottleneck_neurons`	`1637`

Before repair, MANC had 211743 neurons and 247186482 synapses, with 12617642 missing pre endpoints and 121202907 missing post endpoints. Those missing endpoints are a source-model mismatch, not a reason to create placeholder neurons.

MANC Source Model

For MANC v0.9, the canonical SynDB graph universe is the annotated neuron set from:

body-annotations-male-cns-v0.9-minconf-0.5.feather

The weighted connectome source:

connectome-weights-male-cns-v0.9-minconf-0.5.feather

can contain body IDs outside that annotation universe. The importer must filter connection rows to annotated pre and post body IDs before transforming them into SynDB synapses. Do not silently add synthetic neurons for unannotated endpoints: that changes the biological scope of the dataset and corrupts downstream coverage and graph summaries.

This source-aware filter must be applied in every import mode. If both direct ClickHouse upload and Arrow Flight upload exist, both paths must run the same pre-filter before table upload.

Flight Upload Rules

For streaming tables, a table upload is atomic at the table level from the metadata system’s perspective. Do not open one Flight DoPut per batch. Open one DoPut stream for the table and send all record batches through that stream.

The failure signature for the broken pattern was:

table already exists

after the first streaming batch had already succeeded. That left production in a misleading state: the metadata table could say a source table was uploaded, while ClickHouse only contained the first slice of the table.

Large Flight uploads also need operationally appropriate transport settings:

Use a long ETL upload connection timeout, currently two hours.
Convert large dataframes to larger Arrow record batches, up to 65536 rows per batch, rather than relying on small default bridge batches.

Cleaning A Partial Import

When repairing a dataset with partial or semantically invalid rows, clean every dataset-scoped product derived from that dataset before re-importing. Do not delete global views that do not have dataset_id.

First, identify the PostgreSQL leader before changing metadata:

kubectl exec -n syndb syndb-postgres-1 -- patronictl list

Use the current leader pod for writes.

For ClickHouse, delete source rows and dataset-scoped derived rows. Use long timeouts and synchronous mutations for large datasets:

kubectl exec -n syndb <chi-pod> -- \
  clickhouse-client \
    --host syndb-cluster \
    --user syndb \
    --password "$CLICKHOUSE_PASSWORD" \
    --receive_timeout 3600 \
    --send_timeout 3600 \
    --query "
      ALTER TABLE syndb.neurons
      DELETE WHERE dataset_id = '<dataset-id>'
      SETTINGS mutations_sync = 2;

      ALTER TABLE syndb.synapses
      DELETE WHERE dataset_id = '<dataset-id>'
      SETTINGS mutations_sync = 2;
    "

Then remove dataset-scoped materialized-view targets and precompute products. Generate the table list from ClickHouse metadata so global tables are not accidentally deleted:

SELECT table
FROM system.columns
WHERE database = 'syndb'
  AND name = 'dataset_id'
  AND (
    table LIKE 'mv_%'
    OR table LIKE 'precomputed_%'
  )
ORDER BY table;

For each returned table:

ALTER TABLE syndb.<table>
DELETE WHERE dataset_id = '<dataset-id>'
SETTINGS mutations_sync = 2;

Finally, reset the relevant PostgreSQL upload state on the leader:

UPDATE dataset_table_state
SET upload_state = 'pending',
    row_count = NULL,
    uploaded_at = NULL,
    error_message = NULL
WHERE dataset_id = '<dataset-id>'
  AND table_id IN (<table-ids-to-rerun>);

Only reset the table IDs that will actually be regenerated.

MANC-Only Helm Apply

Kubernetes Jobs are immutable, so delete failed or running ETL jobs before changing their specs:

nix develop . -c kubectl delete job -n syndb \
  -l app=syndb-etl \
  --field-selector status.successful!=1

The helper apply path can overlay registry-generated ETL skip flags. When you need a surgical MANC-only run, use a direct Helm apply and explicitly skip every other pipeline:

pipelines=(
  allen-cell-types banc celegans celegans-dauer celegans-male ciona fanc
  flywire h01 hemibrain larval medulla-7col microns neuromorpho optic-lobe
  platynereis spine-morphometry witvliet wormneuroatlas
)

args=()
for pipeline in "${pipelines[@]}"; do
  args+=(--set "syndb-etl.skipPipelines.${pipeline}=true")
done

nix develop . -c helm upgrade --install syndb-nautilus \
  infrastructure/helm/nautilus \
  -n syndb \
  -f infrastructure/helm/nautilus/values.yaml \
  "${args[@]}"

After the import job completes, validate source rows, upload state, and endpoint integrity before running cache population.

Graph Precompute For Large Graphs

MANC is too large for approaches that assume all products can be materialized in memory. Use the large topology backend for exact bottleneck results over the canonical vw_graph_edges stream.

For local operation through the devshell:

nix develop . -c syndb article graph precompute \
  --dataset manc \
  --resume=false \
  --replace-existing=true \
  --max-edges 200000000 \
  --small-network-threshold 5000 \
  --fail-fast=true

If running in-cluster, make sure the deployed ETL image contains the graph-precompute polarity fix:

toString(polarity) AS polarity

in the large-topology neuron query. Without that cast, production MANC can fail when neurons.polarity is represented as Int8 rather than UTF-8.

Validate the graph products directly:

SELECT count()
FROM syndb.precomputed_bottleneck_neurons
WHERE dataset_id = '<dataset-id>';

SELECT status, backend, row_count, exact
FROM syndb.precomputed_analysis_status
WHERE dataset_id = '<dataset-id>'
ORDER BY product;

For the repaired MANC import on April 27, 2026, precomputed_bottleneck_neurons contained 1637 rows and the analysis status reported exact graph_precompute_streaming_topology output.

Cache Population

syndb article cache populate downloads derived production tables and materialized views into local manuscript cache parquet files. It is not the raw dataset import path. A skipped product during cache population should be treated as a symptom of a missing or empty production data product unless the skip is explicitly expected.

For MANC bottlenecks, the correct order is:

Validate the imported source rows and endpoint integrity.
Regenerate precomputed_bottleneck_neurons.
Validate the precomputed row count and analysis status.
Re-run syndb article cache populate.

Do not patch the manuscript cache with synthetic rows. The cache should reflect the deployed data products.

Keyboard shortcuts

Synapse DB