Cluster Setup

This guide covers running tsink as a replicated multi-node cluster. The same engine binary that runs single-node also runs in cluster mode — clustering is enabled with a flag.

Overview

A tsink cluster is a group of nodes that share data via consistent-hash-ring sharding and configurable replication. The cluster handles:

Shard routing — series are hashed to a logical shard, and each shard is owned by one or more replica nodes.
Replication — writes fan out to all replica owners; consistency guarantees are tunable.
Hinted handoff — writes destined for a temporarily unavailable peer are durably queued and replayed when it recovers.
Anti-entropy repair — periodic digest exchange detects and repairs diverged shards.
Online rebalance — shard ownership migrates automatically when nodes join or leave.
Control-plane consensus — membership and ring state are managed by an internal Raft-like log replicated across all storage-capable nodes.

Quick start: three-node cluster

Start three nodes, each binding to its own data path and internal RPC endpoint:

# Node 1
tsink-server \
  --listen 0.0.0.0:9201 \
  --data-path ./var/node1 \
  --cluster-enabled \
  --cluster-node-id node-1 \
  --cluster-bind 0.0.0.0:9211 \
  --cluster-seeds node-2:9212,node-3:9213 \
  --cluster-replication-factor 3

# Node 2
tsink-server \
  --listen 0.0.0.0:9202 \
  --data-path ./var/node2 \
  --cluster-enabled \
  --cluster-node-id node-2 \
  --cluster-bind 0.0.0.0:9212 \
  --cluster-seeds node-1:9211,node-3:9213 \
  --cluster-replication-factor 3

# Node 3
tsink-server \
  --listen 0.0.0.0:9203 \
  --data-path ./var/node3 \
  --cluster-enabled \
  --cluster-node-id node-3 \
  --cluster-bind 0.0.0.0:9213 \
  --cluster-seeds node-1:9211,node-2:9212 \
  --cluster-replication-factor 3

Each node bootstraps by contacting the seed list until the control plane accepts its join. Once all three nodes are active, shard ownership is distributed across them.

CLI flags reference

All cluster flags are only meaningful when --cluster-enabled is set.

Flag	Default	Description
`--cluster-enabled`	`false`	Enable cluster mode.
`--cluster-node-id <ID>`	—	Required. Stable, unique identifier for this node (e.g. `node-1`). Must not be `"unknown"`.
`--cluster-bind <HOST:PORT>`	—	Required. Internal RPC listen/advertise address. Peers connect here.
`--cluster-node-role <ROLE>`	`hybrid`	Node role: `storage`, `query`, or `hybrid`. See node roles.
`--cluster-seeds <HOST:PORT,...>`	—	Comma-separated list of peer endpoints used for initial join. Format: `host:port` or `node-id@host:port`.
`--cluster-shards <N>`	`128`	Number of logical shards. Must be > 0. Set this once at cluster creation and never change it.
`--cluster-replication-factor <N>`	`1`	Number of replicas per shard. Must be > 0 and ≤ number of storage-capable nodes.
`--cluster-write-consistency <MODE>`	`quorum`	Write consistency level: `one`, `quorum`, or `all`.
`--cluster-read-consistency <MODE>`	`eventual`	Read consistency level: `eventual`, `quorum`, or `strict`.
`--cluster-read-partial-response <MODE>`	`allow`	Whether to allow partial results when some shards are unavailable: `allow` or `deny`.
`--cluster-internal-auth-token <TOKEN>`	—	Shared-secret token sent on all internal RPC calls. Mutually exclusive with `--cluster-internal-auth-token-file`.
`--cluster-internal-auth-token-file <PATH>`	—	Path to a file containing the shared-secret token. Mutually exclusive with `--cluster-internal-auth-token`.
`--cluster-internal-mtls-enabled`	`false`	Enable mTLS for internal RPC. When set, token auth is disabled. Requires the three flags below.
`--cluster-internal-mtls-ca-cert <PATH>`	—	PEM CA bundle used to verify peer certificates.
`--cluster-internal-mtls-cert <PATH>`	—	PEM client certificate presented on outbound RPC.
`--cluster-internal-mtls-key <PATH>`	—	PEM private key for `--cluster-internal-mtls-cert`.

Node roles

Each node has a role that determines whether it owns data shards and whether it participates in query fanout.

Role	Owns shards	Handles queries
`hybrid` (default)	Yes	Yes
`storage`	Yes	Via RPC from query nodes
`query`	No	Yes (routes to storage/hybrid)

hybrid is appropriate for homogeneous clusters where every node is equal. storage + query separation is useful when you want dedicated query nodes with more memory for merging results, while storage nodes focus on ingest and retention. A query-role node requires at least one storage or hybrid node in its seed list:

tsink-server \
  --cluster-node-role query \
  --cluster-seeds storage-1:9211,storage-2:9212 \
  ...

Sharding

Series are mapped to shards, and shards are mapped to replica owners, using a consistent hash ring:

Series → shard: shard_index = series_id % shard_count
Shard → owners: virtual-node consistent hash ring (xxHash64, seed 0). Each physical node gets 128 virtual nodes on the ring. The replication_factor clockwise-unique physical nodes are the shard’s replica set.

The --cluster-shards value determines the total number of logical shards and cannot be changed after the cluster is created. The default of 128 is sufficient for most deployments. Use a higher value (e.g. 512 or 1024) if you plan to grow to many nodes and want fine-grained shard migration.

Consistency levels

Write consistency

Controls how many replica acks are required before a write is confirmed.

Level	Acks required	Formula
`one`	1	—
`quorum` (default)	majority	`⌊replication_factor / 2⌋ + 1`
`all`	all replicas	`replication_factor`

The write consistency can be overridden per-request using the HTTP header:

x-tsink-write-consistency: one

Per-request overrides may only weaken (not strengthen) the server-configured level.

Read consistency

Controls how many replica responses are required and whether all replicas are queried.

Level	Replicas queried	Acks required
`eventual` (default)	Primary only	1
`quorum`	All replicas	`⌊replicas / 2⌋ + 1`
`strict`	All replicas	All replicas

Partial response

When --cluster-read-partial-response allow (default), a query can succeed and return data even if some shards are temporarily unavailable. The response will include a "partial_response": true field in the metadata. Set --cluster-read-partial-response deny to return an error instead when any shard is unavailable.

Hinted handoff

When a write cannot be delivered to a replica because the peer is unreachable, tsink durably queues the data in a per-peer hinted handoff outbox. When the peer recovers, the queued data is replayed automatically. Handoff behavior is tunable via environment variables:

Environment variable	Default	Description
`TSINK_CLUSTER_OUTBOX_MAX_ENTRIES`	`100000`	Maximum number of queued entries across all peers.
`TSINK_CLUSTER_OUTBOX_MAX_BYTES`	`536870912` (512 MiB)	Total size cap for the handoff queue.
`TSINK_CLUSTER_OUTBOX_MAX_PEER_BYTES`	`268435456` (256 MiB)	Per-peer size cap.
`TSINK_CLUSTER_OUTBOX_MAX_LOG_BYTES`	`2147483648` (2 GiB)	Maximum WAL size for the outbox log.
`TSINK_CLUSTER_OUTBOX_REPLAY_INTERVAL_SECS`	`2`	How often to attempt replay to recovered peers (seconds).
`TSINK_CLUSTER_OUTBOX_REPLAY_BATCH_SIZE`	`256`	Rows per replay batch.
`TSINK_CLUSTER_OUTBOX_MAX_BACKOFF_SECS`	`30`	Maximum backoff between replay attempts.
`TSINK_CLUSTER_OUTBOX_MAX_RECORD_BYTES`	`2097152` (2 MiB)	Maximum size of a single outbox record.
`TSINK_CLUSTER_OUTBOX_CLEANUP_INTERVAL_SECS`	`30`	How often stale outbox records are cleaned up.
`TSINK_CLUSTER_OUTBOX_CLEANUP_MIN_STALE_RECORDS`	`1024`	Minimum stale records before cleanup runs.
`TSINK_CLUSTER_OUTBOX_STALLED_PEER_AGE_SECS`	`300`	Age threshold (seconds) at which a peer’s outbox queue is considered stalled.
`TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_ENTRIES`	`1`	Minimum entries for a peer queue to be considered stalled.
`TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_BYTES`	`1`	Minimum bytes for a peer queue to be considered stalled.

Anti-entropy repair

Repair uses a digest exchange: each node periodically computes per-shard fingerprints and compares them with peers. Diverged shards trigger a targeted data backfill.

Environment variable	Default	Description
`TSINK_CLUSTER_DIGEST_INTERVAL_SECS`	`30`	How often digest exchange runs per node.
`TSINK_CLUSTER_DIGEST_WINDOW_SECS`	`300`	Lookback window included in each digest (seconds).
`TSINK_CLUSTER_DIGEST_MAX_SHARDS_PER_TICK`	`64`	Max shards compared per tick.
`TSINK_CLUSTER_DIGEST_MAX_MISMATCH_REPORTS`	`128`	Max mismatch records tracked in memory.
`TSINK_CLUSTER_DIGEST_MAX_BYTES_PER_TICK`	`262144` (256 KiB)	Max digest payload size per tick.
`TSINK_CLUSTER_REPAIR_MAX_MISMATCHES_PER_TICK`	`2`	Max diverged shards repaired per tick.
`TSINK_CLUSTER_REPAIR_MAX_SERIES_PER_TICK`	`256`	Max series included in a repair round.
`TSINK_CLUSTER_REPAIR_MAX_ROWS_PER_TICK`	`16384`	Max rows transferred per repair tick.
`TSINK_CLUSTER_REPAIR_MAX_RUNTIME_MS_PER_TICK`	`100`	Max wall time per repair tick (ms).
`TSINK_CLUSTER_REPAIR_FAILURE_BACKOFF_SECS`	`30`	Backoff after a failed repair attempt.

Repair can be paused, resumed, cancelled, or triggered on-demand via the admin API:

# Pause repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/pause

# Resume repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/resume

# Run repair immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/run

# Check repair status
curl http://node-1:9201/api/v1/admin/cluster/repair/status

Rebalance

When shard ownership changes (node joining, leaving, or ring update), data migrates to new owners. Rebalance runs automatically in the background and is rate-limited to avoid impacting production traffic.

Environment variable	Default	Description
`TSINK_CLUSTER_REBALANCE_INTERVAL_SECS`	`5`	How often the rebalance loop ticks.
`TSINK_CLUSTER_REBALANCE_MAX_ROWS_PER_TICK`	`10000`	Max rows migrated per rebalance tick.
`TSINK_CLUSTER_REBALANCE_MAX_SHARDS_PER_TICK`	`4`	Max shards processed per rebalance tick.

Rebalance management endpoints:

# Pause rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/pause

# Resume rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/resume

# Run rebalance immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/run

# Check rebalance status
curl http://node-1:9201/api/v1/admin/cluster/rebalance/status

Adding and removing nodes

Adding a node

Start the new node with --cluster-enabled, a unique --cluster-node-id, and --cluster-seeds pointing at existing nodes.
The new node contacts the seed list and sends a join request to the control plane.
Once accepted, the control plane updates the ring and rebalance begins migrating shards to the new node.

You can also trigger the join manually:

curl -X POST http://new-node:9201/api/v1/admin/cluster/join

Removing a node

Signal the node to drain its shards before stopping:

curl -X POST http://node-to-remove:9201/api/v1/admin/cluster/leave

This initiates a graceful handoff: the node transfers its shard data to the remaining owners before marking itself as removed. Check handoff status:

curl http://node-to-remove:9201/api/v1/admin/cluster/handoff/status

Once the handoff is complete, the node can be stopped safely.

Shard handoff phases

When a shard is migrated, it goes through the following phases:

Phase	Description
`Warmup`	New owner receives data; old owner remains primary.
`Cutover`	Traffic switches to the new owner.
`FinalSync`	Final data sync before the old owner relinquishes the shard.
`Completed`	Migration done.
`Failed`	An error occurred; the handoff can be resumed from `Warmup`.

Control-plane consensus

Cluster membership and ring state are managed by an internal Raft-like consensus log replicated across all storage-capable nodes. The leader coordinates ring updates, join/leave decisions, and snapshots.

Environment variable	Default	Description
`TSINK_CLUSTER_CONTROL_TICK_INTERVAL_SECS`	`2`	Consensus tick interval.
`TSINK_CLUSTER_CONTROL_MAX_APPEND_ENTRIES`	`64`	Max log entries per append round.
`TSINK_CLUSTER_CONTROL_SNAPSHOT_INTERVAL_ENTRIES`	`128`	Log entries between automatic snapshots.
`TSINK_CLUSTER_CONTROL_SUSPECT_TIMEOUT_SECS`	`6`	Seconds before a silent peer is marked suspect.
`TSINK_CLUSTER_CONTROL_DEAD_TIMEOUT_SECS`	`20`	Seconds before a suspect peer is declared dead.
`TSINK_CLUSTER_CONTROL_LEADER_LEASE_SECS`	`6`	Leader lease duration.

Peer liveness transitions: unknown → healthy → suspect → dead.

Snapshots

Snapshot and restore the control-plane state:

# Snapshot this node's control plane
curl -X POST http://node-1:9201/api/v1/admin/cluster/control/snapshot

# Cluster-wide coordinated snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/snapshot

# Restore from a cluster-wide snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/restore

Internal security

All inter-node communication runs on the internal RPC port (--cluster-bind) under /internal/v1/* endpoints. Two mutually-exclusive authentication mechanisms are available.

Shared-secret token (simpler)

Provide a token at startup. All nodes in the cluster must use the same token.

tsink-server \
  --cluster-internal-auth-token "s3cr3t-tok3n" \
  ...

Or load it from a file (useful for secret management):

tsink-server \
  --cluster-internal-auth-token-file /run/secrets/cluster-token \
  ...

The token is sent on every internal RPC call via the x-tsink-internal-auth header. Tokens can be rotated at runtime without restart — see the secret rotation guide.

mTLS (recommended for production)

Enable mTLS to authenticate and encrypt all peer-to-peer traffic using certificates. When mTLS is enabled, shared-secret token auth is disabled. All three paths are required when --cluster-internal-mtls-enabled is set:

tsink-server \
  --cluster-internal-mtls-enabled \
  --cluster-internal-mtls-ca-cert /etc/tsink/ca.pem \
  --cluster-internal-mtls-cert    /etc/tsink/node.crt \
  --cluster-internal-mtls-key     /etc/tsink/node.key \
  ...

Flag	Description
`--cluster-internal-mtls-ca-cert`	PEM CA bundle. Used to verify incoming peer connections and outbound TLS.
`--cluster-internal-mtls-cert`	PEM certificate presented by this node on all outbound RPC calls.
`--cluster-internal-mtls-key`	PEM private key for the certificate above.

mTLS certificates can be rotated at runtime without restart. See the secret rotation guide.

RPC tuning

The following environment variables control internal RPC behavior and resource limits:

Environment variable	Default	Description
`TSINK_CLUSTER_FANOUT_CONCURRENCY`	`16`	Maximum concurrent fanout RPCs per request.
`TSINK_CLUSTER_RPC_TIMEOUT_MS`	`2000`	Timeout per individual RPC call (ms).
`TSINK_CLUSTER_RPC_MAX_RETRIES`	`2`	Maximum retries before falling back to hinted handoff (writes) or error (reads).
`TSINK_CLUSTER_WRITE_MAX_BATCH_ROWS`	`1024`	Maximum rows per write batch sent to a peer.
`TSINK_CLUSTER_WRITE_MAX_INFLIGHT_BATCHES`	`32`	Maximum concurrent write batches in flight to all peers.
`TSINK_CLUSTER_READ_MAX_MERGED_SERIES`	`250000`	Maximum number of series that can be merged in a single distributed read.
`TSINK_CLUSTER_READ_MAX_MERGED_POINTS_PER_SERIES`	`1000000`	Maximum data points per series in a distributed read result.
`TSINK_CLUSTER_READ_MAX_MERGED_POINTS_TOTAL`	`5000000`	Maximum total data points in a distributed read result.
`TSINK_CLUSTER_READ_MAX_INFLIGHT_QUERIES`	`64`	Maximum concurrent distributed read queries.
`TSINK_CLUSTER_READ_MAX_INFLIGHT_MERGED_POINTS`	`20000000`	Maximum total in-flight merged points across all concurrent reads.
`TSINK_CLUSTER_READ_RESOURCE_ACQUIRE_TIMEOUT_MS`	`25`	Timeout acquiring read concurrency slots (ms).

Write deduplication

To prevent double-writes during retry storms, tsink tracks idempotency keys per node in a sliding time window. Duplicate writes within the window are silently dropped on the receiving node.

Environment variable	Default	Description
`TSINK_CLUSTER_DEDUPE_WINDOW_SECS`	`900`	Deduplication window (15 minutes).
`TSINK_CLUSTER_DEDUPE_MAX_ENTRIES`	`250000`	Maximum tracked idempotency keys.
`TSINK_CLUSTER_DEDUPE_MAX_LOG_BYTES`	`67108864` (64 MiB)	Maximum WAL size for the dedupe log.
`TSINK_CLUSTER_DEDUPE_CLEANUP_INTERVAL_SECS`	`30`	Cleanup interval for expired entries.

Cluster audit log

All control-plane operations (joins, leaves, ring changes, handoffs) are written to an audit log queryable via the admin API.

Environment variable	Default	Description
`TSINK_CLUSTER_AUDIT_RETENTION_SECS`	`2592000` (30 days)	How long audit records are retained.
`TSINK_CLUSTER_AUDIT_MAX_LOG_BYTES`	`134217728` (128 MiB)	Maximum size of the audit log.
`TSINK_CLUSTER_AUDIT_MAX_QUERY_LIMIT`	`1000`	Maximum records returned per audit query.

# Query recent audit records
curl http://node-1:9201/api/v1/admin/cluster/audit

# Export full audit log as NDJSON
curl http://node-1:9201/api/v1/admin/cluster/audit/export

Configuration checklist

Before starting a production cluster:

Every node has a unique, stable --cluster-node-id.
--cluster-bind is reachable by all peers on the network.
--cluster-shards is consistent across all nodes in the cluster.
--cluster-replication-factor ≤ number of storage-capable nodes.
Internal auth is configured: either --cluster-internal-auth-token[|-file] or --cluster-internal-mtls-*.
--cluster-seeds lists enough existing nodes for bootstrap (any subset works).
Firewalls allow traffic on both the public --listen port and the internal --cluster-bind port.

​Cluster Setup

​Overview

​Quick start: three-node cluster

​CLI flags reference

​Node roles

​Sharding

​Consistency levels

​Write consistency

​Read consistency

​Partial response

​Hinted handoff

​Anti-entropy repair

​Rebalance

​Adding and removing nodes

​Adding a node

​Removing a node

​Shard handoff phases

​Control-plane consensus

​Snapshots

​Internal security

​Shared-secret token (simpler)

​mTLS (recommended for production)

​RPC tuning

​Write deduplication

​Cluster audit log

​Configuration checklist