Skip to main content

Cluster Setup

This guide covers running tsink as a replicated multi-node cluster. The same engine binary that runs single-node also runs in cluster mode — clustering is enabled with a flag.

Overview

A tsink cluster is a group of nodes that share data via consistent-hash-ring sharding and configurable replication. The cluster handles:
  • Shard routing — series are hashed to a logical shard, and each shard is owned by one or more replica nodes.
  • Replication — writes fan out to all replica owners; consistency guarantees are tunable.
  • Hinted handoff — writes destined for a temporarily unavailable peer are durably queued and replayed when it recovers.
  • Anti-entropy repair — periodic digest exchange detects and repairs diverged shards.
  • Online rebalance — shard ownership migrates automatically when nodes join or leave.
  • Control-plane consensus — membership and ring state are managed by an internal Raft-like log replicated across all storage-capable nodes.

Quick start: three-node cluster

Start three nodes, each binding to its own data path and internal RPC endpoint:
# Node 1
tsink-server \
  --listen 0.0.0.0:9201 \
  --data-path ./var/node1 \
  --cluster-enabled \
  --cluster-node-id node-1 \
  --cluster-bind 0.0.0.0:9211 \
  --cluster-seeds node-2:9212,node-3:9213 \
  --cluster-replication-factor 3

# Node 2
tsink-server \
  --listen 0.0.0.0:9202 \
  --data-path ./var/node2 \
  --cluster-enabled \
  --cluster-node-id node-2 \
  --cluster-bind 0.0.0.0:9212 \
  --cluster-seeds node-1:9211,node-3:9213 \
  --cluster-replication-factor 3

# Node 3
tsink-server \
  --listen 0.0.0.0:9203 \
  --data-path ./var/node3 \
  --cluster-enabled \
  --cluster-node-id node-3 \
  --cluster-bind 0.0.0.0:9213 \
  --cluster-seeds node-1:9211,node-2:9212 \
  --cluster-replication-factor 3
Each node bootstraps by contacting the seed list until the control plane accepts its join. Once all three nodes are active, shard ownership is distributed across them.

CLI flags reference

All cluster flags are only meaningful when --cluster-enabled is set.
FlagDefaultDescription
--cluster-enabledfalseEnable cluster mode.
--cluster-node-id <ID>Required. Stable, unique identifier for this node (e.g. node-1). Must not be "unknown".
--cluster-bind <HOST:PORT>Required. Internal RPC listen/advertise address. Peers connect here.
--cluster-node-role <ROLE>hybridNode role: storage, query, or hybrid. See node roles.
--cluster-seeds <HOST:PORT,...>Comma-separated list of peer endpoints used for initial join. Format: host:port or node-id@host:port.
--cluster-shards <N>128Number of logical shards. Must be > 0. Set this once at cluster creation and never change it.
--cluster-replication-factor <N>1Number of replicas per shard. Must be > 0 and ≤ number of storage-capable nodes.
--cluster-write-consistency <MODE>quorumWrite consistency level: one, quorum, or all.
--cluster-read-consistency <MODE>eventualRead consistency level: eventual, quorum, or strict.
--cluster-read-partial-response <MODE>allowWhether to allow partial results when some shards are unavailable: allow or deny.
--cluster-internal-auth-token <TOKEN>Shared-secret token sent on all internal RPC calls. Mutually exclusive with --cluster-internal-auth-token-file.
--cluster-internal-auth-token-file <PATH>Path to a file containing the shared-secret token. Mutually exclusive with --cluster-internal-auth-token.
--cluster-internal-mtls-enabledfalseEnable mTLS for internal RPC. When set, token auth is disabled. Requires the three flags below.
--cluster-internal-mtls-ca-cert <PATH>PEM CA bundle used to verify peer certificates.
--cluster-internal-mtls-cert <PATH>PEM client certificate presented on outbound RPC.
--cluster-internal-mtls-key <PATH>PEM private key for --cluster-internal-mtls-cert.

Node roles

Each node has a role that determines whether it owns data shards and whether it participates in query fanout.
RoleOwns shardsHandles queries
hybrid (default)YesYes
storageYesVia RPC from query nodes
queryNoYes (routes to storage/hybrid)
hybrid is appropriate for homogeneous clusters where every node is equal. storage + query separation is useful when you want dedicated query nodes with more memory for merging results, while storage nodes focus on ingest and retention. A query-role node requires at least one storage or hybrid node in its seed list:
tsink-server \
  --cluster-node-role query \
  --cluster-seeds storage-1:9211,storage-2:9212 \
  ...

Sharding

Series are mapped to shards, and shards are mapped to replica owners, using a consistent hash ring:
  1. Series → shard: shard_index = series_id % shard_count
  2. Shard → owners: virtual-node consistent hash ring (xxHash64, seed 0). Each physical node gets 128 virtual nodes on the ring. The replication_factor clockwise-unique physical nodes are the shard’s replica set.
The --cluster-shards value determines the total number of logical shards and cannot be changed after the cluster is created. The default of 128 is sufficient for most deployments. Use a higher value (e.g. 512 or 1024) if you plan to grow to many nodes and want fine-grained shard migration.

Consistency levels

Write consistency

Controls how many replica acks are required before a write is confirmed.
LevelAcks requiredFormula
one1
quorum (default)majority⌊replication_factor / 2⌋ + 1
allall replicasreplication_factor
The write consistency can be overridden per-request using the HTTP header:
x-tsink-write-consistency: one
Per-request overrides may only weaken (not strengthen) the server-configured level.

Read consistency

Controls how many replica responses are required and whether all replicas are queried.
LevelReplicas queriedAcks required
eventual (default)Primary only1
quorumAll replicas⌊replicas / 2⌋ + 1
strictAll replicasAll replicas

Partial response

When --cluster-read-partial-response allow (default), a query can succeed and return data even if some shards are temporarily unavailable. The response will include a "partial_response": true field in the metadata. Set --cluster-read-partial-response deny to return an error instead when any shard is unavailable.

Hinted handoff

When a write cannot be delivered to a replica because the peer is unreachable, tsink durably queues the data in a per-peer hinted handoff outbox. When the peer recovers, the queued data is replayed automatically. Handoff behavior is tunable via environment variables:
Environment variableDefaultDescription
TSINK_CLUSTER_OUTBOX_MAX_ENTRIES100000Maximum number of queued entries across all peers.
TSINK_CLUSTER_OUTBOX_MAX_BYTES536870912 (512 MiB)Total size cap for the handoff queue.
TSINK_CLUSTER_OUTBOX_MAX_PEER_BYTES268435456 (256 MiB)Per-peer size cap.
TSINK_CLUSTER_OUTBOX_MAX_LOG_BYTES2147483648 (2 GiB)Maximum WAL size for the outbox log.
TSINK_CLUSTER_OUTBOX_REPLAY_INTERVAL_SECS2How often to attempt replay to recovered peers (seconds).
TSINK_CLUSTER_OUTBOX_REPLAY_BATCH_SIZE256Rows per replay batch.
TSINK_CLUSTER_OUTBOX_MAX_BACKOFF_SECS30Maximum backoff between replay attempts.
TSINK_CLUSTER_OUTBOX_MAX_RECORD_BYTES2097152 (2 MiB)Maximum size of a single outbox record.
TSINK_CLUSTER_OUTBOX_CLEANUP_INTERVAL_SECS30How often stale outbox records are cleaned up.
TSINK_CLUSTER_OUTBOX_CLEANUP_MIN_STALE_RECORDS1024Minimum stale records before cleanup runs.
TSINK_CLUSTER_OUTBOX_STALLED_PEER_AGE_SECS300Age threshold (seconds) at which a peer’s outbox queue is considered stalled.
TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_ENTRIES1Minimum entries for a peer queue to be considered stalled.
TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_BYTES1Minimum bytes for a peer queue to be considered stalled.

Anti-entropy repair

Repair uses a digest exchange: each node periodically computes per-shard fingerprints and compares them with peers. Diverged shards trigger a targeted data backfill.
Environment variableDefaultDescription
TSINK_CLUSTER_DIGEST_INTERVAL_SECS30How often digest exchange runs per node.
TSINK_CLUSTER_DIGEST_WINDOW_SECS300Lookback window included in each digest (seconds).
TSINK_CLUSTER_DIGEST_MAX_SHARDS_PER_TICK64Max shards compared per tick.
TSINK_CLUSTER_DIGEST_MAX_MISMATCH_REPORTS128Max mismatch records tracked in memory.
TSINK_CLUSTER_DIGEST_MAX_BYTES_PER_TICK262144 (256 KiB)Max digest payload size per tick.
TSINK_CLUSTER_REPAIR_MAX_MISMATCHES_PER_TICK2Max diverged shards repaired per tick.
TSINK_CLUSTER_REPAIR_MAX_SERIES_PER_TICK256Max series included in a repair round.
TSINK_CLUSTER_REPAIR_MAX_ROWS_PER_TICK16384Max rows transferred per repair tick.
TSINK_CLUSTER_REPAIR_MAX_RUNTIME_MS_PER_TICK100Max wall time per repair tick (ms).
TSINK_CLUSTER_REPAIR_FAILURE_BACKOFF_SECS30Backoff after a failed repair attempt.
Repair can be paused, resumed, cancelled, or triggered on-demand via the admin API:
# Pause repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/pause

# Resume repair
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/resume

# Run repair immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/repair/run

# Check repair status
curl http://node-1:9201/api/v1/admin/cluster/repair/status

Rebalance

When shard ownership changes (node joining, leaving, or ring update), data migrates to new owners. Rebalance runs automatically in the background and is rate-limited to avoid impacting production traffic.
Environment variableDefaultDescription
TSINK_CLUSTER_REBALANCE_INTERVAL_SECS5How often the rebalance loop ticks.
TSINK_CLUSTER_REBALANCE_MAX_ROWS_PER_TICK10000Max rows migrated per rebalance tick.
TSINK_CLUSTER_REBALANCE_MAX_SHARDS_PER_TICK4Max shards processed per rebalance tick.
Rebalance management endpoints:
# Pause rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/pause

# Resume rebalance
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/resume

# Run rebalance immediately
curl -X POST http://node-1:9201/api/v1/admin/cluster/rebalance/run

# Check rebalance status
curl http://node-1:9201/api/v1/admin/cluster/rebalance/status

Adding and removing nodes

Adding a node

  1. Start the new node with --cluster-enabled, a unique --cluster-node-id, and --cluster-seeds pointing at existing nodes.
  2. The new node contacts the seed list and sends a join request to the control plane.
  3. Once accepted, the control plane updates the ring and rebalance begins migrating shards to the new node.
You can also trigger the join manually:
curl -X POST http://new-node:9201/api/v1/admin/cluster/join

Removing a node

Signal the node to drain its shards before stopping:
curl -X POST http://node-to-remove:9201/api/v1/admin/cluster/leave
This initiates a graceful handoff: the node transfers its shard data to the remaining owners before marking itself as removed. Check handoff status:
curl http://node-to-remove:9201/api/v1/admin/cluster/handoff/status
Once the handoff is complete, the node can be stopped safely.

Shard handoff phases

When a shard is migrated, it goes through the following phases:
PhaseDescription
WarmupNew owner receives data; old owner remains primary.
CutoverTraffic switches to the new owner.
FinalSyncFinal data sync before the old owner relinquishes the shard.
CompletedMigration done.
FailedAn error occurred; the handoff can be resumed from Warmup.

Control-plane consensus

Cluster membership and ring state are managed by an internal Raft-like consensus log replicated across all storage-capable nodes. The leader coordinates ring updates, join/leave decisions, and snapshots.
Environment variableDefaultDescription
TSINK_CLUSTER_CONTROL_TICK_INTERVAL_SECS2Consensus tick interval.
TSINK_CLUSTER_CONTROL_MAX_APPEND_ENTRIES64Max log entries per append round.
TSINK_CLUSTER_CONTROL_SNAPSHOT_INTERVAL_ENTRIES128Log entries between automatic snapshots.
TSINK_CLUSTER_CONTROL_SUSPECT_TIMEOUT_SECS6Seconds before a silent peer is marked suspect.
TSINK_CLUSTER_CONTROL_DEAD_TIMEOUT_SECS20Seconds before a suspect peer is declared dead.
TSINK_CLUSTER_CONTROL_LEADER_LEASE_SECS6Leader lease duration.
Peer liveness transitions: unknown → healthy → suspect → dead.

Snapshots

Snapshot and restore the control-plane state:
# Snapshot this node's control plane
curl -X POST http://node-1:9201/api/v1/admin/cluster/control/snapshot

# Cluster-wide coordinated snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/snapshot

# Restore from a cluster-wide snapshot
curl -X POST http://node-1:9201/api/v1/admin/cluster/restore

Internal security

All inter-node communication runs on the internal RPC port (--cluster-bind) under /internal/v1/* endpoints. Two mutually-exclusive authentication mechanisms are available.

Shared-secret token (simpler)

Provide a token at startup. All nodes in the cluster must use the same token.
tsink-server \
  --cluster-internal-auth-token "s3cr3t-tok3n" \
  ...
Or load it from a file (useful for secret management):
tsink-server \
  --cluster-internal-auth-token-file /run/secrets/cluster-token \
  ...
The token is sent on every internal RPC call via the x-tsink-internal-auth header. Tokens can be rotated at runtime without restart — see the secret rotation guide. Enable mTLS to authenticate and encrypt all peer-to-peer traffic using certificates. When mTLS is enabled, shared-secret token auth is disabled. All three paths are required when --cluster-internal-mtls-enabled is set:
tsink-server \
  --cluster-internal-mtls-enabled \
  --cluster-internal-mtls-ca-cert /etc/tsink/ca.pem \
  --cluster-internal-mtls-cert    /etc/tsink/node.crt \
  --cluster-internal-mtls-key     /etc/tsink/node.key \
  ...
FlagDescription
--cluster-internal-mtls-ca-certPEM CA bundle. Used to verify incoming peer connections and outbound TLS.
--cluster-internal-mtls-certPEM certificate presented by this node on all outbound RPC calls.
--cluster-internal-mtls-keyPEM private key for the certificate above.
mTLS certificates can be rotated at runtime without restart. See the secret rotation guide.

RPC tuning

The following environment variables control internal RPC behavior and resource limits:
Environment variableDefaultDescription
TSINK_CLUSTER_FANOUT_CONCURRENCY16Maximum concurrent fanout RPCs per request.
TSINK_CLUSTER_RPC_TIMEOUT_MS2000Timeout per individual RPC call (ms).
TSINK_CLUSTER_RPC_MAX_RETRIES2Maximum retries before falling back to hinted handoff (writes) or error (reads).
TSINK_CLUSTER_WRITE_MAX_BATCH_ROWS1024Maximum rows per write batch sent to a peer.
TSINK_CLUSTER_WRITE_MAX_INFLIGHT_BATCHES32Maximum concurrent write batches in flight to all peers.
TSINK_CLUSTER_READ_MAX_MERGED_SERIES250000Maximum number of series that can be merged in a single distributed read.
TSINK_CLUSTER_READ_MAX_MERGED_POINTS_PER_SERIES1000000Maximum data points per series in a distributed read result.
TSINK_CLUSTER_READ_MAX_MERGED_POINTS_TOTAL5000000Maximum total data points in a distributed read result.
TSINK_CLUSTER_READ_MAX_INFLIGHT_QUERIES64Maximum concurrent distributed read queries.
TSINK_CLUSTER_READ_MAX_INFLIGHT_MERGED_POINTS20000000Maximum total in-flight merged points across all concurrent reads.
TSINK_CLUSTER_READ_RESOURCE_ACQUIRE_TIMEOUT_MS25Timeout acquiring read concurrency slots (ms).

Write deduplication

To prevent double-writes during retry storms, tsink tracks idempotency keys per node in a sliding time window. Duplicate writes within the window are silently dropped on the receiving node.
Environment variableDefaultDescription
TSINK_CLUSTER_DEDUPE_WINDOW_SECS900Deduplication window (15 minutes).
TSINK_CLUSTER_DEDUPE_MAX_ENTRIES250000Maximum tracked idempotency keys.
TSINK_CLUSTER_DEDUPE_MAX_LOG_BYTES67108864 (64 MiB)Maximum WAL size for the dedupe log.
TSINK_CLUSTER_DEDUPE_CLEANUP_INTERVAL_SECS30Cleanup interval for expired entries.

Cluster audit log

All control-plane operations (joins, leaves, ring changes, handoffs) are written to an audit log queryable via the admin API.
Environment variableDefaultDescription
TSINK_CLUSTER_AUDIT_RETENTION_SECS2592000 (30 days)How long audit records are retained.
TSINK_CLUSTER_AUDIT_MAX_LOG_BYTES134217728 (128 MiB)Maximum size of the audit log.
TSINK_CLUSTER_AUDIT_MAX_QUERY_LIMIT1000Maximum records returned per audit query.
# Query recent audit records
curl http://node-1:9201/api/v1/admin/cluster/audit

# Export full audit log as NDJSON
curl http://node-1:9201/api/v1/admin/cluster/audit/export

Configuration checklist

Before starting a production cluster:
  • Every node has a unique, stable --cluster-node-id.
  • --cluster-bind is reachable by all peers on the network.
  • --cluster-shards is consistent across all nodes in the cluster.
  • --cluster-replication-factor ≤ number of storage-capable nodes.
  • Internal auth is configured: either --cluster-internal-auth-token[|-file] or --cluster-internal-mtls-*.
  • --cluster-seeds lists enough existing nodes for bootstrap (any subset works).
  • Firewalls allow traffic on both the public --listen port and the internal --cluster-bind port.