Cluster Setup
This guide covers running tsink as a replicated multi-node cluster. The same engine binary that runs single-node also runs in cluster mode — clustering is enabled with a flag.Overview
A tsink cluster is a group of nodes that share data via consistent-hash-ring sharding and configurable replication. The cluster handles:- Shard routing — series are hashed to a logical shard, and each shard is owned by one or more replica nodes.
- Replication — writes fan out to all replica owners; consistency guarantees are tunable.
- Hinted handoff — writes destined for a temporarily unavailable peer are durably queued and replayed when it recovers.
- Anti-entropy repair — periodic digest exchange detects and repairs diverged shards.
- Online rebalance — shard ownership migrates automatically when nodes join or leave.
- Control-plane consensus — membership and ring state are managed by an internal Raft-like log replicated across all storage-capable nodes.
Quick start: three-node cluster
Start three nodes, each binding to its own data path and internal RPC endpoint:CLI flags reference
All cluster flags are only meaningful when--cluster-enabled is set.
| Flag | Default | Description |
|---|---|---|
--cluster-enabled | false | Enable cluster mode. |
--cluster-node-id <ID> | — | Required. Stable, unique identifier for this node (e.g. node-1). Must not be "unknown". |
--cluster-bind <HOST:PORT> | — | Required. Internal RPC listen/advertise address. Peers connect here. |
--cluster-node-role <ROLE> | hybrid | Node role: storage, query, or hybrid. See node roles. |
--cluster-seeds <HOST:PORT,...> | — | Comma-separated list of peer endpoints used for initial join. Format: host:port or node-id@host:port. |
--cluster-shards <N> | 128 | Number of logical shards. Must be > 0. Set this once at cluster creation and never change it. |
--cluster-replication-factor <N> | 1 | Number of replicas per shard. Must be > 0 and ≤ number of storage-capable nodes. |
--cluster-write-consistency <MODE> | quorum | Write consistency level: one, quorum, or all. |
--cluster-read-consistency <MODE> | eventual | Read consistency level: eventual, quorum, or strict. |
--cluster-read-partial-response <MODE> | allow | Whether to allow partial results when some shards are unavailable: allow or deny. |
--cluster-internal-auth-token <TOKEN> | — | Shared-secret token sent on all internal RPC calls. Mutually exclusive with --cluster-internal-auth-token-file. |
--cluster-internal-auth-token-file <PATH> | — | Path to a file containing the shared-secret token. Mutually exclusive with --cluster-internal-auth-token. |
--cluster-internal-mtls-enabled | false | Enable mTLS for internal RPC. When set, token auth is disabled. Requires the three flags below. |
--cluster-internal-mtls-ca-cert <PATH> | — | PEM CA bundle used to verify peer certificates. |
--cluster-internal-mtls-cert <PATH> | — | PEM client certificate presented on outbound RPC. |
--cluster-internal-mtls-key <PATH> | — | PEM private key for --cluster-internal-mtls-cert. |
Node roles
Each node has a role that determines whether it owns data shards and whether it participates in query fanout.| Role | Owns shards | Handles queries |
|---|---|---|
hybrid (default) | Yes | Yes |
storage | Yes | Via RPC from query nodes |
query | No | Yes (routes to storage/hybrid) |
hybrid is appropriate for homogeneous clusters where every node is equal. storage + query separation is useful when you want dedicated query nodes with more memory for merging results, while storage nodes focus on ingest and retention.
A query-role node requires at least one storage or hybrid node in its seed list:
Sharding
Series are mapped to shards, and shards are mapped to replica owners, using a consistent hash ring:- Series → shard:
shard_index = series_id % shard_count - Shard → owners: virtual-node consistent hash ring (xxHash64, seed 0). Each physical node gets 128 virtual nodes on the ring. The
replication_factorclockwise-unique physical nodes are the shard’s replica set.
--cluster-shards value determines the total number of logical shards and cannot be changed after the cluster is created. The default of 128 is sufficient for most deployments. Use a higher value (e.g. 512 or 1024) if you plan to grow to many nodes and want fine-grained shard migration.
Consistency levels
Write consistency
Controls how many replica acks are required before a write is confirmed.| Level | Acks required | Formula |
|---|---|---|
one | 1 | — |
quorum (default) | majority | ⌊replication_factor / 2⌋ + 1 |
all | all replicas | replication_factor |
Read consistency
Controls how many replica responses are required and whether all replicas are queried.| Level | Replicas queried | Acks required |
|---|---|---|
eventual (default) | Primary only | 1 |
quorum | All replicas | ⌊replicas / 2⌋ + 1 |
strict | All replicas | All replicas |
Partial response
When--cluster-read-partial-response allow (default), a query can succeed and return data even if some shards are temporarily unavailable. The response will include a "partial_response": true field in the metadata.
Set --cluster-read-partial-response deny to return an error instead when any shard is unavailable.
Hinted handoff
When a write cannot be delivered to a replica because the peer is unreachable, tsink durably queues the data in a per-peer hinted handoff outbox. When the peer recovers, the queued data is replayed automatically. Handoff behavior is tunable via environment variables:| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_OUTBOX_MAX_ENTRIES | 100000 | Maximum number of queued entries across all peers. |
TSINK_CLUSTER_OUTBOX_MAX_BYTES | 536870912 (512 MiB) | Total size cap for the handoff queue. |
TSINK_CLUSTER_OUTBOX_MAX_PEER_BYTES | 268435456 (256 MiB) | Per-peer size cap. |
TSINK_CLUSTER_OUTBOX_MAX_LOG_BYTES | 2147483648 (2 GiB) | Maximum WAL size for the outbox log. |
TSINK_CLUSTER_OUTBOX_REPLAY_INTERVAL_SECS | 2 | How often to attempt replay to recovered peers (seconds). |
TSINK_CLUSTER_OUTBOX_REPLAY_BATCH_SIZE | 256 | Rows per replay batch. |
TSINK_CLUSTER_OUTBOX_MAX_BACKOFF_SECS | 30 | Maximum backoff between replay attempts. |
TSINK_CLUSTER_OUTBOX_MAX_RECORD_BYTES | 2097152 (2 MiB) | Maximum size of a single outbox record. |
TSINK_CLUSTER_OUTBOX_CLEANUP_INTERVAL_SECS | 30 | How often stale outbox records are cleaned up. |
TSINK_CLUSTER_OUTBOX_CLEANUP_MIN_STALE_RECORDS | 1024 | Minimum stale records before cleanup runs. |
TSINK_CLUSTER_OUTBOX_STALLED_PEER_AGE_SECS | 300 | Age threshold (seconds) at which a peer’s outbox queue is considered stalled. |
TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_ENTRIES | 1 | Minimum entries for a peer queue to be considered stalled. |
TSINK_CLUSTER_OUTBOX_STALLED_PEER_MIN_BYTES | 1 | Minimum bytes for a peer queue to be considered stalled. |
Anti-entropy repair
Repair uses a digest exchange: each node periodically computes per-shard fingerprints and compares them with peers. Diverged shards trigger a targeted data backfill.| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_DIGEST_INTERVAL_SECS | 30 | How often digest exchange runs per node. |
TSINK_CLUSTER_DIGEST_WINDOW_SECS | 300 | Lookback window included in each digest (seconds). |
TSINK_CLUSTER_DIGEST_MAX_SHARDS_PER_TICK | 64 | Max shards compared per tick. |
TSINK_CLUSTER_DIGEST_MAX_MISMATCH_REPORTS | 128 | Max mismatch records tracked in memory. |
TSINK_CLUSTER_DIGEST_MAX_BYTES_PER_TICK | 262144 (256 KiB) | Max digest payload size per tick. |
TSINK_CLUSTER_REPAIR_MAX_MISMATCHES_PER_TICK | 2 | Max diverged shards repaired per tick. |
TSINK_CLUSTER_REPAIR_MAX_SERIES_PER_TICK | 256 | Max series included in a repair round. |
TSINK_CLUSTER_REPAIR_MAX_ROWS_PER_TICK | 16384 | Max rows transferred per repair tick. |
TSINK_CLUSTER_REPAIR_MAX_RUNTIME_MS_PER_TICK | 100 | Max wall time per repair tick (ms). |
TSINK_CLUSTER_REPAIR_FAILURE_BACKOFF_SECS | 30 | Backoff after a failed repair attempt. |
Rebalance
When shard ownership changes (node joining, leaving, or ring update), data migrates to new owners. Rebalance runs automatically in the background and is rate-limited to avoid impacting production traffic.| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_REBALANCE_INTERVAL_SECS | 5 | How often the rebalance loop ticks. |
TSINK_CLUSTER_REBALANCE_MAX_ROWS_PER_TICK | 10000 | Max rows migrated per rebalance tick. |
TSINK_CLUSTER_REBALANCE_MAX_SHARDS_PER_TICK | 4 | Max shards processed per rebalance tick. |
Adding and removing nodes
Adding a node
- Start the new node with
--cluster-enabled, a unique--cluster-node-id, and--cluster-seedspointing at existing nodes. - The new node contacts the seed list and sends a
joinrequest to the control plane. - Once accepted, the control plane updates the ring and rebalance begins migrating shards to the new node.
Removing a node
Signal the node to drain its shards before stopping:Shard handoff phases
When a shard is migrated, it goes through the following phases:| Phase | Description |
|---|---|
Warmup | New owner receives data; old owner remains primary. |
Cutover | Traffic switches to the new owner. |
FinalSync | Final data sync before the old owner relinquishes the shard. |
Completed | Migration done. |
Failed | An error occurred; the handoff can be resumed from Warmup. |
Control-plane consensus
Cluster membership and ring state are managed by an internal Raft-like consensus log replicated across all storage-capable nodes. The leader coordinates ring updates, join/leave decisions, and snapshots.| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_CONTROL_TICK_INTERVAL_SECS | 2 | Consensus tick interval. |
TSINK_CLUSTER_CONTROL_MAX_APPEND_ENTRIES | 64 | Max log entries per append round. |
TSINK_CLUSTER_CONTROL_SNAPSHOT_INTERVAL_ENTRIES | 128 | Log entries between automatic snapshots. |
TSINK_CLUSTER_CONTROL_SUSPECT_TIMEOUT_SECS | 6 | Seconds before a silent peer is marked suspect. |
TSINK_CLUSTER_CONTROL_DEAD_TIMEOUT_SECS | 20 | Seconds before a suspect peer is declared dead. |
TSINK_CLUSTER_CONTROL_LEADER_LEASE_SECS | 6 | Leader lease duration. |
unknown → healthy → suspect → dead.
Snapshots
Snapshot and restore the control-plane state:Internal security
All inter-node communication runs on the internal RPC port (--cluster-bind) under /internal/v1/* endpoints. Two mutually-exclusive authentication mechanisms are available.
Shared-secret token (simpler)
Provide a token at startup. All nodes in the cluster must use the same token.x-tsink-internal-auth header. Tokens can be rotated at runtime without restart — see the secret rotation guide.
mTLS (recommended for production)
Enable mTLS to authenticate and encrypt all peer-to-peer traffic using certificates. When mTLS is enabled, shared-secret token auth is disabled. All three paths are required when--cluster-internal-mtls-enabled is set:
| Flag | Description |
|---|---|
--cluster-internal-mtls-ca-cert | PEM CA bundle. Used to verify incoming peer connections and outbound TLS. |
--cluster-internal-mtls-cert | PEM certificate presented by this node on all outbound RPC calls. |
--cluster-internal-mtls-key | PEM private key for the certificate above. |
RPC tuning
The following environment variables control internal RPC behavior and resource limits:| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_FANOUT_CONCURRENCY | 16 | Maximum concurrent fanout RPCs per request. |
TSINK_CLUSTER_RPC_TIMEOUT_MS | 2000 | Timeout per individual RPC call (ms). |
TSINK_CLUSTER_RPC_MAX_RETRIES | 2 | Maximum retries before falling back to hinted handoff (writes) or error (reads). |
TSINK_CLUSTER_WRITE_MAX_BATCH_ROWS | 1024 | Maximum rows per write batch sent to a peer. |
TSINK_CLUSTER_WRITE_MAX_INFLIGHT_BATCHES | 32 | Maximum concurrent write batches in flight to all peers. |
TSINK_CLUSTER_READ_MAX_MERGED_SERIES | 250000 | Maximum number of series that can be merged in a single distributed read. |
TSINK_CLUSTER_READ_MAX_MERGED_POINTS_PER_SERIES | 1000000 | Maximum data points per series in a distributed read result. |
TSINK_CLUSTER_READ_MAX_MERGED_POINTS_TOTAL | 5000000 | Maximum total data points in a distributed read result. |
TSINK_CLUSTER_READ_MAX_INFLIGHT_QUERIES | 64 | Maximum concurrent distributed read queries. |
TSINK_CLUSTER_READ_MAX_INFLIGHT_MERGED_POINTS | 20000000 | Maximum total in-flight merged points across all concurrent reads. |
TSINK_CLUSTER_READ_RESOURCE_ACQUIRE_TIMEOUT_MS | 25 | Timeout acquiring read concurrency slots (ms). |
Write deduplication
To prevent double-writes during retry storms, tsink tracks idempotency keys per node in a sliding time window. Duplicate writes within the window are silently dropped on the receiving node.| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_DEDUPE_WINDOW_SECS | 900 | Deduplication window (15 minutes). |
TSINK_CLUSTER_DEDUPE_MAX_ENTRIES | 250000 | Maximum tracked idempotency keys. |
TSINK_CLUSTER_DEDUPE_MAX_LOG_BYTES | 67108864 (64 MiB) | Maximum WAL size for the dedupe log. |
TSINK_CLUSTER_DEDUPE_CLEANUP_INTERVAL_SECS | 30 | Cleanup interval for expired entries. |
Cluster audit log
All control-plane operations (joins, leaves, ring changes, handoffs) are written to an audit log queryable via the admin API.| Environment variable | Default | Description |
|---|---|---|
TSINK_CLUSTER_AUDIT_RETENTION_SECS | 2592000 (30 days) | How long audit records are retained. |
TSINK_CLUSTER_AUDIT_MAX_LOG_BYTES | 134217728 (128 MiB) | Maximum size of the audit log. |
TSINK_CLUSTER_AUDIT_MAX_QUERY_LIMIT | 1000 | Maximum records returned per audit query. |
Configuration checklist
Before starting a production cluster:- Every node has a unique, stable
--cluster-node-id. -
--cluster-bindis reachable by all peers on the network. -
--cluster-shardsis consistent across all nodes in the cluster. -
--cluster-replication-factor≤ number of storage-capable nodes. - Internal auth is configured: either
--cluster-internal-auth-token[|-file]or--cluster-internal-mtls-*. -
--cluster-seedslists enough existing nodes for bootstrap (any subset works). - Firewalls allow traffic on both the public
--listenport and the internal--cluster-bindport.