Architecture Overview
This document describes the high-level design of tsink, the component interactions, and the end-to-end data flow for writes and reads.Table of Contents
- Deployment modes
- Repository layout
- Public API surface
- Storage engine overview
- Write path
- Read path
- Write-Ahead Log (WAL)
- Chunk encoding
- Segments and on-disk index
- Series registry
- Compaction
- Tiered storage and lifecycle
- Visibility and MVCC fence
- PromQL engine
- Rollups and downsampling
- Server mode
- Clustering and replication
- Security model
- Observability and background runtime
- Configuration reference summary
1. Deployment modes
tsink ships as three interconnected Rust crates in a single workspace.| Crate | Purpose |
|---|---|
tsink (root src/) | Embeddable library — the complete storage engine with no async runtime dependency. |
tsink-server (crates/tsink-server/) | Standalone HTTP server binary with protocol ingest, clustering, and RBAC. |
tsink-uniffi (crates/tsink-uniffi/) | UniFFI-generated Python bindings that wrap the library crate. |
Storage trait that application code uses directly.
2. Repository layout
3. Public API surface
Core interface — Storage trait (src/storage.rs)
All interaction with the engine goes through the Storage trait. Key methods:
| Method | Description |
|---|---|
insert_rows(&[Row]) | Synchronous batch write. |
insert_rows_with_result | Batch write returning per-row WriteResult. |
select(metric, labels, start, end) | Range read for a single series. |
select_with_options | Range read with QueryOptions (aggregation, downsampling, pagination). |
select_all | Bulk range read across all series matching a selector. |
list_metrics | Enumerate stored metric names. |
select_series | Enumerate series matching a label selector with time-range filtering. |
scan_series_rows / scan_metric_rows | Paginated scan for large result sets. |
delete_series | Tombstone-based deletion with optional time range. |
snapshot | Atomic directory snapshot. |
add_rollup_policy / run_rollup_pass | Manage persistent downsampling policies. |
observability_snapshot | Current internal counters and health state. |
Key data types
Row—(metric: String, labels: Vec<Label>, data_point: DataPoint).DataPoint—(timestamp: i64, value: Value).Value—F64(f64) | I64(i64) | U64(u64) | Bool(bool) | Bytes(Vec<u8>) | Histogram(Box<NativeHistogram>).Label—(name: String, value: String).TimestampPrecision—Nanoseconds | Microseconds | Milliseconds | Seconds.
Async façade — AsyncStorage (src/async.rs)
AsyncStorage wraps Storage without requiring Tokio. It spawns OS threads internally:
- One dedicated write worker receives
WriteCommandmessages over anasync-channel. - A pool of read workers (sized by
cgroup::default_workers_limit()) handlesReadCommandmessages concurrently. - Uses
parking_lot::Mutexandstd::thread— suitable for embedding in any async runtime.
4. Storage engine overview
The engine lives insrc/engine/ and is structured in three layers.
Layer 1 — Foundation libraries
Pure data-structure and codec modules with no cross-cutting concerns:| Module | Role |
|---|---|
chunk | In-memory chunk builder and sealed EncodedChunk. |
encoder | Timestamp and value codec selection and execution. |
segment | On-disk segment format, reader, writer, index postings. |
series | Series identity, interned string dictionaries, Roaring Bitmap postings. |
wal | WAL frame format, segmented log files, replay stream. |
index | In-memory ChunkIndex and PostingsIndex. |
query | Core query primitives: QueryPlan, ChunkSeriesCursor, chunk decoding. |
compactor | L0/L1/L2 compaction planning and execution. |
Layer 2 — Shared shell
State containers and cross-cutting services that the owner modules operate on:| Module | Role |
|---|---|
engine.rs | ChunkStorage — the top-level struct with five state buckets. |
state | All internal state type definitions. |
visibility | MVCC fence and per-series visibility summaries. |
runtime | Background thread management and supervision. |
metrics / observability | Lockless counters and snapshot generation. |
shard_routing | Maps series IDs to 1-of-64 shards; dead-lock-safe lock ordering. |
metadata_lookup | Distributed-shard series resolution. |
process_lock | Exclusive-open file lock. |
Layer 3 — Owner modules
Higher-level operations that drive the shell:| Module | Role |
|---|---|
bootstrap | Six-phase startup: discovery → recovery → WAL open → hydrate → replay → finalize. |
ingest + ingest_pipeline | Five-phase write pipeline (resolve → prepare → apply → commit). |
write_buffer | Active-head flush policies. |
lifecycle | Flush, persist, replay, and shutdown orchestration. |
query_exec + query_read | High-level query execution and low-level read path. |
deletion | Tombstone-based series deletion. |
rollups | Persistent downsampling materialization. |
maintenance + tiering + registry_catalog | Tier lifecycle, catalog validation. |
State buckets in ChunkStorage
To contain lock contention, ChunkStorage organises mutable state into five separate structs:
| Struct | Contents |
|---|---|
CatalogState | SeriesRegistry, 64 write_txn_shards (Mutex), pending series IDs, persistence lock. |
ChunkBufferState | 64-shard active builders, 64-shard sealed chunks, persisted-chunk watermarks, monotone chunk sequence counter. |
VisibilityState | Tombstones, materialized series, per-series visibility summaries, flush visibility RwLock. |
PersistedStorageState | PersistedIndexState, WAL handle, compactors, tiered-storage config, pending segment diff. |
RuntimeConfigState | Retention/partition/skew windows, memory budget, cardinality limit, WAL size limit, concurrency bounds. |
5. Write path
A singleinsert_rows call traverses six steps before returning to the caller.
Row carries a metric, Vec<Label>, and a DataPoint. The 64-shard split means that independent series in the same batch can be admitted concurrently.
Admission control in the prepare phase enforces:
memory_budget_bytes— total in-memory chunk bytes.cardinality_limit— maximum unique series count.- WAL size limit — maximum outstanding un-flushed WAL bytes.
WriteResult when calling insert_rows_with_result; the synchronous insert_rows drops rejected rows silently.
6. Read path
flush_visibility_lock is the consistency fence: writers hold a write lock during flush publication; readers hold a shared read lock for the duration of snapshot capture, guaranteeing a consistent view.
7. Write-Ahead Log (WAL)
The WAL lives under<data_path>/wal/ and is managed by the FramedWal struct.
Frame wire format
SeriesDefinitionFrame — written once per new series: (series_id, metric, labels).
SamplesBatchFrame — written per chunk flush: (series_id, lane, timestamp_codec_id, value_codec_id, point_count, base_timestamp, encoded_timestamps, encoded_values).
Durability modes
| Mode | Behavior |
|---|---|
WalSyncMode::PerAppend | fsync on every frame append. Crash-safe; throughput-limited. |
WalSyncMode::Periodic(d) | Background fsync every duration d. Higher throughput; up to d of data can be lost on hard crash. |
Replay modes
| Mode | Behavior |
|---|---|
WalReplayMode::Strict | Abort if any WAL segment is corrupted or unreadable. |
WalReplayMode::Salvage | Skip damaged segments and continue replay. |
High-water mark
WalHighWatermark { segment: u64, frame: u64 } tracks how far replay has been committed. Each persisted segment stores the WAL high-water mark at the time it was written, so recovery knows which WAL frames to replay and which to skip.
The published high-water file (wal.published) uses magic b"TSHW".
8. Chunk encoding
Chunk lifecycle
Timestamp codecs
| Codec | Use case |
|---|---|
FixedStepRle | Constant-interval series (e.g. scrape-interval metrics). |
DeltaVarint | Variable-interval series with small deltas. |
DeltaOfDeltaBitpack | Gorilla-style double-delta with bit-packing for irregular series. |
choose_best_timestamp_codec after inspecting the actual delta distribution.
Value codecs
| Codec | Use case |
|---|---|
GorillaXorF64 | XOR floating-point compression (Gorilla paper). |
ZigZagDeltaBitpackI64 | Signed integer delta + bit-packing. |
DeltaBitpackU64 | Unsigned integer delta + bit-packing. |
ConstantRle | Run-length encoding for constant-value series. |
BoolBitpack | 1-bit packing for boolean streams. |
BytesDeltaBlock | Block delta encoding for byte-string payloads. |
Timestamp search index
Each sealed chunk builds aTimestampSearchIndex — an array of anchor entries every 64 points. Queries binary-search this index to seek directly into the encoded payload without full decompression, giving O(log N) range access.
9. Segments and on-disk index
Directory structure
Segment manifest
Each segment directory contains:- Encoded chunk files (one per series × value lane).
- A
SegmentManifest(segment_id, level, chunk/point/series counts, min/max timestamps, wal_highwater). - A
SegmentPostingsIndexfor label/metric lookups within the segment. - Checksums; corrupted segments are quarantined automatically.
In-memory index
ChunkIndex — a flat sorted Vec<ChunkIndexEntry>, binary-searched for time-range queries. Each entry stores (series_id, min_ts, max_ts, chunk_offset, chunk_len, point_count, lane, codec IDs, level).
PostingsIndex — a BTreeMap<label_name → BTreeMap<label_value → BTreeSet<SeriesId>>> used to resolve label-matcher queries to candidate series lists.
10. Series registry
The registry maps(metric, labels) pairs to stable numeric SeriesId values and maintains postings for fast label-based lookup.
Internals
- 64-shard split —
series_shards,metric_postings_shards,label_postings_shards,all_series_shards; each shard protected by an independentRwLock. Reads and writes on different series never contend. - Interned string dictionaries —
metric_dict,label_name_dict,label_value_dictstore unique strings once;DictionaryId (u32)is used internally. Reduces memory for high-cardinality label sets. - Roaring Bitmaps — postings lists are
RoaringTreemap— compact, fast union/intersection for label-matcher resolution. SeriesValueFamily— inferred at first write (F64 | I64 | U64 | Bool | Blob | Histogram); used to select the appropriate codec path for all subsequent writes.
Series identity
canonical_series_identity(metric, labels) produces a deterministic binary key: labels are sorted lexicographically, then length-prefixed and concatenated with the metric name. stable_series_identity_hash applies xxh64 to this key for O(1) shard dispatch.
Persistence
The registry is persisted as a binary RIDX v2 file (series_index.bin) with incremental delta checkpoints in series_index.delta.d/. On startup, registry_catalog.rs validates per-segment xxh64 fingerprints against the catalog; a mismatch triggers a full registry rebuild from segment postings files.
11. Compaction
tsink uses a three-level LSM-inspired compaction scheme.- Planning (
compactor/planning.rs) — selects a window of up to 8 source segments at the triggering level. - Execution (
compactor/execution.rs) — merge-sorts all chunks, deduplicates overlapping samples, re-encodes with optimal codecs, writes new target-level segment. - Atomic replacement — the replacement is staged in
.compaction-replacements/and renamed into place. If the process crashes mid-replacement,finalize_pending_compaction_replacementscleans up the staging area at next startup.
12. Tiered storage and lifecycle
Tiers
| Tier | Location | Access |
|---|---|---|
| Hot | Local filesystem (lane_numeric/, lane_blob/) | Direct mmap reads. |
| Warm | Configurable local path or object store | Loaded on demand. |
| Cold | Object store | Loaded on demand; full fetch before query. |
Tier lifecycle
A background maintenance pass computes aPostFlushMaintenancePolicyPlan from RetentionTierPolicy:
expired_actions— delete segments that fall outside the retention window.rewrite_actions— trim segment boundaries at time-split points.move_actions— promote segments from Hot → Warm or Warm → Cold when they cross the configured retention cutoffs.
Query tier selection
TieredQueryPlan { start, end, include_warm, include_cold } is computed from the query time range vs. tier cutoffs. A query entirely within the hot window skips all remote tier I/O.
13. Visibility and MVCC fence
The problem
A flush publishes many new segment entries and visibility summaries at once. Without a fence, a reader could see a partially-published flush: some new segments visible, others not.The solution
flush_visibility_lock is a RwLock<()>:
- Writers acquire the exclusive
writelock for the entire duration of flush publication. - Readers acquire a shared
readlock during snapshot capture (step 3 of the read path).
Per-series visibility summaries
SeriesVisibilitySummary tracks up to SERIES_VISIBILITY_SUMMARY_MAX_RANGES = 32 disjoint time ranges per series. This handles series with gaps (e.g. intermittently reporting nodes) efficiently: readers can skip time ranges that are known to be empty without scanning segment files.
Fields per summary:
latest_visible_timestamp— highest timestamp ingested for this series.latest_bounded_visible_timestamp— highest timestamp whose containing chunk has been sealed and persisted.exhaustive_floor_inclusive— data below this timestamp is complete (no more writes expected).truncated_before_floor— all data before this point has been tombstoned.ranges— list of(min, max)tuples defining continuous data windows.
14. PromQL engine
tsink includes a full in-process PromQL evaluator insrc/promql/.
Components
| Module | Role |
|---|---|
lexer.rs | Hand-written lexer; produces Vec<Token>. |
parser.rs | Recursive-descent Pratt parser; produces ast::Expr. |
ast.rs | Complete AST: VectorSelector, MatrixSelector, SubqueryExpr, BinaryExpr, AggregationExpr, CallExpr, AtModifier, all operators. |
eval/ | Evaluator; implements all PromQL functions and aggregations. |
Evaluator
Engine { storage, default_lookback_delta, timestamp_units_per_second }:
instant_query(query, time)— evaluates at a single timestamp.range_query(query, start, end, step)— vectorized evaluation over a step range.
PrefetchCache is keyed by metric name and populated via select_all before evaluation, amortising storage round-trips across all eval timestamps in a range query.
Supported features: rate, irate, increase, delta, histogram_quantile, all standard aggregations (sum, avg, count, min, max, topk, bottomk, quantile), binary operators with on/ignoring/group_left/group_right, subqueries, @ modifier.
15. Rollups and downsampling
Rollup policies define persistent, automated downsampling of raw data.Policy structure
RollupPolicy:
policy_id— stable identifier.- Source selector — metric name + optional label matchers.
aggregation—Sum | Min | Max | Avg | Count | ....resolution— output bucket width.- Retention rules — when to expire rolled-up data.
Materialization
Materialized series are stored under synthetic metric names of the form__tsink_rollup__:<policy_id>:<original_metric>. A rollup worker runs every 5 seconds, reads raw data for each pending policy/source pair, computes aggregated buckets, and writes the result back via a normal insert_rows call.
Rollup state is checkpointed as JSON files in .rollups/ with per-policy per-source progress cursors. When a tombstone is written to a source series, any rolled-up data covering the deleted range is invalidated and scheduled for re-materialization.
16. Server mode
Thetsink-server crate wraps the library engine in a full HTTP server.
Protocol support
| Protocol | Endpoint |
|---|---|
| Prometheus Remote Write | POST /api/v1/write (Snappy-framed protobuf) |
| Prometheus Remote Read | POST /api/v1/read |
| Prometheus Text Exposition | POST /api/v1/import/prometheus |
| InfluxDB Line Protocol v1/v2 | POST /write, POST /api/v2/write |
| OTLP HTTP | POST /v1/metrics (protobuf; gauges, sums, histograms, summaries) |
| StatsD | UDP (counter, gauge, timer, set) |
| Graphite | TCP (plaintext) |
| PromQL instant query | GET /api/v1/query |
| PromQL range query | GET /api/v1/query_range |
Connection handling
MAX_CONNECTIONS = 1024simultaneous TCP connections.KEEP_ALIVE_TIMEOUT = 30 s,HANDSHAKE_TIMEOUT = 10 s,SHUTDOWN_GRACE_PERIOD = 10 s.- TLS via
tokio-rustls; HTTP/1.1 framing.
Multi-tenancy
Tenant identity is propagated via thex-tsink-tenant HTTP header and injected as a __tsink_tenant__ synthetic label on all writes and reads. Per-tenant admission semaphores govern ingest, query, metadata, and retention concurrency independently.
17. Clustering and replication
Clustering is an opt-in layer over the single-node server. All cluster logic lives incrates/tsink-server/src/cluster/.
Topology
Consistent hash ring (cluster/ring.rs)
ShardRing { shard_count, replication_factor, virtual_nodes_per_node = 128 }:
- Each node contributes 128 virtual tokens distributed by xxh64 hash around the ring.
- Each shard is pre-assigned to
replication_factorowner nodes. stable_series_identity_hash(metric, labels)→ deterministic shard selection — the same series always lands on the same shard regardless of which node receives the write.
Write routing (cluster/replication.rs)
WriteRouter decomposes an incoming batch into a WritePlan { local_rows, remote_batches }. Remote batches are sent in parallel via tokio::task::JoinSet (max 32 in-flight batches, max 1024 rows per batch). Failed remote writes are enqueued to HintedHandoffOutbox for retry.
Consistency levels — ClusterWriteConsistency { One | Quorum | All } — control how many replicas must acknowledge before the write is considered successful.
Control plane (cluster/consensus.rs)
A custom Raft-like log replication layer manages cluster membership and control state:
- Tick interval: 2 s; max append entries: 64; snapshot every 128 entries.
- Leader election with suspect timeout: 6 s; dead timeout: 20 s; leader lease: 6 s.
- Control log persisted as NDJSON to
tsink-control-log.
Anti-entropy repair (cluster/repair.rs)
DigestExchangeRuntime runs background anti-entropy:
- Every 30 s, each node exchanges window digest hashes with its peers for up to 64 shards per tick.
- A digest mismatch triggers a
InternalRepairBackfillRequestto re-replicate missing data. - Rebalance moves shard ownership between nodes in 5 s intervals with configurable row-per-tick limits.
Deduplication
DedupeWindowStore provides idempotency-key deduplication to prevent re-ingestion of retried writes during hinted handoff re-delivery.
18. Security model
TLS
All external HTTP and peer-to-peer traffic can be protected by TLS. Certificates are loaded viatokio-rustls and support hot rotation without server restart via SecurityManager.
Internal cluster traffic uses mTLS with a dedicated cluster CA to authenticate peer nodes.
Authentication
Two methods are supported:- Bearer tokens — loaded from a file or via an exec command; verified on every request.
- OIDC JWT — RS256, ES256, or HS256 tokens issued by a configured identity provider; JWKS fetched at startup with 60 s clock skew tolerance.
RBAC (src/rbac.rs)
RbacRegistry assigns roles to principals. Actions are Read | Write; resource kinds are Tenant | Admin | System. Service accounts carry 32-byte random tokens (base64url-encoded, HMAC-backed) with configurable rotation schedules.
An in-memory ring buffer retains the last 256 RBAC decisions for audit review.
Secret rotation
SecurityManager supports runtime rotation of:
- Public and admin auth tokens.
- Cluster internal auth token.
- TLS listener certificate/key.
- Cluster mTLS certificate/key.
19. Observability and background runtime
Self-instrumentation
StorageObservabilityCounters provides lockless AtomicU64 counters grouped by subsystem:
| Counter group | What it tracks |
|---|---|
WalObservabilityCounters | Replay runs, frames, points, errors, durations. |
FlushObservabilityCounters | Flush runs, backpressure events, active chunk stats. |
CompactionObservabilityCounters | Runs, source/output segment/chunk/point counts. |
QueryObservabilityCounters | Hot/warm/cold tier plans, chunks read, fetch durations. |
RollupObservabilityCounters | Materialization runs, points produced. |
RetentionObservabilityCounters | Expired series, segments, points. |
HealthObservabilityState | fail_fast_triggered, last error strings. |
observability_snapshot() converts raw counters + runtime state into the public StorageObservabilitySnapshot DTO.
The server exposes /metrics in Prometheus exposition format and /healthz / /ready Kubernetes-compatible probes.
Background threads
The runtime (src/engine/runtime.rs) manages three background workers:
| Worker | Interval | Activity |
|---|---|---|
| Flush | 250 ms | Flushes BackgroundEligible or BackgroundBounded active chunks to the sealed queue. |
| Compaction | 5 s (+ triggered) | Runs Compactor::compact_once for numeric and blob lanes. |
| Rollup | 5 s | Executes pending rollup materializations. |
STORAGE_OPEN / CLOSING / CLOSED) and stop themselves when the engine begins shutdown. If background_fail_fast = true (default), a background worker panic sets fail_fast_triggered and causes subsequent writes to return an error.
20. Configuration reference summary
Key engine knobs and their defaults:| Option | Default | Notes | |||
|---|---|---|---|---|---|
timestamp_precision | Nanoseconds | `Ns | µs | ms | s` |
retention_window | 14 days | Data older than this is eligible for expiry. | |||
future_skew_window | 15 min | Writes with future timestamps beyond this are rejected. | |||
partition_window | 1 hour | Time-bucket width for active partition heads. | |||
max_active_partition_heads_per_series | 8 | Maximum concurrent open partitions per series. | |||
max_writers | cgroup CPU count | Write parallelism gate (Semaphore permits). | |||
write_timeout | 30 s | Maximum wait time for a write permit. | |||
memory_budget_bytes | u64::MAX (no limit) | Total in-memory chunk budget. | |||
cardinality_limit | usize::MAX (no limit) | Maximum unique series count. | |||
chunk_points | 2048 | Points per sealed chunk. | |||
compaction_interval | 5 s | Background compaction frequency. | |||
flush_interval | 250 ms | Background flush frequency. | |||
background_fail_fast | true | Worker panic triggers engine failure mode. |
cgroup.rs reads /sys/fs/cgroup/cpu.max and /sys/fs/cgroup/memory.max to detect CPU and memory limits. The TSINK_MAX_CPUS environment variable overrides the detected CPU count.