Architecture Overview

This document describes the high-level design of tsink, the component interactions, and the end-to-end data flow for writes and reads.

Deployment modes
Repository layout
Public API surface
Storage engine overview
Write path
Read path
Write-Ahead Log (WAL)
Chunk encoding
Segments and on-disk index
Series registry
Compaction
Tiered storage and lifecycle
Visibility and MVCC fence
PromQL engine
Rollups and downsampling
Server mode
Clustering and replication
Security model
Observability and background runtime
Configuration reference summary

1. Deployment modes

tsink ships as three interconnected Rust crates in a single workspace.

Crate	Purpose
`tsink` (root `src/`)	Embeddable library — the complete storage engine with no async runtime dependency.
`tsink-server` (`crates/tsink-server/`)	Standalone HTTP server binary with protocol ingest, clustering, and RBAC.
`tsink-uniffi` (`crates/tsink-uniffi/`)	UniFFI-generated Python bindings that wrap the library crate.

All three share the same engine code. The server and Python bindings call into the same Storage trait that application code uses directly.

2. Repository layout

src/
  lib.rs                 — public re-exports
  storage.rs             — Storage trait, StorageBuilder, public DTOs
  async.rs               — runtime-agnostic async façade (AsyncStorage)
  value.rs               — Value enum, NativeHistogram
  label.rs               — Label, canonical identity, stable hash
  wal.rs                 — WalSyncMode, WalReplayMode
  mmap.rs                — PlatformMmap (memmap2 wrapper)
  concurrency.rs         — Semaphore (write gate)
  cgroup.rs              — container-aware CPU/memory detection
  validation.rs          — metric/label boundary checks
  query_aggregation.rs   — Aggregation enum, downsampling
  query_matcher.rs       — CompiledSeriesMatcher (regex → postings)
  query_selection.rs     — PreparedSeriesSelection execution
  error.rs               — StorageError
  engine/                — all internal engine modules (see §4)
  promql/                — PromQL lexer, parser, AST, evaluator

crates/
  tsink-server/src/      — HTTP server, protocol adapters, cluster logic
  tsink-uniffi/src/      — UniFFI bindings

3. Public API surface

Core interface — `Storage` trait (`src/storage.rs`)

All interaction with the engine goes through the Storage trait. Key methods:

Method	Description
`insert_rows(&[Row])`	Synchronous batch write.
`insert_rows_with_result`	Batch write returning per-row `WriteResult`.
`select(metric, labels, start, end)`	Range read for a single series.
`select_with_options`	Range read with `QueryOptions` (aggregation, downsampling, pagination).
`select_all`	Bulk range read across all series matching a selector.
`list_metrics`	Enumerate stored metric names.
`select_series`	Enumerate series matching a label selector with time-range filtering.
`scan_series_rows` / `scan_metric_rows`	Paginated scan for large result sets.
`delete_series`	Tombstone-based deletion with optional time range.
`snapshot`	Atomic directory snapshot.
`add_rollup_policy` / `run_rollup_pass`	Manage persistent downsampling policies.
`observability_snapshot`	Current internal counters and health state.

Key data types

Row — (metric: String, labels: Vec<Label>, data_point: DataPoint).
DataPoint — (timestamp: i64, value: Value).
Value — F64(f64) | I64(i64) | U64(u64) | Bool(bool) | Bytes(Vec<u8>) | Histogram(Box<NativeHistogram>).
Label — (name: String, value: String).
TimestampPrecision — Nanoseconds | Microseconds | Milliseconds | Seconds.

Async façade — `AsyncStorage` (`src/async.rs`)

AsyncStorage wraps Storage without requiring Tokio. It spawns OS threads internally:

One dedicated write worker receives WriteCommand messages over an async-channel.
A pool of read workers (sized by cgroup::default_workers_limit()) handles ReadCommand messages concurrently.
Uses parking_lot::Mutex and std::thread — suitable for embedding in any async runtime.

4. Storage engine overview

The engine lives in src/engine/ and is structured in three layers.

Layer 1 — Foundation libraries

Pure data-structure and codec modules with no cross-cutting concerns:

Module	Role
`chunk`	In-memory chunk builder and sealed `EncodedChunk`.
`encoder`	Timestamp and value codec selection and execution.
`segment`	On-disk segment format, reader, writer, index postings.
`series`	Series identity, interned string dictionaries, Roaring Bitmap postings.
`wal`	WAL frame format, segmented log files, replay stream.
`index`	In-memory `ChunkIndex` and `PostingsIndex`.
`query`	Core query primitives: `QueryPlan`, `ChunkSeriesCursor`, chunk decoding.
`compactor`	L0/L1/L2 compaction planning and execution.

Layer 2 — Shared shell

State containers and cross-cutting services that the owner modules operate on:

Module	Role
`engine.rs`	`ChunkStorage` — the top-level struct with five state buckets.
`state`	All internal state type definitions.
`visibility`	MVCC fence and per-series visibility summaries.
`runtime`	Background thread management and supervision.
`metrics` / `observability`	Lockless counters and snapshot generation.
`shard_routing`	Maps series IDs to 1-of-64 shards; dead-lock-safe lock ordering.
`metadata_lookup`	Distributed-shard series resolution.
`process_lock`	Exclusive-open file lock.

Layer 3 — Owner modules

Higher-level operations that drive the shell:

Module	Role
`bootstrap`	Six-phase startup: discovery → recovery → WAL open → hydrate → replay → finalize.
`ingest` + `ingest_pipeline`	Five-phase write pipeline (resolve → prepare → apply → commit).
`write_buffer`	Active-head flush policies.
`lifecycle`	Flush, persist, replay, and shutdown orchestration.
`query_exec` + `query_read`	High-level query execution and low-level read path.
`deletion`	Tombstone-based series deletion.
`rollups`	Persistent downsampling materialization.
`maintenance` + `tiering` + `registry_catalog`	Tier lifecycle, catalog validation.

State buckets in `ChunkStorage`

To contain lock contention, ChunkStorage organises mutable state into five separate structs:

Struct	Contents
`CatalogState`	`SeriesRegistry`, 64 `write_txn_shards` (`Mutex`), pending series IDs, persistence lock.
`ChunkBufferState`	64-shard active builders, 64-shard sealed chunks, persisted-chunk watermarks, monotone chunk sequence counter.
`VisibilityState`	Tombstones, materialized series, per-series visibility summaries, flush visibility RwLock.
`PersistedStorageState`	`PersistedIndexState`, WAL handle, compactors, tiered-storage config, pending segment diff.
`RuntimeConfigState`	Retention/partition/skew windows, memory budget, cardinality limit, WAL size limit, concurrency bounds.

5. Write path

A single insert_rows call traverses six steps before returning to the caller.

insert_rows(&[Row])
    │
    ▼
[1] Semaphore acquire          (max_writers permits, write_timeout deadline)
    │
    ▼
[2] Shard lock acquisition     (minimal set of write_txn_shards, ascending order)
    │
    ▼
[3] WriteResolver              — validate labels, resolve or mint SeriesId,
    │                            append WAL SeriesDefinitionFrame for new series
    ▼
[4] WritePreparer              — normalise timestamps, reject future skew,
    │                            reject below retention floor,
    │                            admission control: memory budget, cardinality limit, WAL size
    ▼
[5] WriteApplier               — append ChunkPoint to ActivePartitionHead (ChunkBuilder),
    │                            append WAL SamplesBatchFrame
    ▼
[6] WriteCommitter             — seal full chunks, publish visibility,
                                 update tombstone/rollup cross-references,
                                 notify background flush/compaction workers

Each Row carries a metric, Vec<Label>, and a DataPoint. The 64-shard split means that independent series in the same batch can be admitted concurrently. Admission control in the prepare phase enforces:

memory_budget_bytes — total in-memory chunk bytes.
cardinality_limit — maximum unique series count.
WAL size limit — maximum outstanding un-flushed WAL bytes.

Admission failures return a per-Row WriteResult when calling insert_rows_with_result; the synchronous insert_rows drops rejected rows silently.

6. Read path

select / select_all / select_with_options
    │
    ▼
[1] candidate_planner          — resolve metric + label matchers → Vec<SeriesId>
    │                            via PostingsIndex (postings bitmaps, regex → finite literal set)
    ▼
[2] TieredQueryPlan            — compute which tiers to include (hot-only if within hot window,
    │                            else hot + warm + cold)
    ▼
[3] QuerySnapshotContext       — take visibility-fenced snapshot:
    │                            acquire flush_visibility_lock (read)
    │                            snapshot active chunks, sealed chunks, persisted index
    │                            release lock
    ▼
[4] Per-series read            — for each SeriesId:
    │    a. Scan active ChunkBuilders (in-memory, unsorted tail)
    │    b. Scan sealed EncodedChunks (flushed from active, pending segment write)
    │    c. Scan PersistedIndexState (mmap'd segment files on disk, hot tier)
    │    d. Optionally scan warm/cold tiers (object store or additional paths)
    ▼
[5] Merge and decode           — ChunkSeriesCursor binary-searches each chunk's
    │                            TimestampSearchIndex, decodes only the needed range,
    │                            merge-sorts streams from all sources
    ▼
[6] Aggregation / downsampling — optional bucket aggregation via query_aggregation
    │
    ▼
[7] Return SeriesPoints

The flush_visibility_lock is the consistency fence: writers hold a write lock during flush publication; readers hold a shared read lock for the duration of snapshot capture, guaranteeing a consistent view.

7. Write-Ahead Log (WAL)

The WAL lives under <data_path>/wal/ and is managed by the FramedWal struct.

Frame wire format

Offset  Size  Field
   4  Magic = b"TSFR"
   1  Frame type (1 = SeriesDef, 2 = Samples)
   4  CRC-32 of payload
   4  Payload length (max 64 MiB)
  11  Sequence number (monotone u64 + padding)
   N  Payload

SeriesDefinitionFrame — written once per new series: (series_id, metric, labels). SamplesBatchFrame — written per chunk flush: (series_id, lane, timestamp_codec_id, value_codec_id, point_count, base_timestamp, encoded_timestamps, encoded_values).

Durability modes

Mode	Behavior
`WalSyncMode::PerAppend`	`fsync` on every frame append. Crash-safe; throughput-limited.
`WalSyncMode::Periodic(d)`	Background `fsync` every duration `d`. Higher throughput; up to `d` of data can be lost on hard crash.

Replay modes

Mode	Behavior
`WalReplayMode::Strict`	Abort if any WAL segment is corrupted or unreadable.
`WalReplayMode::Salvage`	Skip damaged segments and continue replay.

High-water mark

WalHighWatermark { segment: u64, frame: u64 } tracks how far replay has been committed. Each persisted segment stores the WAL high-water mark at the time it was written, so recovery knows which WAL frames to replay and which to skip. The published high-water file (wal.published) uses magic b"TSHW".

8. Chunk encoding

Chunk lifecycle

ChunkBuilder (active)
    │  accumulates ChunkPoints in 64-point frozen blocks + mutable tail
    │  sealed when point count reaches DEFAULT_CHUNK_POINTS (2048)
    ▼
EncodedChunk (sealed)
    │  raw points cleared; only encoded payload survives
    │  held in sealed_chunks until background flush
    ▼
SegmentWriter → persisted segment file

Timestamp codecs

Codec	Use case
`FixedStepRle`	Constant-interval series (e.g. scrape-interval metrics).
`DeltaVarint`	Variable-interval series with small deltas.
`DeltaOfDeltaBitpack`	Gorilla-style double-delta with bit-packing for irregular series.

The encoder auto-selects the best codec via choose_best_timestamp_codec after inspecting the actual delta distribution.

Value codecs

Codec	Use case
`GorillaXorF64`	XOR floating-point compression (Gorilla paper).
`ZigZagDeltaBitpackI64`	Signed integer delta + bit-packing.
`DeltaBitpackU64`	Unsigned integer delta + bit-packing.
`ConstantRle`	Run-length encoding for constant-value series.
`BoolBitpack`	1-bit packing for boolean streams.
`BytesDeltaBlock`	Block delta encoding for byte-string payloads.

Timestamp search index

Each sealed chunk builds a TimestampSearchIndex — an array of anchor entries every 64 points. Queries binary-search this index to seek directly into the encoded payload without full decompression, giving O(log N) range access.

9. Segments and on-disk index

Directory structure

<data_path>/
  lane_numeric/          — float/int/bool/histogram chunks
    l0/                  — freshly flushed segments
    l1/                  — first compaction level
    l2/                  — final compaction level
  lane_blob/             — bytes-value chunks (same level layout)
  wal/                   — WAL segment files
  series_index.bin       — binary RIDX v2 series registry snapshot
  series_index.delta.d/  — incremental series registry delta checkpoints
  series_index.catalog.json — fingerprint catalog for fast registry reload
  tombstones.json.store/ — sharded binary tombstone store (256 shards)
  .rollups/              — rollup state JSON (policies.json, state.json)
  .process.lock          — exclusive-open process lock

Segment manifest

Each segment directory contains:

Encoded chunk files (one per series × value lane).
A SegmentManifest (segment_id, level, chunk/point/series counts, min/max timestamps, wal_highwater).
A SegmentPostingsIndex for label/metric lookups within the segment.
Checksums; corrupted segments are quarantined automatically.

In-memory index

ChunkIndex — a flat sorted Vec<ChunkIndexEntry>, binary-searched for time-range queries. Each entry stores (series_id, min_ts, max_ts, chunk_offset, chunk_len, point_count, lane, codec IDs, level). PostingsIndex — a BTreeMap<label_name → BTreeMap<label_value → BTreeSet<SeriesId>>> used to resolve label-matcher queries to candidate series lists.

10. Series registry

The registry maps (metric, labels) pairs to stable numeric SeriesId values and maintains postings for fast label-based lookup.

Internals

64-shard split — series_shards, metric_postings_shards, label_postings_shards, all_series_shards; each shard protected by an independent RwLock. Reads and writes on different series never contend.
Interned string dictionaries — metric_dict, label_name_dict, label_value_dict store unique strings once; DictionaryId (u32) is used internally. Reduces memory for high-cardinality label sets.
Roaring Bitmaps — postings lists are RoaringTreemap — compact, fast union/intersection for label-matcher resolution.
SeriesValueFamily — inferred at first write (F64 | I64 | U64 | Bool | Blob | Histogram); used to select the appropriate codec path for all subsequent writes.

Series identity

canonical_series_identity(metric, labels) produces a deterministic binary key: labels are sorted lexicographically, then length-prefixed and concatenated with the metric name. stable_series_identity_hash applies xxh64 to this key for O(1) shard dispatch.

Persistence

The registry is persisted as a binary RIDX v2 file (series_index.bin) with incremental delta checkpoints in series_index.delta.d/. On startup, registry_catalog.rs validates per-segment xxh64 fingerprints against the catalog; a mismatch triggers a full registry rebuild from segment postings files.

11. Compaction

tsink uses a three-level LSM-inspired compaction scheme.

Active chunks
    │  background flush (250 ms default)
    ▼
L0 segments    ← trigger: 4 L0 segments → compact → L1
    ▼
L1 segments    ← trigger: 4 L1 segments → compact → L2
    ▼
L2 segments    (cold-compacted, largest, longest time span)

Each compaction pass:

Planning (compactor/planning.rs) — selects a window of up to 8 source segments at the triggering level.
Execution (compactor/execution.rs) — merge-sorts all chunks, deduplicates overlapping samples, re-encodes with optimal codecs, writes new target-level segment.
Atomic replacement — the replacement is staged in .compaction-replacements/ and renamed into place. If the process crashes mid-replacement, finalize_pending_compaction_replacements cleans up the staging area at next startup.

Compaction is tombstone-aware: tombstoned time ranges are excluded from the output segment, physically removing deleted data from disk.

12. Tiered storage and lifecycle

Tiers

Tier	Location	Access
Hot	Local filesystem (`lane_numeric/`, `lane_blob/`)	Direct mmap reads.
Warm	Configurable local path or object store	Loaded on demand.
Cold	Object store	Loaded on demand; full fetch before query.

Tier lifecycle

A background maintenance pass computes a PostFlushMaintenancePolicyPlan from RetentionTierPolicy:

expired_actions — delete segments that fall outside the retention window.
rewrite_actions — trim segment boundaries at time-split points.
move_actions — promote segments from Hot → Warm or Warm → Cold when they cross the configured retention cutoffs.

Query tier selection

TieredQueryPlan { start, end, include_warm, include_cold } is computed from the query time range vs. tier cutoffs. A query entirely within the hot window skips all remote tier I/O.

13. Visibility and MVCC fence

The problem

A flush publishes many new segment entries and visibility summaries at once. Without a fence, a reader could see a partially-published flush: some new segments visible, others not.

The solution

flush_visibility_lock is a RwLock<()>:

Writers acquire the exclusive write lock for the entire duration of flush publication.
Readers acquire a shared read lock during snapshot capture (step 3 of the read path).

This is a brief, bounded mutual-exclusion window — not a per-operation lock. Normal write ingest does not touch this lock.

Per-series visibility summaries

SeriesVisibilitySummary tracks up to SERIES_VISIBILITY_SUMMARY_MAX_RANGES = 32 disjoint time ranges per series. This handles series with gaps (e.g. intermittently reporting nodes) efficiently: readers can skip time ranges that are known to be empty without scanning segment files. Fields per summary:

latest_visible_timestamp — highest timestamp ingested for this series.
latest_bounded_visible_timestamp — highest timestamp whose containing chunk has been sealed and persisted.
exhaustive_floor_inclusive — data below this timestamp is complete (no more writes expected).
truncated_before_floor — all data before this point has been tombstoned.
ranges — list of (min, max) tuples defining continuous data windows.

14. PromQL engine

tsink includes a full in-process PromQL evaluator in src/promql/.

Components

Module	Role
`lexer.rs`	Hand-written lexer; produces `Vec<Token>`.
`parser.rs`	Recursive-descent Pratt parser; produces `ast::Expr`.
`ast.rs`	Complete AST: `VectorSelector`, `MatrixSelector`, `SubqueryExpr`, `BinaryExpr`, `AggregationExpr`, `CallExpr`, `AtModifier`, all operators.
`eval/`	Evaluator; implements all PromQL functions and aggregations.

Evaluator

Engine { storage, default_lookback_delta, timestamp_units_per_second }:

instant_query(query, time) — evaluates at a single timestamp.
range_query(query, start, end, step) — vectorized evaluation over a step range.

A PrefetchCache is keyed by metric name and populated via select_all before evaluation, amortising storage round-trips across all eval timestamps in a range query. Supported features: rate, irate, increase, delta, histogram_quantile, all standard aggregations (sum, avg, count, min, max, topk, bottomk, quantile), binary operators with on/ignoring/group_left/group_right, subqueries, @ modifier.

15. Rollups and downsampling

Rollup policies define persistent, automated downsampling of raw data.

Policy structure

RollupPolicy:

policy_id — stable identifier.
Source selector — metric name + optional label matchers.
aggregation — Sum | Min | Max | Avg | Count | ....
resolution — output bucket width.
Retention rules — when to expire rolled-up data.

Materialization

Materialized series are stored under synthetic metric names of the form __tsink_rollup__:<policy_id>:<original_metric>. A rollup worker runs every 5 seconds, reads raw data for each pending policy/source pair, computes aggregated buckets, and writes the result back via a normal insert_rows call. Rollup state is checkpointed as JSON files in .rollups/ with per-policy per-source progress cursors. When a tombstone is written to a source series, any rolled-up data covering the deleted range is invalidated and scheduled for re-materialization.

16. Server mode

The tsink-server crate wraps the library engine in a full HTTP server.

Protocol support

Protocol	Endpoint
Prometheus Remote Write	`POST /api/v1/write` (Snappy-framed protobuf)
Prometheus Remote Read	`POST /api/v1/read`
Prometheus Text Exposition	`POST /api/v1/import/prometheus`
InfluxDB Line Protocol v1/v2	`POST /write`, `POST /api/v2/write`
OTLP HTTP	`POST /v1/metrics` (protobuf; gauges, sums, histograms, summaries)
StatsD	UDP (counter, gauge, timer, set)
Graphite	TCP (plaintext)
PromQL instant query	`GET /api/v1/query`
PromQL range query	`GET /api/v1/query_range`

Connection handling

MAX_CONNECTIONS = 1024 simultaneous TCP connections.
KEEP_ALIVE_TIMEOUT = 30 s, HANDSHAKE_TIMEOUT = 10 s, SHUTDOWN_GRACE_PERIOD = 10 s.
TLS via tokio-rustls; HTTP/1.1 framing.

Multi-tenancy

Tenant identity is propagated via the x-tsink-tenant HTTP header and injected as a __tsink_tenant__ synthetic label on all writes and reads. Per-tenant admission semaphores govern ingest, query, metadata, and retention concurrency independently.

17. Clustering and replication

Clustering is an opt-in layer over the single-node server. All cluster logic lives in crates/tsink-server/src/cluster/.

Topology

Client
  │  writes / reads
  ▼
Any cluster node (HTTP)
  │
  ├── WriteRouter        — splits batch by consistent hash ring → local + remote shards
  │     │
  │     ├── local write  — directly into local ChunkStorage
  │     └── remote write — parallel RPC fanout to replica owners
  │
  └── ReadFanoutExecutor — fans query to owner shards, merges responses

Consistent hash ring (`cluster/ring.rs`)

ShardRing { shard_count, replication_factor, virtual_nodes_per_node = 128 }:

Each node contributes 128 virtual tokens distributed by xxh64 hash around the ring.
Each shard is pre-assigned to replication_factor owner nodes.
stable_series_identity_hash(metric, labels) → deterministic shard selection — the same series always lands on the same shard regardless of which node receives the write.

Write routing (`cluster/replication.rs`)

WriteRouter decomposes an incoming batch into a WritePlan { local_rows, remote_batches }. Remote batches are sent in parallel via tokio::task::JoinSet (max 32 in-flight batches, max 1024 rows per batch). Failed remote writes are enqueued to HintedHandoffOutbox for retry. Consistency levels — ClusterWriteConsistency { One | Quorum | All } — control how many replicas must acknowledge before the write is considered successful.

Control plane (`cluster/consensus.rs`)

A custom Raft-like log replication layer manages cluster membership and control state:

Tick interval: 2 s; max append entries: 64; snapshot every 128 entries.
Leader election with suspect timeout: 6 s; dead timeout: 20 s; leader lease: 6 s.
Control log persisted as NDJSON to tsink-control-log.

Anti-entropy repair (`cluster/repair.rs`)

DigestExchangeRuntime runs background anti-entropy:

Every 30 s, each node exchanges window digest hashes with its peers for up to 64 shards per tick.
A digest mismatch triggers a InternalRepairBackfillRequest to re-replicate missing data.
Rebalance moves shard ownership between nodes in 5 s intervals with configurable row-per-tick limits.

Deduplication

DedupeWindowStore provides idempotency-key deduplication to prevent re-ingestion of retried writes during hinted handoff re-delivery.

18. Security model

TLS

All external HTTP and peer-to-peer traffic can be protected by TLS. Certificates are loaded via tokio-rustls and support hot rotation without server restart via SecurityManager. Internal cluster traffic uses mTLS with a dedicated cluster CA to authenticate peer nodes.

Authentication

Two methods are supported:

Bearer tokens — loaded from a file or via an exec command; verified on every request.
OIDC JWT — RS256, ES256, or HS256 tokens issued by a configured identity provider; JWKS fetched at startup with 60 s clock skew tolerance.

RBAC (`src/rbac.rs`)

RbacRegistry assigns roles to principals. Actions are Read | Write; resource kinds are Tenant | Admin | System. Service accounts carry 32-byte random tokens (base64url-encoded, HMAC-backed) with configurable rotation schedules. An in-memory ring buffer retains the last 256 RBAC decisions for audit review.

Secret rotation

SecurityManager supports runtime rotation of:

Public and admin auth tokens.
Cluster internal auth token.
TLS listener certificate/key.
Cluster mTLS certificate/key.

A 300 s overlap grace period (configurable) allows in-flight requests to complete with the old credential before it is invalidated.

19. Observability and background runtime

Self-instrumentation

StorageObservabilityCounters provides lockless AtomicU64 counters grouped by subsystem:

Counter group	What it tracks
`WalObservabilityCounters`	Replay runs, frames, points, errors, durations.
`FlushObservabilityCounters`	Flush runs, backpressure events, active chunk stats.
`CompactionObservabilityCounters`	Runs, source/output segment/chunk/point counts.
`QueryObservabilityCounters`	Hot/warm/cold tier plans, chunks read, fetch durations.
`RollupObservabilityCounters`	Materialization runs, points produced.
`RetentionObservabilityCounters`	Expired series, segments, points.
`HealthObservabilityState`	`fail_fast_triggered`, last error strings.

observability_snapshot() converts raw counters + runtime state into the public StorageObservabilitySnapshot DTO. The server exposes /metrics in Prometheus exposition format and /healthz / /ready Kubernetes-compatible probes.

Background threads

The runtime (src/engine/runtime.rs) manages three background workers:

Worker	Interval	Activity
Flush	250 ms	Flushes `BackgroundEligible` or `BackgroundBounded` active chunks to the sealed queue.
Compaction	5 s (+ triggered)	Runs `Compactor::compact_once` for numeric and blob lanes.
Rollup	5 s	Executes pending rollup materializations.

Workers check three lifecycle states (STORAGE_OPEN / CLOSING / CLOSED) and stop themselves when the engine begins shutdown. If background_fail_fast = true (default), a background worker panic sets fail_fast_triggered and causes subsequent writes to return an error.

20. Configuration reference summary

Key engine knobs and their defaults:

Option	Default	Notes
`timestamp_precision`	Nanoseconds	`Ns	µs	ms	s`
`retention_window`	14 days	Data older than this is eligible for expiry.
`future_skew_window`	15 min	Writes with future timestamps beyond this are rejected.
`partition_window`	1 hour	Time-bucket width for active partition heads.
`max_active_partition_heads_per_series`	8	Maximum concurrent open partitions per series.
`max_writers`	cgroup CPU count	Write parallelism gate (`Semaphore` permits).
`write_timeout`	30 s	Maximum wait time for a write permit.
`memory_budget_bytes`	`u64::MAX` (no limit)	Total in-memory chunk budget.
`cardinality_limit`	`usize::MAX` (no limit)	Maximum unique series count.
`chunk_points`	2048	Points per sealed chunk.
`compaction_interval`	5 s	Background compaction frequency.
`flush_interval`	250 ms	Background flush frequency.
`background_fail_fast`	`true`	Worker panic triggers engine failure mode.

Container-aware defaults: cgroup.rs reads /sys/fs/cgroup/cpu.max and /sys/fs/cgroup/memory.max to detect CPU and memory limits. The TSINK_MAX_CPUS environment variable overrides the detected CPU count.

​Architecture Overview

​Table of Contents

​1. Deployment modes

​2. Repository layout

​3. Public API surface

​Core interface — Storage trait (src/storage.rs)

​Key data types

​Async façade — AsyncStorage (src/async.rs)

​4. Storage engine overview

​Layer 1 — Foundation libraries

​Layer 2 — Shared shell

​Layer 3 — Owner modules

​State buckets in ChunkStorage

​5. Write path

​6. Read path

​7. Write-Ahead Log (WAL)

​Frame wire format

​Durability modes

​Replay modes

​High-water mark

​8. Chunk encoding

​Chunk lifecycle

​Timestamp codecs

​Value codecs

​Timestamp search index

​9. Segments and on-disk index

​Directory structure

​Segment manifest

​In-memory index

​10. Series registry

​Internals

​Series identity

​Persistence

​11. Compaction

​12. Tiered storage and lifecycle

​Tiers

​Tier lifecycle

​Query tier selection

​13. Visibility and MVCC fence

​The problem

​The solution

​Per-series visibility summaries

​14. PromQL engine

​Components

​Evaluator

​15. Rollups and downsampling

​Policy structure

​Materialization

​16. Server mode

​Protocol support

​Connection handling

​Multi-tenancy

​17. Clustering and replication

​Topology

​Consistent hash ring (cluster/ring.rs)

​Write routing (cluster/replication.rs)

​Control plane (cluster/consensus.rs)

​Anti-entropy repair (cluster/repair.rs)

​Deduplication

​18. Security model

​TLS

​Authentication

​RBAC (src/rbac.rs)

​Secret rotation

​19. Observability and background runtime

​Self-instrumentation

​Background threads

​20. Configuration reference summary