Skip to main content

Compaction

tsink uses LSM-style leveled compaction to merge small L0 segments produced by the write pipeline into progressively larger L1 and L2 segments. Compaction reduces the number of segment files, eliminates duplicate and tombstoned data points, and keeps query fan-out bounded as data accumulates over time.

L0 / L1 / L2 levels

Every segment carries a level tag written into its manifest at creation time. There are three levels:
LevelWritten byCompacted from
L0Flush pipeline (WAL → segments)
L1CompactorL0
L2CompactorL1
Segments at the same level are independent: they may cover overlapping or non-overlapping time ranges and may belong to different series. The compactor merges segments within a level and writes the output one level higher. There is no compaction out of L2; L2 segments age out only through retention expiry. The engine maintains two separate compactors — one for the numeric lane (float64 series) and one for the blob lane (bytes and native histograms). Each compactor operates on its own directory subtree and carries its own segment-ID allocator backed by a shared atomic counter.

Trigger conditions

A compaction pass fires for a given level when either of the following is true:
  • Count trigger — the number of eligible segments at the source level reaches the configured threshold (default 4 for both L0→L1 and L1→L2).
  • Time overlap — any two segments at the source level have overlapping time ranges, regardless of segment count.
The overlap trigger exists because overlapping segments break the sorted-segments assumption that queries rely on: merging them eagerly keeps the persisted view coherent and avoids returning duplicates at query time. Each compact_once call checks L0 first, then L1. Only one level is compacted per call.

Window selection

Rather than compacting all available source segments at once, each pass selects a window of up to DEFAULT_SOURCE_WINDOW_SEGMENTS (8) segments. The selection algorithm:
  1. If any segments have overlapping time ranges, the algorithm identifies the smallest cluster of overlapping segments and selects up to 8 of them by their original storage order. This ensures overlaps are resolved before anything else.
  2. If there are no overlaps, the algorithm takes the oldest count_trigger (≥ 2) segments sorted by segment ID.
A window must contain at least two segments; a single-segment window is always a no-op.

Merge algorithm

The merge is a k-way heap merge that streams all source data points in timestamp order across all input chunks for the same series:
  1. Each input chunk is decoded into individual data points and placed behind a cursor.
  2. A min-heap ordered by (timestamp, chunk_order) drives the merge — the earliest timestamp across all cursors is drained first.
  3. A point is skipped if:
    • It is an exact duplicate of the previously emitted point — same (ts, value) pair.
    • Its timestamp falls within a tombstoned range.
    • Its timestamp is older than the retention cutoff (when compaction is used to enforce retention).
  4. Surviving points are accumulated into output chunks. When a chunk reaches chunk_point_cap points it is closed and a new one is started.
  5. When accumulated output points reach the segment point budgetchunk_point_cap × 512 — the current set of chunks is written as a new output segment and the accumulator is reset. This caps output segment size even when many source segments are merged at once.
Chunk encoding for output chunks uses the same adaptive codec pipeline as all other segments: delta/XOR encoding for timestamps and values, with optional zstd compression. The WAL highwater carried by an output segment is the maximum highwater across all source segments, allowing the WAL to reclaim space for any data already compacted to persistent segments.

Tombstone handling

Before each compaction pass the tombstone store is loaded from disk. A tombstone is a series-scoped time range (start inclusive, end exclusive). Any point whose timestamp falls within a tombstone range for its series is silently discarded during the merge. Points tombstoned in one compaction pass will never reappear in subsequent reads because they are permanently excluded from the output segments.

Atomic segment replacement

Compaction writes output segments to disk first, then atomically swaps them in to replace the source segments. This two-phase protocol makes interruptions safe:
  1. Write outputs — each output segment is written to a new directory under the lane root. If writing fails, any partial output directories are removed (rollback).
  2. Write replacement marker — a JSON file is written atomically to .compaction-replacements/replace-<ts>-<nonce>.json. The marker lists the relative paths of both source and output segments.
  3. Apply replacement — the source segment directories are removed and the marker file is deleted.
If the process is interrupted between steps 2 and 3, the marker file survives on disk. The next compact_once call (or startup) calls finalize_pending_compaction_replacements which replays all pending markers in sorted order, completing any interrupted replacements. An output-side manifest must exist for the replacement to be applied; if it does not, the replacement is treated as a corruption error.

Background thread

The compaction background thread runs continuously while the storage engine is open. It sleeps for compaction_interval (default 5 seconds) between passes and wakes immediately when the flush pipeline signals that new segments have been written. The thread runs both the numeric compactor and the blob compactor in sequence on each wakeup. A mutex (compaction_lock) serialises the thread against manual compaction calls and against snapshot operations. On close(), the engine acquires all write permits, flushes active state to segments, and then runs up to 128 compaction passes to drain any remaining work before shutting down background threads. When background_fail_fast is enabled (the default), a compaction error marks the storage engine as unhealthy and causes subsequent write and flush operations to return an error.

Observability

The observability_snapshot() method exposes a CompactionObservabilitySnapshot with cumulative counters:
FieldDescription
runs_totalTotal compact_once calls
success_totalRuns that produced at least one output segment
noop_totalRuns where no compaction was needed
errors_totalRuns that returned an error
source_segments_totalCumulative source segments consumed
output_segments_totalCumulative output segments produced
source_chunks_totalCumulative source chunks processed
output_chunks_totalCumulative output chunks written
source_points_totalCumulative input data points before dedup/tombstone filtering
output_points_totalCumulative output data points after filtering
duration_nanos_totalCumulative wall-clock time spent in compaction
The ratio output_points_total / source_points_total indicates how much deduplication or tombstone removal is occurring over time.

Tuning

StorageBuilder methodDefaultEffect
with_chunk_points(n)2048Maximum points per chunk. Also controls the output segment size budget (n × 512 points per output segment). Larger values produce fewer, bigger segments and reduce compaction frequency but increase per-segment memory and I/O cost.
There are no builder methods for the L0/L1 count triggers or the source window size; they are fixed at 4 and 8 respectively. The compaction interval is also fixed at 5 seconds in the engine defaults.

Interaction with retention

Compaction and retention are separate operations. The retention sweep (sweep_expired_persisted_segments) deletes entire segments whose time range falls entirely below the retention cutoff. When a segment partially overlaps the cutoff, the compactor rewrites it (stage_segment_rewrite_with_retention) and drops points below the cutoff during the merge, producing a smaller output segment at the same level rather than advancing it upward.