Skip to main content

Performance Tuning

This guide explains which knobs most affect throughput and latency in tsink and how to choose values for your workload. It assumes you have already read the storage engine internals and the configuration reference. Sections are ordered from the highest-leverage tunables (memory, write pipeline) to lower-level ones (compaction, encoding, containers).

Contents

  1. Understanding the write path
  2. Memory budget
  3. Write pipeline
  4. WAL sync mode
  5. Chunk and partition sizing
  6. Compaction tuning
  7. Metadata sharding
  8. Query performance
  9. Rollups for read-heavy workloads
  10. Server-level concurrency limits
  11. Cgroup-aware scheduling
  12. Monitoring performance
  13. Quick-reference table

1. Understanding the write path

Every call to insert_rows travels through five ordered stages before the write is acknowledged:
insert_rows(rows)

    ├─ 1. Resolve     — look up or create series IDs
    ├─ 2. Prepare     — validate, check memory budget, pick WAL codec
    ├─ 3. Stage       — encode and append to WAL (fsync if PerAppend)
    ├─ 4. Apply       — append points to the active ChunkBuilder
    └─ 5. Publish     — advance query visibility
The dominant cost varies by workload:
  • High series cardinality — stage 1 (registry lookup) dominates. Use metadata sharding.
  • Durability-sensitive — stage 3 (WAL fsync) dominates. Switch to periodic WAL sync.
  • Memory pressure — stage 2 (admission control) dominates. Increase the memory budget or reduce the ingestion rate.
  • High concurrent throughput — writer pool saturation. Increase max_writers.

2. Memory budget

The memory budget is the single most important tunable for sustained throughput. The engine tracks all active and sealed chunk bytes and holds writes in admission control when the budget is exhausted.

Setting the budget

Embedded library:
let storage = StorageBuilder::new()
    .with_data_path("./tsink-data")
    .with_memory_limit(2 * 1024 * 1024 * 1024) // 2 GiB
    .build()?;
Server:
tsink-server --memory-limit 2G --data-path ./var/tsink

Sizing guidelines

DeploymentRecommended budget
Laptop / development256 MiB – 512 MiB
Single-node server, moderate cardinality (< 100 k series)1 GiB – 4 GiB
Single-node server, high cardinality (> 500 k series)4 GiB – 16 GiB
Cluster nodeSet to 50–70 % of available RAM; leave room for the OS page cache

What uses memory

The budget covers active chunk builders and sealed chunks pending flush. Several other categories consume memory but are excluded from the budget:
CategoryVisibility
Series registry (metadata)tsink_memory_registry_bytes
Metadata caches and indexestsink_memory_metadata_cache_bytes
Persisted chunk refs and timestamp search indexestsink_memory_persisted_index_bytes
mmap-mapped segment payloadstsink_memory_persisted_mmap_bytes
Tombstone statetsink_memory_tombstone_bytes
Add a proportional allowance for these when calculating total process memory.

Backpressure behaviour

When active + sealed bytes exceeds the budget:
  1. The flush pipeline is triggered immediately to persist sealed chunks to L0 segments and free their memory.
  2. Incoming writes poll every 10 ms waiting for budget recovery.
  3. If the budget is still exhausted after write_timeout (default 30 s), the write returns an error.
If tsink_flush_admission_backpressure_delays_total is climbing in /metrics, the budget is too small for your write rate. Either increase the budget or add a --data-path to enable background persistence.

3. Write pipeline

Writes are gated by a semaphore sized to max_writers. Each writer acquires a permit for the duration of stages 1–5 and releases it on exit.

Setting max_writers

Embedded library:
let storage = StorageBuilder::new()
    .with_max_writers(8)
    .build()?;
Server:
tsink-server --max-writers 8
The default is the cgroup-visible CPU count (see cgroup-aware scheduling).

Tuning guidance

  • Low-latency single-writer workloads — keep max_writers at the CPU count or lower. Adding writers above CPU count introduces lock contention in the series registry.
  • Fan-out write workloads (many HTTP clients writing many small batches) — increase max_writers to 2× CPU count. Each write spends time blocked on WAL I/O, leaving CPU headroom for other writers.
  • Batch ingestion (few large batches per second) — lower max_writers to 1–2. Large batches saturate the registry faster; more writers increase memory usage without proportional throughput gains.

Write timeout

If all writer slots are occupied, new writes wait up to write_timeout (default 30 s). On a healthy system this should never trigger. If you see WriteTimeout errors, either increase max_writers or reduce the ingest rate.
StorageBuilder::new()
    .with_write_timeout(Duration::from_secs(10))
    .build()?

4. WAL sync mode

WAL sync mode is the primary durability vs. throughput tradeoff.
ModeDurabilityTypical overhead
PerAppend (default)fsync per write — crash-safeHigh latency per write, excellent for low-rate durable ingestion
Periodic(interval)OS-buffered; data in the crash window can be lost5–20× higher write throughput at equivalent concurrency

Disabling crash-safety for maximum throughput

use tsink::wal::WalSyncMode;
use std::time::Duration;

let storage = StorageBuilder::new()
    .with_wal_sync_mode(WalSyncMode::Periodic(Duration::from_millis(500)))
    .build()?;
Server:
tsink-server --wal-sync-mode periodic
With Periodic, insert_rows_with_result returns WriteAcknowledgement::Appended instead of ::Durable. A background thread fsync’s the WAL at the configured interval; data written within the last interval window can be lost on process crash.

Disabling the WAL entirely

For in-memory or ephemeral deployments where crash-safety is not needed:
let storage = StorageBuilder::new()
    .with_wal_enabled(false)
    .build()?;
No WAL means no durability guarantee at all. All in-flight data is lost on process exit until the next compaction and flush cycle makes data available in recovery.

WAL buffer size

The default per-write I/O buffer is 4096 bytes. On high-throughput workloads with many small writes, increasing this can reduce syscall overhead:
StorageBuilder::new()
    .with_wal_buffer_size(65536)
    .build()?

5. Chunk and partition sizing

Chunk point capacity (chunk_points)

A chunk is the atomic encoded unit. The encoder selects a codec for each chunk independently. The default capacity is 2048 points.
Larger chunksSmaller chunks
Higher compression ratio (more context for delta/XOR)Lower read amplification on recent data
Lower flush I/O overheadFaster availability after write (each flush covers a smaller unit)
More memory held per series before sealedShorter WAL high-watermark advancement
For Prometheus-style scrape workloads with a 15 s interval and regular timestamps, FixedStepRle cuts timestamp storage to a constant regardless of chunk size — here larger chunks mainly improve the float XOR compression. A value of 2048–8192 is effective. For irregular or heavily-jittered ingestion (InfluxDB, StatsD, OTLP), smaller chunks (512–1024) reduce merge amplification during compaction.
StorageBuilder::new()
    .with_chunk_points(4096)
    .build()?
Server:
tsink-server --chunk-points 4096

Partition duration

Partitions group data in time windows (default: 1 hour). One ChunkBuilder per partition head is kept open per series.
  • Shorter partitions (e.g., 15 m) reduce the amount of active memory per series but increase the number of small L0 segments, leading to more compaction.
  • Longer partitions (e.g., 6 h) improve compression across more points per series but hold more data in memory before the partition head seals.
Match the partition duration to your typical query range. A common pattern is:
Scrape intervalRetentionRecommended partition
15 s7–30 d1 h (default)
1 m90+ d6 h
5 m1 year+24 h

Max active partition heads

max_active_partition_heads_per_series (default: 8) controls how many time-windows can be open simultaneously for a single series. Increase this only when you have significant backfill or unordered ingestion spanning many different historical windows. Keeping it low reduces memory usage.

6. Compaction tuning

Compaction merges L0 segments into L1, then L1 into L2. It runs in a background thread every 5 seconds.

Trigger thresholds

Two thresholds govern when a compaction pass fires for each level transition:
ThresholdDefaultEffect
l0_trigger4Compact L0 → L1 when ≥ 4 L0 segments exist
l1_trigger4Compact L1 → L2 when ≥ 4 L1 segments exist
Additionally, a compaction pass fires immediately for a level if any two segments in that level have overlapping time ranges, regardless of the count trigger. This keeps query read amplification low. These thresholds are not yet exposed as public API — increasing the flush frequency relative to the ingest rate is the easiest way to control L0 growth.

Source window limit

Each compaction pass merges at most 8 source segments at a time. For burst-heavy workloads this keeps individual compaction jobs bounded. Back-to-back passes converge the level quickly.

Monitoring compaction health

tsink_compaction_runs_total
tsink_compaction_success_total
tsink_compaction_errors_total
tsink_compaction_source_segments_total    # segments consumed
tsink_compaction_output_segments_total   # segments produced
tsink_compaction_duration_nanoseconds_total
A healthy deployment sees source_segments / output_segments ≈ 4 (matching the trigger threshold). A ratio close to 1 means compaction is not converging — investigate whether flush is running too frequently or the ingest rate exceeds compaction bandwidth.

Compaction and tombstones

Tombstones are applied lazily during compaction: deleted time ranges are trimmed from merged chunks so that physical disk space is reclaimed. If you issue many deletes, compaction may briefly spike in duration while re-encoding chunks.

7. Metadata sharding

The in-memory series registry is split across 64 shards by default (locked independently, keyed by series_id % 64). For workloads with hundreds of thousands of distinct series, write-transaction lock contention on the registry can become a bottleneck. with_metadata_shard_count enables an additional metadata partitioning layer for series index scans and postings lookups:
StorageBuilder::new()
    .with_metadata_shard_count(16)
    .build()?
Useful values are powers-of-two between 8 and 64. Higher counts reduce per-shard scan range, improving label-matcher query latency at high cardinality. There is no strong reason to exceed 64 for the built-in 64-shard write-path sharding. When using the server, this can be tuned per deployment but is not yet exposed as a CLI flag — use the embedded library or Python bindings for fine-grained control.

8. Query performance

mmap reads

Segment chunk files are opened as read-only memory maps and decoded directly from the mapped address space without a user-space copy. The OS page cache serves as the effective read buffer. Performance improves proportionally with available free RAM. For read-heavy deployments:
  • Leave at least 30–50 % of RAM unallocated so the page cache can hold recently-accessed segment files.
  • Avoid the memory budget consuming all available RAM — tsink_memory_persisted_mmap_bytes contributes to the overhead.

Block-level seek index

Each chunk carries an in-memory timestamp search index that allows the decoder to seek directly to the 64-point block containing the desired timestamp without decompressing the whole chunk. Narrower time-range queries benefit most from this.

Series selection and label matchers

The postings bitmap indexes (one per metric name, per label name, per label name+value pair) are loaded into memory on startup and used to resolve label-matchers before any chunk data is read. Regex matchers compile once and are applied against the string dictionary, not against raw label values. For high-cardinality label spaces, prefer exact-match (=) and not-equal (!=) matchers over regex (=~) wherever possible — exact matches walk the postings index in O(1) while regex requires scanning the label dictionary.

Metadata sharding for range queries

For select_series queries with time-range filters applied to servers with large histories (many persisted segments), increasing metadata_shard_count reduces the per-shard scan range and can cut latency by 2–5× at > 500 k series.

Rollup materialization

For dashboard-style repeated queries over long time ranges, configure rollup policies. The query engine checks for materialized series before falling back to a raw scan. At query time a rollup hit saves reading and merging the underlying raw chunks entirely.

9. Rollups for read-heavy workloads

If the same long-range aggregation query runs repeatedly (e.g., Grafana dashboards with rate, sum, avg), define a rollup policy to pre-materialize the result at a lower resolution:
curl -X POST http://127.0.0.1:9201/api/v1/rollups \
  -H 'Content-Type: application/json' \
  -d '{
    "id": "http_req_1m",
    "metric": "http_requests_total",
    "interval": 60000,
    "aggregation": "sum"
  }'
Materialized rollup series are stored as regular time series and resolved at query time with no special code path — the engine simply notices the materialized resolution matches the query step and short-circuits the raw scan.

10. Server-level concurrency limits

The tsink-server adds an HTTP-layer admission tier on top of the engine-level max_writers semaphore. These environment variables are read once at process startup:
VariableDefaultDescription
TSINK_SERVER_WRITE_MAX_INFLIGHT_REQUESTS64Max concurrent write HTTP requests
TSINK_SERVER_WRITE_MAX_INFLIGHT_ROWS200000Max total in-flight rows across all write requests
TSINK_SERVER_WRITE_RESOURCE_ACQUIRE_TIMEOUT_MS25ms to wait for a write slot before returning 429
TSINK_SERVER_READ_MAX_INFLIGHT_REQUESTS64Max concurrent read HTTP requests
TSINK_SERVER_READ_MAX_INFLIGHT_QUERIES128Max total in-flight query slots
TSINK_SERVER_READ_MAX_INFLIGHT_ROWS(unlimited)Optionally cap total rows returned across concurrent reads

Tuning for high write throughput

For Prometheus remote-write or OTLP workloads where each request carries 1000–10 000 rows:
export TSINK_SERVER_WRITE_MAX_INFLIGHT_REQUESTS=128
export TSINK_SERVER_WRITE_MAX_INFLIGHT_ROWS=500000
export TSINK_SERVER_WRITE_RESOURCE_ACQUIRE_TIMEOUT_MS=100

Tuning for many concurrent queries

For dashboard-heavy setups with many Grafana panels refreshing simultaneously:
export TSINK_SERVER_READ_MAX_INFLIGHT_REQUESTS=256
export TSINK_SERVER_READ_MAX_INFLIGHT_QUERIES=512
export TSINK_SERVER_READ_RESOURCE_ACQUIRE_TIMEOUT_MS=200

11. Cgroup-aware scheduling

tsink reads container CPU and memory limits from the cgroup v2 interface (/sys/fs/cgroup/cpu.max and /sys/fs/cgroup/memory.max) at startup and uses those values to size the default worker pools. Internally:
  • available_cpus() returns the smaller of the cgroup CPU quota (rounded up) and the host CPU count.
  • max_writers defaults to available_cpus().
  • The async storage read worker pool also defaults to available_cpus().

Overriding the detected CPU count

If the autodetection is wrong or you want to pin the worker count regardless of the container limit:
export TSINK_MAX_CPUS=4
This overrides available_cpus() globally. All worker pools that default to CPU count will respect this override.

Memory limit awareness

The cgroup memory limit is exposed through cgroup::get_memory_limit() but is not currently applied automatically as the storage memory budget — you must set --memory-limit or with_memory_limit() explicitly. A reasonable starting point is 60–70 % of the container memory limit:
# cgroup limit = 4 GiB → use 2.5 GiB for tsink data
tsink-server --memory-limit 2684354560 --data-path ./var/tsink

12. Monitoring performance

All internal counters are exposed at GET /metrics in Prometheus format. The most useful ones for performance diagnosis:

Write throughput and latency

# Rate of points ingested
rate(tsink_wal_append_points_total[1m])

# Write admission delays (backpressure events)
tsink_flush_admission_backpressure_delays_total
tsink_flush_admission_backpressure_delay_nanoseconds_total

# Write timeout errors
tsink_flush_pipeline_timeout_total

Memory pressure

# Budget usage
tsink_memory_used_bytes / tsink_memory_budget_bytes

# mmap footprint (not counted in budget)
tsink_memory_persisted_mmap_bytes

Compaction health

# Compaction rate — should be ≥ flush rate to avoid L0 accumulation
rate(tsink_compaction_success_total[5m])

# L0 segment count (should stay low — near the l0_trigger)
tsink_flush_hot_segments_visible

Flush pipeline

# Segments landing on disk per minute
rate(tsink_flush_persisted_segments_total[1m])

# Seal-to-persist pipeline duration
rate(tsink_flush_persist_duration_nanoseconds_total[1m])
  / rate(tsink_flush_persist_success_total[1m])

Query latency

# Select call duration
rate(tsink_query_select_duration_nanoseconds_total[1m])
  / rate(tsink_query_select_calls_total[1m])

# Rollup hit rate vs raw scans
rate(tsink_query_rollup_query_plans_total[5m])
  / rate(tsink_query_select_with_options_calls_total[5m])

13. Quick-reference table

GoalParameterRecommended change
Maximise write throughputwal_sync_modePeriodic(500ms)
Reduce write latency at high concurrencymax_writers2× CPU count
Prevent OOM on memory spikememory_limit60–70 % of available RAM
Improve compression ratiochunk_points4096–8192
Reduce read latency on large label spacesmetadata_shard_count16–64
Reduce read latency on long time rangesRollup policies1 m / 5 m resolution
Handle burst write trafficTSINK_SERVER_WRITE_MAX_INFLIGHT_REQUESTS2× concurrent writers
Handle many concurrent queriesTSINK_SERVER_READ_MAX_INFLIGHT_QUERIES4–8× CPU count
Cap worker pool in containersTSINK_MAX_CPUSContainer CPU quota
Survive background failures silentlybackground_fail_fastfalse (dev/test only)
Fast startup after crashwal_replay_modeSalvage (accept potential data loss)