Performance Tuning
This guide explains which knobs most affect throughput and latency in tsink and how to choose values for your workload. It assumes you have already read the storage engine internals and the configuration reference. Sections are ordered from the highest-leverage tunables (memory, write pipeline) to lower-level ones (compaction, encoding, containers).Contents
- Understanding the write path
- Memory budget
- Write pipeline
- WAL sync mode
- Chunk and partition sizing
- Compaction tuning
- Metadata sharding
- Query performance
- Rollups for read-heavy workloads
- Server-level concurrency limits
- Cgroup-aware scheduling
- Monitoring performance
- Quick-reference table
1. Understanding the write path
Every call toinsert_rows travels through five ordered stages before the write is acknowledged:
- High series cardinality — stage 1 (registry lookup) dominates. Use metadata sharding.
- Durability-sensitive — stage 3 (WAL fsync) dominates. Switch to periodic WAL sync.
- Memory pressure — stage 2 (admission control) dominates. Increase the memory budget or reduce the ingestion rate.
- High concurrent throughput — writer pool saturation. Increase
max_writers.
2. Memory budget
The memory budget is the single most important tunable for sustained throughput. The engine tracks all active and sealed chunk bytes and holds writes in admission control when the budget is exhausted.Setting the budget
Embedded library:Sizing guidelines
| Deployment | Recommended budget |
|---|---|
| Laptop / development | 256 MiB – 512 MiB |
| Single-node server, moderate cardinality (< 100 k series) | 1 GiB – 4 GiB |
| Single-node server, high cardinality (> 500 k series) | 4 GiB – 16 GiB |
| Cluster node | Set to 50–70 % of available RAM; leave room for the OS page cache |
What uses memory
The budget covers active chunk builders and sealed chunks pending flush. Several other categories consume memory but are excluded from the budget:| Category | Visibility |
|---|---|
| Series registry (metadata) | tsink_memory_registry_bytes |
| Metadata caches and indexes | tsink_memory_metadata_cache_bytes |
| Persisted chunk refs and timestamp search indexes | tsink_memory_persisted_index_bytes |
| mmap-mapped segment payloads | tsink_memory_persisted_mmap_bytes |
| Tombstone state | tsink_memory_tombstone_bytes |
Backpressure behaviour
When active + sealed bytes exceeds the budget:- The flush pipeline is triggered immediately to persist sealed chunks to L0 segments and free their memory.
- Incoming writes poll every 10 ms waiting for budget recovery.
- If the budget is still exhausted after
write_timeout(default 30 s), the write returns an error.
tsink_flush_admission_backpressure_delays_total is climbing in /metrics, the budget is too small for your write rate. Either increase the budget or add a --data-path to enable background persistence.
3. Write pipeline
Writes are gated by a semaphore sized tomax_writers. Each writer acquires a permit for the duration of stages 1–5 and releases it on exit.
Setting max_writers
Embedded library:Tuning guidance
- Low-latency single-writer workloads — keep
max_writersat the CPU count or lower. Adding writers above CPU count introduces lock contention in the series registry. - Fan-out write workloads (many HTTP clients writing many small batches) — increase
max_writersto 2× CPU count. Each write spends time blocked on WAL I/O, leaving CPU headroom for other writers. - Batch ingestion (few large batches per second) — lower
max_writersto 1–2. Large batches saturate the registry faster; more writers increase memory usage without proportional throughput gains.
Write timeout
If all writer slots are occupied, new writes wait up towrite_timeout (default 30 s). On a healthy system this should never trigger. If you see WriteTimeout errors, either increase max_writers or reduce the ingest rate.
4. WAL sync mode
WAL sync mode is the primary durability vs. throughput tradeoff.| Mode | Durability | Typical overhead |
|---|---|---|
PerAppend (default) | fsync per write — crash-safe | High latency per write, excellent for low-rate durable ingestion |
Periodic(interval) | OS-buffered; data in the crash window can be lost | 5–20× higher write throughput at equivalent concurrency |
Disabling crash-safety for maximum throughput
Periodic, insert_rows_with_result returns WriteAcknowledgement::Appended instead of ::Durable. A background thread fsync’s the WAL at the configured interval; data written within the last interval window can be lost on process crash.
Disabling the WAL entirely
For in-memory or ephemeral deployments where crash-safety is not needed:WAL buffer size
The default per-write I/O buffer is 4096 bytes. On high-throughput workloads with many small writes, increasing this can reduce syscall overhead:5. Chunk and partition sizing
Chunk point capacity (chunk_points)
A chunk is the atomic encoded unit. The encoder selects a codec for each chunk independently. The default capacity is 2048 points.
| Larger chunks | Smaller chunks |
|---|---|
| Higher compression ratio (more context for delta/XOR) | Lower read amplification on recent data |
| Lower flush I/O overhead | Faster availability after write (each flush covers a smaller unit) |
| More memory held per series before sealed | Shorter WAL high-watermark advancement |
FixedStepRle cuts timestamp storage to a constant regardless of chunk size — here larger chunks mainly improve the float XOR compression. A value of 2048–8192 is effective.
For irregular or heavily-jittered ingestion (InfluxDB, StatsD, OTLP), smaller chunks (512–1024) reduce merge amplification during compaction.
Partition duration
Partitions group data in time windows (default: 1 hour). OneChunkBuilder per partition head is kept open per series.
- Shorter partitions (e.g., 15 m) reduce the amount of active memory per series but increase the number of small L0 segments, leading to more compaction.
- Longer partitions (e.g., 6 h) improve compression across more points per series but hold more data in memory before the partition head seals.
| Scrape interval | Retention | Recommended partition |
|---|---|---|
| 15 s | 7–30 d | 1 h (default) |
| 1 m | 90+ d | 6 h |
| 5 m | 1 year+ | 24 h |
Max active partition heads
max_active_partition_heads_per_series (default: 8) controls how many time-windows can be open simultaneously for a single series. Increase this only when you have significant backfill or unordered ingestion spanning many different historical windows. Keeping it low reduces memory usage.
6. Compaction tuning
Compaction merges L0 segments into L1, then L1 into L2. It runs in a background thread every 5 seconds.Trigger thresholds
Two thresholds govern when a compaction pass fires for each level transition:| Threshold | Default | Effect |
|---|---|---|
l0_trigger | 4 | Compact L0 → L1 when ≥ 4 L0 segments exist |
l1_trigger | 4 | Compact L1 → L2 when ≥ 4 L1 segments exist |
Source window limit
Each compaction pass merges at most 8 source segments at a time. For burst-heavy workloads this keeps individual compaction jobs bounded. Back-to-back passes converge the level quickly.Monitoring compaction health
source_segments / output_segments ≈ 4 (matching the trigger threshold). A ratio close to 1 means compaction is not converging — investigate whether flush is running too frequently or the ingest rate exceeds compaction bandwidth.
Compaction and tombstones
Tombstones are applied lazily during compaction: deleted time ranges are trimmed from merged chunks so that physical disk space is reclaimed. If you issue many deletes, compaction may briefly spike in duration while re-encoding chunks.7. Metadata sharding
The in-memory series registry is split across 64 shards by default (locked independently, keyed byseries_id % 64). For workloads with hundreds of thousands of distinct series, write-transaction lock contention on the registry can become a bottleneck.
with_metadata_shard_count enables an additional metadata partitioning layer for series index scans and postings lookups:
8. Query performance
mmap reads
Segment chunk files are opened as read-only memory maps and decoded directly from the mapped address space without a user-space copy. The OS page cache serves as the effective read buffer. Performance improves proportionally with available free RAM. For read-heavy deployments:- Leave at least 30–50 % of RAM unallocated so the page cache can hold recently-accessed segment files.
- Avoid the memory budget consuming all available RAM —
tsink_memory_persisted_mmap_bytescontributes to the overhead.
Block-level seek index
Each chunk carries an in-memory timestamp search index that allows the decoder to seek directly to the 64-point block containing the desired timestamp without decompressing the whole chunk. Narrower time-range queries benefit most from this.Series selection and label matchers
The postings bitmap indexes (one per metric name, per label name, per label name+value pair) are loaded into memory on startup and used to resolve label-matchers before any chunk data is read. Regex matchers compile once and are applied against the string dictionary, not against raw label values. For high-cardinality label spaces, prefer exact-match (=) and not-equal (!=) matchers over regex (=~) wherever possible — exact matches walk the postings index in O(1) while regex requires scanning the label dictionary.
Metadata sharding for range queries
Forselect_series queries with time-range filters applied to servers with large histories (many persisted segments), increasing metadata_shard_count reduces the per-shard scan range and can cut latency by 2–5× at > 500 k series.
Rollup materialization
For dashboard-style repeated queries over long time ranges, configure rollup policies. The query engine checks for materialized series before falling back to a raw scan. At query time a rollup hit saves reading and merging the underlying raw chunks entirely.9. Rollups for read-heavy workloads
If the same long-range aggregation query runs repeatedly (e.g., Grafana dashboards withrate, sum, avg), define a rollup policy to pre-materialize the result at a lower resolution:
10. Server-level concurrency limits
The tsink-server adds an HTTP-layer admission tier on top of the engine-levelmax_writers semaphore. These environment variables are read once at process startup:
| Variable | Default | Description |
|---|---|---|
TSINK_SERVER_WRITE_MAX_INFLIGHT_REQUESTS | 64 | Max concurrent write HTTP requests |
TSINK_SERVER_WRITE_MAX_INFLIGHT_ROWS | 200000 | Max total in-flight rows across all write requests |
TSINK_SERVER_WRITE_RESOURCE_ACQUIRE_TIMEOUT_MS | 25 | ms to wait for a write slot before returning 429 |
TSINK_SERVER_READ_MAX_INFLIGHT_REQUESTS | 64 | Max concurrent read HTTP requests |
TSINK_SERVER_READ_MAX_INFLIGHT_QUERIES | 128 | Max total in-flight query slots |
TSINK_SERVER_READ_MAX_INFLIGHT_ROWS | (unlimited) | Optionally cap total rows returned across concurrent reads |
Tuning for high write throughput
For Prometheus remote-write or OTLP workloads where each request carries 1000–10 000 rows:Tuning for many concurrent queries
For dashboard-heavy setups with many Grafana panels refreshing simultaneously:11. Cgroup-aware scheduling
tsink reads container CPU and memory limits from the cgroup v2 interface (/sys/fs/cgroup/cpu.max and /sys/fs/cgroup/memory.max) at startup and uses those values to size the default worker pools.
Internally:
available_cpus()returns the smaller of the cgroup CPU quota (rounded up) and the host CPU count.max_writersdefaults toavailable_cpus().- The async storage read worker pool also defaults to
available_cpus().
Overriding the detected CPU count
If the autodetection is wrong or you want to pin the worker count regardless of the container limit:available_cpus() globally. All worker pools that default to CPU count will respect this override.
Memory limit awareness
The cgroup memory limit is exposed throughcgroup::get_memory_limit() but is not currently applied automatically as the storage memory budget — you must set --memory-limit or with_memory_limit() explicitly. A reasonable starting point is 60–70 % of the container memory limit:
12. Monitoring performance
All internal counters are exposed atGET /metrics in Prometheus format. The most useful ones for performance diagnosis:
Write throughput and latency
Memory pressure
Compaction health
Flush pipeline
Query latency
13. Quick-reference table
| Goal | Parameter | Recommended change |
|---|---|---|
| Maximise write throughput | wal_sync_mode | Periodic(500ms) |
| Reduce write latency at high concurrency | max_writers | 2× CPU count |
| Prevent OOM on memory spike | memory_limit | 60–70 % of available RAM |
| Improve compression ratio | chunk_points | 4096–8192 |
| Reduce read latency on large label spaces | metadata_shard_count | 16–64 |
| Reduce read latency on long time ranges | Rollup policies | 1 m / 5 m resolution |
| Handle burst write traffic | TSINK_SERVER_WRITE_MAX_INFLIGHT_REQUESTS | 2× concurrent writers |
| Handle many concurrent queries | TSINK_SERVER_READ_MAX_INFLIGHT_QUERIES | 4–8× CPU count |
| Cap worker pool in containers | TSINK_MAX_CPUS | Container CPU quota |
| Survive background failures silently | background_fail_fast | false (dev/test only) |
| Fast startup after crash | wal_replay_mode | Salvage (accept potential data loss) |