Prometheus Blog
Introducing the UX Research Working Group
Prometheus has always prioritized solving complex technical challenges to deliver a reliable, performant open-source monitoring system. Over time, however, users have expressed a variety of experience-related pain points. Those pain points range from onboarding and configuration to documentation, mental models, and interoperability across the ecosystem.
At PromCon 2025, a user research study was presented that highlighted several of these issues. Although the central area of investigation involved Prometheus and OpenTelemetry workflows, the broader takeaway was clear: Prometheus would benefit from a dedicated, ongoing effort to understand user needs and improve the overall user experience.
Recognizing this, the Prometheus team established a Working Group focused on improving user experience through design and user research. This group is meant to support all areas of Prometheus by bringing structured research, user insights, and usability perspectives into the community's development and decision-making processes.
How we can help Prometheus maintainers
Building something where the user needs are unclear? Maybe you're looking at two competing solutions and you'd like to understand the user tradeoffs alongside the technical ones.
That's where we can be of help.
The UX Working Group will partner with you to conduct user research or provide feedback on your plans for user outreach. That could include:
- User research reports and summaries
- User journeys, personas, wireframes, prototypes, and other UX artifacts
- Recommendations for improving usability, onboarding, interoperability, and documentation
- Prioritized lists of user pain points
- Suggestions for community discussions or decision-making topics
To get started, tell us what you're trying to do, and we'll work with you to determine what type and scope of research is most appropriate.
How we can help Prometheus end users
We want to hear from you! Let us know if you're interested in participating in a research study and we'll contact you when we're working on one that's a good fit. Having an issue with the Prometheus user experience? We can help you open an issue and direct it to the appropriate community members.
Interested in helping?
New contributors to the working group are always welcome! Get in touch and let us know what you'd like to work on.
Where to find us
Drop us a message in Slack, join a meeting, or raise an issue in GitHub.
- Slack: #prometheus-ux-wg
- Meetings: We meet biweekly, currently on Wednesdays at 14:00 UTC (subject to change depending on contributor availability).
- GitHub: prometheus-community/ux-research
Uncached I/O in Prometheus
Do you find yourself constantly looking up the difference between container_memory_usage_bytes, container_memory_working_set_bytes, and container_memory_rss? Pick the wrong one and your memory limits lie to you, your benchmarks mislead you, and your container gets OOMKilled.
You're not alone. There is even a 9-year-old Kubernetes issue that captures the frustration of users.
The explanation is simple: RAM is not used in just one way. One of the easiest things to miss is the page cache semantics. For some containers, memory taken by page caching can make up most of the reported usage, even though that memory is largely reclaimable, creating surprising differences between those metrics.
NOTE: The feature discussed here currently only supports Linux.
Prometheus writes a lot of data to disk. It is, after all, a database. But not every write benefits from sitting in the page cache. Compaction writes are the clearest example: once a block is written, only a fraction of that data is likely to be queried again soon, and since there is no way to predict which fraction, caching it all offers little return. The use-uncached-io feature flag was built to address exactly this.
Bypassing the cache for those writes reduces Prometheus's page cache footprint, making its memory usage more predictable and easier to reason about. It also relieves pressure on that shared cache, lowering the risk of evicting hot data that queries and other reads actually depend on. A potential bonus is reduced CPU overhead from cache allocations and evictions. The hard constraint throughout was to avoid any measurable regression in CPU or disk I/O.
The flag was introduced in Prometheus v3.5.0 and currently only supports Linux. Under the hood, it uses direct I/O, which requires proper filesystem support and a kernel v2.4.10 or newer, though you should be fine, as that version shipped nearly 25 years ago.
If direct I/O helps here, why was it not done earlier, and why is it not used everywhere it would help? Because direct I/O comes with strict alignment requirements. Unlike buffered I/O, you cannot simply write any chunk of memory to any position in a file. The file offset, the memory buffer address, and the transfer size must all be aligned to the logical sector size of the underlying storage device, typically 512 or 4096 bytes.
To satisfy those constraints, a bufio.Writer-like writer, directIOWriter, was implemented. On Linux kernels v6.1 or newer, Prometheus retrieves the exact alignment values via statx; on older kernels, conservative defaults are used.
The directIOWriter currently covers chunk writes during compaction only, but that alone accounts for a substantial portion of Prometheus's I/O. The results are tangible: benchmarks show a 20–50% reduction in page cache usage, as measured by container_memory_cache.
The work is not done yet, and contributions are welcome. Here are a few areas that could help move the feature closer to General Availability:
Covering more write paths
Direct I/O is currently limited to chunk writes during compaction. Index files and WAL writes are natural next candidates, although they would require some additional work.
Building more confidence around directIOWriter
All existing TSDB tests can be run against the directIOWriter using a dedicated build tag: go test --tags=forcedirectio ./tsdb/. More tests covering edge cases for the writer itself would be welcome, and there is even an idea of formally verifying that it never violates alignment requirements.
Experimenting with RWF_DONTCACHE
Introduced in Linux kernel v6.14, RWF_DONTCACHE enables uncached buffered I/O, where data still goes through the page cache but the corresponding pages are dropped afterwards. It would be worth benchmarking whether this delivers similar benefits without direct I/O's alignment constraints.
Support beyond Linux
Support is currently Linux-only. Contributions to extend it to other operating systems are welcome.
For more details, see the proposal and the PR that introduced the feature.
Introducing the UX Research Working Group
Prometheus has always prioritized solving complex technical challenges to deliver a reliable, performant open-source monitoring system. Over time, however, users have expressed a variety of experience-related pain points. Those pain points range from onboarding and configuration to documentation, mental models, and interoperability across the ecosystem.
At PromCon 2025, a user research study was presented that highlighted several of these issues. Although the central area of investigation involved Prometheus and OpenTelemetry workflows, the broader takeaway was clear: Prometheus would benefit from a dedicated, ongoing effort to understand user needs and improve the overall user experience.
Recognizing this, the Prometheus team established a Working Group focused on improving user experience through design and user research. This group is meant to support all areas of Prometheus by bringing structured research, user insights, and usability perspectives into the community's development and decision-making processes.
How we can help Prometheus maintainers
Building something where the user needs are unclear? Maybe you're looking at two competing solutions and you'd like to understand the user tradeoffs alongside the technical ones.
That's where we can be of help.
The UX Working Group will partner with you to conduct user research or provide feedback on your plans for user outreach. That could include:
- User research reports and summaries
- User journeys, personas, wireframes, prototypes, and other UX artifacts
- Recommendations for improving usability, onboarding, interoperability, and documentation
- Prioritized lists of user pain points
- Suggestions for community discussions or decision-making topics
To get started, tell us what you're trying to do, and we'll work with you to determine what type and scope of research is most appropriate.
How we can help Prometheus end users
We want to hear from you! Let us know if you're interested in participating in a research study and we'll contact you when we're working on one that's a good fit. Having an issue with the Prometheus user experience? We can help you open an issue and direct it to the appropriate community members.
Interested in helping?
New contributors to the working group are always welcome! Get in touch and let us know what you'd like to work on.
Where to find us
Drop us a message in Slack, join a meeting, or raise an issue in GitHub.
- Slack: #prometheus-ux-wg
- Meetings: We meet biweekly, currently on Wednesdays at 14:00 UTC (subject to change depending on contributor availability).
- GitHub: prometheus-community/ux-research
Modernizing Prometheus: Native Storage for Composite Types
Over the last year, the Prometheus community has been working hard on several interesting and ambitious changes that previously would have been seen as controversial or not feasible. While there might be little visibility about those from the outside (e.g., it's not an OpenClaw Prometheus plugin, sorry ?), Prometheus developers are, organically, steering Prometheus into a certain, coherent future. Piece by piece, we unexpectedly get closer to goals we never dreamed we would achieve as an open-source project!
This post starts (hopefully!) as a series of blog posts that share a few ambitious shifts that might be exciting to new and existing Prometheus users and developers. In this post, I'd love to focus on the idea of native storage for the composite types which is tidying up a lot of challenges that piled up over time. Make sure to check the provided inlined links on how you can adopt some of those changes early or contribute!
CAUTION: Disclaimer: This post is intended as a fun overview, from my own personal point of view as a Prometheus maintainer. Some of the mentioned changes haven't been (yet) officially approved by the Prometheus Team; some of them were not proved in production.
NOTE: This post was written by humans; AI was used only for cosmetic and grammar fixes.
Classic Representation: Primitive Samples
As you might know, the Prometheus data model (so server, PromQL, protocols) supports gauges, counters, histograms and summaries. OpenMetrics 1.0 extended this with gaugehistogram, info and stateset types.
Impressively, for a long time Prometheus' TSDB storage implementation had an explicitly clean and simple data model. The TSDB allowed the storage and retrieval of string-labelled primitive samples containing only float64 values and int64 timestamps. It was completely metric-type-agnostic.
The metric types were implied on top of the TSDB, for humans and best effort tooling for PromQL. For simplicity, let's call this way of storing types a classic model or representation. In this model:
We have primitive types:
-
gaugeis a "default" type with no special rules, just a float sample with labels. -
counterthat should have a_totalsuffix in the name for humans to understand its semantics.foo_total 17.0 -
infothat needs an_infosuffix in the metric name and always has a value of1.
We have composite types. This is where the fun begins. In the classic representation, composite metrics are represented as a set of primitive float samples:
-
histogramis a group ofcounterswith certain mandatory suffixes andlelabels:foo_bucket{le="0.0"} 0 foo_bucket{le="1e-05"} 0 foo_bucket{le="0.0001"} 5 foo_bucket{le="0.1"} 8 foo_bucket{le="1.0"} 10 foo_bucket{le="10.0"} 11 foo_bucket{le="100000.0"} 11 foo_bucket{le="1e+06"} 15 foo_bucket{le="1e+23"} 16 foo_bucket{le="1.1e+23"} 17 foo_bucket{le="+Inf"} 17 foo_count 17 foo_sum 324789.3 -
gaugehistogram,summary, andstatesettypes follow the same logic – a group of specialgaugesorcountersthat compose a single metric.
The classic model served the Prometheus project well. It significantly simplified the storage implementation, enabling Prometheus to be one of the most optimized, open-source time-series databases, with distributed versions based on the same data model available in projects like Cortex, Thanos, and Mimir, etc.
Unfortunately, there are always tradeoffs. This classic model has a few limitations:
- Efficiency: It tends to yield overhead for composite types because every new piece of data (e.g., new bucket) takes precious index space (it's a new unique series), whereas samples are significantly more compressible (rarely change, time-oriented).
- Functionality: It poses limitations to the shape and flexibility of the data you store (unless we'd go into some JSON-encoded labels, which have massive downsides).
- Transactionality: Primitive pieces of composite types (separate counters) are processed independently. While we did a lot of work to ensure write isolation and transactionality for scrapes, transactionality completely breaks apart when data is received or sent via remote write, OTLP protocols, or, to distributed long-term storage, Prometheus solutions. For example, a
foohistogram might have been partially sent, but itsfoo_bucket{le="1.1e+23"} 17counter series be delayed or dropped accidentally, which risks triggering false positive alerts or no alerts, depending on the situation. - Reliability: Consumers of the TSDB data have to essentially guess the type semantics. There's nothing stopping users from writing a
foo_bucketgauge orfoo_totalhistogram.
A Glimpse of Native Storage for Composite Types
The classic model was challenged by the introduction of native histograms. The TSDB was extended to store composite histogram samples other than float. We tend to call this a native histogram, because TSDB can now "natively" store a full (with sparse and exponential buckets) histogram as an atomic, composite sample.
At that point, the common wisdom was to stop there. The special advanced histogram that's generally meant to replace the "classic" histograms uses a composite sample, while the rest of the metrics use the classic model. Making other composite types consistent with the new native model felt extremely disruptive to users, with too much work and risks. A common counter-argument was that users will eventually migrate their classic histograms naturally, and summaries are also less useful, given the more powerful bucketing and lower cost of native histograms.
Unfortunately, the migration to native histograms was known to take time, given the slight PromQL change required to use them, and the new bucketing and client changes needed (applications have to define new or edit existing metrics to new histograms). There will also be old software used for a long time that never is never migrated. Eventually, it leaves Prometheus with no chance of deprecating classic histograms, with all the software solutions required to support the classic model, likely for decades.
However, native histograms did push TSDB and the ecosystem into that new composite sample pattern. Some of those changes could be easily adapted to all composite types. Native histograms also gave us a glimpse of the many benefits of that native support. It was tempting to ask ourselves: would it be possible to add native counterparts of the existing composite metrics to replace them, ideally transparently?
Organically, in 2024, for transactionality and efficiency, we introduced a native histogram custom buckets(NHCB) concept that essentially allows storing classic histograms with explicit buckets natively, reusing native histogram composite sample data structures.
NHCB has proven to be at least 30% more efficient than the classic representation, while offering functional parity with classic histograms. However, two practical challenges emerged that slowed down the adoption:
-
Expanding, that is converting from NHCB to classic histogram, is relatively trivial, but combining, which is turning a classic histogram into NHCB, is often not feasible. We don't want to wait for client ecosystem adoption, and also being mindful of legacy, hard to change software, we envisioned NHCB being converted (so combined) on scrape from the classic representation. That has proven to be somewhat expensive on scrape. Additionally, combination logic is practically impossible when receiving "pushes" (e.g., remote write with classic histograms), as you could end up having different parts of the same histogram sample (e.g., buckets and count) sent via different remote write shards or sequential messages. This combination challenge is also why OpenTelemetry collector users see an extra overhead on
prometheusreceiveras the OpenTelemetry model strictly follows the composite sample model. -
Consumption is slightly different, especially in the PromQL query syntax. Our initial decision was to surface NHCB histograms using a native-histogram-like PromQL syntax. For example the following classic histogram:
foo_bucket{le="0.0"} 0 # ... foo_bucket{le="1.1e+23"} 17 foo_bucket{le="+Inf"} 17 foo_count 17 foo_sum 324789.3When we convert this to NHCB, you can no longer use
foo_bucketas your metric name selector. Since NHCB is now stored as afoometric, you need to use:histogram_quantile(0.9, sum(foo{job="a"})) # Old syntax: histogram_quantile(0.9, sum(foo_bucket{job="a"}) by (le))This has also another effect. It violates our "what you see is what you query" rule for the text formats, at least until OpenMetrics 2.
On top of that, similar problems occur on other Prometheus outputs (federation, remote read, and remote write).
NOTE: Fun fact: Prometheus client data model (SDKs) and
PrometheusProtoscrape protocol use the composite sample model already!
Transparent Native Representation
Let's get straight to the point. Organically, the Prometheus community seems to align with the following two ideas:
- We want to eventually move to a fully composite sample model on the storage layer, given all the benefits.
- Users needs to be able to switch (e.g., on scrape) from classic to native form in storage without breaking consumption layer. Essentially to help with non-trivial migration pains (finding who use what, double-writing, synchronizing), avoiding tricky, dual mode, protocol changes and to deprecate the classic model ASAP for the sustainability of the Prometheus codebase, we need to ensure eventual consumption migration e.g., PromQL queries -- independently to the storage layer.
Let's go through evidence of this direction, which also represents efforts you can contribute to or adopt early!
-
We are discussing the "native" summary and stateset to fully eliminate classic model for all composite types. Feel free to join and help on that work!
-
We are working on the OpenMetrics 2.0 to consolidate and improve the pull protocol scene and apply the new learnings. One of the core changes will be the move to composite values in text, which makes the text format trivial to parse for storages that support composite types natively. This solves the combining challenge. Note that, by default, for now, all composite types will be still "expanded" to classic format on scrape, so there's no breaking change for users. Feel free to join our WG to help or give feedback.
-
Prometheus receive and export protocol has been updated. Remote Write 2.0 allows transporting histograms in the "native" form instead of a classic representation (classic one is still supported). In the future versions (e.g. 2.1), we could easily follow a similar pattern and add native summaries and stateset. Contributions are welcome to make Remote Write 2.0 stable!
-
We are experimenting with the consumption compatibility modes that translate the composite types store as composite samples to classic representation. This is not trivial; there are edge cases, but it might be more feasible (and needed!) than we might have initially anticipated. See:
- PromQL compatibility mode for NHCB
- Expanding on remote write
- We need to also consider adding expanding for federation, remote read and other APIs.
In PromQL it might work as follows, for an NHCB that used to be a classic histogram:
# New syntax gives our "foo" NHCB: histogram_quantile(0.9, sum(foo{job="a"})) # Old syntax still works, expanding "foo" NHCB to classic representation: histogram_quantile(0.9, sum(foo_bucket{job="a"}) by (le))Alternatives, like a special label or annotations, are also discussed.
When implemented, it should be possible to fully switch different parts of your metric collection pipeline to native form transparently.
Summary
Moving Prometheus to a native composite type world is not easy and will take time, especially around coding, testing and optimizing. Notably it switches performance characteristics of the metric load from uniform, predictable sample sizes to a sample size that depends on a type. Another challenge is code architecture - maintaining different sample types has already proven to be very verbose (we need unions, Go!).
However, recent work revealed a very clean and possible path that yields clear benefits around functionality, transactionality, reliability, and efficiency in the relatively near future, which is pretty exciting!
If you have any questions around these changes, feel free to:
- DM me on Slack.
- Visit the
#prometheus-devSlack channel and share your questions. - Comment on related issues, create PRs, also review PRs (the most impactful work!)
The Prometheus community is also at KubeConEU 2026 in Amsterdam! Make sure to:
- Visit our Prometheus KubeCon booth.
- Attend our contributing workshop on Wednesday, March 25, 2026 16:00.
- Attend our Prometheus V3 One Year In: OpenMetrics 2.0 and More! session on Thursday, March 26, 13:45.
I'm hoping we can share stories of other important, orthogonal shifts we see in the community in future posts. No promises (and help welcome!), but there's a lot to cover, such as (random order, not a full list):
- Our native start timestamp feature journey that cleanly unblocks native delta temporality without "hacks" like reusing gauges, separate layer of metric types or label annotations e,g.,
__temporality__. - Optional schematization of Prometheus metrics that attempt to solve a ton of stability problems with metric naming and shape; building on top of OpenTelemetry semconv.
- Our metadata storage journey that attempts to improve the OpenTelemetry Entities and resource attributes storage and consumption experience.
- Our journey to organize and extend Prometheus scrape pull protocols with the recent ownership move of OpenMetrics.
- An incredible TSDB Parquet effort, coming from the three LTS project groups (Cortex, Thanos, Mimir) working together, attempting to improve high-cardinality cases.
- Fun experiments with PromQL extensions, like PromQL with pipes and variables and some new SQL transpilation ideas.
- Governance changes.
See you in open-source!

