the mathematics of compression in database…

Feb 9

why compression is (almost) always worthwhile

3 Comments

Really enjoyed this, especially the framing of compression as a tradeoff between CPU and I/O rather than just “making things smaller.” That perspective feels a lot closer to how systems actually behave under load.

The layering you described also stood out to me. The idea of doing some form of semantic or structural encoding before entropy compression feels like where a lot of untapped potential is. Most systems seem to stop at entropy, even though reshaping the data beforehand could reduce the problem space significantly.

I’ve been experimenting with something similar on a smaller scale, trying to push compression slightly closer to structure and meaning instead of just pattern detection. Not replacing something like gzip or zstd, but giving them a cleaner input to work with.

One thing I’m curious about is how you think about the boundary between semantic encoding and complexity. At what point do the gains start getting outweighed by the cost of generalizing across different types of data?

It feels like that edge is where most of the interesting work is right now.

Reply (1)

almog gavra

Thanks for the kind words Robert! It means a lot to me.

I don't think I'd classify semantic encoding as something that is relatively infrequent. Nearly all special-purpose databases apply some type of semantic encoding (e.g. prometheus uses Gorilla for their timeseries metrics, Parquet is famous for RLE, most data formats with schemas will use varint).

IMO the split as it stands works pretty nicely. The general purpose compressors work without knowledge of the data schema, and semantic encoding can stay the responsibility of the data system that has access to the data's schema.

I might not be fully aware of the SOTA work here though, so I'd love to hear what you had in mind!

Reply (1)

Robert Rider

Very fair point, especially around how much semantic encoding is already a part of systems like Prometheus and formats like Apache Parquet. When you frame it that way, it’s less that semantic encoding is rare and more that it’s just handled upstream, which feels like the real distinction.

What I’ve been exploring with GN is kind of sitting in that gap between layers. Not trying to replace schema-aware systems, but pushing some of that semantic awareness closer to the compression boundary itself. Almost like letting the compressor participate in structure discovery instead of assuming the structure has already been fully expressed.

So instead of a strict split between schema-aware encoding and schema-agnostic compression, I’m curious whether there’s a middle ground where the compressor can adapt to repeating semantic patterns on the fly. Especially in messier domains like chat logs or agent traces, where the structure is clearly there but not formally defined.

I do agree the current separation exists for good reasons. I think the question I’m circling is how far you can push that boundary before the added complexity outweighs the gains, and whether there are domains where that tradeoff actually flips.