<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[bits & pages]]></title><description><![CDATA[a publication for engineers curious about online database systems]]></description><link>https://www.bitsxpages.com</link><image><url>https://substackcdn.com/image/fetch/$s_!Oj32!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd170acc3-cf6c-4948-86e5-b8df318fff03_512x512.png</url><title>bits &amp; pages</title><link>https://www.bitsxpages.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 15 May 2026 00:50:09 GMT</lastBuildDate><atom:link href="https://www.bitsxpages.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Almog]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[almoggavra@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[almoggavra@substack.com]]></itunes:email><itunes:name><![CDATA[almog gavra]]></itunes:name></itunes:owner><itunes:author><![CDATA[almog gavra]]></itunes:author><googleplay:owner><![CDATA[almoggavra@substack.com]]></googleplay:owner><googleplay:email><![CDATA[almoggavra@substack.com]]></googleplay:email><googleplay:author><![CDATA[almog gavra]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[how metrics are stored and queried]]></title><description><![CDATA[the data structures powering your metrics dashboard]]></description><link>https://www.bitsxpages.com/p/how-metrics-are-stored-and-queried</link><guid isPermaLink="false">https://www.bitsxpages.com/p/how-metrics-are-stored-and-queried</guid><dc:creator><![CDATA[almog gavra]]></dc:creator><pubDate>Thu, 23 Apr 2026 15:28:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2972d822-c97c-46db-bbda-7ec1a643a3d6_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve recently been spending my time hacking on an <a href="https://github.com/opendata-oss/opendata">MIT-licensed, object store native timeseries database</a>, and I found the domain fascinating enough to warrant a post about how metrics are stored and retrieved in databases.</p><p>As always, let&#8217;s get straight to the point.</p><h1>what is a time series?</h1><p>If you&#8217;ve spent any time as a software engineer you have likely looked at a graph that plots important information about whats going on over time on computers that run your software.</p><p>Something that is so intuitive to analyze as a graph isn&#8217;t quite as intuitive to define formally. As a database developer I think of a timeseries graph in three components:</p><ol><li><p><strong>Samples</strong>, which are the raw measurements. These are typically polled from a machine that reports them constantly by reading some metric value (stored in something like an atomic <code>f64</code>). Since they are taken at a specific time, they&#8217;ll be associated with the timestamp that they were measured at.</p></li><li><p><strong>Series</strong> are vectors of samples that are associated with a particular context, which most systems refer to as labels (e.g. <code>{instance='a2', region='us-west-2'}</code>). </p></li><li><p><strong>Queries</strong> distill series into useful information. Typically there is too much information to directly examine series. If you have thousands of nodes, you might want to look at the CPU utilization on average to know if you have well utilized cores.</p></li></ol><p>This post covers how we store series and samples in order to efficiently serve query.</p><h1>samples and indexes</h1><p>To make sense of how the data is laid out on disk, it pays to understand the anatomy the queries that use that data:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PL-D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PL-D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 424w, https://substackcdn.com/image/fetch/$s_!PL-D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 848w, https://substackcdn.com/image/fetch/$s_!PL-D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 1272w, https://substackcdn.com/image/fetch/$s_!PL-D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PL-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png" width="609" height="283.8484848484849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1386,&quot;resizeWidth&quot;:609,&quot;bytes&quot;:70883,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PL-D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 424w, https://substackcdn.com/image/fetch/$s_!PL-D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 848w, https://substackcdn.com/image/fetch/$s_!PL-D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 1272w, https://substackcdn.com/image/fetch/$s_!PL-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F70851229-28b2-43cb-82ec-d08a69643c49_1386x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The selector determines which series should be fetched and included in the query while the aggregation defines the logic for combining together samples across different series. </p><p>Think about it like a 2D matrix where each series is a row and each timestamp is a column. The cells hold the measurements at each timestamp. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F_st!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F_st!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 424w, https://substackcdn.com/image/fetch/$s_!F_st!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 848w, https://substackcdn.com/image/fetch/$s_!F_st!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 1272w, https://substackcdn.com/image/fetch/$s_!F_st!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F_st!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png" width="1456" height="932" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113462,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F_st!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 424w, https://substackcdn.com/image/fetch/$s_!F_st!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 848w, https://substackcdn.com/image/fetch/$s_!F_st!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 1272w, https://substackcdn.com/image/fetch/$s_!F_st!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa29bc0a4-2770-4189-9465-d66da962d4ed_1544x988.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Efficiently computing the query requires figuring out how to prune series to only the series matched by the selector, extracting the samples for the time range needed, grouping them by the aggregation and finally computing the results.</p><p>This is done in three steps:</p><ol><li><p>An inverted index is used to narrow down which series need to be fetched</p></li><li><p>A forward index determines which series need to be aggregated together</p></li><li><p>The raw samples are retrieved and aggregated</p></li></ol><h2>the inverted index</h2><p>The query starts here. The inverted index narrows down potentially hundreds of thousands of series into just the ones required to serve the query. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qaox!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qaox!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 424w, https://substackcdn.com/image/fetch/$s_!Qaox!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 848w, https://substackcdn.com/image/fetch/$s_!Qaox!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 1272w, https://substackcdn.com/image/fetch/$s_!Qaox!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qaox!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png" width="528" height="261.4201954397394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:608,&quot;width&quot;:1228,&quot;resizeWidth&quot;:528,&quot;bytes&quot;:78623,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qaox!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 424w, https://substackcdn.com/image/fetch/$s_!Qaox!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 848w, https://substackcdn.com/image/fetch/$s_!Qaox!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 1272w, https://substackcdn.com/image/fetch/$s_!Qaox!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F523c8c6f-b484-404b-810d-8c5f52f14159_1228x608.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For example the query <code>http_requests_count{instance='a2'}</code> needs to fetch all the series for <code>http_requests_count</code> that also have the label <code>instance=&#8217;a2&#8217;</code> and can ignore the rest.</p><p>A naive approach is to store series sorted by name. That would allow us to binary search the entire database to isolate only series with the name <code>http_request_count</code> and then scan those to find which belong to <code>instance='a2'</code>.</p><p>The problem is that there may be thousands of instances, each collecting hundreds of metrics, so narrowing down the set to just series that match <code>instance=&#8217;a2&#8217;</code> isn&#8217;t trivial. </p><p>The solution to this problem is borrowed from search systems: the inverted index.</p><p>For every label that exists, the database stores a list of the series that have that label. This list is called a &#8220;posting list&#8221;.</p><p>Finding the requires series is reduced to an intersection of posting lists.</p><p>To make this intersection efficient, each series is assigned a unique ID and the posting lists are sorted by this ID. That makes the algorithm simple: you load the posting lists for the labels you care about and advance pointers into the lists until they all align on a series ID, that&#8217;s one you need to fetch (in the diagram above, series 1 and 5). You continue until you&#8217;ve exhausted one of the posting lists and you can ignore posting lists that aren&#8217;t part of your query.</p><h3>roaring bitmaps</h3><p>There are some encoding tricks you can use to store and leverage the inverted index efficiently. The state of the art mechanism for this particular use case is called a <a href="https://arxiv.org/pdf/1603.06549">roaring bitmap</a>.</p><p>The insight with this data structure is that posting lists tend to be either sparse or dense, and the sparse ones tend to be clustered. For timeseries, you can imagine that the posting list for <code>container_id=123</code> (a sparse list) holds significantly fewer series than <code>region:us-west-2</code> (a dense list). If you assign series ids monotonically, though, it&#8217;s likely that all the series generated by <code>container_id=123</code> have ids within a small range of the total series id domain (they all get registered on startup). </p><p>To leverage this observation, roaring bitmaps split the representation of posting lists into &#8220;dense blocks&#8221; and &#8220;sparse blocks&#8221;. Since they only encode 32-bit integer values, they index these blocks by the &#8220;high 16&#8221; bits and then the blocks encode which exact values exist. Dense blocks encode bitsets to represent which values are present while sparse blocks just encode the array of values naively.</p><p>In the example below, we have representations for both sparse and dense blocks. The sparse block encodes the values <code>[5, 1000]</code> while the dense block encodes every number between <code>196608</code> and <code>250000</code>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6mRi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6mRi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 424w, https://substackcdn.com/image/fetch/$s_!6mRi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 848w, https://substackcdn.com/image/fetch/$s_!6mRi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 1272w, https://substackcdn.com/image/fetch/$s_!6mRi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6mRi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png" width="582" height="362.95054945054943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:908,&quot;width&quot;:1456,&quot;resizeWidth&quot;:582,&quot;bytes&quot;:150454,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6mRi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 424w, https://substackcdn.com/image/fetch/$s_!6mRi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 848w, https://substackcdn.com/image/fetch/$s_!6mRi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 1272w, https://substackcdn.com/image/fetch/$s_!6mRi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eb48b49-1a0e-43a0-8fe3-95d7bf7821dd_1524x950.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If we were to encode the sparse block with a full bitset it would take 8KB since we need one bit for every value that starts with <code>0x0000</code>. But since there are only two numbers, it&#8217;s more efficient to just encode 4 bytes (the lower 16 bits for each 5 and 1000). </p><p>On the other hand if we were to encode the dense block as a simple array, it would require 212KB (2 bytes per entry). It&#8217;s more efficient to store a single 8KB bitset.</p><p>Roaring bitmaps are not only efficient in storage space. Because the lists are sorted, and because bitmap intersections/unions can be done with SIMD instructions, set operations on roaring bitmaps are incredibly efficient.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsxpages.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading bits &amp; pages! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>the forward index</h2><p>If you followed the inverted index section closely, you&#8217;ll notice that we made a small decision with big implications: the inverted index only stores series ids not the series definitions themselves. </p><p>This means that while the inverted index helped us find which series IDs we need to serve a query, it doesn&#8217;t help us at all for determining which series should be grouped together for an aggregation.This means we need some mapping of series IDs to their definitions (the label set). </p><p>This is called the forward index:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5BW0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5BW0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 424w, https://substackcdn.com/image/fetch/$s_!5BW0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 848w, https://substackcdn.com/image/fetch/$s_!5BW0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 1272w, https://substackcdn.com/image/fetch/$s_!5BW0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5BW0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png" width="459" height="307.47115384615387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1248,&quot;resizeWidth&quot;:459,&quot;bytes&quot;:74303,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5BW0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 424w, https://substackcdn.com/image/fetch/$s_!5BW0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 848w, https://substackcdn.com/image/fetch/$s_!5BW0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 1272w, https://substackcdn.com/image/fetch/$s_!5BW0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcabc3562-c1f0-4079-b743-8c9b9926b3d9_1248x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Since queries need to load the forward index for every series it matches, the retrieval of this data can get expensive. </p><p>Most data that is stored on disk, including indexes like the forward index, is stored in something similar to an SST (see <a href="https://www.bitsxpages.com/p/sorted-string-tables-sst-from-first">this previous post</a> on that for more detail), which means you choose a key that you sort by. This is called the &#8220;clustering&#8221; key, and how you choose it makes a big difference to your storage performance.</p><p>To make this concrete, here are two alternative layout options for storing series and forward indexes:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FiO_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FiO_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 424w, https://substackcdn.com/image/fetch/$s_!FiO_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 848w, https://substackcdn.com/image/fetch/$s_!FiO_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 1272w, https://substackcdn.com/image/fetch/$s_!FiO_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FiO_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png" width="1456" height="500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:500,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:156921,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FiO_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 424w, https://substackcdn.com/image/fetch/$s_!FiO_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 848w, https://substackcdn.com/image/fetch/$s_!FiO_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 1272w, https://substackcdn.com/image/fetch/$s_!FiO_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d1f7c98-d6be-41b2-a524-ea261a32afa8_1880x646.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the first example, we just choose to sort by <code>series_id</code>. This makes it easy to go from the inverted index to then retrieve the samples and the forward index because each posting in the inverted index is just the series ID.</p><p>The problem is that series IDs have no encoding or relationship to the metric name. If a query targets only a specific metric (which is the case for almost every query), it may have to fetch series that are scattered on disk, which can dramatically increase the <a href="https://www.bitsxpages.com/p/understanding-lsm-trees-via-read">read amplification</a> of a query.</p><p>The layout on the right sorts the forward index and samples based on a composite key of <code>(metric_name, series_id)</code> instead. This means that consecutive series IDs are not necessarily clustered, but blocks on disk are dedicated to a single metric. If you need to fetch many series for the same metric, this can be significantly more efficient.</p><h2>storing samples</h2><p>Now that we have a way to know which series we need and which ones group together in an aggregation, we just need a mechanism for storing and fetching the raw data. </p><p>An interesting observation is that timeseries data is both abundant and repetitive, a combination that results in some interesting characteristics that make it particularly amenable to compression (something we discussed in depth in <a href="https://www.bitsxpages.com/p/the-mathematics-of-compression-in">a previous post</a>).</p><p>A data point in a series, called a sample, has two parts: the timestamp the measurement was taken and the value of the measurement.</p><h3>timestamps</h3><p>If you&#8217;ve ever configured prometheus, you know that you configure your scraper to scrape at some regular interval (perhaps every 15s). That means it shouldn&#8217;t surprise you when the timestamps in any particular series look a lot like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">&#9484;series A timestamps&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;     &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9474;
&#9474;&#9474;00:00:00&#9474;&#9474;00:00:15&#9474;&#9474;00:00:31&#9474;&#9474;00:00:45&#9474; ... &#9474;01:00:00&#9474;&#9474;
&#9474;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;     &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre></div><p>You may occasionally see gaps that are wider (e.g. if a pod is overloaded), but in stead state you expect the timestamps to always be ~15s apart.</p><p>The typical encoding of a timestamp is an <code>i64</code> long, which takes up 8 bytes. If you were to encode 100k series every 15s that would take you 192 MB/hour <em>just to store timestamps</em>. In production, however, most systems are able to store this with just ~3 MB/hour (a 60x reduction). How?</p><p>Some smart people figured out that if you store only the first timestamp, and then encode just the difference between them, you can store these points using much less data:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">&#9484;&#9472;series A timestamps (deltas)&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474; &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;     &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9474;
&#9474; &#9474;00:00:00&#9474;&#9474;00:00:15&#9474;&#9474;00:00:31&#9474;&#9474;00:00:45&#9474; ... &#9474;01:00:00&#9474; &#9474;
&#9474; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;     &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9474;
&#9474;     &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;                    &#9474;
&#9474;     &#9474;  15s   &#9474; &#9474;  16s   &#9474; &#9474;  14s   &#9474; ...                &#9474;
&#9474;     &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;                    &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre></div><p>Using 2 bytes per delta, you can encode up to gaps of 18 hours (65k seconds) and reduce your overall data by 4x. </p><p>Because the timestamps are reported at 15s intervals, It turns out we can do even better. The <em>difference of the difference</em> between timestamps is even smaller, in fact it&#8217;s almost always pretty close to 0.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">&#9484;series A timestamps (delta of deltas)&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
&#9474;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;&#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;     &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9474;
&#9474;&#9474;00:00:00&#9474;&#9474;00:00:15&#9474;&#9474;00:00:31&#9474;&#9474;00:00:45&#9474; ... &#9474;01:00:00&#9474; &#9474;
&#9474;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;     &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9474;
&#9474;    &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;                    &#9474;
&#9474;    &#9474;  15s   &#9474; &#9474;  16s   &#9474; &#9474;  14s   &#9474; ...                &#9474;
&#9474;    &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;                    &#9474;
&#9474;          &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488; &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;                         &#9474;
&#9474;          &#9474;   1s   &#9474; &#9474;  -2s   &#9474;                         &#9474;
&#9474;          &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496; &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;                         &#9474;
&#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre></div><p>There are some <a href="https://www.vldb.org/pvldb/vol8/p1816-teller.pdf">fancy ways</a> that you can encode deltas of deltas so that you can store timestamps in approximately 1 bit per sample, a 64x reduction from the original 8 byte format.</p><h3>measurements</h3><p>If you got the gist of the last section, you may have already deduced that measurements might follow a similar pattern. Metrics typically change slowly over time (unless something special is happening, which by definition doesn&#8217;t happen often).</p><p>This sound pretty similar to what we saw when we were discussing timestamps, but the problem is that a measurement is a floating point value. A small difference between two floating point numbers can still require many bits.</p><p>Take for example the IEEE 754 (the standard) representation for <code>0.0001</code>, such a simple decimal surprisingly looks like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">0011111100011010001101101110001011101011000111000100001100101101</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
  &#9474;     Component      &#9474;      Bits      &#9474;       Value       &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; Sign (1 bit)       &#9474; 0              &#9474; +                 &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; Exponent (11 bits) &#9474; 01111110001    &#9474; 1009 &#8722; 1023 = &#8722;14 &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; Mantissa (52 bits) &#9474; 1010...0101101 &#9474; fractional part   &#9474;
  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre></div><p>The sign bit is simple (positive or negative). The mantissa is multiplied by <code>2 ^ exponent</code> to recreate the fractional value, in this case we get:</p><pre><code><code>(&#8722;1)^0 &#215; 1.1010...0101101&#8322; &#215; 2^&#8722;14 &#8776; 0.00010000000000000000479</code></code></pre><p>That&#8217;s not exactly <code>0.0001</code> but it&#8217;s as close as we can represent with floating point numbers. The interesting thing is that you might notice that <code>0.0001</code> and <code>0.0002</code> look&#8230; very similar (in fact, the mantissa is the same):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">  &#9484;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9488;
  &#9474; Value  &#9474; Sign &#9474;  Exponent   &#9474;  Mantissa                  &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; 0.0001 &#9474; 0    &#9474; 01111110001 &#9474; 101000110110111...00101101 &#9474;
  &#9500;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9532;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9508;
  &#9474; 0.0002 &#9474; 0    &#9474; 01111110010 &#9474; 101000110110111...00101101 &#9474;
  &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9524;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;</code></pre></div><p>If you XOR these two values you get a value with a lot of 0s: </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">0000000000110000000000000000000000000000000000000000000000000000</code></pre></div><p>This isn&#8217;t coincidence, many similar floating point numbers have lots of 0s when you XOR them. This means you can store a three part number: first store the number of leading 0s (10) then store the meaningful length (2) then store the meaningful bits <code>0b11</code> which is only 2 bits. This takes you down from the original 64 bits to just 15 bits (in this case).</p><p>Together these two concepts (delta of delta encoding for timestamps and XOR encoding for measurement values) allow us to store many measurements extremely efficiently, storing on average only ~11 bits per measurement, down from 16 bytes of the naive format.</p><p>Going back to our original 100k series reported at 15s interval, you go from 384 MB/h to 33 MB/h, a nearly 12x improvement.</p><h2>series churn</h2><p>Let&#8217;s revisit the 2D matrix model of timeseries where the horizontal dimension is time and the vertical dimension are the series. If you look at a few hours of production data you would probably see something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KdVO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KdVO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 424w, https://substackcdn.com/image/fetch/$s_!KdVO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 848w, https://substackcdn.com/image/fetch/$s_!KdVO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 1272w, https://substackcdn.com/image/fetch/$s_!KdVO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png" width="512" height="287.452566096423" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:722,&quot;width&quot;:1286,&quot;resizeWidth&quot;:512,&quot;bytes&quot;:81054,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KdVO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 424w, https://substackcdn.com/image/fetch/$s_!KdVO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 848w, https://substackcdn.com/image/fetch/$s_!KdVO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 1272w, https://substackcdn.com/image/fetch/$s_!KdVO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb71744c5-3042-467f-b881-71f8481e6d9d_1286x722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The &#9608;&#9608; represents series with data during that time and a <code>.</code> represents series with no data at that time. </p><p>Series S06-S12 are likely tagged by something like <code>container_id</code> that is not static, so whenever you deploy a new series is generated with nearly identical labels except for a change in <code>container_id</code>.</p><p>This is called &#8220;series churn&#8221;, and it&#8217;s the biggest problem in timeseries databases. If we were to ignore this pattern in observability systems, we&#8217;d end up with some pretty nasty side effects.</p><p>For illustration, imagine there were only 2 labels on each of these metrics: <code>name</code> (the metric name) and <code>container_id</code>. If we indexed this in a single inverted index you&#8217;d see something like this:</p><pre><code><code>name=metric_1: [1, 3, 5, 7, 9, 11]
name=metric_2: [2, 4, 6, 8, 10, 12]
container=a:   [1, 2, 3]
container=b:   [4, 5]
container=c:   [6, 7]
container=d:   [8, 9, 10]
container=e:   [11, 12]</code></code></pre><p>If my query targeted the time range <code>16:00-24:00</code>, and I&#8217;m querying for <code>metric_2</code> then I&#8217;d need to examine all of <code>[1, 3, 5, 7, 9, 11]</code> because that&#8217;s the only filtering condition I have. I would then discover that series <code>[7, 9]</code> don&#8217;t have any data associated with the window I&#8217;m querying, but only <em>after</em> I&#8217;ve fetched the samples for those series.</p><p>The solution to this is bucketing: storing a full trio of samples, inverted and forward indexes for each &#8220;bucket&#8221; of time. The series IDs are also scoped to a bucket, which allows you to quickly find the series ID for incoming data without looking through the entire history of series.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dy0i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dy0i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 424w, https://substackcdn.com/image/fetch/$s_!dy0i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 848w, https://substackcdn.com/image/fetch/$s_!dy0i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 1272w, https://substackcdn.com/image/fetch/$s_!dy0i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dy0i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png" width="1456" height="705" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:705,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195780,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/195176356?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dy0i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 424w, https://substackcdn.com/image/fetch/$s_!dy0i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 848w, https://substackcdn.com/image/fetch/$s_!dy0i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 1272w, https://substackcdn.com/image/fetch/$s_!dy0i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F53483bcc-5a7a-49ef-93f9-f3f2690fda8d_2118x1026.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Notice that the downside of this is that this generates many more distinct series IDs, even though some of them refer to the same &#8220;logical&#8221; series. In the example above, A1, B1, and C1 all have the same labels. </p><p>This is a fundamental tradeoff between <a href="https://www.bitsxpages.com/p/understanding-lsm-trees-via-read">read, write and space amplification</a> in timeseries storage. You can manage this by merging series across time blocks together for older buckets, and this is in fact exactly how many timeseries databases handle the problem.</p><h1>fsync()</h1><p>That&#8217;s it for today! If you want to read up more details on how we encode and store these data types on disk, we have a detailed <a href="https://github.com/opendata-oss/opendata/blob/main/timeseries/rfcs/0001-tsdb-storage.md">RFC on GitHub</a>.</p><p>We&#8217;ve covered the three main types of data that a timeseries system stores: samples, inverted indices and forward indices as well as how they are represented on disk. Coming soon, I&#8217;ll write a full post on the query execution pipelines for timeseries engines so make sure to subscribe to get it delivered straight to you.</p>]]></content:encoded></item><item><title><![CDATA[the broken economics of databases]]></title><description><![CDATA[Why database companies charge so much, earn so little, and keep making things complicated]]></description><link>https://www.bitsxpages.com/p/the-broken-economics-of-databases</link><guid isPermaLink="false">https://www.bitsxpages.com/p/the-broken-economics-of-databases</guid><dc:creator><![CDATA[almog gavra]]></dc:creator><pubDate>Mon, 23 Mar 2026 18:29:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5bcc4396-42c3-4b06-aa50-bf1386bd8d99_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The database industry has a strange economic problem, and this post attempts to lay out the foundation for why that&#8217;s the case. It will cover some concepts from economics 101 and hopefully answer the question &#8220;why is it so expensive to pay someone to run my database?&#8221;.</p><p>The short version is that database vendors have incentives that push them away from simplicity and toward defensibility, resulting in databases that seemingly get <em>worse</em> over time.</p><blockquote><p><strong>Disclaimer</strong>: I&#8217;m working on an OSS database project called <a href="https://github.com/opendata-oss/opendata">OpenData</a> that was partly motivated by the dynamics in this post, but this is the only time I&#8217;ll mention it.</p></blockquote><h2>making sense of earnings reports</h2><p>There is some puzzling data available in historical earnings reports of public database companies that were acquired or shutdown. <a href="https://www.cloudera.com/about/news-and-blogs/press-releases/2021-03-10-cloudera-reports-fourth-quarter-and-fiscal-year-2021-financial-results.html">Cloudera</a>, <a href="https://fintool.com/app/research/companies/MRDB/earnings/Q1%202024">MariaDB</a>, and <a href="https://www.sec.gov/ix?doc=/Archives/edgar/data/0001699838/000169983825000013/cflt-20250930.htm">Confluent</a> (where I was a relatively early employee) all reported <a href="https://en.wikipedia.org/wiki/Gross_margin">gross margins</a> between 78% and 91%. In the traditional economic sense, gross margins this high are pretty rare and you&#8217;d expect these companies to be wildly profitable. It is clear that none of the referenced companies were.</p><p>For a database company, <strong>gross margins</strong> are the ratio between the amount of revenue the company brings in and the cost to &#8220;run the database&#8221;. The accounting here is pretty loose, but there are in essence two factors that play into the cost: the hardware cost and the operations cost.</p><p>The hardware cost is more obviously measurable, this is what the company pays the cloud providers for the hardware. The operations cost is estimated based on the number of engineering hours spent on-call and on &#8220;other operational tasks&#8221;.</p><p>When I was at Confluent, this was the result of a survey that accounting sent out to the engineering teams, asking us how much time we spent on operations. As you can imagine, that&#8217;s an imprecise estimate, but investors care a lot about it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yYAs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yYAs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 424w, https://substackcdn.com/image/fetch/$s_!yYAs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 848w, https://substackcdn.com/image/fetch/$s_!yYAs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 1272w, https://substackcdn.com/image/fetch/$s_!yYAs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yYAs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png" width="1306" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1306,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86040,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/191531898?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yYAs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 424w, https://substackcdn.com/image/fetch/$s_!yYAs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 848w, https://substackcdn.com/image/fetch/$s_!yYAs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 1272w, https://substackcdn.com/image/fetch/$s_!yYAs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe133fb91-80ac-4e34-bbdd-eb9c8fd65f11_1306x456.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What explains this half of the story is <strong>operating margin</strong>, which indicates how much of the revenue went toward R&amp;D, Sales &amp; Marketing and Administration. After accounting for that, these companies generated very modest margin (from 2-8%, and deep in the red if you include all <a href="https://en.wikipedia.org/wiki/Generally_Accepted_Accounting_Principles_(United_States)">GAAP</a> expenses).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VegE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VegE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 424w, https://substackcdn.com/image/fetch/$s_!VegE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 848w, https://substackcdn.com/image/fetch/$s_!VegE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 1272w, https://substackcdn.com/image/fetch/$s_!VegE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VegE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png" width="1306" height="494" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:494,&quot;width&quot;:1306,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93968,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/191531898?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VegE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 424w, https://substackcdn.com/image/fetch/$s_!VegE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 848w, https://substackcdn.com/image/fetch/$s_!VegE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 1272w, https://substackcdn.com/image/fetch/$s_!VegE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea8f5235-456c-4854-ad81-1895d2e673dd_1306x494.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These gross and operating margins are interesting on their own, but together they paint a concerning economic picture: database companies depend on a delicate balance, and even a small sustained dip in gross margins can be fatal.</p><h2>monopoly and competition</h2><blockquote><p>If you aren&#8217;t too interested in economic theory or already have a solid foundation, you can skip this section.</p></blockquote><p>In economic theory there&#8217;s a spectrum between monopoly markets and perfect competition. At one end, a monopolist has meaningful control over price, and therefore demand. At the other, firms sell effectively identical products into a market with many competitors, so any one firm has almost no pricing power.</p><p>Economists represent this by relating <strong>marginal revenues</strong> <code>MR</code> (revenue from selling one more unit) and <strong>marginal costs</strong> <code>MC</code> (cost to serve one more unit) with respect to demand. Firms, whether monopolies or not, want to increase output so long as selling one more unit brings in more revenue than cost. This equilibrium is denoted with <code>MR = MC</code>.</p><p>We can explore the intuition using a database example. I love to pick on MongoDB because they&#8217;re generous enough to have detailed public pricing. An M50 instance is priced at $1,460/mo. They chose this price intentionally.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!if6R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!if6R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 424w, https://substackcdn.com/image/fetch/$s_!if6R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 848w, https://substackcdn.com/image/fetch/$s_!if6R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 1272w, https://substackcdn.com/image/fetch/$s_!if6R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!if6R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png" width="1456" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93629,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/191531898?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!if6R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 424w, https://substackcdn.com/image/fetch/$s_!if6R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 848w, https://substackcdn.com/image/fetch/$s_!if6R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 1272w, https://substackcdn.com/image/fetch/$s_!if6R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22464a8a-ea9e-4c3c-84a8-a19738a79c8d_1980x722.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Imagine the alternatives to help understand the intuition behind the <code>MR=MC</code> condition:</p><ol><li><p>At $2,000/month, they would make more on each sale but likely lose enough customers that total profit falls.</p></li><li><p>At $1,000/month, they would win more customers but probably not enough to make up for the lower revenue per cluster.</p></li><li><p>Somewhere in between is the &#8220;best&#8221; point that preserves high margins but remains low enough to avoid losing too much demand.</p></li></ol><p>That math looks very different depending on the shape of the demand curve.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1zui!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1zui!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 424w, https://substackcdn.com/image/fetch/$s_!1zui!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 848w, https://substackcdn.com/image/fetch/$s_!1zui!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 1272w, https://substackcdn.com/image/fetch/$s_!1zui!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1zui!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png" width="1400" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56941,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/191531898?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1zui!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 424w, https://substackcdn.com/image/fetch/$s_!1zui!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 848w, https://substackcdn.com/image/fetch/$s_!1zui!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 1272w, https://substackcdn.com/image/fetch/$s_!1zui!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b120b0-4d6b-49a1-8aaa-4e22e0cd4123_1400x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The graphs above show the extreme ends of the spectrum.</p><p>In a perfect monopoly, the firm&#8217;s own pricing decision meaningfully affects the quantity it can sell. If MongoDB was a true monopoly, customers may dislike the M50 price tag, but they would not have a close substitute to jump to immediately (their only alternative is not having MongoDB).</p><p>In perfect competition, the opposite is true. If there were a thousand vendors selling truly identical managed M50 clusters, MongoDB could not raise prices much without losing customers to cheaper alternatives, and it could not sustainably cut prices below cost forever either.</p><p>The reality is that MongoDB lives somewhere in the middle. The question for us is what characteristics does the database market as a whole have in the long term?</p><p>The answer depends on how &#8220;different&#8221; databases really are. Postgres and Kafka are relatively different systems and you&#8217;re usually <a href="https://www.morling.dev/blog/you-dont-need-kafka-just-use-postgres-considered-harmful/">ill advised</a> to use them interchangeably. Confluent&#8217;s Kafka and Amazon&#8217;s MSK, on the other hand, are close substitutes. In economic models, even a single close substitute can quickly erode margins and force vendors to compete on price (the <a href="https://en.wikipedia.org/wiki/Bertrand_paradox_(economics)">Bertrand Paradox</a>).</p><p>That puts significant pressure on vendors to avoid becoming interchangeable, else risk losing their high gross margins.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsxpages.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading bits &amp; pages! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>why vendors need high margins</h2><p>There&#8217;s an interesting parallel to draw between two seemingly unrelated markets: databases and pharmaceuticals. For both, the <em>initial</em> R&amp;D investment to bring a product to market is substantial. Developing a serious database or a successful drug takes years of specialized work, but selling one more unit of either isn&#8217;t particularly expensive.</p><p>Economists describe this as high <strong>fixed costs</strong> (the cost to develop the initial product) and low <strong>variable costs</strong> (the cost to manufacture a single instance of the product). This creates a basic requirement where the initial investment has to be earned back somehow. Whether that happens through high margins, high volume, or both depends on the demand curve we discussed earlier.</p><p>There are a few ways that this can play out, and where the comparison with pharma becomes useful. In both industries, the company whose R&amp;D investment results in something genuinely innovative can charge what&#8217;s known as <a href="https://en.wikipedia.org/wiki/Schumpeterian_rent">Schumpeterian Rent</a>: <em>temporary</em> profits that exist because competitors have not yet caught up. During this period the original company is effectively a monopoly. You can charge as much as you want for Insulin or Kafka if you&#8217;re the only one selling it (for now).</p><p>The core tension here is that <em>innovation is not the same thing as sustained value capture</em>. Innovation creates temporary pricing power, which unlocks the temporary high gross margin we see in earning reports, but maintaining that requires constant innovation and sustained R&amp;D costs.</p><p>In situations like this, who captures value from an innovation is shaped by two factors: <a href="https://www.crb.gov/proceedings/2006-3/riaa-ex-o-101-dp.pdf">appropriability and complementary assets</a>. Appropriability defines how easily an invention can be imitated and the Complementary Assets define what is required to monetize the innovation itself.</p><p>The reason the pharmaceutical industry is a useful foil is that it tends to have a stronger answer to both questions. Government-issued patents create a much more formal appropriability regime, and the path from inventing a drug to monetizing it is usually more directly controlled by the innovator (they directly own or have strong relationships with manufacturing).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z_Vx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z_Vx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 424w, https://substackcdn.com/image/fetch/$s_!z_Vx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 848w, https://substackcdn.com/image/fetch/$s_!z_Vx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 1272w, https://substackcdn.com/image/fetch/$s_!z_Vx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z_Vx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png" width="1456" height="655" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:655,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82785,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/191531898?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z_Vx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 424w, https://substackcdn.com/image/fetch/$s_!z_Vx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 848w, https://substackcdn.com/image/fetch/$s_!z_Vx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 1272w, https://substackcdn.com/image/fetch/$s_!z_Vx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc51f350b-a7b2-4753-ba17-399c8e8380a6_1600x720.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Databases do not get the luxury of government protection, so vendors have to rely on weaker defenses like closed source code, restrictive licenses, and operational know-how. None are reliable barriers in the long term, and the half-life of these protections is <a href="https://malus.sh/index.html">quickly degrading</a> with advancements in AI. That means the Schumpeterian rents from database innovation usually decay more gradually and less predictably than they do in pharma.</p><p>The bigger problem is in the complementary assets. In pharma, the innovator usually has a much more direct claim on the assets required to commercialize the product. In databases, the crucial complement is the infrastructure and hardware that runs the system, which is typically owned by cloud providers like AWS rather than by the inventor of the database.</p><p><strong>This asymmetry is the core of the industry&#8217;s strange economics.</strong></p><p>Bringing it back to the demand curves and what options are available to database vendors, it starts to make sense why they must sell databases at high margins to survive. They need to recover high fixed costs in a market where the innovation diffuses rapidly, the protection is weak, and the owners of the key complements (the hyperscalers) are structurally better positioned to win a price war.</p><p>This means their only strategy is to continuously generate innovations that renew the Schumpeterian advantage, differentiate in some other manner, or fundamentally change their business models. But continuously generating innovation requires additional investment, which has a significant fixed cost and starts the cycle all over again.</p><h2>why databases become worse over time</h2><p>This conclusion, on its own, is not especially troubling. It is good that vendors have to keep innovating if they want to earn outsized profits. The problem is that if these innovations run dry, the company ends up resorting to tactics that hurt the product to avoid a price war with AWS.</p><p>The unpalatable truth is that the useful feature set of a database stabilizes relatively quickly. Database developers tend to release a wealth of new features to stay ahead of competition, but on average these features don&#8217;t get adopted (and at worst cause issues).</p><p>When feature development is no longer enough, the focus shifts from pure product improvement to avoiding commoditization.</p><p>There are three characteristics of a commodity: fungibility, price transparency, and low switching costs. With a mature database, fungibility tends to rise first. If most buyers view multiple managed offerings of the same database as &#8220;close enough,&#8221; then the remaining levers of differentiation shift elsewhere.</p><p>Some of those levers can be healthy, and even benefit developers. Sometimes a vendor creates high switching costs by providing such tight, high-quality integration across a bundle of services that detangling it is simply not feasible without significant investment. All database companies want this kind of defensibility, but few achieve it. Databricks is one of the clearer success cases.</p><p>Some levers are toxic. Vendors can, and do, manufacture price opacity by hiding behind complicated pricing schemes. Snowflake pricing, for example, is so complex that <a href="https://www.gartner.com/en/documents/6668634">Gartner released a report</a> (available for purchase) that helps you understand and negotiate your bill.</p><p>But there is another, more subtle lever: operational complexity.</p><p>To see why, we can revisit the MongoDB Atlas example. If you recall, an M50 instance is priced at $1,460/mo but the equivalent cluster of three 1-year reserved <code>r5.xlarge</code> instances in AWS us-east-1 cost only $360/mo (including 160GB EBS volumes). This means that MongoDB thinks it&#8217;s reasonable for you to value the operational work they do for you at $1,100/mo, or 3x the hardware cost.</p><p>This also means a 50% reduction in hardware cost will reduce your price by ~12% while a 50% reduction in operational costs reduces your price by ~38%, assuming savings are passed on directly to you.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mdq0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mdq0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 424w, https://substackcdn.com/image/fetch/$s_!Mdq0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 848w, https://substackcdn.com/image/fetch/$s_!Mdq0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 1272w, https://substackcdn.com/image/fetch/$s_!Mdq0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mdq0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png" width="1456" height="515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:515,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123004,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/191531898?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mdq0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 424w, https://substackcdn.com/image/fetch/$s_!Mdq0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 848w, https://substackcdn.com/image/fetch/$s_!Mdq0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 1272w, https://substackcdn.com/image/fetch/$s_!Mdq0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a2a029d-a4f2-40bc-b258-e6fb3bcf0474_1504x532.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That creates an uncomfortable incentive structure. MongoDB has strong reasons to make operations easier inside Atlas, but weaker reasons to make the database easy for everyone else to operate. Operational expertise is, after all, a large part of what the managed service is selling. If the system becomes dramatically easier to run everywhere, you would be less willing to pay such high margins for their management layer.</p><p>In other words, they are incentivized to do the opposite. The tougher it is to operate a database and the more specialized the expertise required, the harder it is for competitors (including the self-hosting alternative) to offer an equivalent experience.</p><p>This usually does not happen overtly. I&#8217;ve been a database engineer my whole career, and I can guarantee you that we are not sitting around trying to make systems harder to run for no reason. The dynamic is subtler. We are rewarded for adding functionality that helps the product stand out, and each additional layer of functionality tends to make the system more operationally complex. Over time, the database becomes more capable, but also more &#8220;interesting&#8221; to debug at 3AM when the pager goes off.</p><p>To summarize, the market incentivizes defensibility over simplification. This, in turn, results in databases getting &#8220;worse&#8221; over time.</p><h2>the end</h2><p>You may not have sympathy for multi-billion dollar database companies, and that&#8217;s fine, but the inconvenient truth is that most database development (and <a href="https://ieeexplore.ieee.org/document/6759009">OSS software at large</a>) is funded by enterprise, which means that your user experience is directly dictated by corporate profits. We&#8217;ve explored why this may not be in your favor.</p><p>You likely also need no proof that the hyperscalers are not a solution to this. Their incentives are not particularly aligned with users either, and they are often happy to capture the value created by upstream database companies and open source communities without taking on the same level of product risk. They are excellent at scaling proven systems, but much less naturally motivated to make the kind of focused, opinionated bets that produce new categories in the first place.</p><p>Since I&#8217;ve already inundated you with economic theory and this post is quite long, I&#8217;ll end it with a glimmer of optimism: I think object storage is already making a different economic model possible. It is generic enough to absorb a large share of the heavy infrastructure burden, while still being simple enough to serve as a foundation for building narrower, more focused systems on top. That changes the fixed-cost equation in a way that could matter a lot, opening up the door to lower-margin business models. You can already see systems like Warpstream, Turbopuffer, and Quickwit taking advantage of it. </p><p>But that&#8217;s a subject for another post!</p><p>Next up, I&#8217;ll get back to some deep engineering content and let the economics settle for a bit. As always, thanks for your support and reading bits &amp; pages.</p>]]></content:encoded></item><item><title><![CDATA[the mathematics of compression in database systems]]></title><description><![CDATA[why compression is (almost) always worthwhile]]></description><link>https://www.bitsxpages.com/p/the-mathematics-of-compression-in</link><guid isPermaLink="false">https://www.bitsxpages.com/p/the-mathematics-of-compression-in</guid><dc:creator><![CDATA[almog gavra]]></dc:creator><pubDate>Mon, 09 Feb 2026 19:28:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/12db8f95-24f6-43fc-93e4-f249e4706586_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This post is a little departure from our usual architecture discussions, but I promise it <em>is</em> worth understanding if you care about your database performance. </p><p>I started thinking about compression when implementing <a href="http://github.com/slatedb/slatedb/pull/1228">prefix compression for SlateDB</a>. When I ran benchmarks, I noticed that performance seemed "worse" despite improved compression ratios. </p><p>This got me thinking deeper about why databases use compression, and what the right framework is for reasoning about whether that tradeoff is worth it. This post is the result of that exploration, and I hope it helps you understand why, when and how data systems use techniques to reduce data size.</p><h1>understanding &#8220;compression math&#8221;</h1><p>The first rule to understand about why compression is important is that there are effectively three resources that databases can be bottlenecked on<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>:</p><ol><li><p>I/O Bandwidth: bytes/sec from disk, network, or even DRAM into your CPU caches</p></li><li><p>CPU: cycles available on your machine for processing</p></li><li><p>Memory: transient storage that is orders of magnitude faster to access than disk/network</p></li></ol><p>Compression gives you a knob to tune between these resources. More specifically, <em>compression directly trades I/O for CPU</em> because it takes CPU cycles to compress and decompress data but moving compressed data requires less I/O bandwidth.</p><p>Mathematically, we can reason about this tradeoff by computing the cost of compression and applying it to different resource utilization. Let&#8217;s assume for a given workload and compression we have the following properties<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\nS &amp;: \\text{uncompressed size of your dataset} \\\\\nS_c &amp;: \\text{compressed size of your dataset} \\\\\nR = S / S_c &amp;: \\text{compression ratio (higher is better)} \\\\\nT_c \\text{ and } T_d &amp;: \\text{compression / decompression time} \\\\\nB_x &amp;: \\text{physical bandwidth of I/O path } x \\text{ (bytes/sec)}\n\\end{aligned}&quot;,&quot;id&quot;:&quot;OVLOFCPYQP&quot;}" data-component-name="LatexBlockToDOM"></div><h2>is compression about latency?</h2><p>Naively, we can compute a hypothetical breakeven I/O bandwidth &#946; that determines whether compression will speed up a point-to-point transfer.</p><p>An uncompressed data transfer takes <code>S/B_x </code>seconds. Compressing then transferring the same dataset takes <code>T_c + S_c/B_x + T_d</code> seconds<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, since you first need to compress the data before sending it and decompress it after sending it. In other words, compression improves the overall transfer latency when:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;T_c + T_d < \\frac{S-S_c}{B_x}&quot;,&quot;id&quot;:&quot;KDYXRWJFZE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Using this we can compute our breakeven bandwidth:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\beta = \\frac{S-S_c}{T_c+T_d}&quot;,&quot;id&quot;:&quot;FSJSCFXNHT&quot;}" data-component-name="LatexBlockToDOM"></div><p>To make this concrete I timed the (de)compression of a sample <a href="https://github.com/agavra/compression-golf/blob/main/data.json.gz">data</a> taken from a <a href="https://github.com/agavra/compression-golf">compression challenge</a> with a baseline of 191MiB using both <code>zstd</code> level 4 (fast, lighter compression) and level 10 (slower, more aggressive compression):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qwcU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qwcU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 424w, https://substackcdn.com/image/fetch/$s_!qwcU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 848w, https://substackcdn.com/image/fetch/$s_!qwcU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 1272w, https://substackcdn.com/image/fetch/$s_!qwcU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qwcU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png" width="492" height="133.46206896551723" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:236,&quot;width&quot;:870,&quot;resizeWidth&quot;:492,&quot;bytes&quot;:27582,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qwcU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 424w, https://substackcdn.com/image/fetch/$s_!qwcU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 848w, https://substackcdn.com/image/fetch/$s_!qwcU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 1272w, https://substackcdn.com/image/fetch/$s_!qwcU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759520b5-b4f5-46c1-ae5d-e9fb10ede522_870x236.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Using the breakeven formula from above, the breakeven points for compressing the data are 328 MiB/s (2.76 Gbps) and 76 MiB/s (0.64 Gbps) for levels 4 and 10 respectively. </p><p>An <code>r5d.xlarge</code> instance has baseline <code>B_network</code> of 1.25 Gbps, meaning if latency is all you cared about <code>zstd</code> with level 4 compression is worthwhile but level 10 is not: the improved compression is overshadowed by the time it takes to achieve it.</p><p>The very same instance, however, has a speedy NVMe disk with <code>B_NVMe</code> of 10s of Gbps. This would make it seem like if you aren&#8217;t transferring data over the network, you shouldn&#8217;t compress with either of these options! And if your only bottleneck were the latency of a one-shot transfer from disk into the CPU, that conclusion would be right&#8230; but the breakeven analysis assumes the I/O pipe is otherwise idle, which is rarely true in a database.</p><h2>saturation and the CPU &#8596; I/O exchange rate</h2><p>In practice, databases are not doing one-shot transfers of a blob from disk to RAM and calling it a day. Instead, they need to constantly read significant amounts of data to handle incoming requests, compress old files, fill caches, etc. Put another way, <em>databases will saturate their I/O bandwidth much before they saturate CPU</em>. Sustained throughput matters far more than latency.</p><p>Instead of thinking about a breakeven latency point, we can think about compression as a way to exchange bandwidth for CPU. I&#8217;ll refer to the effective throughput after accounting for compression as logical bandwidth:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;B_\\text{logical} = B_x \\cdot R&quot;,&quot;id&quot;:&quot;EHEUXNSBUY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Getting back to our example from before, <code>zstd</code> level 4 gives us ~9.4x compression ratio. This means that our <em>logical</em> bandwidth on a 1.25 Gbps network line is 11.75 Gbps. Put another way, we can sustain a throughput of 11.75 Gbps if we compress/decompress as compared to 1.25 Gbps if we send raw data over.</p><p>Of course, this logical bandwidth only materializes if you can afford the CPU cost. A useful way to think about codecs is in terms of (de)compression throughput <code>&#952;</code>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\theta_{c,d}=\\frac{S}{T_{c,d}}&quot;,&quot;id&quot;:&quot;YRZIDXNZLW&quot;}" data-component-name="LatexBlockToDOM"></div><p>To fully saturate a pipe of bandwidth <code>B_x</code> you need:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\theta_{c,d} \\geqslant B_x&quot;,&quot;id&quot;:&quot;KBRGFHXZSA&quot;}" data-component-name="LatexBlockToDOM"></div><p>In other words you need to be able to compress data <em>faster</em> than you can send it over the network.</p><p>This math starts to influence decisions when you look at the increasing amount of time it takes for rather trivial improvements at higher levels of compression (for the same 191MiB dataset from before):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bwv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bwv8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 424w, https://substackcdn.com/image/fetch/$s_!Bwv8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 848w, https://substackcdn.com/image/fetch/$s_!Bwv8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Bwv8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bwv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png" width="500" height="183.585313174946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7866c839-ea2b-479b-8afd-038c86861afd_926x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:926,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:38962,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bwv8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 424w, https://substackcdn.com/image/fetch/$s_!Bwv8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 848w, https://substackcdn.com/image/fetch/$s_!Bwv8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Bwv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7866c839-ea2b-479b-8afd-038c86861afd_926x340.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>There&#8217;s an obvious diminishing return of the higher compression levels, and this is why databases don&#8217;t simply choose to use the most aggressive compaction. But we can go one step further and put a dollar value to the exchange rate between CPU and logical bandwidth.</p><h2>putting a dollar value on compression</h2><p>The final principle to consider when evaluating whether (and how aggressively) to compress is in the context of managed services that markup their cost per byte, meaning they put an artificial premium on bytes that make compression even more valuable. The typical example here are egress and ingress fees with cloud providers.</p><p>This ends up looking quite similar to the exchange rate between CPU and I/O, except instead you are trading CPU for dollars. Let&#8217;s consider the <code>r5d</code> class of instances on AWS; you&#8217;ll have the following rates:</p><ul><li><p>$0.072 / vCPU hour (in <code>us-west-2</code>)</p></li><li><p>$0.02 / GB transferred across zones (ingress + egress)</p></li></ul><p>Using these numbers, sustaining 1 Gbps of <em>physical</em> bandwidth (450 GB/h) costs $9/hour in transfer fees alone. To sustain 1 Gbps of <em>logical</em> (uncompressed) throughput, you only need to transfer 1/R Gbps on the wire, so your bandwidth cost drops to $9/R per hour. But to get this you need enough CPU to compress at that rate. Since &#952; measures how many Gbps a single vCPU can compress, you need 1/&#952; vCPUs to keep up. At $0.072 per vCPU-hour, your total cost to sustain 1 logical Gbps is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Cost/h} = \\frac{\\$9}{R} + \\frac{\\$0.072}{\\theta} &quot;,&quot;id&quot;:&quot;XBHMRIRSIE&quot;}" data-component-name="LatexBlockToDOM"></div><p>In other words, you pay less GB transfer costs the higher your compression ratio is but to achieve a higher compression ratio and sustain that throughput you need to pay more CPU. Plugging in the table of observed values from the previous section:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Apvd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Apvd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 424w, https://substackcdn.com/image/fetch/$s_!Apvd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 848w, https://substackcdn.com/image/fetch/$s_!Apvd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 1272w, https://substackcdn.com/image/fetch/$s_!Apvd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Apvd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png" width="510" height="145.36231884057972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79616032-f474-4665-8f08-d60febc24ab1_828x236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:236,&quot;width&quot;:828,&quot;resizeWidth&quot;:510,&quot;bytes&quot;:33031,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Apvd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 424w, https://substackcdn.com/image/fetch/$s_!Apvd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 848w, https://substackcdn.com/image/fetch/$s_!Apvd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 1272w, https://substackcdn.com/image/fetch/$s_!Apvd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79616032-f474-4665-8f08-d60febc24ab1_828x236.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>For these particular compression ratios, choosing <code>zstd</code> level 8 is your best bet (if what you care about is reducing cost).</p><h2>sometimes it is about latency</h2><p>Bringing it back to where we started, it&#8217;s often worth being conservative about compression levels. While the cost equation above suggests that something like <code>zstd</code> level 8 minimizes dollars per unit throughput, databases don&#8217;t live entirely in the world of steady-state bandwidth.</p><p>In many systems, decompression happens directly on the query critical path. In those cases, every additional millisecond of extra CPU time can show up in tail latencies, and the most cost-efficient compression level may not be the latency-optimal one.</p><p>One useful property (see the data in the latency section) is that decompression time is nearly constant across levels. The CPU cost asymmetry is heavily skewed toward the write path. This means you can invest in aggressive compression at write time where latency is often more tolerable without penalizing read-side tail latencies. The practical implication is that for write-once, read-many workloads, the compression level decision is primarily a write-path throughput question.</p><p>As with all things databases, these are tradeoffs you need to make with your workload and preferences in mind.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsxpages.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading bits &amp; pages! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>consider getting hands on</h1><p>Before I get into the techniques used, I often feel the best way to learn about performance optimizations (including data size) is to play around with them yourself. To that end, I created a challenge to compress a sample dataset of one million rows as much as possible. The naive dataset serialized as JSON is 210MB, <code>zstd</code> at level 22 gets down to 11MB, and after a few hours of bit twiddling I got it down to 9.4MB.</p><p>I&#8217;m maintaining a leaderboard, so you can try your hand at the challenge yourself: <a href="https://github.com/agavra/compression-golf">https://github.com/agavra/compression-golf</a></p><p>In addition, there are some interesting submissions that detail their approaches to compressing the dataset (typically as comments at the top of the file or in the PR they submit). You may want to start by reading this blog, and then applying that understanding to understand the leading submissions.</p><h1>techniques for data reduction</h1><p>There are two main techniques for reducing data size:</p><ol><li><p><strong>Semantic Encoding:</strong> these techniques understand the data patterns and change the binary representation to be more compact. Examples include prefix, varint and run-lenght encoding (discussed more later).</p></li><li><p><strong>Entropy Compressors:</strong> these have no knowledge of the data semantics, but they can analyze the bytes to find redundancy and then use various techniques to reduce the entropy. Examples include zstd, snappy and gzip.</p></li></ol><p>These two techniques are often applied together. First you will use semantic encoding to reduce your logical data size, and then apply entropy compression to further compress the data.</p><p>Entropy compression alone gets you pretty far. On the compression golf dataset, <code>zstd -9</code> took 210MB down to 18MB (level 22 reached 11MB). Adding standard semantic encodings got us to 7MB with the most aggressive techniques getting down to 5MB.</p><p>The more important property is the CPU cost. Since semantic encodings are mostly O(n) passes with simple ops they are cheap to run. If you apply them first your entropy compressor has less (and lower entropy) data to chew on, requiring fewer CPU cycles.</p><p>Connecting this back to the cost model shows how this two-layer compression pays dividends. On the compression-golf dataset, my semantic codec reduced the data from 191 MiB to ~11 MiB. Applying zstd on top brought it down to 7.2 MiB in a total of 1.4s, which is comparable CPU time to zstd level 8 alone (1.3s) but with 2.5x better compression.</p><p>Plugging this into our cost formula computes &#952; &#8776; 1.14 Gbps (similar to level 8) and R &#8776; 26.5 (vs 10.67 for level 8). Since the bandwidth term $9/R dominates the cost equation, that ratio improvement is dominating bringing the total cost to $0.40/hr per logical Gbps, <em>less than half of the previous best </em>(see cost table in the section on dollar value of compression).</p><p>Entropy compression techniques are well explored<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> and can be applied to your data with little to no modification from you. There are many libraries (zstd, gzip, snappy) that implement entropy-based compression you can pick up off the shelf and apply to your workload. Since it is unlikely you will need to implement any of these techniques yourself, the rest of this post will focus on how you can apply semantic encoding to your data.</p><h2>small encodings of common datatypes</h2><p>There are standard, in-memory byte representations for many data types that I suspect you are familiar with. </p><p>An example of this is <code>u32</code> (an unsigned<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>, big-endian, 32-bit integer). For example the number 2,031,620 is represented like this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WWCc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WWCc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 424w, https://substackcdn.com/image/fetch/$s_!WWCc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 848w, https://substackcdn.com/image/fetch/$s_!WWCc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 1272w, https://substackcdn.com/image/fetch/$s_!WWCc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WWCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png" width="403" height="236.7625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:800,&quot;resizeWidth&quot;:403,&quot;bytes&quot;:32297,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WWCc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 424w, https://substackcdn.com/image/fetch/$s_!WWCc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 848w, https://substackcdn.com/image/fetch/$s_!WWCc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 1272w, https://substackcdn.com/image/fetch/$s_!WWCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4b3fcaf-6508-4473-8b47-bbdc43f3da80_800x470.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Each of the four bytes (in this case represented as hexadecimal codes) represent orders of magnitude more significant values than the next (by a factor of 256, the number of values that each hex code can represent).</p><p>If we were to store these as bytes (on disk, in memory, etc&#8230;) each one takes up 32 bits. If I store 1024 of these <code>u32</code> integers I&#8217;m storing 32KB of data. Can we do better?</p><p>Turns out when you have a <code>u32</code> field, most values are typically pretty small and the first and second byte would both be <code>0x00</code>. To avoid storing unnecessary <code>0x00</code> bytes on disk, there are strategies such as variable encoding that will use the most significant bit to indicate whether or not more bytes are needed past the current byte (and then the remaining 7 bits to store data).</p><p>Here is an example of encoding the same integer as above, but using only 3 bytes. Note that we need to encode the least-significant bytes first (little endian encoding) so that we can stop as soon as we no longer need more bytes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hHoC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hHoC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 424w, https://substackcdn.com/image/fetch/$s_!hHoC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 848w, https://substackcdn.com/image/fetch/$s_!hHoC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 1272w, https://substackcdn.com/image/fetch/$s_!hHoC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hHoC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png" width="494" height="277.7213930348259" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1206,&quot;resizeWidth&quot;:494,&quot;bytes&quot;:58426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hHoC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 424w, https://substackcdn.com/image/fetch/$s_!hHoC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 848w, https://substackcdn.com/image/fetch/$s_!hHoC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 1272w, https://substackcdn.com/image/fetch/$s_!hHoC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a41c1d4-58ec-45d8-9272-be675bca7309_1206x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This gets more efficient the smaller the numbers are. Numbers less than 128 only take 1 byte!</p><p>My favorite characteristic of varint encoding is that there is no size limit, you can encode a <code>u16</code>, <code>u32</code> and <code>u64</code> all with the same encoding. This makes it backwards compatible to change the in-memory representation (upcast from a four byte to eight byte integer) without changing the underlying storage and is a technique used by many formats such as Parquet and Protobuf.</p><p>There are various other strategies that help achieve similar results for other data types, but varint is the most common of them.</p><h2>delta, run-length, prefix and XOR encoding</h2><p>The next class of techniques exploit the fact that in-memory types are more general than necessary. This makes sense for mutable data since you need headroom for values you haven&#8217;t seen yet. Once the data set is fixed (on disk, for example) you can examine it and choose tighter encodings.</p><p>Each of these four techniques help take advantage of the observation that data is typically sorted or clustered in a way where similar values are next to one another.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DH6a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DH6a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 424w, https://substackcdn.com/image/fetch/$s_!DH6a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 848w, https://substackcdn.com/image/fetch/$s_!DH6a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 1272w, https://substackcdn.com/image/fetch/$s_!DH6a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DH6a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png" width="1456" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DH6a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 424w, https://substackcdn.com/image/fetch/$s_!DH6a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 848w, https://substackcdn.com/image/fetch/$s_!DH6a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 1272w, https://substackcdn.com/image/fetch/$s_!DH6a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fd98892-d41a-41da-9a73-17db4951d6a7_2074x938.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;ll summarize these techniques below and help you with understanding when each should be chosen but for the full algorithms there are various more detailed explanations online:</p><ul><li><p><strong>Delta:</strong> use when integers are semantically close to one another. Instead of encoding the raw values, encoding only the first raw value and then the differences (deltas) between them. Often the deltas will be much smaller than the raw data, allowing varint encoding to work better than it would on the raw numbers.</p></li><li><p><strong>Run-Length Encoding (RLE):</strong> use when there are many repeated values or sparse data. When many values repeat, encode the value once and then the number of times it repeats. This is commonly used in columnar data formats.</p></li><li><p><strong>Prefix Encoding:</strong> use with sorted strings. Since sorted strings often share prefixes you only need to encode the length of the shared prefix and the suffix. Combined with Delta encoding or RLE you can reduce the size of the encoded prefix lengths.</p></li><li><p><strong>XOR Encoding:</strong> use mostly for similar floating point values. This is similar to Delta encoding, but sometimes a small difference between floating points requires a &#8216;large&#8217; representation. It works by XOR&#8217;ing the bits, so values that are similar to one another will typically end up with many 0s. There are then techniques you can use (see Gorilla for example) for compressing values that are sparse.</p></li></ul><h2>dictionary encoding</h2><p>Dictionary encoding helps when you have repeated data (or data patterns) that aren&#8217;t necessarily close in proximity, meaning you can&#8217;t use one of the techniques previously discussed. In other words, the word &#8220;engineer&#8221; might show up frequently in a database of users as the job title but not next to users that are near one another.</p><p>To solve this, dictionary encoding is a technique that maps commonly used tokens to numeric identifiers, and then replaces the tokens with the identifier themselves. There are many ways to represent the dictionary concisely (FST, Front Coded Lists, Radix/MARISA Tries, etc&#8230;) depending on the data distribution and requirements, but the main concept remains the same.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ByWV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ByWV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 424w, https://substackcdn.com/image/fetch/$s_!ByWV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 848w, https://substackcdn.com/image/fetch/$s_!ByWV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 1272w, https://substackcdn.com/image/fetch/$s_!ByWV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ByWV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png" width="374" height="310.2391857506361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:652,&quot;width&quot;:786,&quot;resizeWidth&quot;:374,&quot;bytes&quot;:51732,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ByWV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 424w, https://substackcdn.com/image/fetch/$s_!ByWV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 848w, https://substackcdn.com/image/fetch/$s_!ByWV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 1272w, https://substackcdn.com/image/fetch/$s_!ByWV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13642718-1e3c-4b1e-85d3-353363f3ae53_786x652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The interesting thing about dictionaries is that they&#8217;re often used by entropy-based compressors (such as <code>zstd</code>) because they can see repeated sequences of byte strings. This means that <em>using a semantic dictionary alone is not very useful if all you care about is compressed data size</em>, you&#8217;re just repeating work that your compressor is going to do anyway.</p><p>The benefits of the dictionary are second order.</p><p>First, semantic dictionary encoding is faster because it can add a string to a dictionary as soon as it encounters it. In the meantime an algorithm like <code>zstd</code> needs to consider all possible sequences of bits and doesn&#8217;t know which ones repeat without a complicated lookback mechanism.</p><p>Second, it unlocks other encodings (like delta encoding and bitpacking), which compounds to improve the overall compression ratio.</p><h2>bit-packing</h2><p>Bitpacking is a technique used to ignore byte-alignment and pack multiple values into a single byte. A classic example of bit-packing is to pack eight boolean values into a single byte, using just one bit to represent each boolean.</p><p>The technique extends to more complicated data types as well. For example, if you know that deltas between all of your values are between 0 and 32 you only need 5 bits to represent each delta. The smallest varint is still 1 byte (8 bits), so you can do even better if you pack eight 5-bit values into 5 bytes:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Frk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Frk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 424w, https://substackcdn.com/image/fetch/$s_!6Frk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 848w, https://substackcdn.com/image/fetch/$s_!6Frk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 1272w, https://substackcdn.com/image/fetch/$s_!6Frk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Frk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png" width="542" height="208.3968992248062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:1290,&quot;resizeWidth&quot;:542,&quot;bytes&quot;:38035,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/187426616?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Frk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 424w, https://substackcdn.com/image/fetch/$s_!6Frk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 848w, https://substackcdn.com/image/fetch/$s_!6Frk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 1272w, https://substackcdn.com/image/fetch/$s_!6Frk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bbe2aa2-deb3-4f28-809d-05742635427e_1290x496.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>You should use bitpacking with extreme caution.</strong></p><p>Not only is it easy to get your encodings wrong, but it can cause your final compressed dataset to be <em>larger</em> than if you hadn&#8217;t bitpacked. This is because most entropy-based compressors (<code>zstd</code> included) are byte-aligned, meaning they look for repetition along byte boundaries. If you have many repeated 7-bit strings the data looks &#8220;scrambled&#8221; to the eyes of a byte-aligned algorithm, so it won&#8217;t be able to do its job properly.</p><p>To show this I ran a quick experiment (<a href="https://gist.github.com/agavra/f35157b04000dc611f167968f5cbc164">see code and full results</a>) that generated 10k values between 0 and 127 (values that fit in 7 bits), sorted them and then compared the final compressed size if I were to just use full 8-bit values or if I bitpacked them into 7-bit values. </p><p>Despite the bitpacked representation being smaller before compression (8,750 bytes vs 10,000 bytes), zstd compresses the raw bytes down to just 246 bytes, which is nearly 3x better than the 662 bytes it achieves on the bitpacked version.</p><h2>lossy compression</h2><p>The techniques described earlier in this blogpost are all &#8220;lossless&#8221;. In other words, you can reconstruct the exact dataset you compressed initially. Sometimes, though, you can accept a lossy compression. Most of these techniques are domain-specific, and you are probably familiar with this concept in image and video processing.</p><p>While these techniques are interesting in their own right, I&#8217;ll save covering them for a future post (stay tuned for a post on vector databases, which leverage lossy compression quite elegantly).</p><h1>fsync()</h1><p>And we&#8217;re done for today! If you take away just one thing from this post it should be that compression is (almost) always worth doing, but how <em>much</em> compression depends on certain parameters of your workload.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>If you read my post on <a href="https://www.bitsxpages.com/p/frameworks-for-understanding-databases">database foundations</a>, you might see some parallels with the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM conjecture</a> here.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>All measurements in this blog post use <a href="https://en.wikipedia.org/wiki/Elapsed_real_time">CPU time, not wall clock time</a>, when evaluating compression. I had originally attempted using <code>valgrind</code> to count the number of CPU instructions, but (a) this made some heavier compression algorithms take over thirty minutes to run and (b) the conversion from CPU instruction to time depends on the IPS (instructions per second) which depends on the instructions and not the clock speed.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>You can use streaming compression algorithms to pipeline/stream compressed data, which makes the story for compression even more compelling since you can concurrently serialize, transfer and deserialize packets.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>See techniques like <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman Encoding</a> and <a href="https://en.wikipedia.org/wiki/Asymmetric_numeral_systems#Entropy_coding">Entropy Coding</a>, or look at implementations of systems like <a href="https://github.com/facebook/zstd">zstd</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Signed numbers are a little trickier to deal with because the standard <a href="https://en.wikipedia.org/wiki/Two%27s_complement">two&#8217;s complement</a> encoding uses the first bit to indicate that a number is negative. This means you need all the bytes to know whether a value is positive or negative. There are alternative encodings, the most popular is <a href="https://en.wikipedia.org/wiki/Variable-length_quantity#Zigzag_encoding">zigzag</a>, which help solve this problem.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[understanding LSM trees via read, write, and space amplification]]></title><description><![CDATA[how LSM trees work and why they can be optimized for almost any workload]]></description><link>https://www.bitsxpages.com/p/understanding-lsm-trees-via-read</link><guid isPermaLink="false">https://www.bitsxpages.com/p/understanding-lsm-trees-via-read</guid><dc:creator><![CDATA[almog gavra]]></dc:creator><pubDate>Wed, 21 Jan 2026 23:05:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2a115d75-b0a1-4d03-beeb-7b6eded07945_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Our <a href="https://www.bitsxpages.com/p/sorted-string-tables-sst-from-first">last post covered SSTs</a>, which are a powerful building block for data systems with one big limitation: they are immutable and (most) databases are not. To work around this limitation, databases add structure. This post covers the most common structure for using SSTs: Log Structured Merge Trees (or LSM Trees).</p><h2>reasoning about structure</h2><p>As with all indexing structures, what you need to keep in mind is the read, write and space amplification characteristics of a particular structure. If you haven&#8217;t read our <a href="https://www.bitsxpages.com/p/frameworks-for-understanding-databases">database foundations post</a>, I&#8217;ll give a quick summary of what that means here:</p><ol><li><p>Read amplification<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> describes the overhead for reading an entry in a database. If I need to fetch 32KB from disk to read a single 1KB row, my read amplification is 32x.</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Read Amplification} = \\frac{\\text{Bytes Read from Disk}}{\\text{Result Size}}&quot;,&quot;id&quot;:&quot;SFYGYTHZLX&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="2"><li><p>Write amplification quantifies the overhead when a database writes more bytes than strictly necessary to store a piece of data. If writing a single 1KB row to disk required rewriting a 32KB block, my write amplification is 32x.</p></li></ol><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Write Amplification} = \\frac{\\text{Bytes Written to Disk}}{\\text{Logical Row Size}}&quot;,&quot;id&quot;:&quot;NXFFGHHSZV&quot;}" data-component-name="LatexBlockToDOM"></div><ol start="3"><li><p>Space amplification captures the ratio between actual storage consumed and the logical data size, accounting for fragmentation, tombstones, and redundant copies. If I need 32GB disk space to comfortably run a workload with 16GB data, my space amplification is 2x.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Space Amplification} = \\frac{\\text{Required Disk Space}} {\\text{Logical Data Size}}&quot;,&quot;id&quot;:&quot;SXUOOKYKJP&quot;}" data-component-name="LatexBlockToDOM"></div></li></ol><p>Since there&#8217;s no perfect data structure on all three (<a href="http://daslab.seas.harvard.edu/rum-conjecture/">the RUM conjecture</a>), the rest of the post will go over different ways to structure SSTs that have different tradeoffs on each of these dimensions.</p><h1>naive strategies</h1><p>No matter what indexing structure you choose, the initial write path for all databases that use SSTs is the same: you buffer data in memory until you have enough to flush out an SST and you write that SST to disk. As we discussed in the last post, there&#8217;s numerous reasons for why that write path is an advantageous design.</p><p>What happens after that is called <em>compa</em>c<em>tion</em>, and that defines the process of giving these immutable files some structure and periodically combining the immutable SSTs together.</p><p>The wonderful thing about separating the write path and compaction is that you have an extraordinarily flexible mechanism for trading off between read, write and space amplification. You can design a system that accepts writes as fast as your SSD can handle, or you can have a system that attempts to always find the data you need with a single index lookup and pay for it with write amplification. In practice, you typically want something that lands somewhere in the middle.</p><p>This section outlines the extremes and use those as anchor points when we look into the typical strategies used in production systems.</p><p>Examples throughout this post assume that SSTs have the following properties (for more details on what these mean, see the <a href="https://www.bitsxpages.com/p/sorted-string-tables-sst-from-first">previous post on SSTs</a>):</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Assh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Assh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 424w, https://substackcdn.com/image/fetch/$s_!Assh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 848w, https://substackcdn.com/image/fetch/$s_!Assh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Assh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Assh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png" width="416" height="205.58139534883722" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/82536555-2b69-4200-a4ea-e629d086bd50_688x340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:340,&quot;width&quot;:688,&quot;resizeWidth&quot;:416,&quot;bytes&quot;:30247,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Assh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 424w, https://substackcdn.com/image/fetch/$s_!Assh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 848w, https://substackcdn.com/image/fetch/$s_!Assh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 1272w, https://substackcdn.com/image/fetch/$s_!Assh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F82536555-2b69-4200-a4ea-e629d086bd50_688x340.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h2>a simple log</h2><p>The easiest way to work around the immutability limitation of SSTs is to just write new ones.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qcIu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qcIu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 424w, https://substackcdn.com/image/fetch/$s_!qcIu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 848w, https://substackcdn.com/image/fetch/$s_!qcIu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 1272w, https://substackcdn.com/image/fetch/$s_!qcIu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qcIu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png" width="486" height="245.47454175152748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:982,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:36645,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qcIu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 424w, https://substackcdn.com/image/fetch/$s_!qcIu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 848w, https://substackcdn.com/image/fetch/$s_!qcIu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 1272w, https://substackcdn.com/image/fetch/$s_!qcIu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22e54a66-b2a9-407f-be38-39ecb409a529_982x496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A log index structure continuously prepends new SSTs to the head of the log. When finding a key within the log, a scan is performed sequentially on each SST until the key is found.</p><h3>write amplification</h3><p>This structure is surprisingly versatile. The main benefit is that there is no write amplification: rows that are written to a log of SSTs are written exactly once and never rewritten. You pay for this structure with both read and space amplification.</p><h3>read amplification</h3><p>The read amplification for a log is the number of bytes you need to read before you find your key. Since the log is structured by time, you know that the first SST in the log that contains your key is the most recent version of it. Assuming you have an empty cache and the SST containing <code>key</code> is the <code>n</code>-th SST, you can estimate the number of bytes you will need to read in the worst case:</p><pre><code><code>  N * sizeof(bloom_filter)
+ (N - 1) * bloom_filter_fp_rate * sizeof(index + block_size)
+ sizeof(index + block_size)
</code></code></pre><p>This formula explained in words: you need to read bloom filters for every SST, you will need to read the <code>index + block_size</code> to verify that the key is <em>not</em> in an SST for every false positive, and then read the <code>index + block_size</code> for the SST that contains the final key.</p><p>This means that the cost of reading data <strong>grows linearly with the number of SSTs</strong>.</p><p>If you get unlucky and the key you have is old, you can see how this quickly adds up (assuming the &#8220;real&#8221; data you want to read is a single 1KB row):</p><pre><code><code>Read Amplification with 0.1% False Positive Rate
(Bloom: 16KB, Index+Block: 36KB)

Read Amplification by SST Count (1KB rows, 0.1% FP bloom filters)

  SSTs |   Total Read |     Amp
-------|--------------|--------
    10 |       196 KB |    196x
   100 |      1.64 MB |  1,679x
  1000 |      16.1 MB | 16,486x
</code></code></pre><p>The dominating portion of that table are reading the bloom filters. This is why production databases aim to maintain the bloom filters and indexes of all frequently accessed SSTs in memory. It would drop the disk-read amplification by orders of magnitude (though as you have more SSTs in a log-like system the memory pressure for this would be infeasible).</p><h3>space amplification</h3><p>This gets us to the space amplification of the log structure. Since we never delete or merge SSTs in a log (more on that later), any SST that contains a duplicate key will have garbage data. How much you pay in space amplification with your log depends on your use case: some (like an audit log use case) would pay nothing since you never overwrite entries, others can pay a significant price.</p><p>This is a good time to introduce the concept of a tombstones. How do you delete data if SSTs are immutable? The answer is to, ironically, write more data. If I have an SST 0 on disk that contains <code>keyA -&gt; valueA</code>, I could write a new value in SST 1 that contains <code>keyA -&gt; [deleted]</code>. When I follow my algorithm for &#8220;find first instance of <code>keyA</code> in the database&#8221;, I&#8217;ll find <code>[deleted]</code> in SST1 and know that I should not continue looking in SST 0.</p><p>As you can tell, deletions and updates cause more disk space amplification in a naive log system.</p><h2>eager merge</h2><p>I have yet to see a production system that exhibits this behavior, but it&#8217;s interesting to illustrate the polar opposite of a simple log. I&#8217;m calling this an &#8220;eager merge tree&#8221;, a data structure that will always merge any new SST with the old SST as soon as its written into a new merged SST. This works around the immutability by constantly rewriting the old data and discarding it:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Ijv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Ijv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 424w, https://substackcdn.com/image/fetch/$s_!0Ijv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 848w, https://substackcdn.com/image/fetch/$s_!0Ijv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 1272w, https://substackcdn.com/image/fetch/$s_!0Ijv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Ijv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png" width="678" height="338.4104347826087" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:1150,&quot;resizeWidth&quot;:678,&quot;bytes&quot;:45614,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Ijv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 424w, https://substackcdn.com/image/fetch/$s_!0Ijv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 848w, https://substackcdn.com/image/fetch/$s_!0Ijv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 1272w, https://substackcdn.com/image/fetch/$s_!0Ijv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d759cb6-7220-4058-ab6f-58225f6d68db_1150x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>read amplification</h3><p>This system is ideal for reads. You only ever load up the single SST that is on disk, check the Bloom Filter, verify in the index and read the appropriate block. Assuming your single bloom filter and index are in memory (which they would be, since every query reads the same ones) you read a total of 1 block for your key. Using the same parameters as before, that&#8217;s a 4KB read for a 1KB key or a 4x read amplification.</p><h3>space amplification</h3><p>This system is excellent for steady-state space amplification. You only ever have a single SST on disk, so there&#8217;s no garbage data maintained. Compaction is also memory-efficient since you hold one block from each input SST at a time, interleaving them and writing out new blocks as you go.</p><p>The catch is temporary disk space during compaction. You can&#8217;t delete the old SST until the new merged SST is fully written and synced to disk or a crash mid-compaction would lose data. This means you briefly need 2x your data size in disk space every time you compact. For a 1TB dataset, you need 2TB of disk available even though you&#8217;ll only use 1TB in steady state<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>.</p><p>In other words, you need to <em>over-provision your disks</em> to keep steady state utilization below 50%.</p><h3>write amplification</h3><p>I&#8217;m certain you saw this coming. The eager merge tree is so atrociously bad from a write amplification perspective that it becomes infeasible for all but effectively immutable data sets. Every time you write a new SST and merge with the old one, all data is read and rewritten.</p><pre><code><code>Eager Merge Tree: Write Amplification
(4GB buffer, 1 MB/s writes, 1KB keys)

Data Size | Logical Writes | Physical Writes |   Amp (x)
----------|----------------|-----------------|----------
     4 GB |           4 GB |            4 GB |        1x
    40 GB |           4 GB |           40 GB |       10x
   100 GB |           4 GB |          100 GB |       25x
   400 GB |           4 GB |          400 GB |      100x
     1 TB |           4 GB |            1 TB |      250x
     4 TB |           4 GB |            4 TB |    1,000x
    10 TB |           4 GB |           10 TB |    2,500x
</code></code></pre><p>There is almost a brutal simplicity to this: even at 400GB, every 4GB of new data costs you 100x write amplification. Your SSDs won&#8217;t last long with this strategy.</p><h1>log structured merge trees</h1><p>Let&#8217;s build intuition for LSM trees by exploring the middle ground between the two extremes of logs and eager merging. The log based approach shows us that the read amplification is proportional to the number of SSTs. The eager merge approach fixes this, but introduces significant write amplification caused by repeatedly merging small new SSTs with a single large &#8220;main&#8221; SST.</p><p>We could start searching for a middle ground by increasing the size of the &#8220;small new SST&#8221; before merging it with the main one in an eager-tree. If my dataset is 1GB, I can improve my write amplification by by buffering larger &#8220;new&#8221; SSTs before merging them:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xahu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xahu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 424w, https://substackcdn.com/image/fetch/$s_!Xahu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 848w, https://substackcdn.com/image/fetch/$s_!Xahu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 1272w, https://substackcdn.com/image/fetch/$s_!Xahu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xahu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png" width="1456" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71297,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xahu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 424w, https://substackcdn.com/image/fetch/$s_!Xahu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 848w, https://substackcdn.com/image/fetch/$s_!Xahu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 1272w, https://substackcdn.com/image/fetch/$s_!Xahu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd07b1694-6880-4e4d-8c61-84ab5b6bbfb8_1738x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Assuming the new SST only updates existing rows so the new compacted &#8220;main&#8221; SST is the same size as the old one, you can see that the write amplification is improved the more data I buffer up before compacting. This improvement is helpful but still limiting. Buffering more data before flushing it out causes increased memory pressure and doesn&#8217;t help you if the size of your main SST continues to grow.</p><p>The main insight of an LSM tree is that it provides a continuum between the log and the eager merge structures by maintaining a log for the most recent data, and then &#8220;levels&#8221; of merged data that increase in size to keep amplification manageable.</p><p>SSTs from new writes to the database go into the log. Then you maintain a series of groups of SSTs where each group is larger than the previous group by some multiplier. You can then trade off read and write amplification by tuning the multiplier or the number of SSTs you allow to accumulate in a group.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!exPs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!exPs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 424w, https://substackcdn.com/image/fetch/$s_!exPs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 848w, https://substackcdn.com/image/fetch/$s_!exPs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 1272w, https://substackcdn.com/image/fetch/$s_!exPs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!exPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png" width="382" height="368.3945205479452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/955a9684-ab20-4125-9f96-a12179391a59_730x704.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:730,&quot;resizeWidth&quot;:382,&quot;bytes&quot;:39029,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!exPs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 424w, https://substackcdn.com/image/fetch/$s_!exPs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 848w, https://substackcdn.com/image/fetch/$s_!exPs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 1272w, https://substackcdn.com/image/fetch/$s_!exPs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F955a9684-ab20-4125-9f96-a12179391a59_730x704.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Before we dive into the specifics of how these trees are structured, there are some common concepts that are shared between all of them:</p><ol><li><p>A <strong>level</strong> of a tree is typically referred to as <code>LN</code> (e.g. <code>L1</code>), with the number representing the depth of the tree. In other words <code>L0</code> contains the most recent data, and data sinks downward to <code>L1, L2, ...</code> over time through compaction. Note that if you think of trees in terms of &#8220;height&#8221; instead of &#8220;depth&#8221; you&#8217;re going to trip over the number scheme. I recommend you disabuse yourself of that notion whenever dealing with LSM trees.</p></li><li><p>A <strong>sorted run</strong> is a set of SSTs that do not overlap in keys, but together cover the entire keyspace. In other words, if Sorted Run 100 contains SSTs 101, 102 and 103 then those SSTs might cover keys <code>[a-g]</code>, <code>[h-m]</code>, <code>[n-z]</code> respectively (you would not find a key that starts with <code>k</code> in SST 101 or 103, only in SST 102). These are useful because they allow you to maintain smaller filters and indexes for each SST while using the min/max to find the right SST within a sorted run.</p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsxpages.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading bits &amp; pages! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>leveled trees</h2><p>A discussion of LSM trees usually starts with Leveled Trees, since this was their first widely used implementation with <a href="https://github.com/google/leveldb">LevelDB</a> from Google<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. The structure of a Leveled Tree looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qZkG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qZkG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 424w, https://substackcdn.com/image/fetch/$s_!qZkG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 848w, https://substackcdn.com/image/fetch/$s_!qZkG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 1272w, https://substackcdn.com/image/fetch/$s_!qZkG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qZkG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png" width="508" height="445.63168316831684" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1010,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:63763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qZkG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 424w, https://substackcdn.com/image/fetch/$s_!qZkG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 848w, https://substackcdn.com/image/fetch/$s_!qZkG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 1272w, https://substackcdn.com/image/fetch/$s_!qZkG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8491e43a-9ac2-41c3-bbf6-58ec26e41f37_1010x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The first level (L0) is special<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>: each SST in L0 covers the entire key range. In the diagram above it&#8217;s possible that key <code>foo</code> exists both in <code>SST100</code> and <code>SST101</code>. Every deeper level in the tree (L1+) is made up of sorted runs, each of which are split into SSTs with non-overlapping keys. In other words, if the Sorted Run for L1 is split into three SSTs (with ids 88, 89 &amp; 90) then a key that shows up in SST88 is guaranteed not to show up in SST 89 or 90:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!czux!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!czux!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 424w, https://substackcdn.com/image/fetch/$s_!czux!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 848w, https://substackcdn.com/image/fetch/$s_!czux!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 1272w, https://substackcdn.com/image/fetch/$s_!czux!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!czux!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png" width="346" height="254.62314540059347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:674,&quot;resizeWidth&quot;:346,&quot;bytes&quot;:30401,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!czux!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 424w, https://substackcdn.com/image/fetch/$s_!czux!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 848w, https://substackcdn.com/image/fetch/$s_!czux!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 1272w, https://substackcdn.com/image/fetch/$s_!czux!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19bba2be-53be-4edc-a054-41f9f7d79823_674x496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The algorithm for making this happen is called Leveled Compaction, and it looks a lot like a combination of the two &#8220;naive&#8221; strategies covered previously. New writes go into a simple log structure by just appending new SSTs into L0. This allows a Leveled tree to accept writes quickly without paying any cost for compaction.</p><p>Then in order to improve read performance, L0 is periodically compacted when it accumulates too many SSTs. Since L0 files can overlap with each other, compaction typically selects multiple L0 files and merges them with any overlapping SSTs in L1 (if any exist). The output is a set of new, non-overlapping SSTs that are placed into L1.</p><p>When a deeper (L1+) level exceeds a configured size (configured as a multiple of the previous size to keep the sizes growing exponentially, 10x is common), one or more SSTs are selected and merged with the overlapping SSTs from the next level down. This cascades as needed: if the new level exceeds its size limit, some of its SSTs will be merged into L2 (10x larger than L1), and so on until the tree stabilizes.</p><h3>read amplification</h3><p>Leveled compaction&#8217;s main point contribution over a simple log is to improve read amplification by bounding the number of SSTs you need to check. For L0, you still need to check every SST since they can overlap, but deeper levels only need to check at most one SST. This means the worst case lookup is:</p><pre><code><code>L0_count * sizeof(bloom_filter)
+ num_levels * sizeof(bloom_filter)
+ bf_fp_rate * (L0_count + num_levels) * sizeof(index + block_size)
+ sizeof(index + block_size)
</code></code></pre><p>Assuming nothing is kept in memory (which is unlikely, but helps illustrate the point) we can compare a leveled tree to a simple log when it comes to read amplification:</p><pre><code><code>Read Amplification: Leveled vs Simple Log (10x ratio, 4 L0 files)

Data Size | Levels | Leveled |  Simple Log
----------|--------|---------|-------------
     1 GB |      4 |  160 KB |      196 KB
    10 GB |      5 |  176 KB |    1,640 KB
   100 GB |      6 |  192 KB |   16,036 KB
     1 TB |      7 |  208 KB |  160,036 KB
    10 TB |      8 |  224 KB |    1,600 MB</code></code></pre><p>The leveled tree&#8217;s read amplification grows logarithmically (one bloom filter per level) while the simple log grows linearly (one bloom filter per SST).</p><h3>space amplification</h3><p>Alongside with read amplification, Leveled compaction improves on both naive designs on space amplification. There&#8217;s very little garbage data because (a) SSTs in each level has no key overlap and (b) compaction aggressively merges data downward.</p><p>Roughly 90% of your data lives in the last level (assume 10x ratio). This means that in the worst case when the entire remaining tree are updates to the last level, the garbage in the tree is still only 10% of the data size (space amplification is ~1.1x).</p><p>Temporary disk space during compaction is also well-bounded. Unlike eager merge, which rewrites the entire dataset and needs 2x disk space, leveled compaction only rewrites the SSTs involved in a single compaction. You need enough headroom to hold the old and new SSTs simultaneously, but that&#8217;s a small fraction of your total data rather than all of it. To summarize the comparison:</p><pre><code><code>                   |  Leveled Tree  |  Simple Log  |  Eager Merge
-------------------|----------------|--------------|-------------
Disk overhead      |         ~1.11x |       1-10x+ |       ~1.0x
Memory overhead*   |   logarithmic  |       linear |    constant
Temp disk space**  |     ~11 SSTs** |          0%  |       ~100%

*  Bloom filters + indexes scale with SST count  
** Space for old + new SSTs during a single compaction; multiple 
    concurrent compactions would increase this proportionally
</code></code></pre><h3>write amplification</h3><p>Hopefully you&#8217;ve been paying attention, so before you read this section take a step back and guess what you think the write amplification behavior of a Leveled Tree is (generally, compared to a simple log).</p><p>If you applied the RUM conjecture to guess that this is where the cost comes in, you are right! Every time data compacts from one level to the next, it gets rewritten. Because each level is 10x larger than the previous, data from the smaller level must merge into the larger one about 10 times before the larger level crosses its size threshold and compacts down.</p><p>To determine the write amplification, we want to know how many times a key get rewritten at each level. On average, a key will arrive halfway through the level being filled out, so it gets rewritten about 5 times before moving to the next level. This leads us to estimate the write amplification:</p><pre><code>write_amp &#8776; num_levels &#215; (size_ratio / 2)</code></pre><pre><code><code>Leveled Tree: Write Amplification
(10x level ratio, 1KB rows)

Data Size | Levels | Expected Write Amp*  
----------|--------|---------------------
     1 GB |      3 |                  15x 
    10 GB |      4 |                  20x 
   100 GB |      5 |                  25x 
     1 TB |      6 |                  30x 
    10 TB |      7 |                  35x 

* Actual may be lower because not all data makes it to the deepest level, and compaction selection tries to minimize overlap.
</code></code></pre><p>While this is way worse than the simple log, it is many times better than the strict eager merge strategy. For read-heavy workloads leveled compaction is often worth it.</p><h2>size tiered trees</h2><p>The second main type of LSM compaction strategy attempts to improve with respect to write amplification of a Leveled Tree at the cost of read amplification. The main innovation with size tiered trees is that each tier (the analog to a Level in Leveled Compaction) can have <em>multiple</em> sorted runs which are never partially modified. This is more similar to having each tier be its own &#8220;simple log&#8221;, periodically compacting an entire section of that log down into the next tier.</p><p>In the diagram below, you can imagine that <code>Sorted Run 11</code> was created when 4 L0 SSTs each of size 64MB were ready to compact. The next four created <code>Sorted Run 12</code> without touching <code>11</code> at all. When there were 4 sorted runs in Tier 1 they are merged together into a new sorted run in Tier 2 with size 1GB, so on and so forth until they&#8217;ve merged their way down to Tier N.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9390!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9390!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 424w, https://substackcdn.com/image/fetch/$s_!9390!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 848w, https://substackcdn.com/image/fetch/$s_!9390!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 1272w, https://substackcdn.com/image/fetch/$s_!9390!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9390!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png" width="600" height="451.27334465195247" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:886,&quot;width&quot;:1178,&quot;resizeWidth&quot;:600,&quot;bytes&quot;:68719,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9390!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 424w, https://substackcdn.com/image/fetch/$s_!9390!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 848w, https://substackcdn.com/image/fetch/$s_!9390!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 1272w, https://substackcdn.com/image/fetch/$s_!9390!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c03037c-145c-41a4-9683-b111ea042b8e_1178x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>write amplification</h3><p>A good description of the comparison between the two comes from the <a href="https://github.com/facebook/rocksdb/wiki/universal-compaction">RocksDB documentation</a>, so I&#8217;ll paraphrase that here:</p><blockquote><p>In size tiered storage, new writes move entries from a smaller sorted run to a much larger one. Every compaction is likely to make the update exponentially closer to the final sorted run, which is the largest.</p><p>In leveled compaction new writes are compacted more as a part of the larger sorted run where a smaller sorted run is merged into, than as a part of the smaller sorted run. As a result, in most of the times an update is compacted, it is not moved to a larger sorted run, so it doesn&#8217;t make much progress towards the final largest run.</p></blockquote><p>If each tier merges <code>R</code> sorted runs into one larger run, then a key is rewritten approximately once per tier. This yields:</p><pre><code><code>write_amp &#8776; number_of_tiers</code></code></pre><p>Or, using the same assumptions as before, here&#8217;s a write amplification comparison between a Size Tiered and a Leveled Tree, which you can see is about an order of magnitude of an improvement:</p><pre><code><code>Write Amplification: Size-Tiered vs Leveled

Data Size | Tiers | Tiered WA (&#8776; T) | Levels | Leveled WA (&#8776; L&#215;5)
----------|-------|-----------------|--------|---------------------
     1 GB |     3 |             ~3x |      3 |                ~15x
    10 GB |     4 |             ~4x |      4 |                ~20x
   100 GB |     5 |             ~5x |      5 |                ~25x
     1 TB |     6 |             ~6x |      6 |                ~30x
    10 TB |     7 |             ~7x |      7 |                ~35x

</code></code></pre><h3>read amplification</h3><p>And as expected, the order-of-magnitude improvement in write amplification is paid for in read amplification. Unlike leveled compaction, size-tiered compaction does not enforce a single sorted run per tier. Instead, each tier may contain up to R overlapping sorted runs (where R is a configurable merge threshold). In the worst case, a point lookup must check all runs in every tier before determining whether a key exists.</p><p>The worst-case read cost becomes:</p><pre><code><code>(N &#215; R) * sizeof(bloom_filter)
+ bloom_filter_fp_rate &#215; (N &#215; R) * sizeof(index + block_size)
+ sizeof(index + block_size)</code></code></pre><p>This grows logarithmically with data size (like leveled), but multiplied by <code>R</code>. Compare:</p><pre><code><code>Strategy    | Read Amplification  
------------|-----------------------
Naive log   | O(data_size)
Size-tiered | O(R &#215; log(data_size))
Leveled     | O(log(data_size))

Read Amplification: Size-Tiered vs Leveled
(nothing in memory) - 4 L0 SSTs

          |       |          SSTs |            Reads
Data Size | Depth | Tiered / Level| Tiered   / Leveled
----------|-------|---------------|----------------------
     1 GB |     3 |        12 / 7 |  ~240 KB /  ~148 KB
    10 GB |     4 |        16 / 8 |  ~320 KB /  ~164 KB
   100 GB |     5 |        20 / 9 |  ~400 KB /  ~180 KB
     1 TB |     6 |       24 / 10 |  ~480 KB /  ~196 KB
    10 TB |     7 |       28 / 11 |  ~560 KB /  ~212 KB

</code></code></pre><h3>space amplification</h3><p>Size Tiered Trees sit somewhere in the middle between a simple log and leveled compaction for their space amplification requirements. Because sorted runs overlap in key space, obsolete versions and tombstones can accumulate within a tier until a full merge happens. In the worst case, each tier can contain up to <code>R</code> copies of the same key (one per run), but in practice real workloads tend to do better than this.</p><h2>time windowed structures</h2><p>The last main type of tree structure commonly seen in LSM trees are time-windowed ones. They are worth an honorable mention because they serve as an &#8220;asterisk&#8221; on the RUM conjecture: if you know that a workload follows certain patterns, you can improve <em>all three</em> of read, write, and space amplification characteristics. If the assumptions are violated, however, you pay a steep price in either correctness or by rewriting the entire dataset in a major compaction.</p><p>Time-Windowed Compaction Strategy (TWCS) works by assuming that data arrives roughly in time order and that there is a practical bound on how late a write can arrive. With this simplification, the database is split into two parts: a mutable region for the current time window and a set of immutable regions for historical windows. New SSTs are written into the current window, and once a window closes, it becomes immutable except for compactions within that window.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xo3v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xo3v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 424w, https://substackcdn.com/image/fetch/$s_!Xo3v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 848w, https://substackcdn.com/image/fetch/$s_!Xo3v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!Xo3v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xo3v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png" width="406" height="490.50678733031674" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6671fce2-69b2-498d-b347-c05602c181de_884x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1068,&quot;width&quot;:884,&quot;resizeWidth&quot;:406,&quot;bytes&quot;:68437,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/185351494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xo3v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 424w, https://substackcdn.com/image/fetch/$s_!Xo3v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 848w, https://substackcdn.com/image/fetch/$s_!Xo3v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!Xo3v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6671fce2-69b2-498d-b347-c05602c181de_884x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This improves read amplification (assuming reads are also time-scoped) because queries can skip entire historical windows instead of scanning the full tree. It improves write amplification because data does not &#8220;bubble down&#8221; through progressively larger structures; once a window is sealed, its data is never merged with newer data again. Finally, because compaction is restricted to occur within a single window, both steady-state and temporary space amplification remain bounded by the size of a window rather than the total size of the dataset.</p><p>I&#8217;ll keep the discussion on this short for now, since this is already a lengthy post, but if you&#8217;re curious about more details there&#8217;s an interesting discussion on the original <a href="https://issues.apache.org/jira/browse/CASSANDRA-9666">Cassandra JIRA ticket</a> that introduced it.</p><h1>compact()</h1><p>That&#8217;s it for today! Join us next time for a deep dive on B-Trees and why (I believe) LSM trees are having their moment in the sun for object-storage native systems.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Read and write amplification here is simplified to illustrate the key parts of LSM tree structure. In practice, amplification also accounts for other factors such as the <em>number</em> of IOPs (which may matter more than the amount of data read in situations such as using EBS with limited IOP capacity).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>There are strategies for mitigating how much space amplification you need by incrementally deleting old parts of the data while you write new SSTs out.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>This was introduced by Jeff Dean and Sanjay Ghemawat, though the initial LSM paper was written by <a href="https://dsf.berkeley.edu/cs286/papers/lsm-acta1996.pdf">O&#8217;Neil in 1996</a>. Personally, my favorite paper on the subject is <a href="https://nivdayan.github.io/dostoevsky.pdf">Dostoevsky</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>A &#8220;pure&#8221; leveled tree would not have an L0, instead it would flush merge data directly into the lowest level of the tree. In practice, this is often too expensive so leveled trees treat the top level similar to a log. See the next section for more details.</p></div></div>]]></content:encoded></item><item><title><![CDATA[sorted string tables (SST) from first principles]]></title><description><![CDATA[why sorted string tables are the swiss army knife for data systems and how they are implemented]]></description><link>https://www.bitsxpages.com/p/sorted-string-tables-sst-from-first</link><guid isPermaLink="false">https://www.bitsxpages.com/p/sorted-string-tables-sst-from-first</guid><dc:creator><![CDATA[almog gavra]]></dc:creator><pubDate>Mon, 05 Jan 2026 19:38:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/bb889d77-4989-402c-b401-a9111a33d428_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This blog is about how data is laid out on disk, specifically about the details of Sorted String Tables (SSTs). Let&#8217;s cut to the chase.</p><h1>SSDs and memory</h1><p>First, it&#8217;s important to understand how data gets from disk in to usable memory. Most cloud instances use SSDs, so we&#8217;ll focus on that instead of spinning disks.</p><p>If you haven&#8217;t read the <a href="https://www.bitsxpages.com/p/frameworks-for-understanding-databases">initial frameworks blog post</a> of this series I recommend starting with that. In this context, the main point to takeaway is that not all read amplification is created equal. All databases need to abide by the laws of physics: fetching data that&#8217;s already in memory is hundreds of times faster than fetching data from over the network. Data structure design is all about minimizing the amount of data you need to fetch from expensive storage tiers while reducing the memory overhead (this an application of the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM conjecture</a>).</p><h3>pages on storage devices</h3><p>A high performance data system needs to reduce the amount of unnecessary bytes read to serve a query, reading only the necessary (&#8221;hot&#8221;) data.</p><p>An ideal system would read <em>exactly</em> the bytes it needs for the data it wants, but the fundamental unit of I/O isn&#8217;t a byte. On SSDs, it&#8217;s a page (typically 4KB). This means that whether you request a single byte, a hundreds bytes, or four thousand bytes from your disk you&#8217;ll still get the 4KB.</p><p>You can verify this on your own machine:</p><pre><code><code># on MacOS, check the size of the page on your disk
&gt; stat -f %k /
4096</code></code></pre><p>To see this in action, I ran an <a href="https://github.com/agavra/bits-x-pages/tree/main/experiments/fetching_blocks">experiment</a> measuring read latency for 1KB vs 4KB reads using Direct I/O (bypassing the OS page cache). Despite requesting 4x less data, the 1KB reads took about the same time as the 4KB reads (about 9.2&#181;s) on my SSD<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e5HM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e5HM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 424w, https://substackcdn.com/image/fetch/$s_!e5HM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 848w, https://substackcdn.com/image/fetch/$s_!e5HM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 1272w, https://substackcdn.com/image/fetch/$s_!e5HM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e5HM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png" width="1456" height="317" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:317,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102269,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e5HM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 424w, https://substackcdn.com/image/fetch/$s_!e5HM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 848w, https://substackcdn.com/image/fetch/$s_!e5HM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 1272w, https://substackcdn.com/image/fetch/$s_!e5HM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8860c0b4-be08-4b07-b5fb-9d5dc32b6cba_3466x754.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here&#8217;s where it gets interesting for database design. Imagine you&#8217;re serving a query that needs a single 256B row that lives somewhere in a 4KB page alongside other rows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y635!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y635!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 424w, https://substackcdn.com/image/fetch/$s_!Y635!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 848w, https://substackcdn.com/image/fetch/$s_!Y635!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 1272w, https://substackcdn.com/image/fetch/$s_!Y635!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y635!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png" width="620" height="247.3684210526316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:1178,&quot;resizeWidth&quot;:620,&quot;bytes&quot;:32806,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y635!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 424w, https://substackcdn.com/image/fetch/$s_!Y635!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 848w, https://substackcdn.com/image/fetch/$s_!Y635!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 1272w, https://substackcdn.com/image/fetch/$s_!Y635!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6ffe51-0dd7-4dbb-a54b-cf1d3b5aaa87_1178x470.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To read <code>row3</code>, you have to read the entire page. Rows 1 through 10 come along for the ride whether you want them or not. The ratio of &#8220;data you read&#8221; to &#8220;data you needed&#8221; is an instance of read amplification. In our example, if you needed 256 bytes but read 4KB, your read amplification is 16x. That sounds sub-optimal, but it&#8217;s unavoidable given the hardware constraints.</p><h3>spatial &amp; temporal locality</h3><p>The read amplification from page size and sizes isn&#8217;t all bad news.</p><p>There&#8217;s a fixed cost overhead to reading a single page from disk. On an SSD the actual data transfer part is less than 1-2% of the total latency (the rest comes from processing the command, translating the logical address to the physical location on the disk, etc&#8230;). This means that it takes approximately the same amount of time to read 1 page, independent of whether that block is sized 512B on a 512B system or 4KB on a 4KB system.</p><p>To take advantage of the page size, databases attempt to place data that is commonly read together physically close together on the storage device, ideally in the same page. Typically this means one of two things: either the data keys are the similar (spatial locality) or the keys were written around the same time (temporal locality).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9t2D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9t2D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 424w, https://substackcdn.com/image/fetch/$s_!9t2D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 848w, https://substackcdn.com/image/fetch/$s_!9t2D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 1272w, https://substackcdn.com/image/fetch/$s_!9t2D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9t2D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png" width="1456" height="766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:766,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73508,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9t2D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 424w, https://substackcdn.com/image/fetch/$s_!9t2D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 848w, https://substackcdn.com/image/fetch/$s_!9t2D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 1272w, https://substackcdn.com/image/fetch/$s_!9t2D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5619b6c-4ca1-461c-9cda-d5661a2b3c51_1486x782.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This technique helps reduce the overhead of read amplification that comes from reading data you won&#8217;t access. As an example, if I&#8217;m executing a query that sums the total cost of orders from a particular customer my query execution path is likely to perform a loop that looks something like:</p><pre><code><code>total = 0
for order_id in [cust2_order1, cust2_order2, cust2_order3]:
&#9;let order := db.get(order_id)
&#9;total += order.cost
</code></code></pre><p>If my first call to <code>db.get(cust2_order1)</code> fetches a 4KB page from disk that contains the orders <code>cust2_order2</code> and <code>cust2_order3</code> then when I next call <code>db.get(cust2_order2)</code> that data will already be in the OS page cache, which will make it much faster to load.</p><p>This is why designing the key for your database is so important. It is the first attempt to make sure that you can take advantage of the spatial locality of the data on disk.</p><h3>mutability on SSDs &amp; blocks</h3><p>Unlike hard drives and spinning disks, data that&#8217;s written onto an SSD cannot be directly replaced without first erasing that data. The electromagnetic physics behind it is beyond my expertise, but the way I think about it is that while you can write a small (4K) page, you cannot target erasure at that level. Instead, erasure works at a much higher level called a block (typically 128-256KB).</p><p>Because of this effect, when you rewrite data on an SSD you are actually writing a new data block and telling the SSD controller (a piece of firmware that comes loaded on your SSD) the new location of the data. Eventually, the controller will erase large chunks of garbage data in a mechanism that looks very similar to handling garbage collection in a memory allocator.</p><p>SSDs, therefore, much prefer that pages are not modified over and over again. Instead they become invalid in large ranges at a time. This means that immutable on-disk data structures work well with SSDs and any mutable structure will cause significant write amplification at the hardware level. We&#8217;ll get to the implications of this a little more in a future blog post about B-Trees and LSM trees.</p><p>For now, we&#8217;ll draw the conclusion that immutable storage formats have an edge.</p><h3>storing data durably on SSDs</h3><p>To recap the above sections, SSDs push you to use a data structure that:</p><ol><li><p>Is written and deleted in large batches aligned to the internal block size (e.g. 256KB)</p></li><li><p>Is immutable to avoid the overhead associated with rewriting data</p></li><li><p>Organizes data in a way that clusters related data together to take advantage of the page size</p></li></ol><p>There are several ways to organize immutable data durably that meet these requirements, the simplest of which is an append-only log. In a log, the system writes records sequentially (aligned to block size) and scans from the beginning for reads. You can see why this works well with SSDs: </p><ol><li><p>Logs let you batch data together and write them when you have enough data.</p></li><li><p>Logs are immutable, typically with some retention at which point you drop entire sections of it. </p></li><li><p>Logs organize data in a way that is optimized for the particular read pattern of reading data in the order it was written.</p></li></ol><p>The downside of a log is the performance of random reads: you might need to scan the entire file to find a specific key. Some systems (like BitCask) address this by storing the entire key-set in memory with pointers to the location on disk stored in memory.</p><p>To make reads directly from disk efficient without exploding memory usage, you need some other kind of structure. There are a number of different strategies, but for this post I&#8217;ll focus on a typical approach used by row-based systems (SSTs).</p><p>Sorted String Tables (SSTs) build off the ideas from the log to play well with the limits of SSDs (you&#8217;ll see the similarity even further when we talk about LSM trees). If you think of a log as an array sorted on an implicit timestamp key, SSTs are essentially the same structure but sorted on a user-defined key instead. To make this possible, databases that use SSTs first buffer data sorted by that key in memory until they&#8217;ve collected enough data to write a large, immutable batch.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.bitsxpages.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading bits &amp; pages! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>sorted string tables (SSTs)</h1><p>This section will go over the design and implementation of a basic SST: an immutable storage data structure for key-value data. Here&#8217;s a reference to look back on that covers the big-picture of an SST layout (this is inspired from the <a href="https://github.com/slatedb/slatedb/blob/main/schemas/sst.fbs">SlateDB SST layout</a>, though other implementations such as RocksDB are quite similar):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CTOd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CTOd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 424w, https://substackcdn.com/image/fetch/$s_!CTOd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 848w, https://substackcdn.com/image/fetch/$s_!CTOd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 1272w, https://substackcdn.com/image/fetch/$s_!CTOd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CTOd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png" width="390" height="437.78145695364236" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:604,&quot;resizeWidth&quot;:390,&quot;bytes&quot;:48358,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CTOd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 424w, https://substackcdn.com/image/fetch/$s_!CTOd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 848w, https://substackcdn.com/image/fetch/$s_!CTOd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 1272w, https://substackcdn.com/image/fetch/$s_!CTOd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d142aea-18f9-4137-9382-3ef042dd5fd3_604x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>data blocks</h2><p>The core concept of a sorted string table is to store data on disk sorted by the lexicographical ordering of the byte representation of their keys. In other words, a single data block of an SST can be thought of as a byte array constructed like this:</p><pre><code><code>struct Record {
&#9;key: Vec&lt;u8&gt;,
&#9;val: Vec&lt;u8&gt;,
}

struct SstDataBlock {
&#9;data: Vec&lt;Record&gt;,
}

impl SstDataBlock {
    fn from_unsorted(data: Vec&lt;Record&gt;) -&gt; Self {
        let mut data = data;
        // sorts the records by the lexicographic order of the keys
        data.sort_by(|a, b| a.key.cmp(&amp;b.key));
        Self { data }
    }
}
</code></code></pre><p>This strategy is nice for two reasons: first, it takes advantage of the spatial locality principle we discussed above. Similar keys are stored next to one another. Second, and perhaps more fundamental, it allows us to binary search the <code>data</code> within an <code>SstDataBlock</code>:</p><pre><code><code>fn find_record(&amp;self, key: &amp;[u8]) -&gt; Option&lt;&amp;Record&gt; {
    self.data.binary_search_by(|record| record.key.as_slice().cmp(key))
        .ok()
        .map(|index| &amp;self.data[index])
}
</code></code></pre><p>But this relies on the in-memory representation of the <code>SstDataBlock</code>. We still need some way to get this struct onto disk which only understands simple byte arrays <code>[u8]</code>. There are many strategies for serializing the data block of an SST; production data systems will often use techniques such as <a href="https://github.com/slatedb/slatedb/blob/4296c696b8b7c1cdf2ad79a9900de0d769a75576/slatedb/src/row_codec.rs#L18-L54">prefix encoding for keys</a> and <a href="https://en.wikipedia.org/wiki/Variable-length_quantity#Variants">varint</a> encoding for lengths to reduce the amount of space a block takes up but for this exercise we&#8217;ll do something simpler: we&#8217;ll just encode the block as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UIPw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UIPw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 424w, https://substackcdn.com/image/fetch/$s_!UIPw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 848w, https://substackcdn.com/image/fetch/$s_!UIPw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 1272w, https://substackcdn.com/image/fetch/$s_!UIPw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UIPw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png" width="444" height="108.92523364485982" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:210,&quot;width&quot;:856,&quot;resizeWidth&quot;:444,&quot;bytes&quot;:17491,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UIPw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 424w, https://substackcdn.com/image/fetch/$s_!UIPw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 848w, https://substackcdn.com/image/fetch/$s_!UIPw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 1272w, https://substackcdn.com/image/fetch/$s_!UIPw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F686a41a2-218f-42ff-9b05-c0e1c3b261fb_856x210.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>And a code mockup for reading and writing these <code>SstDataBlock</code> instances:</p><pre><code><code>fn encode(&amp;self) -&gt; Vec&lt;u8&gt; {
    let mut encoded = Vec::new();
    encoded.extend_from_slice(&amp;(self.data.len() as u32).to_le_bytes());

    for record in &amp;self.data {
        encoded.extend_from_slice(&amp;(record.key.len() as u32).to_le_bytes());
        encoded.extend_from_slice(&amp;record.key);
        encoded.extend_from_slice(&amp;(record.val.len() as u32).to_le_bytes());
        encoded.extend_from_slice(&amp;record.val);
    }

    encoded
}

fn decode(data: &amp;[u8]) -&gt; Self {
  // decode is more verbose so it's emitted here but it just reads
  // the number of records, then deserializes them one by one reading
  // the length of the key, then the key, then length of the value,
  // then the value itself
  ...
}</code></code></pre><h2>indexes</h2><p>The data block covers how to get a record from within a block, and technically this is all you need for a functioning SST.</p><p>Simply storing data as a series of sorted blocks, however, isn&#8217;t ideal. If that&#8217;s all you did, your query algorithm would end up reading blocks in a binary-search pattern fetching entire blocks to decompressing them just to see whether or not the key you&#8217;re looking for is even within the range of the block.</p><p>To solve this problem, SSTs introduce multiple layers of indexes.</p><h3>the main index</h3><p>The first index (often just called &#8220;the index&#8221;) is a smaller data structure that just contains the first key of every data block and the location of the block (an offset within a larger file). This allows you to load the index into memory and binary search on that, which is an order of magnitude smaller than trying to load the actual blocks, to find the required data block.</p><p>To put the index size for corresponding data blocks into perspective, let&#8217;s imagine you have 4KB blocks with keys that average 8 bytes (a <code>u64</code>) and values that average 256 bytes. This means a single data block can contain approximately 15 records. The index in the most naive format can contain 512 keys (with delta compression you can squeeze significantly more). If each key is the first key of a block, a 4KB index block can index 2MB of data blocks, or 7680 records.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l52U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l52U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 424w, https://substackcdn.com/image/fetch/$s_!l52U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 848w, https://substackcdn.com/image/fetch/$s_!l52U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 1272w, https://substackcdn.com/image/fetch/$s_!l52U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l52U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png" width="566" height="304.3951890034364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:1164,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:44508,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l52U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 424w, https://substackcdn.com/image/fetch/$s_!l52U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 848w, https://substackcdn.com/image/fetch/$s_!l52U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 1272w, https://substackcdn.com/image/fetch/$s_!l52U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4fdcb77-4e57-483a-af87-12de0180349d_1164x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The code for the index block would look something like this (without the encoding/decoding section, which looks very similar to the data block encoding and decoding):</p><pre><code><code>pub(crate) struct IndexEntry {
    key: Vec&lt;u8&gt;,
    offset: u64,
}

pub(crate) struct SstIndexBlock {
    entries: Vec&lt;IndexEntry&gt;,
}

/// used to find which block in the SST contains the key
/// that you want using the index. once the data block is
/// identified, it should be deserialized into a SstDataBlock
/// and then searched using SstDataBlock::find_record for the
/// exact record
impl SstIndexBlock {
&#9;&#9;fn find_entry(&amp;self, key: &amp;[u8]) -&gt; Option&lt;&amp;IndexEntry&gt; {
        match self.entries
            .binary_search_by(|entry| entry.key.as_slice().cmp(key))
        {
            Ok(i) =&gt; Some(&amp;self.entries[i]),
            Err(insertion_point) =&gt; {
                if insertion_point &gt; 0 {
                    Some(&amp;self.entries[insertion_point - 1])
                } else {
                    None 
                }
            }
        }
    }
}
</code></code></pre><h3>filter indexes</h3><p>Up until this point, we made the assumption that all the data you&#8217;d need exists in a single SST. We&#8217;ll discuss why you might not want that in a future blog post about LSM trees, but for now I&#8217;ll ask you to accept that it&#8217;s often better to store data in multiple SSTs instead of one big one.</p><p>If you setup your storage like that, it is helpful to know whether or not a given SST has the key you&#8217;re looking for before you even try reading the index and data blocks. This information is stored in a filter block, and there&#8217;s typically two forms of filters that are used: min/max filters and Bloom filters.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NdBW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NdBW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 424w, https://substackcdn.com/image/fetch/$s_!NdBW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 848w, https://substackcdn.com/image/fetch/$s_!NdBW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 1272w, https://substackcdn.com/image/fetch/$s_!NdBW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NdBW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png" width="614" height="289.06529209621993" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:548,&quot;width&quot;:1164,&quot;resizeWidth&quot;:614,&quot;bytes&quot;:48347,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NdBW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 424w, https://substackcdn.com/image/fetch/$s_!NdBW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 848w, https://substackcdn.com/image/fetch/$s_!NdBW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 1272w, https://substackcdn.com/image/fetch/$s_!NdBW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d405375-1649-4d11-961d-a4dff4e6921e_1164x548.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Min/Max filters are extremely simple filters that just encode the minimum and maximum keys that exist in the SST. This very simple data structure takes up very little space and can save a lot of computational work in certain storage systems that lay out their SSTs across very wide key ranges (think using a number such as timestamp as the key).</p><p>Bloom filters are more complicated and come with some interesting tradeoffs. There have been many sources that explain how these filters work, if you&#8217;re curious the <a href="https://en.wikipedia.org/wiki/Bloom_filter">Wikipedia</a> page has an excellent overview. If you&#8217;re not, the important thing to understand is that they&#8217;re a probabilistic data structure that can tell you with 100% certainty that a key does <em>not</em> exist in SST but it cannot guarantee that a key <em>does</em> exist. Their accuracy is measured by their false positive chance: each false positive means you need to dig into the index and data blocks unnecessarily, finding out that the key in fact does not exist.</p><h3>index space amplification</h3><p>Indexes and filters aren&#8217;t free and, typically, you want to keep the entire index along with any filters in memory to avoid deserializing indexes over and over again.</p><p>The implication is that indexes are a fundamental way to tradeoff between read and space amplification.</p><p>To tune between the two, you can change the size of the data blocks (smaller blocks means larger indexes, but less read amplification as less data needs to be paged in). I ran an <a href="https://github.com/agavra/bits-x-pages/tree/main/experiments/lsm-space-amp">experiment</a> on RocksDB in practice to show this in action<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. The experiment set up a data set of 4GB size and varied the block size, checking the total size of the SST (on disk), the size of the index (in memory) and the read throughout.</p><pre><code><code>$ space_amp --block_sizes=4096,8192,32768 --read_ops=50000
Raw payload bytes: ~4GB (33554432 entries)

Block Size    Total SST   Table Mem   Reads/s
4.00KB           3.51GB      38.9MB     13144
8.00KB           3.48GB      19.3MB     11310
32.0KB           3.45GB      4.88MB      9311
</code></code></pre><p>This shows what we expect to see: smaller block sizes mean larger indexes, higher memory utilization but faster reads. It also shows the diminishing returns of large indexes. When we decrease the block size 8x, memory grows by approximately the same factor but the read throughput only improved by 1.4x.</p><p>Bloom filters are not exempt from this tradeoff either: to reduce the false positive rate of a bloom filter, you need a large one (more bits). The good thing is that the math to compute the false positive rate is straightforward. Assuming your SST stores 1 billion keys, the diagram below shows the false positive rate and the corresponding Bloom Filter size.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BDYd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BDYd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 424w, https://substackcdn.com/image/fetch/$s_!BDYd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 848w, https://substackcdn.com/image/fetch/$s_!BDYd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 1272w, https://substackcdn.com/image/fetch/$s_!BDYd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BDYd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png" width="430" height="280.575" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:800,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:46675,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/183585370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BDYd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 424w, https://substackcdn.com/image/fetch/$s_!BDYd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 848w, https://substackcdn.com/image/fetch/$s_!BDYd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 1272w, https://substackcdn.com/image/fetch/$s_!BDYd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb19c29bf-ad3f-4e12-b59b-41ed5b8e4a1d_800x522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>metadata block</h2><p>The final block in an SST is the metadata block. Despite its humble size, typically just 48-64 bytes, it plays a critical role by storing the offset and length of the index and filter blocks, plus a magic number and version that allows the SST format to evolve over time.</p><p>But here&#8217;s an interesting question: Why are the index and bloom filter blocks at the end of the file?</p><p>The answer reveals a constraint of SST construction. When building a large SSTable you face a choice. Your first option is to buffer the entire SST in memory, build the complete index and bloom filter, then write everything in the &#8220;correct&#8221; order: index, bloom filter, data blocks. Alternatively, your second option is to stream data blocks to disk one at a time as they&#8217;re built, constructing the index and bloom filter incrementally as you go. If you take the second path, then you can&#8217;t know the locations of the blocks until the end, so you can&#8217;t write out the index block until then.</p><p>Note that this optimization only really applies when your data is already in sorted order (such as when merging two SSTs), otherwise you need to buffer the data in memory to sort it in first place.</p><h1>SSTs are general purpose building blocks</h1><p>While SSTs are an obvious fit for key-value storage, sorted rows turn out to be a remarkably general abstraction. Almost any access pattern you care about can be encoded into SSTs if you&#8217;re clever about how you structure your keys.</p><p>The insight is that <strong>l</strong>exicographic sorting on byte strings gives you hierarchy for free. When your keys share a prefix, they&#8217;re stored physically adjacent. This means a &#8220;range scan over all keys starting with X&#8221; is just a sequential read which is the best possible access pattern for storage devices (remember that the minimum fetch of data from a disk is 4KB, even for an SSD, so if all of that data is relevant then you aren&#8217;t paying excess read amplification).</p><h2>encoding access patterns into keys</h2><p>There&#8217;s a spectrum of cleverness in how you shove data models into key-value form. This section goes through strategies you can use, from obvious to devious:</p><h3>level 0: plain key-value lookups</h3><p>The trivial case models the keys exactly the way you are looking them up. There&#8217;s nothing clever about this and often this is all you need:</p><pre><code><code>"user:alice" &#8594; {name: "Alice", email: "alice@bitsxpages.com"}
"user:bob"   &#8594; {name: "Bob", email: "bob@example.com"}
</code></code></pre><h3>level 1: key namespacing</h3><p>The next requirement might be to store different <em>types</em> of data in the same SST rather than split each data type into its own SST. To do this, you can prefix your keys with the type of data they represent and then construct your lookup key based on that.</p><p>The following example has three key types: metadata keys, users and sessions:</p><pre><code><code>"meta:schema_version" &#8594; "3"
"meta:created_at"     &#8594; "2024-01-15T00:00:00Z"
"user:alice"          &#8594; {name: "Alice", ...}
"user:bob"            &#8594; {name: "Bob", ...}  
"session:abc123"      &#8594; {user: "alice", expires: ...}
</code></code></pre><p>Because keys are sorted, all the <code>meta:</code> keys cluster together, all the <code>user:</code> keys cluster together, and so on. A range scan with prefix <code>user:</code> gives you all users. This is effectively multiple logical tables in one physical SST.</p><h3>level 2: composite keys for hierarchy</h3><p>This is when things start getting interesting. You can effectively &#8220;decompose&#8221; a single complicated key into multiple rows for efficient lookups of both the composite key as well as the single key. Imagine our users had a set of orders, so the logical data would look like:</p><pre><code><code>"user:alice" &#8594; {
&#9;name: "Alice", 
&#9;orders: [
&#9;&#9;{"id": "7", "item": "catnip", "count": 2},
&#9;&#9;{"id": "112", "item": "dog_treats", "count": 11},
&#9;]
}
</code></code></pre><p>Using level-0, and level-1 techniques outlined before we could get decently far:</p><ol><li><p>Level 0: Store <code>"user:alice"</code> as the key, and include the full order details in the document value (match the logical model)</p></li><li><p>Level 1: Store <code>"user:alice"</code> as the key, include the order ids in the value, and then store <code>"order:7"</code> as a separately namespaced key.</p></li></ol><p>Each have downsides. Level 0 has significant read amplification if you don&#8217;t need all the order details (e.g. you just want to retrieve Alice&#8217;s email address). Level 1 has significant read amplification if you <em>do</em> want all the order details because you need to fetch all the order ids, which may not be in the same data blocks (they&#8217;re ordered by ids, so <code>order:7</code> and <code>order:112</code> may not be next to one another).</p><p>The third alternative is to use a composite key:</p><pre><code><code>"user:alice" -&gt; {name: "Alice", ...}
"user:alice,order:7" -&gt; {"id": "7", "item": "catnip", "count": 2}
"user:alice,order:112" -&gt; {"id": "112", "item": "dog_treats", "count": 11}
</code></code></pre><p>Now the query for &#8220;get order 7 for Alice&#8221; is a point lookup for <code>user:alice,order:7</code> and the query &#8220;get all orders for Alice&#8221; is a prefix scan on <code>user:alice,order:*</code>. Both are efficient since the sort order gives you the hierarchy for free<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><h3>level 2.5: wide columns</h3><p>The same technique can be used for handling documents with many sparse columns. For example, you could chose to store a <code>phone_number</code> column separately if most users don&#8217;t submit a value for that when creating a profile:</p><pre><code><code>"user:alice" -&gt; {name: "Alice", ...}
"user:alice,col:phone" -&gt; "+1-650-123-4567",
</code></code></pre><p>This has a nice property: sparse columns are free. If Alice has a &#8220;phone&#8221; column and Bob doesn&#8217;t, you simply don&#8217;t store a key for Bob&#8217;s phone. The downside is that reading the full row now requires a prefix scan, which involves more read amplification.</p><h3>level 3: embedding secondary indexes</h3><p>The hierarchy trick only works if you know every element in the hierarchy you want to query for. This doesn&#8217;t help me if I want to find &#8220;what user has the email <code>alice@bitsxpages.com</code>?&#8221; The solution to that is leveraging namespacing to store special keys that represent the secondary index:</p><pre><code><code># primary: lookup by user ID
"user:alice" &#8594; {name: "Alice", email: "alice@example.com"}

# secondary index: lookup by email  
"&lt;idx:email:alice@example.com&gt;" &#8594; "user:alice"
</code></code></pre><p>Now looking up a user by email is two steps: scan the index to get the ID, then fetch the primary record. This is one of the most powerful tricks you can use, and allows you to implement many different types of systems on top of SSTs.</p><p>The previous example I gave is a simple secondary index, but you can implement more complicated ones:</p><ul><li><p>A <strong>covering index</strong> can store more than just document ids in line with the value. For example if I always want to get the user&#8217;s full name when I lookup by email the index value for <code>idx:email:alice@example.com</code> could contain <code>{"id": "user:alice", "full name": "Alice Ecila"}</code>.</p></li><li><p>An <strong>inverted index</strong> can be stored in SSTs where the keys are <code>term:value</code> (e.g. <code>department:engineering</code>) and the value is a sorted list of document ids, compressed as a roaring bitmap. If I want to find &#8220;all documents that are in engineering and have the name alice&#8221; I lookup <code>department:engineering</code> and <code>name:alice</code>, then retrieve the intersection of the two lists.</p></li><li><p>A <strong>queue</strong> can be stored in SSTs where the key is the offset in the key and the value is the full row.</p></li></ul><h2>why this matters</h2><p>The deeper point is that SSTs derive their power from a simple focus on playing to the advantages of SSDs and Object Storage: immutable data and block-aligned reads (index and filter structures that minimize I/O).</p><p>This alignment with hardware means you can map a surprising variety of use cases onto SSTs without fighting the storage layer. Cassandra and ScyllaDB use SSTs for scalable KV storage. Yugabyte and MyRocks use SSTs to implement full SQL engines. <a href="https://www.datadoghq.com/blog/engineering/timeseries-indexing-at-scale/">DataDog&#8217;s metrics backend</a> stores data in SSTs. Kafka&#8217;s log segments are conceptually SSTs optimized for append-only access. If you generalize SSTs a little further and consider systems that have converged on &#8220;sorted immutable files with indexes and filters&#8221; as their storage primitive is long and growing (Clickhouse, etc&#8230;).</p><p>Once you internalize that &#8220;sorted bytes on disk + binary search + immutability&#8221; is the primitive so much else builds on, the design space for databases becomes clearer.</p><p>But the reality is that most databases aren&#8217;t immutable&#8230; stay tuned for the next entry in this series to learn how multiple SSTs are composed together to form LSM trees.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>If anyone knows why 4KB has a longer tail and wider standard deviation I&#8217;d love to hear from you!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In order to replicate this experiment you need to really force RocksDB to use the disk. This means disabling the cache and using Direct I/O to avoid the OS page cache as well.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Typically bloom filters will only tell you if a specific key might be in the SST, meaning they aren&#8217;t used for the range scans used to reconstruct the full document. You can circumvent this problem and make them useful for range scans by using <a href="https://github.com/facebook/rocksdb/wiki/Prefix-Seek#configure-prefix-bloom-filter">prefix bloom filters</a> (storing the prefix for the keys in the filter instead of the entire key). That way the bloom filter will store the logical key, which is a prefix of the physical key that is stored.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[frameworks for understanding databases]]></title><description><![CDATA[building mental models for tradeoffs in performance, availability and durability in data systems]]></description><link>https://www.bitsxpages.com/p/frameworks-for-understanding-databases</link><guid isPermaLink="false">https://www.bitsxpages.com/p/frameworks-for-understanding-databases</guid><dc:creator><![CDATA[almog gavra]]></dc:creator><pubDate>Mon, 08 Dec 2025 17:18:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/873ce294-1a74-438c-9e9c-e5da45715858_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There are nine types of online databases: relational, key-value, time series, graph, search, vector, analytical, streaming, and object. <em>This blog series is for engineers that want to learn more about how these systems work underneath the hood.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZQXn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZQXn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 424w, https://substackcdn.com/image/fetch/$s_!ZQXn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 848w, https://substackcdn.com/image/fetch/$s_!ZQXn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQXn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZQXn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png" width="380" height="296.6381766381766" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:548,&quot;width&quot;:702,&quot;resizeWidth&quot;:380,&quot;bytes&quot;:56673,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZQXn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 424w, https://substackcdn.com/image/fetch/$s_!ZQXn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 848w, https://substackcdn.com/image/fetch/$s_!ZQXn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQXn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bea15b6-fa57-4d16-a479-9e2acbee0288_702x548.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This introductory post first presents a mental model for understanding data system tradeoffs and then outlines the common building blocks. We&#8217;ll follow this up with deep dive series on each of the nine data systems types<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>.</p><h2>the mental model</h2><p>All databases have one purpose: to store data so you can retrieve it. This core similarity means that, if you squint, all databases look similar. Evaluating them becomes a question of understanding tradeoffs that can be made.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uNDe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uNDe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 424w, https://substackcdn.com/image/fetch/$s_!uNDe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 848w, https://substackcdn.com/image/fetch/$s_!uNDe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 1272w, https://substackcdn.com/image/fetch/$s_!uNDe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uNDe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png" width="606" height="265.37919463087246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1192,&quot;resizeWidth&quot;:606,&quot;bytes&quot;:58834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uNDe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 424w, https://substackcdn.com/image/fetch/$s_!uNDe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 848w, https://substackcdn.com/image/fetch/$s_!uNDe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 1272w, https://substackcdn.com/image/fetch/$s_!uNDe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F364d97af-63ac-4094-9a4d-108979d6350c_1192x522.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I evaluate data systems on three dimensions, and evaluate each of those dimensions with a set of frameworks (we&#8217;ll dive deeper into these throughout the post):</p><ol><li><p>For reasoning about <strong>performance</strong> the <a href="https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html">Read, Write and Space Amplification</a> of a system determine how much work is done on query, ingest and compaction respectively. Then, the indexes and data orientation are evaluated with the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM conjecture</a>.</p></li><li><p>For reasoning about <strong>availability</strong> in a distributed system, the <a href="https://en.wikipedia.org/wiki/PACELC_design_principle">PACELC</a> framework is an evolution of the widely referenced <a href="https://en.wikipedia.org/wiki/CAP_theorem">CAP</a> theorem. In steady state, you trade off between latency and consistency &#8212; only during a network partition do you choose between availability and consistency.</p></li><li><p>For reasoning about <strong>durability</strong> (especially in cloud native systems), the <a href="https://materializedview.io/p/cloud-storage-triad-latency-cost-durability">LCD</a> framework illustrates that you may pick two of Latency, Cost and Durability.</p></li></ol><p>Usage of these frameworks answers how to choose a data system that serves your use case at the acceptable cost (both on hardware requirements as well as operational overhead).</p><h2>performance</h2><h3>framework</h3><p>There&#8217;s the famous <a href="https://static.googleusercontent.com/media/sre.google/en//static/pdf/rule-of-thumb-latency-numbers-letter.pdf">Latency Numbers Every Programmer Should Know</a> chart<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> from Jeff Dean that helps reason about why I/O efficiency is so important.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zHdw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zHdw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 424w, https://substackcdn.com/image/fetch/$s_!zHdw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 848w, https://substackcdn.com/image/fetch/$s_!zHdw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!zHdw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zHdw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png" width="1346" height="1276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1276,&quot;width&quot;:1346,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:172144,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zHdw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 424w, https://substackcdn.com/image/fetch/$s_!zHdw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 848w, https://substackcdn.com/image/fetch/$s_!zHdw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!zHdw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e2051ed-ed46-46f3-a83a-96b8dc7450b4_1346x1276.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The goal of a data systems&#8217; storage layout is to (a) ingest data into a durable medium as fast as possible and (b) get required data to serve a query into L1 cache as fast as possible. Sometimes that data is already there, other times is on the disk of a machine in another AZ. The latency cost of fetching the latter dominates system performance.</p><p>Different systems make tradeoffs on <a href="https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html">read, write and space amplification</a>:</p><ol><li><p>Read amplification measures how many bytes are required to serve a single logical query. More read amplification happens in systems optimized for write throughput.</p></li><li><p>Write amplification quantifies the overhead when a database writes more bytes than strictly necessary to store a piece of data, often due the index structure or compaction.</p></li><li><p>Space amplification captures the ratio between actual storage consumed and the logical data size, accounting for fragmentation, tombstones, and redundant copies. A good example here is a size of a bloom filter (a large bloom filter can help reduce false positives, but takes up more memory).</p></li></ol><h3>indexes</h3><p>A naive database with no index must scan the entire raw storage, potentially across many machines, filtering out points that don&#8217;t match the query. Since this is more data than what can fit in L1, much of that data will be elsewhere. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WEkq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WEkq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 424w, https://substackcdn.com/image/fetch/$s_!WEkq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 848w, https://substackcdn.com/image/fetch/$s_!WEkq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 1272w, https://substackcdn.com/image/fetch/$s_!WEkq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WEkq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png" width="1456" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82700,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WEkq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 424w, https://substackcdn.com/image/fetch/$s_!WEkq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 848w, https://substackcdn.com/image/fetch/$s_!WEkq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 1272w, https://substackcdn.com/image/fetch/$s_!WEkq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2eafa-a777-428b-8151-8dac79d55a33_1556x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Indexing is the strategy used by data systems to reduce the amount of data scanned. I think about indexes in three categories: primary, secondary and filters.</p><ol><li><p><em>Clustering indexes</em> are how the raw data is organized. For example an LSM Tree (used by SlateDB, Cassandra, etc&#8230;) stores the key-values pairs in SSTs, which sort the keys in lexicographical order so a query can binary search to find the required key.</p></li><li><p><em>Secondary indexes</em> are auxiliary structures that return primary keys<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, which can then be used to lookup the raw data. For example, an inverted index (used by Lucene) will return a list of primary keys that contain a particular value (this is particularly useful in search use cases).</p></li><li><p><em>Filter indexes</em> are often embedded within the primary index to help filter out blocks of data. You&#8217;ve likely heard of bloom filters (used by SlateDB, Clickhouse, etc&#8230;), which can guarantee that a key you are attempting to lookup does not exist in a particular block of data.</p></li></ol><p>I reference the <a href="http://daslab.seas.harvard.edu/rum-conjecture/">RUM conjecture</a> when evaluating indexes. This grounds the discussion with the understanding that a single index structure cannot be optimized for read, update AND memory overhead simultaneously. In other words, indexes tradeoff across read, write and space amplification to improve performance of reads, updates or memory utilization.</p><h3>data orientation</h3><p>The previous section discussed how the index layout affects how fast you can get data into L1 cache, this section discusses the raw data format and its impact on the types of computations a system can execute.</p><p>The primary difference between the way databases store raw data is the way that data is oriented: row or columnar. If your workload reconstructs complete rows choose a row-based data system. If your workload aggregates values across rows choose a columnar system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2h29!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2h29!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 424w, https://substackcdn.com/image/fetch/$s_!2h29!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 848w, https://substackcdn.com/image/fetch/$s_!2h29!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 1272w, https://substackcdn.com/image/fetch/$s_!2h29!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2h29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png" width="526" height="364.64251207729467" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:574,&quot;width&quot;:828,&quot;resizeWidth&quot;:526,&quot;bytes&quot;:42394,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2h29!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 424w, https://substackcdn.com/image/fetch/$s_!2h29!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 848w, https://substackcdn.com/image/fetch/$s_!2h29!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 1272w, https://substackcdn.com/image/fetch/$s_!2h29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9141d696-200d-4ae4-bb97-94680e2e2758_828x574.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Why? In a row orientation, the entire value for a single row is stored in one place, but aggregating across rows for a few columns means reading unnecessary data. In a column orientation, values across all rows for the same columns are stored in one place, but reconstructing an original row means reading unnecessary data.</p><p>Data orientation directly impacts read amplification. To illustrate, consider rows of 16KB with 32 columns of 500B each stored in blocks of 64KB. In a row-based system, fetching one row reads a 64KB block but discards 48KB of unnecessary data (read amplification of 4x). In a column-based system, fetching that same row requires reading 32 separate 64KB columnar blocks and discarding ~2MB (read amplification of over 100x). The opposite is true when aggregating columns across multiple rows. </p><p>Beyond minimizing read amplification, aligning data layout with workload patterns enables optimizations like vectorized operations (SIMD) in columnar systems, which are only possible because each loaded memory block is a single-typed array.</p><h3>compaction &amp; garbage collection</h3><p>Compaction and garbage collection (GC) are common strategies for reducing write amplification in favor of read/space amplification. A data system can only write a single block at a time, which means that if the contents of that block are modified it is no longer entirely valid. Since rewriting a block in-place may be impossible due to fragmentation, the strategy most systems use are to append new blocks that model updates to the old blocks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cY5h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cY5h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 424w, https://substackcdn.com/image/fetch/$s_!cY5h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 848w, https://substackcdn.com/image/fetch/$s_!cY5h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 1272w, https://substackcdn.com/image/fetch/$s_!cY5h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cY5h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png" width="506" height="469.93075356415477" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe27b94f-f053-4a3a-b264-dc6633789587_982x912.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:912,&quot;width&quot;:982,&quot;resizeWidth&quot;:506,&quot;bytes&quot;:80289,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cY5h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 424w, https://substackcdn.com/image/fetch/$s_!cY5h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 848w, https://substackcdn.com/image/fetch/$s_!cY5h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 1272w, https://substackcdn.com/image/fetch/$s_!cY5h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe27b94f-f053-4a3a-b264-dc6633789587_982x912.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Compaction takes files that have overlapping data and compact them back into files with non-overlapping data. These compactions jobs don&#8217;t immediately need to clean up the old files that are still sitting around. Garbage collection can run separately, cleaning up orphaned files. This separation allows keeping &#8220;backups&#8221; to old data.</p><h2>availability</h2><h3>framework</h3><p>Few applications <em>need</em> more than 3 nines of availability (~9h of downtime yearly) because downtime violations in excess of that frequently occur as the result of human, not machine, failure.</p><p>Despite this, more availability is always better so unless your system falls into the small select group of requiring high availability (HA) it becomes a cost tradeoff: how much does a minute of downtime cost compared to the overhead of operating with HA.</p><p>To make sense of tradeoffs in HA systems, I recommend you read up on <a href="https://en.wikipedia.org/wiki/PACELC_design_principle">PACELC</a>, but the summary is that during a network partition the CAP theorem applies, but otherwise you tradeoff between consistency and latency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-I57!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-I57!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 424w, https://substackcdn.com/image/fetch/$s_!-I57!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 848w, https://substackcdn.com/image/fetch/$s_!-I57!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!-I57!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-I57!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png" width="1456" height="1072" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1072,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-I57!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 424w, https://substackcdn.com/image/fetch/$s_!-I57!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 848w, https://substackcdn.com/image/fetch/$s_!-I57!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 1272w, https://substackcdn.com/image/fetch/$s_!-I57!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff895ae94-84f9-44eb-ae68-8bcae5b292f9_1486x1094.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>system architecture</h3><p>Availability in systems is tied to the system&#8217;s deployment architecture. Consider the two-dimensional matrix below. The first axis is the leader/leaderless dimension which models whether or not all writes must be funneled through a single node. The second axis is the disaggregated/distributed dimension which models whether replication is done within the system or delegated to a separate (typically object storage) system:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hBsx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hBsx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 424w, https://substackcdn.com/image/fetch/$s_!hBsx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 848w, https://substackcdn.com/image/fetch/$s_!hBsx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!hBsx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hBsx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png" width="1456" height="995" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:995,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:99910,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hBsx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 424w, https://substackcdn.com/image/fetch/$s_!hBsx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 848w, https://substackcdn.com/image/fetch/$s_!hBsx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 1272w, https://substackcdn.com/image/fetch/$s_!hBsx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83c68b2-1742-4ee5-95cb-6dea5a02ae26_1486x1016.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p><em>Single Node Deployment:</em> omitted from the diagram above is a non-HA deployment strategy that makes sense for some data systems like caches and edge database deployments. Example: <a href="https://sqlite.org/">sqlite</a></p></li><li><p><em>Single Writer, Multi-Reader:</em> a single-node system with the additional feature of writing to shared storage (e.g. S3) so other readers can read it. Example: <a href="https://slatedb.io/">slatedb</a></p></li><li><p><em>Multi-Writer, Multi-Reader:</em> a deployment mode that allows for writes and reads across many nodes, typically resulting in a async coordination layer. Examples: <a href="https://www.warpstream.com/">warpstream</a>, <a href="https://quickwit.io/">quickwit</a></p></li><li><p><em>Leader / Follower:</em> a distributed system that requires writes to go to a single leader, which will replicate to followers before acknowledging the write. Example: <a href="https://kafka.apache.org/">kafka</a></p></li><li><p><em>Leaderless:</em> this is a distributed system that can accept writes to any machine, but will still replicate to other nodes in the cluster. Example: <a href="https://www.scylladb.com/">scylla</a></p></li></ol><p>The left half of the quadrant prioritize consistency. Since valid writes will only be handled by a single machine, it is easier to provide ACID semantics. Single-writer systems are always consistent, but allow trading off latency for durability (see next section). Leader-follower systems allow configurable consistency by reducing the number of acquired acks from followers. Both systems experience downtime in the event of a writer or leader failure.</p><p>Leaderless systems prioritize availability, since a single node failure does not require a leader election process or a new writer to restart. Consistency, on the other hand, requires increased latency because all writes must meet quorum.</p><h2>durability</h2><h3>framework</h3><p>The primary consideration for durability is a tradeoff between <a href="https://materializedview.io/p/cloud-storage-triad-latency-cost-durability">latency, durability and cost</a>. You can see the durability guarantees of each of the following durability strategies in the chart below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x5Rh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x5Rh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 424w, https://substackcdn.com/image/fetch/$s_!x5Rh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 848w, https://substackcdn.com/image/fetch/$s_!x5Rh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 1272w, https://substackcdn.com/image/fetch/$s_!x5Rh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x5Rh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png" width="1456" height="658" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:658,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83616,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x5Rh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 424w, https://substackcdn.com/image/fetch/$s_!x5Rh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 848w, https://substackcdn.com/image/fetch/$s_!x5Rh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 1272w, https://substackcdn.com/image/fetch/$s_!x5Rh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28868b06-df6a-4193-80ca-fbbbb7d062e4_1500x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you keep data only in-memory on a single node, you have no durability. When you <code>fsync</code> that data to a local disk, you gain a bit more. When you replicate that to a network disk (EBS) or across multiple zones, you get more with each step.</p><p>In coupled storage and compute systems, durability was more closely tied to availability. If you were running with 3 replicas across zones and prioritized consistency then you had three nodes that potentially had the data in memory, making durability of the storage a slightly less important concern. In a disaggregated system you no longer replicate within the compute subsystem and delegate that to object storage, the tradeoff becomes more acute.</p><h3>storage options</h3><p>Once you&#8217;ve made your decision on how durable your data needs to be, the next choice is where to store durable data. The chart below shows the latency and cost deltas between different solutions:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cBrI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cBrI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 424w, https://substackcdn.com/image/fetch/$s_!cBrI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 848w, https://substackcdn.com/image/fetch/$s_!cBrI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 1272w, https://substackcdn.com/image/fetch/$s_!cBrI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cBrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png" width="1444" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88314,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.bitsxpages.com/i/180526631?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cBrI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 424w, https://substackcdn.com/image/fetch/$s_!cBrI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 848w, https://substackcdn.com/image/fetch/$s_!cBrI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 1272w, https://substackcdn.com/image/fetch/$s_!cBrI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b04a099-e4f4-4d80-80d2-7030d7eb0434_1444x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is possible to have a tiered write system: recent writes can go to a faster storage solution and eventually tier to the slower solutions. This is an example of the LCD tradeoff: if recent writes go to local disk, they can be fast (and cheaper than in memory) but are less durable to an outage.</p><h3>WAL (write ahead log)</h3><p>Write ahead logs (WALs) are a strategy that illustrate the tradeoffs of both the LCD and RUM conjectures. Data systems will write changes to the database in a write-optimized format (effectively logging the request with no transformations) to a durable storage. Using the LCD framework, WALs increase cost (by increasing write amplification) so that writes can be durable with lower latency. WALs are also an extreme example of UM data structures (update &amp; memory optimized), but reading from them is inefficient and reserved for failure situations.</p><h2>fsync()</h2><p>That&#8217;s it for today folk! I hope this has helped shaped the way you understand and evaluate different data systems. Next time, we&#8217;ll pick up key value stores as the first data system we&#8217;ll examine in more depth to see how we can apply this framework to a specific type of database.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>If you can&#8217;t wait, I highly recommend <a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/">Designing Data Intensive Applications</a>. It&#8217;s a phenomenal book that covers a bunch of core database concepts. If you can wait a little bit, the second edition is coming out soon and is likely to be updated with important additions.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>I&#8217;ve modified the latency numbers chart to have some important cloud-native latency numbers, such as S3/S3 Express and removed less relevant numbers.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Often secondary indexes will point directly to the block of data containing the corresponding row instead of just the key, which saves one lookup but requires more index maintenance when the data changes (nonfunctional updates to the data layout, such as compaction, require modifying such indexes).</p></div></div>]]></content:encoded></item></channel></rss>