{"id":9598,"date":"2022-09-14T21:04:00","date_gmt":"2022-09-14T19:04:00","guid":{"rendered":"https:\/\/monodes.com\/predaelli\/?p=9598"},"modified":"2022-09-14T17:05:22","modified_gmt":"2022-09-14T15:05:22","slug":"choosing-a-postgres-primary-key","status":"publish","type":"post","link":"https:\/\/monodes.com\/predaelli\/2022\/09\/14\/choosing-a-postgres-primary-key\/","title":{"rendered":"Choosing a Postgres Primary Key"},"content":{"rendered":"<p><a href=\"https:\/\/supabase.com\/blog\/choosing-a-postgres-primary-key\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/i0.wp.com\/monodes.com\/predaelli\/wp-content\/uploads\/sites\/4\/2022\/09\/choosing-a-postgres-primary-key-og.jpg?w=910&#038;ssl=1\" alt=\"\"\/><\/a>&nbsp;<em><a href=\"https:\/\/supabase.com\/blog\/choosing-a-postgres-primary-key\">Choosing a Postgres Primary Key<\/a><\/em><\/p>\n<p><!--more--><!--nextpage--><\/p>\n<blockquote>\n<div class=\"mb-16 max-w-5xl space-y-8\">\n<div class=\"space-y-4\">\n<h1 class=\"h1\">Choosing a Postgres Primary Key<\/h1>\n<div class=\"text-scale-900 flex space-x-3 text-sm\">\n<p>2022-09-08<\/p>\n<p>\u2022<\/p>\n<p>24 minute read<\/p>\n<\/div>\n<div class=\"flex flex-col gap-3 pt-6 md:flex-row md:gap-0 lg:gap-3\">\n<div class=\"mr-4 w-max\">\n<div class=\"flex items-center gap-3\">\n<div class=\"w-10\"><img decoding=\"async\" class=\"dark:border-dark rounded-full border\" src=\"https:\/\/supabase.com\/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=3840&amp;q=75\" sizes=\"100vw\" srcset=\"\/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=640&amp;q=75 640w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=750&amp;q=75 750w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=828&amp;q=75 828w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=1080&amp;q=75 1080w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=1200&amp;q=75 1200w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=1920&amp;q=75 1920w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=2048&amp;q=75 2048w, \/_next\/image?url=https%3A%2F%2Fgithub.com%2Ft3hmrman.png&amp;w=3840&amp;q=75 3840w\" data-nimg=\"responsive\"\/><\/div>\n<div class=\"flex flex-col\"><span class=\"text-scale-1200 mb-0 text-sm\">Victor<\/span><span class=\"text-scale-900 mb-0 text-xs\">Guest Author<\/span><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"grid grid-cols-12 lg:gap-16 xl:gap-8\">\n<div class=\"col-span-12 lg:col-span-7 xl:col-span-7\">\n<div class=\"relative mb-8 h-96 w-full overflow-auto rounded-lg border\"><img decoding=\"async\" src=\"https:\/\/supabase.com\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=3840&amp;q=75\" sizes=\"100vw\" srcset=\"\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=640&amp;q=75 640w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=750&amp;q=75 750w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=828&amp;q=75 828w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=1080&amp;q=75 1080w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=1200&amp;q=75 1200w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=1920&amp;q=75 1920w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=2048&amp;q=75 2048w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fchoosing-a-postgres-primary-key-og.jpg&amp;w=3840&amp;q=75 3840w\" data-nimg=\"fill\"\/><\/div>\n<article class=\"prose prose-docs\">Primary keys are important. They uniquely identify rows of data in tables, and make it easy to fetch data. The job of a database is to archive and recall data and you&#8217;re going to have a hard time finding data without a good primary key or a good index.<\/p>\n<p>Sometimes it makes sense to use a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_key\">\u201cnatural key\u201d<\/a> (like an <code class=\"\" data-line=\"\">email<\/code> column in a <code class=\"\" data-line=\"\">users<\/code> table) and sometimes it\u2019s better to use a \u201c<a href=\"https:\/\/en.wikipedia.org\/wiki\/Surrogate_key\">surrogate key\u201d<\/a>, a value made <em>for the purpose<\/em> of identifying a row (and has no other meaning).<\/p>\n<p>At first glance, the question of <em>which<\/em> primary key to use is easy! Just throw a <code class=\"\" data-line=\"\">integer<\/code>\/<code class=\"\" data-line=\"\">serial<\/code> on there, right? Numeric IDs are cool, but what about random value IDs or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Uuid\">Universally Unique IDentifiers (UUIDs)<\/a>?<\/p>\n<p>Turns out the question of which identifier (and in this case, UUID) to use is complicated &#8212; we&#8217;re going to dive into some of the complexity and inherent trade-offs, and figure things out:<\/p>\n<ul>\n<li>What are the choices for identifiers?<\/li>\n<li>If we choose to use\/add UUIDs, which ones should we choose?<\/li>\n<li>How can we get these UUIDs into postgres?<\/li>\n<li>Which UUIDs perform best?<\/li>\n<\/ul>\n<p>But first, a quick history lesson.<\/p>\n<h2 id=\"a-brief-history-of-identifiers-and-why-we-use-them\">A brief history of identifiers and why we use them<\/h2>\n<h3 id=\"integerbiginteger\"><code class=\"\" data-line=\"\">integer<\/code>\/<code class=\"\" data-line=\"\">biginteger<\/code><\/h3>\n<p>Let&#8217;s think about identifying rows of data from first principles. What&#8217;s the first way you might think of identifying things? Assigning them numbers!<\/p>\n<p>We can set up a table like this:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">-- Let&#039;s enable access to case-insensitive text\nCREATE EXTENSION IF NOT EXISTS citext;\n\n-- Heres a basic users table\nCREATE TABLE users (\n  id integer PRIMARY KEY,\n  email citext NOT NULL CHECK (LENGTH(email) &lt;span class=&quot;hljs-operator&quot;&gt;&lt;&lt;\/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;255&lt;\/span&gt;),\n  name text NOT NULL\n);\n\n-- Let&#039;s assume we don&#039;t want two users with the exact same email\nCREATE UNIQUE INDEX users_email_uniq ON users USING BTREE (email);\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<\/article>\n<\/div>\n<\/div>\n<p>This looks great, but what should <code class=\"\" data-line=\"\">id<\/code> be on new rows? I don&#8217;t know &#8212; maybe the application can figure it out? If they store some value in memory? That doesn&#8217;t seem right.<\/p>\n<p>Maybe we could figure out the next integer from what&#8217;s in the table itself &#8212; we just need to be able to &#8220;count&#8221; upwards. We <em>do<\/em> have all the <code class=\"\" data-line=\"\">users<\/code> tables rows in there, so we should be able to do it:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">INSERT INTO users (id, email, name)\nSELECT COUNT(&lt;span class=&quot;hljs-operator&quot;&gt;*&lt;\/span&gt;) &lt;span class=&quot;hljs-operator&quot;&gt;+&lt;\/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt;, &#039;new@example.com&#039;, &#039;new_user&#039; FROM users;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>After running that query, we can double check our results:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">SELECT &lt;span class=&quot;hljs-operator&quot;&gt;*&lt;\/span&gt; FROM users;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<table>\n<thead>\n<tr>\n<th>id<\/th>\n<th>email<\/th>\n<th>name<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code class=\"\" data-line=\"\">1<\/code><\/td>\n<td><code class=\"\" data-line=\"\">new@example.com<\/code><\/td>\n<td><code class=\"\" data-line=\"\">new user<\/code><\/td>\n<\/tr>\n<tr>\n<td><code class=\"\" data-line=\"\">2<\/code><\/td>\n<td><code class=\"\" data-line=\"\">new2@example.com<\/code><\/td>\n<td><code class=\"\" data-line=\"\">new user<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Using <code class=\"\" data-line=\"\">COUNT(*)<\/code> in our query is not the most efficient (or even easiest) solution though, and hopefully it&#8217;s clear why &#8212; <strong>counting a sequence of numbers for primary keys is a feature built in to Postgres<\/strong>!<\/p>\n<p><a href=\"https:\/\/www.postgresql.org\/docs\/current\/datatype-numeric.html#DATATYPE-SERIAL\">serial\/bigserial<\/a> is the right tool in our toolbox to maintain a shared, auto-incrementing sequence of numbers. Let&#8217;s pretend we read the <a href=\"https:\/\/www.postgresql.org\/docs\">postgres documentation<\/a> and use those instead.<\/p>\n<h3 id=\"serialbigserial\"><code class=\"\" data-line=\"\">serial<\/code>\/<code class=\"\" data-line=\"\">bigserial<\/code><\/h3>\n<p><code class=\"\" data-line=\"\">serial<\/code> is essentially a convenient macro for using <a href=\"https:\/\/www.postgresql.org\/docs\/current\/sql-createsequence.html\">Postgres sequences<\/a>, a database-managed auto-incrementing stream of <code class=\"\" data-line=\"\">integer<\/code>.<\/p>\n<p>Let&#8217;s hear it from the docs:<\/p>\n<blockquote><p>The data types <code class=\"\" data-line=\"\">smallserial<\/code>, <code class=\"\" data-line=\"\">serial<\/code> and <code class=\"\" data-line=\"\">bigserial<\/code> are not true types, but merely a notational convenience for creating unique identifier columns (similar to the <code class=\"\" data-line=\"\">AUTO_INCREMENT<\/code> property supported by some other databases).<\/p><\/blockquote>\n<p>Using a <code class=\"\" data-line=\"\">serial<\/code> column to create the <code class=\"\" data-line=\"\">users<\/code> table would look like this:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">CREATE TABLE users (\n  id serial PRIMARY KEY,\n  email citext NOT NULL CHECK (LENGTH(email) &lt;span class=&quot;hljs-operator&quot;&gt;&lt;&lt;\/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;255&lt;\/span&gt;),\n  name text NOT NULL\n);\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>OK, now let&#8217;s try inserting into it &#8211; we shouldn&#8217;t have to specify <code class=\"\" data-line=\"\">id<\/code>:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">INSERT INTO users_serial (email, name) \nVALUES (&#039;user@example.com&#039;, &#039;new user&#039;);\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">SELECT &lt;span class=&quot;hljs-operator&quot;&gt;*&lt;\/span&gt; FROM users;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<table>\n<thead>\n<tr>\n<th>id<\/th>\n<th>email<\/th>\n<th>name<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code class=\"\" data-line=\"\">1<\/code><\/td>\n<td><code class=\"\" data-line=\"\">user@example.com<\/code><\/td>\n<td><code class=\"\" data-line=\"\">new user<\/code><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt; row) \nINSERT &lt;span class=&quot;hljs-number&quot;&gt;0&lt;\/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>It works, as you might expect &#8211; now the application doesn&#8217;t have to <em>somehow magically know<\/em> the right ID to use when inserting.<\/p>\n<p>But what does <code class=\"\" data-line=\"\">serial<\/code> actually do? Using a serial column is operationally similar to the following SQL:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">CREATE SEQUENCE tablename_colname_seq AS integer;\nCREATE TABLE tablename (\n    colname integer NOT NULL DEFAULT nextval(&#039;tablename_colname_seq&#039;)\n);\nALTER SEQUENCE tablename_colname_seq OWNED BY tablename.colname;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>Back in application land, the <code class=\"\" data-line=\"\">INSERT<\/code> statement returns, and provides the new <code class=\"\" data-line=\"\">id<\/code> the database assigned our new row. Multiple application instances don&#8217;t need to coordinate what ID to use &#8212; they just <em>don&#8217;t<\/em>, and find out from the database.<\/p>\n<p><strong>We&#8217;ve taken a somewhat meandering path to get here, but this is the standard solution for most reasonable database schemas.<\/strong><\/p>\n<h3 id=\"why-not-stop-at-serial\">Why not stop at <code class=\"\" data-line=\"\">serial<\/code>?<\/h3>\n<p>There are few issues with sequences:<\/p>\n<ul>\n<li>When writing automation that simply iterates through id values, note that <code class=\"\" data-line=\"\">serial<\/code> columns can have gaps, even if you never <code class=\"\" data-line=\"\">DELETE<\/code> (e.x. if an <code class=\"\" data-line=\"\">INSERT<\/code> was rolled back \u2014 sequences live <em>outside<\/em> transactions).<\/li>\n<li>When used from outside code <code class=\"\" data-line=\"\">serial<\/code> may leak some data or give attackers an edge (e.x., if <code class=\"\" data-line=\"\">yoursite.com\/users\/50<\/code> works, how about <code class=\"\" data-line=\"\">yoursite.com\/users\/51<\/code>?).<\/li>\n<li><code class=\"\" data-line=\"\">serial<\/code> is PostgreSQL specific (i.e. not SQL standards compliant)<\/li>\n<\/ul>\n<p>Don&#8217;t be too put off by these reasons &#8212; <code class=\"\" data-line=\"\">serial<\/code> is still the go-to for most use-cases.<\/p>\n<p>Even the last point about <code class=\"\" data-line=\"\">serial<\/code> not being standards compliant is solved in Postgres 10+ by using&#8230;<\/p>\n<h3 id=\"integerbiginteger-again\"><code class=\"\" data-line=\"\">integer<\/code>\/<code class=\"\" data-line=\"\">biginteger<\/code> (again!)<\/h3>\n<p>Postgres 10 <a href=\"https:\/\/www.postgresql.org\/docs\/10\/release-10.html#id-1.11.6.26.5.7\">added support<\/a> for the <code class=\"\" data-line=\"\">IDENTITY<\/code> column syntax in <code class=\"\" data-line=\"\">CREATE TABLE<\/code> (EDB has a great writeup on the <a href=\"https:\/\/www.enterprisedb.com\/blog\/postgresql-10-identity-columns-explained\">addition<\/a>).<\/p>\n<p>This means we can happily go back to using <code class=\"\" data-line=\"\">integer<\/code>\/<code class=\"\" data-line=\"\">biginteger<\/code>:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">CREATE TABLE (\n  id integer PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,\n  email citext NOT NULL CHECK (LENGTH(email) &lt;span class=&quot;hljs-operator&quot;&gt;&lt;&lt;\/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;255&lt;\/span&gt;),\n  name text NOT NULL\n)\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>So how does <em>this<\/em> work? Postgres does the same thing under the covers &#8212; it generates a <code class=\"\" data-line=\"\">sequence<\/code>. As this syntax is standards compliant, it&#8217;s generally recommended practice for DBAs going forward, for the sake of the realm.<\/p>\n<p><strong>So <code class=\"\" data-line=\"\">integer<\/code>s are great, but information leakage is still a problem. How do we fix that?<\/strong> Well, make the numbers random, obviously.<\/p>\n<h3 id=\"random-numeric-ids\">Random Numeric IDs<\/h3>\n<p>Let&#8217;s say the application using the DB has some <a href=\"https:\/\/python.org\">Python<\/a> code like the following:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">import randrange from random\nimport User from models;\nMAX_RANDOM_USER_ID &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt; &lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt;_000_000_000\ndef create_user():\n    &quot;&quot;&quot;\n    Add new user to the database\n    &quot;&quot;&quot;\n    user_id &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt; randrange(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt;, MAX_RANDOM_USER_ID)\n    user &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt; User(id&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt;user_id, email&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt;&quot;new@example.com&quot;, name&lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt;&quot;new user&quot;)\n    db.save(user)\n\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>That <em>looks<\/em> good, but there&#8217;s a problem &#8212; <a href=\"https:\/\/docs.python.org\/3\/library\/random.html\">random<\/a> is a <em>pseudorandom<\/em> generator.<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Pseudorandomness\">Pseudo-random<\/a> numbers are <em>not<\/em> what we want for ensuring user IDs cannot be easily guessed\/collide. It&#8217;s possible to get the exact same sequence of values out of a pseudo-random number generator by using the same <em>seed value<\/em>.<\/p>\n<p>Sometimes you <em>want<\/em> pseudo-random behavior (let&#8217;s say for testing or fuzzing), but it&#8217;s generally not desired for production systems that might run from a duplicated identical application image, since they <em>could<\/em> have weak pseudo-random seed initialization.<\/p>\n<h3 id=\"secure-random-numeric-ids\">(Secure) Random Numeric IDs<\/h3>\n<p><strong>At the very least<\/strong> we need a properly secure random numbers &#8212; we need Python&#8217;s <a href=\"https:\/\/docs.python.org\/3\/library\/secrets.html\">secrets<\/a> module:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-python\" data-line=\"\">from secrets import randbelow\n# ...\n&lt;span class=&quot;hljs-function&quot;&gt;def&lt;\/span&gt; &lt;span class=&quot;hljs-function&quot;&gt;create_user&lt;\/span&gt;&lt;span class=&quot;hljs-function&quot;&gt;():&lt;\/span&gt;\n    &quot;&quot;&quot;\n    Add new user to the database (using secure random numbers)\n    &quot;&quot;&quot;\n    user_id = randrange(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt;, MAX_RANDOM_USER_ID)\n    user = User(id=user_id, email=&quot;new@example.com&quot;, name=&quot;new user&quot;)\n    db.save(user)\n\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>Now we have a secure random value coming in for our user IDs. But having values like <code class=\"\" data-line=\"\">583247<\/code> and <code class=\"\" data-line=\"\">8923916<\/code> get generated are cool and all, but there are a few problems:<\/p>\n<ul>\n<li>These numbers are random <em>and<\/em> quite inscrutable<\/li>\n<li>The keyspace is fairly small (maybe good for comments on a popular website, but not for IDs!)<\/li>\n<li>People can still <em>technically<\/em> check them all (the guessing space is 1 to <code class=\"\" data-line=\"\">MAX_RANDOM_USER_ID<\/code>!)<\/li>\n<\/ul>\n<p>We need something better.<\/p>\n<h3 id=\"secure-random-uuids\">(Secure) Random UUIDs<\/h3>\n<p>Along comes UUIDs &#8212; you&#8217;re probably used to seeing them now, values like this UUIDv4:<\/p>\n<p><code class=\"\" data-line=\"\">468e8075-5815-4fe2-80d3-45a31827954b<\/code> .<\/p>\n<p>They&#8217;re <em>very<\/em> random (almost always generated with secure random sources), and while they&#8217;re even worse for remembering, they&#8217;re near impossible to practically guess &#8212; the search space is just too large!<\/p>\n<p>More importantly, UUIDs introduce methodology to the madness &#8212; different versions of UUID are derived different ways &#8212; combined with other sources of randomness or known values.<\/p>\n<p>There are a lot of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Universally_unique_identifier#Versions\">versions of UUID<\/a>, but let&#8217;s discuss the ones we&#8217;re more likely to use\/see day to day.<\/p>\n<h3 id=\"uuidv1\">UUIDv1<\/h3>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Universally_unique_identifier#Version_1_(date-time_and_MAC_address)\">Version 1 UUIDs<\/a> have three two components:<\/p>\n<ul>\n<li>a 60 bit date-time (at nanosecond precision)<\/li>\n<li>a 48 bit <a href=\"https:\/\/en.wikipedia.org\/wiki\/MAC_address\">MAC address<\/a><\/li>\n<\/ul>\n<p>But where&#8217;s the randomness? Well v1s assume that you <em>won&#8217;t<\/em> generate a ton of values in the same nanosecond (and there are some extra bits reserved for differentiating even when you do), but another source is the MAC address. MAC addresses uniquely (usually) identify network cards &#8212; which is a security risk &#8212; and those bits can be made random.<\/p>\n<p>Here&#8217;s what a UUIDv1 looks like:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"\" data-line=\"\">a9957082-0b47-11ed-8a91-3cf011fe32f1\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>You can generate v1 UUIDs in Postgres natively thanks to the <a href=\"https:\/\/www.postgresql.org\/docs\/current\/uuid-ossp.html\">uuid-ossp contrib module<\/a>. Here&#8217;s how to generate a v1 UUID with random MAC address:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">CREATE EXTENSION IF NOT EXISTS &quot;uuid-ossp&quot;;\nCREATE EXTENSION\n\nSELECT uuid_generate_v1mc();\n\n          uuid_generate_v1mc\n--------------------------------------\n dd1bbf10&lt;span class=&quot;hljs-number&quot;&gt;-0&lt;\/span&gt;b47&lt;span class=&quot;hljs-number&quot;&gt;-11&lt;\/span&gt;ed&lt;span class=&quot;hljs-number&quot;&gt;-80&lt;\/span&gt;de&lt;span class=&quot;hljs-operator&quot;&gt;-&lt;\/span&gt;db48f6faaf86\n \n(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt; row)\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<h3 id=\"uuidv4\">UUIDv4<\/h3>\n<p>Version 4 UUIDs use <em>all<\/em> the available bits for randomness &#8212; <em>122 bits worth<\/em>!.<\/p>\n<p>UUIDv4s look like this:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"\" data-line=\"\">ce0b897d-03a0-4f54-8c97-41d29a325a23\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>These don&#8217;t have a time component, but they don&#8217;t have in time they make up for in randomness &#8212; it is <em>very<\/em> unlikely for them to collide, so they make for excellent Global Unique IDentifiers (&#8220;GUID&#8221;s).<\/p>\n<p>We can generate them in Postgres like this (with <code class=\"\" data-line=\"\">uuid-ossp<\/code>):<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">SELECT uuid_generate_v4();\n\n           uuid_generate_v4\n--------------------------------------\n &lt;span class=&quot;hljs-number&quot;&gt;6&lt;\/span&gt;ca93dde&lt;span class=&quot;hljs-number&quot;&gt;-81&lt;\/span&gt;d4&lt;span class=&quot;hljs-number&quot;&gt;-4&lt;\/span&gt;ea0&lt;span class=&quot;hljs-operator&quot;&gt;-&lt;\/span&gt;bfe1&lt;span class=&quot;hljs-number&quot;&gt;-92&lt;\/span&gt;ecb4d81ee4\n \n(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt; row)\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>Since Postgres would catch a collision on a <code class=\"\" data-line=\"\">PRIMARY KEY<\/code> or <code class=\"\" data-line=\"\">UNIQUE INDEX<\/code> column, we&#8217;re done right? If we want to generate UUIDs all we need to do is choose UUID v1 or V4, and we won&#8217;t leak any schema structure information to the outside world, right?<\/p>\n<p>This is a workable solution, but as you might expect, it&#8217;s not that easy.<\/p>\n<h3 id=\"the-post-uuidv1v4-era-a-cambrian-explosion-of-identifiers\">The Post-UUIDv1\/v4 era: A Cambrian explosion of identifiers<\/h3>\n<div class=\" next-image--dynamic-fill to-scale-400 from-scale-500 rounded-lg border bg-gradient-to-r \"><img decoding=\"async\" class=\"next-image--dynamic-fill rounded-md border\" src=\"https:\/\/supabase.com\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=3840&amp;q=75\" sizes=\"100vw\" srcset=\"\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=640&amp;q=75 640w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=750&amp;q=75 750w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=828&amp;q=75 828w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=1080&amp;q=75 1080w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=1200&amp;q=75 1200w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=1920&amp;q=75 1920w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=2048&amp;q=75 2048w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fuuids.png&amp;w=3840&amp;q=75 3840w\" alt=\"thats-alot-of-uuids.png\" data-nimg=\"fill\"\/><\/div>\n<p>UUIDv1 and v4 were a start, but weren&#8217;t enough for <em>many<\/em> companies out there. There are a couple shortcomings that plague both v1 and v2:<\/p>\n<ul>\n<li>UUIDs are twice the size of <code class=\"\" data-line=\"\">bigint<\/code>\/<code class=\"\" data-line=\"\">bigserial<\/code><\/li>\n<li>UUIDv1s contain a time element but they&#8217;re <em>not<\/em> lexicographically sortable (this means they <code class=\"\" data-line=\"\">SORT<\/code> terribly, relative to <code class=\"\" data-line=\"\">integer<\/code> or a <code class=\"\" data-line=\"\">timestamp<\/code> column)<\/li>\n<li>UUIDv1s are less random than UUIDv4, and can collide\/overlap in close enough time intervals, at large scale<\/li>\n<li>UUIDv4s index <em>terribly<\/em>, as they&#8217;re essentially random values (obviously, they <code class=\"\" data-line=\"\">SORT<\/code> terribly as well)<\/li>\n<\/ul>\n<p>Many of the world&#8217;s biggest companies generated UUIDs at speeds that made all of these deficiencies a problem.<\/p>\n<p>A cambrian explosion of UUIDs resulted, as noticed by the IETF &#8212; this resulted in the <a href=\"https:\/\/www.ietf.org\/archive\/id\/draft-peabody-dispatch-new-uuid-format-01.html\">new UUID formats<\/a> (v6,v7,v8) being published in 2021.<\/p>\n<p>Here&#8217;s a quick list (from that IETF document):<\/p>\n<ul>\n<li>LexicalUUID by <a href=\"https:\/\/blog.twitter.com\/engineering\">Twitter<\/a><\/li>\n<li><a href=\"https:\/\/blog.twitter.com\/engineering\/en_us\/a\/2010\/announcing-snowflake\">Snowflake<\/a> by <a href=\"https:\/\/blog.twitter.com\/engineering\">Twitter<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/boundary\/flake\">Flake<\/a> by <a href=\"http:\/\/www.bmc.com\/\">Boundary (now BMC TrueSight Pulse)<\/a><\/li>\n<li>ShardingID by <a href=\"http:\/\/instagram-engineering.com\/\">Instagram<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/segmentio\/ksuid\">KSUID<\/a> by <a href=\"https:\/\/segment.com\/blog\/engineering\/\">Segment<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/ppearcy\/elasticflake\">Elasticflake<\/a> by <a href=\"https:\/\/github.com\/ppearcy\/elasticflake\">P. Pearcy<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/s-yadav\/FlakeId\">FlakeID<\/a> by <a href=\"https:\/\/github.com\/T-PWK\">T. Pawlak<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/sony\/sonyflake\">Sonyflake<\/a> by <a href=\"https:\/\/github.com\/sony\">Sony<\/a><\/li>\n<li><a href=\"https:\/\/laravel.com\/docs\/5.7\/helpers#method-str-ordered-uuid\">orderedUuid<\/a> by <a href=\"https:\/\/darkghosthunter.medium.com\/\">IT. Cabrera<\/a><\/li>\n<li>COMBGUID by <a href=\"https:\/\/github.com\/richardtallent\">R. Tallent<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/ulid\/spec\">ULID<\/a> by <a href=\"https:\/\/github.com\/alizain\">A. Feerasta<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/chilts\/sid\">SID<\/a> by <a href=\"https:\/\/github.com\/chilts\">A. Chilton<\/a><\/li>\n<li><a href=\"https:\/\/firebase.googleblog.com\/2015\/02\/the-2120-ways-to-ensure-unique_68.html\">pushID<\/a> by <a href=\"https:\/\/google.com\/\">Google<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/rs\/xid\">XID<\/a> by <a href=\"https:\/\/github.com\/rs\">O. Poitrey<\/a><\/li>\n<li><a href=\"https:\/\/www.mongodb.com\/docs\/manual\/reference\/method\/ObjectId\/\">ObjectID<\/a> by <a href=\"https:\/\/www.mongodb.com\/blog\/channel\/engineering-blog\">MongoDB<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/ericelliott\/cuid\">CUID<\/a> by <a href=\"https:\/\/github.com\/ericelliott\">E. Elliott<\/a><\/li>\n<\/ul>\n<p>That&#8217;s&#8230; A lot of UUIDs. They&#8217;re all slightly different, but the innovation was summed up by the IETF:<\/p>\n<blockquote><p>An inspection of these implementations details the following trends that help define this standard:<\/p>\n<ul>\n<li>Timestamps MUST be k-sortable. That is, values within or close to the same timestamp are ordered properly by sorting algorithms.<\/li>\n<li>Timestamps SHOULD be big-endian with the most-significant bits of the time embedded as-is without reordering.<\/li>\n<li>Timestamps SHOULD utilize millisecond precision and Unix Epoch as timestamp source. Although, there is some variation to this among implementations depending on the application requirements.<\/li>\n<li>The ID format SHOULD be Lexicographically sortable while in the textual representation.<\/li>\n<li>IDs MUST ensure proper embedded sequencing to facilitate sorting when multiple UUIDs are created during a given timestamp.<\/li>\n<li>IDs MUST NOT require unique network identifiers as part of achieving uniqueness.<\/li>\n<li>Distributed nodes MUST be able to create collision resistant Unique IDs without a consulting a centralized resource. <strong>The IETF went on to introduce three 3 new types of UUIDs<\/strong> that have these properties these companies were looking for: UUIDv6, UUIDv7, and UUIDv8.<\/li>\n<\/ul>\n<\/blockquote>\n<p>So what&#8217;s the difference you ask?<\/p>\n<ul>\n<li><strong>UUIDv6<\/strong> &#8211; 62 bits of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gregorian_calendar\">gregorian<\/a> time + 48 bits of randomness<\/li>\n<li><strong>UUIDv7<\/strong> &#8211; 36 bits of big endian unix timestamp (seconds since epoch + leapseconds w\/ optional sub-second precision) + variable randomness up to 62 bits<\/li>\n<li><strong>UUIDv8<\/strong> &#8211; variable size timestamp (32\/48\/60\/64 bits) + variable size clock (8\/12 bits) + variable randomness (54\/62 bits)<\/li>\n<\/ul>\n<p>It&#8217;s not quite easy to work out what this all <em>means<\/em> but let&#8217;s boil it down:<\/p>\n<ul>\n<li>All of these UUIDs sort properly (the &#8220;high bits&#8221; of time are first, like putting the year before the month &#8212; &#8220;2022\/07&#8221;)<\/li>\n<li>UUIDv6 <em>requires<\/em> randomness<\/li>\n<li>The data contained in the UUID can be variable (ex. UUIDv8), this means you can bytes that mean something else (ex. an encoding of the compute region you&#8217;re running in)<\/li>\n<\/ul>\n<p>Alright, done hearing about UUIDs? Let&#8217;s get to the fun part.<\/p>\n<h2 id=\"benchmarking-id-generation-with-uuid-ossp-and-pg_idkit\">Benchmarking ID generation with <code class=\"\" data-line=\"\">uuid-ossp<\/code> and <code class=\"\" data-line=\"\">pg_idkit<\/code><\/h2>\n<p>With the history lesson behind us, let\u2019s benchmark these ID generation mechanisms against each other! For UUIDv1 and UUIDv4 we can use <a href=\"https:\/\/www.postgresql.org\/docs\/current\/uuid-ossp.html\">uuid-ossp<\/a>.<\/p>\n<p>Unfortunately, <code class=\"\" data-line=\"\">uuid-ossp<\/code> isn&#8217;t <em>quite<\/em> so advanced as to have many of these newer UUIDs we&#8217;ve been discussing, so we\u2019ll pull in <a href=\"https:\/\/github.com\/t3hmrman\/pg_idkit\">pg_idkit<\/a> here.<\/p>\n<p><a href=\"https:\/\/github.com\/t3hmrman\/pg_idkit\">pg_idkit<\/a> is built with Rust, so it gives us access to the following ID generation crates:<\/p>\n<ul>\n<li><a href=\"https:\/\/crates.io\/crates\/nanoid\">nanoid<\/a> (a <a href=\"https:\/\/www.npmjs.com\/package\/nanoid\">well known package<\/a> from the the NodeJS ecosystem)<\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/ksuid\">ksuid<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/ulid\">ulid<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/rs-snowflake\">rs-snowflake<\/a> (Twitter&#8217;s Snowflake algorithm)<\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/timeflake-rs\">timeflake-rs<\/a> (Inspired by Twitter&#8217;s Snowflake, Instagram&#8217;s ID and Firebase&#8217;s PushID)<\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/sonyflake\">sonyflake<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/pushid\">pushid<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/xid\">xid<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/cuid\">cuid<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/uuidv6\">uuidv6<\/a><\/li>\n<li><a href=\"https:\/\/crates.io\/crates\/uuid7\">uuid7<\/a><\/li>\n<\/ul>\n<p>For each type of UUID, we can test the following:<\/p>\n<ul>\n<li><strong>Generation speed:<\/strong> **How fast can I generate IDs (let&#8217;s say 1,000,000 of them)?<\/li>\n<li><strong>Table &amp; Index size:<\/strong> How much larger do tables and associated indices get?<\/li>\n<\/ul>\n<aside>\u26a0\ufe0f Grain of salt required for these tests &#8211; we haven&#8217;t audited the implementations of all the underlying crates!<\/p>\n<\/aside>\n<h3 id=\"generation-speed\">Generation speed<\/h3>\n<p>Generation speed is pretty easy to test, we can enable <code class=\"\" data-line=\"\">\\\\timing<\/code> mode on <code class=\"\" data-line=\"\">psql<\/code> and run a simple benchmark with <code class=\"\" data-line=\"\">generate_series<\/code>:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">\\timing -- enable psql timing mode\n\n-- Generate IDs 1 million times with ksuid\nSELECT COUNT(idkit_ksuid_generate()) FROM generate_series(&lt;span class=&quot;hljs-number&quot;&gt;1&lt;\/span&gt;, &lt;span class=&quot;hljs-number&quot;&gt;1000000&lt;\/span&gt;);\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>Running all of the ID generation mechanisms on a <em>single core<\/em> of my machine (which happens to be an <a href=\"https:\/\/system76.com\/laptops\/oryx\">Oryx Pro<\/a>), the lowest of 5 runs for each ID looks like this:<\/p>\n<div class=\" next-image--dynamic-fill to-scale-400 from-scale-500 rounded-lg border bg-gradient-to-r \"><img decoding=\"async\" class=\"next-image--dynamic-fill rounded-md border\" src=\"https:\/\/supabase.com\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=3840&amp;q=75\" sizes=\"100vw\" srcset=\"\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=640&amp;q=75 640w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=750&amp;q=75 750w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=828&amp;q=75 828w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=1080&amp;q=75 1080w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=1200&amp;q=75 1200w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=1920&amp;q=75 1920w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=2048&amp;q=75 2048w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Fgeneration-speed-chart.jpg&amp;w=3840&amp;q=75 3840w\" alt=\"Generation speed test\" data-nimg=\"fill\"\/><\/div>\n<p>To be fair, <strong>generation speed shouldn\u2019t be a deal breaker<\/strong> as it\u2019s unlikely to be the bottle neck for most applications. That said, it is nice to have some data on where each ID generation mechanism lands.<\/p>\n<aside>\u2139\ufe0f Percona and CyberTec have some excellent posts on the topic:<a href=\"https:\/\/www.percona.com\/blog\/2014\/12\/19\/store-uuid-optimized-way\/\">https:\/\/www.percona.com\/blog\/2014\/12\/19\/store-uuid-optimized-way\/<\/a> <a href=\"https:\/\/www.cybertec-postgresql.com\/en\/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys\/\">https:\/\/www.cybertec-postgresql.com\/en\/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys\/<\/a><\/p>\n<\/aside>\n<h3 id=\"table--index-size\">Table &amp; Index size<\/h3>\n<p>We can check the size of our tables &amp; related indices with this query (after running <code class=\"\" data-line=\"\">VACUUM<\/code>):<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">SELECT\n   relname  as table_name,\n   pg_size_pretty(pg_total_relation_size(relid)) As &quot;Table Size&quot;,\n   pg_size_pretty(pg_indexes_size(relid)) as &quot;Index Size&quot;,\n   pg_size_pretty(pg_relation_size(relid)) as &quot;Total Size&quot;\n   FROM pg_catalog.pg_statio_user_tables\nORDER BY pg_total_relation_size(relid) DESC;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>Here are the sizes in tabular form:<\/p>\n<div class=\" next-image--dynamic-fill to-scale-400 from-scale-500 rounded-lg border bg-gradient-to-r \"><img decoding=\"async\" class=\"next-image--dynamic-fill rounded-md border\" src=\"https:\/\/supabase.com\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=3840&amp;q=75\" sizes=\"100vw\" srcset=\"\/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=640&amp;q=75 640w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=750&amp;q=75 750w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=828&amp;q=75 828w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=1080&amp;q=75 1080w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=1200&amp;q=75 1200w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=1920&amp;q=75 1920w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=2048&amp;q=75 2048w, \/_next\/image?url=%2Fimages%2Fblog%2Fprimary-keys%2Ftable-and-index-size-chart.jpg&amp;w=3840&amp;q=75 3840w\" alt=\"Sizes in tabular form\" data-nimg=\"fill\"\/><\/div>\n<p>These numbers are <em>mostly<\/em> a reflection of the <em>length<\/em> of the default settings of <code class=\"\" data-line=\"\">pg_idkit<\/code> but probably worth having in front of you anyway.<\/p>\n<p>With this, we probably have enough information to make a decision (and a new library to generate our UUIDs with)!<\/p>\n<h2 id=\"which-id-should-you-use\">Which ID should you use?<\/h2>\n<p>As usual, <em>it depends<\/em> &#8212; you didn&#8217;t think it&#8217;d be that easy, did you?<\/p>\n<p>All I can offer are some general rules of thumb that hopefully work for you:<\/p>\n<ul>\n<li><code class=\"\" data-line=\"\">integer<\/code>s and <code class=\"\" data-line=\"\">serial<\/code> have obvious benefits for simplicity, storage, and sortability. You <em>might<\/em> not want to expose them to the world though.<\/li>\n<li>If you want the ultimate in collision avoidance UUIDv4 is OK<\/li>\n<li>UUIDv1 <em>could<\/em> have been great, <em>but<\/em> it doesn&#8217;t lexicographically sort.<\/li>\n<li>The best time-based ID seems to be <code class=\"\" data-line=\"\">xid<\/code>, with good performance <em>and<\/em> sort friendliness<\/li>\n<li>If you want to be a little more standards-oriented, UUID v6\/v7<\/li>\n<\/ul>\n<p>As usual, the best results will come from weighing all the options and finding what&#8217;s best for your use-case, and doing appropriate testing on your data.<\/p>\n<h2 id=\"possible-improvements\">Possible Improvements<\/h2>\n<p>We\u2019ve done some good exploration so far, but here are some ideas for interesting use cases for <code class=\"\" data-line=\"\">pg_idkit<\/code> and measuring the impact of ID generation using it.<\/p>\n<h3 id=\"usecase-generating-our-created_at-columns-from-our-ids\">Usecase: Generating our <code class=\"\" data-line=\"\">created_at<\/code> columns from our IDs<\/h3>\n<p>One interesting feature would be using at least partially time-based UUIDs for <code class=\"\" data-line=\"\">created_at<\/code> columns &#8212; we could save space by <em>virtualizing<\/em> our <code class=\"\" data-line=\"\">created_at<\/code> columns:<\/p>\n<div class=\"not-prose dark overflow-hidden\">\n<div class=\"relative\">\n<pre class=\"rounded-t-lg rounded-b-lg\"><code class=\"language-sql\" data-line=\"\">-- At table creation\nCREATE TABLE users (\n  id text PRIMARY KEY DEFAULT idkit_ksuid_generate(),\n  name text,\n  email text,\n);\n\n-- An example query for a specific KSUID that uses created_at\nSELECT &lt;span class=&quot;hljs-operator&quot;&gt;*&lt;\/span&gt;, idkit_ksuid_extract_timestamptz(id) \nFROM users\nWHERE id &lt;span class=&quot;hljs-operator&quot;&gt;=&lt;\/span&gt; &#039;0F755149A55730412B0AEC0E3B5B089C14B5B58D&#039;;\n<\/code><\/pre>\n<div class=\"dark absolute right-2 top-2\"><\/div>\n<\/div>\n<\/div>\n<p>Ideally we could use the <code class=\"\" data-line=\"\">GENERATED ALWAYS AS ( ... )<\/code> syntax for <a href=\"https:\/\/www.postgresql.org\/docs\/current\/ddl-generated-columns.html\">generated columns<\/a> while creating the table, but as the time of this post <em>Postgres does not yet support virtual generated columns (only stored ones)<\/em>.<\/p>\n<h3 id=\"benchmarking-measuring-index-fragmentation\">Benchmarking: Measuring index fragmentation<\/h3>\n<p>How fragmented do our indices get after use of each of these methods?<\/p>\n<p>Luckily for us Postgres has the <a href=\"https:\/\/www.postgresql.org\/docs\/current\/pgstattuple.html\">pgstattuple extension<\/a> so we can find out &#8212; thanks to <a href=\"https:\/\/dba.stackexchange.com\/questions\/273556\/how-do-we-select-fragmented-indexes-from-postgresql\">Laurenz Albe on StackOverflow<\/a>).<\/p>\n<p>Integrating and including these tests in the <code class=\"\" data-line=\"\">pg_idkit<\/code> README would greatly help people looking to make a decision.<\/p>\n<h3 id=\"benchmarking-measuring-sort-friendliness\">Benchmarking: Measuring <code class=\"\" data-line=\"\">SORT<\/code> friendliness<\/h3>\n<p>Another great metric to measure might be performance of these indices on certain common <code class=\"\" data-line=\"\">SORT<\/code> patterns. While this is inherently workload-specific, it would be great to pick a workload and see what we get.<\/p>\n<p>In most code bases, simple <code class=\"\" data-line=\"\">WHERE<\/code> queries with <code class=\"\" data-line=\"\">SORT<\/code>s abound, and one of the big benefits of UUIDv6, UUIDv7 and the other alternatives is lexicographic sorting, after all.<\/p>\n<p>Knowing <em>just how good<\/em> a certain ID generation method is at maintaining locality would be nice to know.<\/p>\n<p>Creating and using a function like <code class=\"\" data-line=\"\">idkit_uuidv1_extract_timestamptz<\/code> and using it in a <a href=\"https:\/\/www.postgresql.org\/docs\/14\/indexes-expressional.html\">\u201cfunctional index\u201d (an index on an expression)<\/a> could resolve the sort unfriendliness of UUIDv1 as well!<\/p>\n<h2 id=\"wrap-up\">Wrap-up<\/h2>\n<p>Identifiers have a long history and are surprisingly unsolved at present. It can be confusing but thanks to the power of Postgres we don&#8217;t have to over-think it &#8212; tables can be migrated from one ID pattern to another, and we can use DDL statements in transactions.<\/p>\n<p>Hopefully this article helps you head off some bikeshedding with your teammates when you next discuss which IDs are best to use.<\/p>\n<h2 id=\"more-postgres-articles\"><\/h2>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p class=\"excerpt\">&nbsp;Choosing a Postgres Primary Key<\/p>\n<p class=\"more-link-p\"><a class=\"more-link\" href=\"https:\/\/monodes.com\/predaelli\/2022\/09\/14\/choosing-a-postgres-primary-key\/\">Read more &rarr;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":{"inline_featured_image":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[72],"tags":[],"class_list":["post-9598","post","type-post","status-publish","format-link","hentry","category-documentations","post_format-post-format-link"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p6daft-2uO","jetpack-related-posts":[{"id":10940,"url":"https:\/\/monodes.com\/predaelli\/2023\/11\/03\/versioning-data-in-postgres-testing-a-git-like-approach-specfy\/","url_meta":{"origin":9598,"position":0},"title":"Versioning data in Postgres? Testing a git like approach &#8211; Specfy","author":"Paolo Redaelli","date":"2023-11-03","format":false,"excerpt":"Versioning data in Postgres? Testing a git like approach - Specfy is fashinating but I think that most of the time these two proposed alternatives fit most of the needs: In-Table versioning, the Wordpress way of doing thing. Add a a column version (or modify date) and SELECT the maximum\u2026","rel":"","context":"In &quot;Tricks&quot;","block_context":{"text":"Tricks","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/tricks\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5102,"url":"https:\/\/monodes.com\/predaelli\/2019\/01\/03\/auto-generated-primary-keys-friends-or-foes-3\/","url_meta":{"origin":9598,"position":1},"title":"Auto-generated Primary Keys, friends or foes?","author":"Paolo Redaelli","date":"2019-01-03","format":false,"excerpt":"Time to read againg Auto-generated Primary Keys, friends or foes? of the almighty Davide. I will save it somewhere, hoping that it will never go offline. In that case you will find it here. \u00a0","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":10679,"url":"https:\/\/monodes.com\/predaelli\/2023\/08\/09\/sqledge\/","url_meta":{"origin":9598,"position":2},"title":"SQLedge","author":"Paolo Redaelli","date":"2023-08-09","format":"link","excerpt":"SQLedge: Replicate Postgres to SQLite on the Edge - https:\/\/github.com\/zknill\/sqledge","rel":"","context":"In &quot;Software&quot;","block_context":{"text":"Software","link":"https:\/\/monodes.com\/predaelli\/category\/software\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3743,"url":"https:\/\/monodes.com\/predaelli\/2018\/02\/03\/postgres-hidden-gems-craig-kerstiens\/","url_meta":{"origin":9598,"position":3},"title":"Postgres hidden gems &#8211; Craig Kerstiens","author":"Paolo Redaelli","date":"2018-02-03","format":false,"excerpt":"citext I've always been fond of PostgreSQL, now this Postgres hidden gems - Craig Kerstiens shows its smart features even more! There are many interesting features of Postgresql that I didn't knew, as I haven't actually used it for a while Postgres has a rich set of features, even when\u2026","rel":"","context":"In &quot;Documentations&quot;","block_context":{"text":"Documentations","link":"https:\/\/monodes.com\/predaelli\/category\/documentations\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":8922,"url":"https:\/\/monodes.com\/predaelli\/2021\/11\/18\/an-introduction-to-the-pg_auto_failover-project\/","url_meta":{"origin":9598,"position":4},"title":"An introduction to the pg_auto_failover project","author":"Paolo Redaelli","date":"2021-11-18","format":"link","excerpt":"\u201cPostgres is awesome, makes developping my application so much easier, it solves so many problems for me!\u201d and the next step \u201cso, how do I run Postgres in Production?\u201d. If you\u2019re not sure how to bridge that gap yourself, how to deploy your first production system with automated failover, then\u2026","rel":"","context":"In &quot;Senza categoria&quot;","block_context":{"text":"Senza categoria","link":"https:\/\/monodes.com\/predaelli\/category\/senza-categoria\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5155,"url":"https:\/\/monodes.com\/predaelli\/2019\/01\/25\/multiple-fields-primary-keys-in-django\/","url_meta":{"origin":9598,"position":5},"title":"Multiple fields primary keys in Django","author":"Paolo Redaelli","date":"2019-01-25","format":false,"excerpt":"I'm developing a little application in Django. Having developed a subtle dislike for UUIDs used as primary keys I tend to rely of the \"natural keys\" which are almost always identificable in a data model. Often thought those keys span over several fields. Think about a receipt of a multi-store\u2026","rel":"","context":"In &quot;Django&quot;","block_context":{"text":"Django","link":"https:\/\/monodes.com\/predaelli\/category\/python\/django\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/9598","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/comments?post=9598"}],"version-history":[{"count":0,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/posts\/9598\/revisions"}],"wp:attachment":[{"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/media?parent=9598"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/categories?post=9598"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/monodes.com\/predaelli\/wp-json\/wp\/v2\/tags?post=9598"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}