Distributed resource allocation with PostgreSQL

Introduction

Last year, like many others before me, I discovered the FOR UPDATE SKIP LOCKED feature in postgres. It's a handy feature that allows you to solve a troublesome class of problems elegantly by delegating all the work to the postgres machinery. This post documents the end-to-end process that led me to it. I've also added a breakdown button to make any assumptions, napkin math, etc more accessible. Purely inspired by something I saw on Michael Nielsen's blog.

Problem Description

The high-level task was to build a feature to enable clients provision IP addresses for their virtual machines. This feature was going to be exposed by a public api. In terms of material constraints:

The api service was running roughly 50 instances, globally distributed across a dozen regions. Client traffic was routed to the closest instance via anycast.
The address supply was basically a couple of /22 IP blocks, from which clients could make their pick. For simplicity, lets assume they were 45.67.80.0/22 and 203.0.112.0/22.
There were about 500,000 clients.

This is a reduced version of the problem, but representative enough of my dilemma at the time. [1]

Breakdown

To get a better sense of the problem, I did some napkin maths, starting with the address space.

IPv4's 32-bit address space gives us 4.3 billion unique addresses.
A single /22 subnet provides 1024 addresses, so two of these blocks gives a total of 2048.
Matching 2048 addresses to 500k clients results in a 1:244 address-to-client ratio. (this assumes perfect load balancing across addresses)
There's going to be high contention if all clients requested IPs in parallel, since p(acquire) = 0.4%

=> 2048 addresses, 500k clients
=> 2048:500000 = 1:244 (we perform a division to get the ratio)
=> under equal distribution of load, each address has 244 possible suitors
=> which means each client request only has a 0.4% chance of claiming an address.

internet protocol version 4 (ipv4)

ipv4 defines how data is routed across network of devices. In order to ensure accurate delivery, each device needs to be uniquely identified. The protocol refers to these identifiers as ip addresses. Here's an example address: 192.168.1.1.

ipv4 address space

The protocol requires all addresses to fit in 32bits of memory. This means only 2³² = 4,294,967,296 unique addresses can be stored; which is roughly 4.3 billion.

heuristics

1. The number 32 can be split into 4 equal groups; 8 + 8 + 8 + 8.
2. This implies that 32bits can also be split into four groups of 8 bits: 2³² = 2⁸ × 2⁸ × 2⁸ × 2⁸.
3. Each group of 8 bits is called an octet and an octet of memory can store numbers from 0 to 255.
4. This is why an ipv4 address is a sequence of 4 octets that look like this: [0-255].[0-255].[0-255].[0-255]

subnets

A subnet is a range of addresses that share the same binary prefix. Naturally, when you write out all 4.3 billion ip addresses in binary form, you're going to notice a pattern where alot of addresses begin with the same sequence of numbers, which we call the prefix. This pattern of shared prefixes enables us to create families of networks, called subnets.

In a given subnet, the fixed prefix is called the network part. The variable part is called the host part. This is analogous to human families, where the network-prefix acts as a surname and the host part acts as the first-name. Can you see the (surname|network) and (firstname|host) pattern in these examples? 😉

192.168.0.0   -> 11000000.10101000.00000000.00000000
192.168.0.1   -> 11000000.10101000.00000000.00000001
192.168.0.2   -> 11000000.10101000.00000000.00000010
192.168.0.3   -> 11000000.10101000.00000000.00000011
192.168.0.4   -> 11000000.10101000.00000000.00000100
192.168.0.5   -> 11000000.10101000.00000000.00000101
192.168.0.6   -> 11000000.10101000.00000000.00000110
192.168.0.7   -> 11000000.10101000.00000000.00000111
192.168.0.8   -> 11000000.10101000.00000000.00001000
192.168.0.9   -> 11000000.10101000.00000000.00001001
...

cidr notation (192.168.0.0/24)

Thinking in base-2 is alot of work, so humans invented the CIDR notation as a simple way to represent these subnets. Instead of writing out all family members like I did in the example above, we can just write:

192.168.0.0/24

This notation tells us that the network-prefix or network-portion is 24 bits. But in order to find the host-portion/host-bit/variable-bit in this subnet block, you only need to calculate 2^32-24 == 2⁸ == 256 . This means, there are 256 unique addresses in this subnet block.

Now, an IP address is unique per single client. And so with 50 API instances capable of handling allocation requests concurrently, I needed to put in place a coordination mechanism to prevent multiple instances from assigning the same address to different clients. In the literature, this is the classic use-case for a (distributed) mutex.

However, a quick supply to demand analysis showed that roughly 497,952 clients will never be allocated an address, regardless of how long they retried or waited. And so allowing these clients to wait on a mutex was a wastage of compute. This meant I required not just any distributed mutex; it had to be non-blocking, so we could fail fast. breakdown

=> Addresses are unique, meaning 1 address can only be claimed by 1 client.
=> This means, if we have 500k clients and 2048 addresses, only 2048 clients can get an address.
=> The remaining (500,000 - 2048) 497,952 were doomed.

Honestly, this section was an exercise in worst case scenarios, but having no control over clients or the routing meant it was somewhat a reality.

Ideation

Implementing a non-blocking distributed mutex at the service level implied adding new infrastructure like Redis or DynamoDB [2]. In order to justify that operational overhead to management, I needed proof that I had exhausted all current options. Luckily for me, our api service was backed by postgres, so I began researching whether I could bend it to my needs.

I had a few (crude) ideas on my way to a good-enough solution, which I'm going to share in the spirit of transparency. Truth is, falsifying them led me to my final solution.

IDEA-1: For each client request, randomly generate an IP address in the subnet and try to claim it. We can rely on the "unique" index in postgres to prevent duplicate allocations. It seems simple enough, but it's not great.
- It was prone to resource exhaustion. 500k clients guessing within a tiny 2048 space, coupled with an attempted db write which had an almost 0% chance of success.
- The lack of book-keeping(randomness), meant new client requests could guess an already allocated addresses, for which they were destined a rejection. Except, a rejection doesn't mean our address supply is depleted, so technically, they could retry, or the service could retry on their behalf. But the million dollar question becomes, at what point do they stop? When all 2048 addresses are claimed? When they finally grab one?
IDEA-2: Create an IP-allocator class that sequentially handles allocation requests using a buffer. This improves on the previous idea, but introduces new problems I wasn't particularly interested in solving. (unless it was my last resort)
- It required book-keeping between instances, for consistency.
- It required some book-keeping between storage and application.
- Lastly, a sequential queue meant higher latency. (sharding might help, see below)
IDEA-3: Shard the subnets into 50 equal groups, and assign each api instance instance a shard. Honestly, I loved this idea. It's elegant & decentralized—doesn't require any distributed locking. Common knowledget at the org was that, if you could solve a problem statically(without network+coordination), do it.
- The logical nature meant book-keeping.
- Due to the anycast routing, a dead instance OR an instance in a low traffic region can hog valuable IP addresses.
- Lastly I would have to figure out rebalancing, for example when we scale down instances.

A few of these ideas were workable, but I still had a hunch I could do better. Can you think about any other pros/cons for these ideas? Can you think of other solutions? I would love to hear them.

Breakthrough

Mid-research, I discovered the FOR UPDATE clause. But before I go into the details, I want you to assume we have this theoretical table called ip_address.

This table represents the pool of addresses
A row from this table represents an IP address
A status column on the row indicates if an address is free or not.
A client_id column on the row indicates which client owns the address.

FOR UPDATE

The FOR UPDATE clause in postgres allows you to acquire a lock on all rows that match a given SELECT query. For a thought experiment, let's assume we have 2 clients, trx-1 and trx-2, and they're both trying to claim a single address; row-1.

In order for trx-1 to claim row-1 in our address pool, it first needs to check if the address is free(select), and then claim it(update). However there's no atomic way to perform this 2-part operation in postgres, which means, there's a tiny gap between the select and update call where another client—trx-2—could quickly steal the address.

This behaviour can easily be mitigated using FOR UPDATE. Invoking it gives trx-1 exclusive access to row-1. Any other client that attempts to claim the same row—trx-2—will have to wait in-line till trx-1 commits or rolls back.

For me, the biggest takeaway was from this clause was that, after trx-1 is done with it's business, the postgres engine will re-execute the "where" clause of trx-2 again, to ensure the row-1 is still free. This enforced some sort of serializability, which was a very good signal for consistency I was looking for.

FOR UPDATE NOWAIT

A good signal wasn't enough, because I was specifically looking for a non-blocking mutex. FOR UPDATE required all 243 clients to wait-inline and re-execute their queries on an address they had exactly 0% chance of getting [3]. This is how I found FOR UPDATE NOWAIT, which is a more interesting variant that doesn't wait for the locked row to be available. This looked like a good fit, I thought, since it would allow all 243 clients requests to fail immediately if they landed on the same address. But like I said in idea 1, failure to claim an address doesn't equate to an exhausted pool. With enough retries, the client could still get an address.

FOR UPDATE SKIP LOCKED

This magical clause gives you a non-blocking lock, but instead of early-exiting, it returns rows that are not locked. A perfect fit for my use case since it doesn't wait for any locks to be released, but also returns the next available row. This removes the need for a best-effort-retry. In prompt engineering terms, this is equivalent to telling the database, "hey psql, try your very best to find me an available IP. Don't make any mistakes".

Early on, I said the FOR UPDATE clause in postgres allows you to acquire a lock on ALL ROWS that match a given SELECT query. The direct implication of this on the 2 variants might not be obvious so let's look at an example.

Lets assume have a pg table with 5 rows (a, b, c, d, e), where rows a and b are locked by another transaction.

SELECT *
  FROM table
  ORDER BY col ASC → Will return all rows (a, b, c, d, e)

SELECT *
  FROM table
  ORDER BY col ASC
  FOR UPDATE NOWAIT → Will return an error, because the first row it attempts to lock(row-a) is already locked.

SELECT *
  FROM table
  ORDER BY col ASC
  FOR UPDATE SKIP LOCKED → Will return rows c, d, e (skips the locked ones)

Hopefully it's clear now!

_{"ok claude, make a billion dollar b2b todo app. make no mistakes."}

Solution

I had a high-contention resource allocation problem, and my ideal solution was one that provided both distributed-consistency and rapid-rejection for failed allocations. Turns out, 4 english words is all you need.

The conceptual model becomes clearer once you:

remodel the problem as an exactly-once processing problem. [4]
use postgres as a queue & invoke the FOR UPDATE SKIP LOCKED clause to ensure exlusivity.
think of the client as a consumer/worker.

With all the pieces in place, it was time to produce shareholder value. The first step was to generate all possible IPs in our subnets and seed our database with them, thereby creating a pool of available ip addresses. I'll provide some sample code to demonstrate the process.

server_ip_migration.sql

CREATE TABLE client (
  id bigint NOT NULL,
);
CREATE TABLE ip_address (
  id bigint NOT NULL,
  ip inet NOT NULL,
  client_id bigint,
);
  
CREATE INDEX index_ip_address_on_client_id 
  ON ip_address USING btree (client_id);

Then I added code to generate the IP supply and seed the database. In every subnet, you cant use the first & last address. You can figure out why for yourself.

server_seed.go

func generate_all_ips_from_blocks(blocks []string) []string {
    ips := []string{}
    for _, b := range blocks {
        range = IP.new(b).to_range()
        // ignore first and last ip in block
        range = [range[1], range[len(range)-1]] 
        for ip := range range {
            ips = append(ips, ip)
        }
    }
    return ips
}


blocks :=[]string{
  “45.67.80.0/22”, 
  “203.0.112.0/22”,
}
unique_ips := generate_all_ips_from_blocks(blocks)
database.ip_address.insert_all(unique_ips)  // the client_id is null, we are just creating a resource pool

Then for the actual allocation component, I created a database query to atomically assign available IPs to clients, in a non-blocking manner. A query error at this point meant either we've exhausted our supply or some other fancy programming problem happened. As a note, random helps spread the contention a bit, but it can be very expensive.

server_allocate_ip_address.go

allocation_query = `
  UPDATE ip_address
  SET client_id = :client_id
  WHERE id IN (
    SELECT id
    FROM ip_address
    WHERE client_id IS NULL
    ORDER BY random()
    LIMIT 1
    FOR UPDATE
    SKIP LOCKED
  )
  RETURNING *
`

func allocate_ip_address(client_id: int) {
	err := database.execute(allocation_query, {client_id})
	if err != nil {
		return client.send(“service out of ip addresses, try again later”)
	}
}

api.handle(/allocate_ip_address, allocate_ip_address)

Once a client needed an IP, it just had to invoke a single call;

server.rpc.allocate_ip_address(client_id: 43943)

Conclusion

I kept falsifying ideas based on the worse case scenario of 500k concurrent allocations. Deep down, I knew very well it wasn't practical. Especially because features like these always have an alpha & beta release to control traffic and get user feedback. In this case however, living by it was a very good-proxy which eventually guided me to an elegant solution.

Feel free to reach out if you have any questions or feedback.

IPV6

I added this section so I can mention a neat trick you could do for ipv6 allocations, since my original project included an IPv6 requirement. If you'd remember, I mentioned the human-friendly nature of the IPv4 address space as a good thing. Now, while that is true, the almost infiniteness of the IPv6 space made for easier allocation. By easier, I meant, static.

For IPv6, you can convert the client_id to hexadecimal and embed it directly in the address, enabling static assignments. Infact in our case, all we did was to convert the allocated IPv4 address to hex and embedded it in an IPv6 address.

[1] The IP subnets used in this post are probably not publicly routable. The real number of clients was several magnitudes higher, and lastly, the allocation was per client_vm, not client. And we most likely had more subnets iirc. (but these are all distractions) go back

[2] I didn't necessarily need these 2, but I wasn't implementing consensus algorithms if it wasn't required. Hence why the first option was to "buy", instead of build. go back

[3] The p(acquisition) = 0 assumes the transaction succeeded. If it fails or rollbacks, then someone else could stand a good chance of grabbing the address. go back

[4] Yes, the classic exactly-once problem is a bit inverted, but if you tone down the rationality a bit, you'll see the picture I'm trying to paint. go back