Double Payments

Exactly-once delivery guarantee is one of software engineering’s hardest problems, even believed by some to be impossible. This blog post isn’t about exactly-once delivery but rather the effects of assuming you’re going to receive a message only once on the internet and how we protected ourselves.

A not-so short story

I used to work for a fintech. For a simple primer on how we did accounting, we operated a double-entry legder where we wrote credit / debit entries for transactions. To compute balances, we aggregated these ledger entries. Once a payment is made and we receive a callback from the processor, we verify the transaction and update our ledger. Very simple, what could possibly go wrong?

One Wednesday I think, someone from the operations team alerted me that our ledger was out of sync with our partners ledger. We were reporting higher numbers than expected. See, balance mistmatches drive me crazy. I quickly opened my sql client to pull up some data. I was essentially looking for transactions whose amount and ledger value didn’t match. 66 lines and some struggles later, I ended up with quite a decent number of faulty transactions. By faulty, I mean those transactions had 2x value in our ledger. A single deposit transaction had 2 different credit entries in our ledger.

Clearly, we were double processing some payments. However, one thing I noticed was that the duplicate entries were milliseconds apart. A race-condition? Ok this is going to be tough.

The great search

Luckily, we were doing a few things right. We were updating our ledger from only one place. So there was only one place to look. We also had some sanity checks to prevent re-processing.

After a thorough review of our payment pipeline, we were pretty confident that the problem wasn’t from our end. Our next theory was that our payment pipeline was being triggered twice for a single transaction. The trigger in this case being the callback from our payment processor. To confirm this theory, we checked our access logs and there it was. Double calls to the callback endpoint for a single transaction.

Clearly, our callback endpoint wasn’t safe for retries. To be fair, we DID have some sensibility checks to guard against this exact phenomenon. But the window between the 2 calls was soo small(milliseconds), our checks were practically useless. Were they 1+ seconds apart, we would have easily ignored the second call.

This goes without saying that, services with proper retry mechanisms (almost) always have some exponential backoff in place. This provides a balance between reliable delivery and not-overloading the recipient. Unfortunately, I don’t think our partners were were actually retying. There was probably a race condition in their software but I digress.

Distributed locking

Okay, back to our problem. We were receiving 2 or more callbacks for a single transaction. The solution was to simply acknowledge only first callback for a transaction. The event payload included a transaction id, so we could use that as the idempotency key across requests. All we had to do was to require the callback handler to acquire a lock on the transaction id before queueing it for processing. The catch here being that, the lock is never released. Ergo, the lock can only be acquired once for any given value.

But there was another layer to our problem. We needed this idempotency across multiple instances of our backend. Reason was simple, n requests distributed amongst n instance(s) of our service should have only 1 effect on our ledger. We needed a distributed lock. A dead simple one for that matter.

I had some experience with redlock but that meant introducing redis into our infrastructure and my CTO at the time had serious trust issues with redis. Luckily, were using dynamodb in some other parts of our application so we decided to build our solution around that. I found out a few distributed locking libraries like dynamo_lock but after carefully reviewing them, I realised they were overkill for the problem we had.

We decided to rollout our own lock. Here’s all the code that went into our solution.

Some terraform to provision the dynamodb table.

resource "aws_dynamodb_table" "txlock" {
  name         = "txlock"
  billing_mode = "PROVISIONED"
  read_capacity  = 5
  write_capacity = 5
  hash_key = "transaction_id"
  attribute {
    name = "transaction_id"
    type = "S"
  }
  tags = {
    Name        = "txlock"
  }
}

A small api in our application code to allow us aquire locks. The most important bit of code is the attribute_not_exists conditional expression. This expression allowed us to acquire the lock in one network call, as opposed to first checking for the lock in one call and acquiring it in a another call. The second approach is susecptible to race conditions. You can read more on that here.

func (t DynamoTable) AcquireLock(value interface{}) (*dynamodb.PutItemOutput, error) {
	item, err := dynamodbattribute.MarshalMap(value)
	if err != nil {
		return nil, err
	}

	item[t.partitionKey] = &dynamodb.AttributeValue{S: aws.String(key)}

	input := dynamodb.PutItemInput{
		TableName:           aws.String(t.tableName),
		Item:                item,
		ConditionExpression: aws.String(fmt.Sprintf("attribute_not_exists(%s)", t.partitionKey)),
	}

	return t.db.PutItem(&input)
}

With that in place, all we had to do was call .AcquireLock() before processing a callback and we were good. We fixed the faulty transactions and also created a metabase view for the Ops team to know whenever our ledgers fall out of sync. I don’t remember getting any reports about double processing ever again.