I lost many hours due to this bug, which is why I’m quickly documenting it, so others can be saved. Like most bugs, the program was performing as it should, it just happened that an error was not being bubbled up correctly, and so I was basically flying blind.

How did we get here

I work for a proxy company, we slap a proxy on anything and call it a day. Culture at my workplace, amazing. Lots of flexibility & autonomy, and so there’s alot of space for experimentation. A week earlier, I decided to experiment with some functionality that would allow you to securely perform remote docker builds over https. Most of the heavy lifting had been done, we were already performing builds over wireguard tunnels, I just wanted to use the internet directly.

The work here involved booting up a golang service, slapping an authenticated proxy infront of the docker daemon, auth + ratelimit and then finally, forwarding safe requests from external docker clients to the daemon. Nothing complex (said every software engineer ever).

The honeymoon phase

After tinkering around with a custom proxy for while with no success, I came across httputil.NewSingleHostReverseProxy which is a dead simple implementation of a reverse proxy hidden in the go standard library. I quickly hooked it up with almost zero config and I was already building docker images remotely. Here’s a stripped down version of how my proxy looked. (auth, ratelimiting, dockerd bootup, etc are hidden)

package main

import (
    "log"
    "net/http"
    "net/http/httputil"
    "net/url"
)

func proxy(reverseProxy *ReverseProxy ) http.Handler {
	// docker daemon usually listens on 2375 on loopback interface.
	reverseProxy := NewSingleHostReverseProxy(&url.URL{
		Scheme: "http",
		Host:   "localhost:2375",
	})

	return http.HandlerFunc(func(w http.ResponseWriter, origReq *http.Request) {
		reverseProxy.ServeHTTP(w, origReq)
	})
}

func main() {
    // Assume dockerd is running in the 
    proxy := httputil.NewSingleHostReverseProxy(&url.URL{
		Scheme: "http",
		Host:   "localhost:2375",
	})

    srv := http.Server{
        Addr: ":80", 
        Handler: proxy,

        // "legacy" code written 3 years ago that I didn't see.
        ReadTimeout:  5 * time.Second,
		WriteTimeout: 10 * time.Second,
    }

    srv.ListenAndServe()
}

This simple service was already serving builds, info & pings. Push operations however, kept failing mid-way. Aggressively retrying the operation seemed to get the remaining docker layers over the line. After peeking at the logs, the only consistent thing I saw across pushes was ERRO[datetime] Not continuing with push after error: context canceled. Honeymoon phase over, It was full step debugging from here.

The Process

I put these logs in github codesearch and it pointed me to the moby/moby software. So I knew for sure those logs were being emitted by the docker daemon running on my build server. Something was cancelling the push operation midway. My own http server logs were super clean, and on the client side, I only saw EOF errors.

I jumped straight into the moby codebase and started trying to work my way backwards. That took some hours of tracing & code reading, but it honestly didn’t lead me anywhere. I couldn’t figure out what exactly was executing the cancel() method of the context that was stopping my image-pushes. Unfortunately, the software was already compiled too, so i couldn’t put breakpoints anywhere. I just had to read the source code and execute the program in my head. I combed the entire internet, didn’t find anything solid. I did come across this guy who was seeing the same errors and you could CLEARLY tell he was losing his shit. Anyway, I got into the dirt and started debugging. I didn’t understand why every other request was working, except for pushes. There was barely any custom code in the path. The reverse proxy was from the stdlib, written by people clearly smarter I am.

In hindsight, I wasted too many hours on this lead, but in my defense, my server was not logging any errors, so that threw me off. You can jump to the learning section to figure out my takeway from this ordeal.

We had a middleware that wrapped around handlers and logged the request after the handler had completed. I noticed it logged every request, with the exception of pushes. Which means the push handler wasnt returning. To confirm, I added a “request started” log before the handler was called and “request ended” log after the handler returned. Fired a sample push. I saw the request started log, the request ended one however never returned. I had 100% assurance that the pushes were failing, but the handler wasn’t returning. My logs were clean. I was hella confused.

My next method was to confirm if our server was the one cancelling request. I setup a goroutine to permanetly poll the Done() channel of the request context we were passing to the reverse-proxy. Interestingly, the Done channel returned “context cancelled” error. That was a good find, dockerd wasn’t the only one being cancelled. But that still didn’t explain why the handler wasn’t returning. Golang context cacellation doesn’t propagate inward, it only propagates outward. And what that meant was, if the ctx.Done() channel of the original request was returning a cancelled error, the cancellation didn’t happen in a child context(or child handler). It happened before our handler was called. Wayyy before we called the reverse proxy with the request & response.

httpServer -> ourHandler -> reverseProxy -> dockerdaemon
httpServer <- ourHandler <- reverseProxy <- dockerdaemon

And that my friends, is how I ended up in the net/http portion of the golang std library, because httpServer is the genesis of this all. The entrypoint being ListenAndServe. Follow the call stack a bit upward and you end up at serverHandler{c.server}.ServeHTTP(w, w.req), this is what calls our handler and passes the context up to it. So whoever was cancelling our http request was in this function. If you scan the conn.serve function a bit more, you’ll see the cancellable context being setup and cancelled on defer. Unfortunately, the cancel is also passed around to other entites and so I went into a rabbit hole trying to identify all actors who could possibly cancel my request.

It’s important to know that this was compiled code so i couldn’t really put breakpoints or log statements anywhere to properly determine who cancelled the context so building a map in my head was the best i could do. I was just reading the source code for a library my program was importing and trying to guess where it’s being hit. (One day I’ll write on what I think would be the ultimate debugging tool for programmers)

You’re probably wondering where I’m going with this. See, I knew all the factors that could cause a cancellation. The most likely being this defer call, however I had 100% certainty ourHandler wasn’t returning, which implied serverHandler{c.server}.ServeHTTP(w, w.req) was also blocked waiting on ourHandler, and so there’s NO WAY that defer call that came before it will fire. In hindsight, this assumption is what made me loose even more hours. I was totally wrong. Normally, that’s a solid assumption to hold, but if you’re this deep into the stack already, you don’t want to hold strong opinions on anything. I had hit a dead end, so I decided to take a new approach to investigate the program.

In the new approach, I decided to throw away the context ourHandler was receiving from conn.serve. I rather passed set a context.Background() on the request and sent that to reverseProxy. This, this was the best thing I had done in one week. My theory was simple, a fresh context wasn’t going to get cancelled, so that shouldn’t propagate to the daemon. When I run the test, something interesting happpend. Remember the “request done” log, that wasn’t showing up? It started popping up. My handler was unblocked. A new log also popped into my server logs. failed to write from localhost:8080->localhost:98374. i/o timeout. But, my docker push was still hanging/failing with the same cancelled error. WHO WAS CANCELLING THE DOCKER PUSH THEN? I was reallly losing it.

But on the flip side, I had a new logline to investigate. After searching around for a while for the log line, I found out it originated from the reverseProxy. If you look at the code, you notice THEY’RE CONDITIONALLY SUPPRESSING ERRORS. That made me really upset, I had lost days just because someone was just swallowing up errors, instead of bubbling them up or dropping a harmless log. Anyway, I had to figure out why this new log line was popping up all of a sudden. When you look at the content of shouldPanicOnCopyError, it became obvious why the logs were only showing after I passed a fresh context. The fresh context I passed was missing a serverContextKey, and that caused the log to pop out. This was just pure luck at play. There’s no method to this. I was really irritated.

But see this thing gets more interesting, and this is where I learnt a big lesson about golang & http servers and how panics are handled. If the error isn’t swallowed by the reverseProxy, it actually panics with the error http.ErrAbortHandler. When a panic occurs, further execution is halted and the function works it way back to it’s caller. By doing this, all defers are executed accordingly. In the http serve implementation, there’s a top level recover. This defer logs EVERY panic except ErrAbortHandler errors. SIGH

This explained so many things that were driving me crazy. If you’re confused let me explain what was actually happening. I make a request to my server, my server calls the reverseProxy to forward the request to the backend, but midway, the reverseProxy breaks down and panics. the http server that wraps around the reverseProxy catches the panic alright, but it swallows it up and shows nothing in the logs.

  • Why wasn’t I seeing the “request completed” log? Why did it seem like my my request was hanging? - The reverse proxy panicked and that prevented the the ServeHTTP function from finishing.
  • Why didn’t I see any log? - The http server wrapper that caught the panic swallowed up the error
  • Why didn’t I see any log from the reverse proxy? - The reverse proxy was also swallowing the error
  • Why was the docker push hanging midway? - Still didn’t know
  • Why was it working for other requests but not push? - Still don’t know

My core issue hadn’t been fixed, but I had narrowed it down a bit. The reverseProxy was one file with barely any external dependencies, so I copied and pasted it into my software and edited it to spit out all errors, instead of swallowing them. Run another test, and the error that was logged was “request cancelled”. LMAO WTF IS GOING ON YO.

I knew for a fact that context was cancelled way before those http defers fired. so it definitely wasn’t them. I added more log lines to the library and followeds to code flow a bit, eventually I narrowed it down a bit more. res.Body.Read() was returning context.Cancelled errors. In the middle of the reverseProxy reading from the backend, someone cancels the context and the read fails. If you’ve had to trace implementation of golang’s io.ReadWrite interfaces, you know how difficult it is to actually find code. All intellisense just leads you to the interface.

In all honestly, I couldn’t narrow down to who was cancelling the read. Mostly because I was tired. But like right before I closed work, I decided to change the unit of the ReadTimeout and WriteTimeout from seconds to minutes. I was too tired to test it, so i just left my desk and went to make dinner. After dinner, I run another test and BOOOM, the pushes weren’t handing anymore.

In hindsight, it was right there. I had enough indicators to narrow the problem down to read/write timeouts. But like why the FUCK was my error being swallowed???

The other requests were working fine because they were simple requests that returned wayyy before the set deadlines.

WTF man.