I lost many man hours tracking this down. The tldr being, a specific failure mode can lead to golang's standard net/http library suppressing errors and panics. If you don't care about the process, skip to the failure mode

Context

I was an engineer working on deployments. A huge part of this work involved reliably building container images & pushing them to remote registries. These builds either happened locally on customer machines or remotely by using a lightweight docker client to drive a remote docker daemon running in a dedicated vm. These remote builds occured over wireguard, providing security. However, wireguard being a udp-based protocol meant connections for remote builds could get flaky for many other reasons, including upset firewalls.

To increase build reliability, I began experimenting on how to enable customers securely perform remote docker builds over open internet(https), which is more reliable/less policed.

Good Signs

A few failed experiments with a custom proxy led me to httputil.NewSingleHostReverseProxy, which is simple & elegant reverse proxy hidden in the go standard library. With little effort, I was already proxying requests from my local docker client to the remote docker daemon. This simple piece of code could already perform builds and respond to info & ping requests. (Note there's no auth, ratelimiting, dockerd bootup, etc.)

package main

import (
    "log"
    "net/http"
    "net/http/httputil"
    "net/url"
)

func proxy(reverseProxy *ReverseProxy ) http.Handler {
	// docker daemon usually listens on 2375 on loopback interface.
	reverseProxy := NewSingleHostReverseProxy(&url.URL{
		Scheme: "http",
		Host:   "localhost:2375",
	})

	return http.HandlerFunc(func(w http.ResponseWriter, origReq *http.Request) {
		reverseProxy.ServeHTTP(w, origReq)
	})
}

func main() {
    // Assume dockerd is running in the 
    proxy := httputil.NewSingleHostReverseProxy(&url.URL{
		Scheme: "http",
		Host:   "localhost:2375",
	})

    srv := http.Server{
        Addr: ":80", 
        Handler: proxy,

        // "legacy" code written 3 years ago that I didn't see.
        ReadTimeout:  5 * time.Second,
		WriteTimeout: 10 * time.Second,
    }

    srv.ListenAndServe()
}

Push operations however, behaved inconsistently. Some pushes worked, others failed, others worked with enough retries. There wasn't enough logs to work with. The remote build server logs only logged ERRO[datetime] Not continuing with push after error: context canceled. The local client invoking the push operation only printed out EOF errors.

honeymoon The remote build server (everything except the client) ran in a remote VM, hence "remote builder".

The Process

My first step was to run the server logs through github codesearch. This returned some hits in the moby/moby codebase. This is a core library used by docker products, so I knew those logs were being emitted by the "docker daemon" portion of my remote build server. From this, I knew a component of my build server was cancelling the docker daemon's push operation. My aim was to find the code cancelling the push operations. I jumped straight into the moby codebase, trying to work my way backwards. This took hours of tracing & code reading, but led nowhere.

Early Signals

As a next step, I wrapped the http handler that proxies push requests to the docker daemon with a middleware that logs a "request start" and "request end" before and after the handler was executed. I noticed it logged both messages for all docker requests except pushes, for which the "end" log never appeared. From this, I implied the http handler that proxied push requests to the docker daemon was clearly stopping midway, but somehow it wasn't returning control flow to it's caller. That didn't match my understanding of control flow.

In order to confirm the cancellation didn't originate from the client, I needed to prove the http server was the origin of the cancellation. I performed this proof by adding a goroutine to poll the Done() channel of the request context being passed to the reverse-proxy. Interestingly, the done channel returned “context cancelled” error. Context cancellation in golang is top-down, so the ctx.Done() channel of the http_server returning a cancelled error meant the cancellation couldnt have originated from a later call, ie. the reverse-proxy. These were pragmattic assumptions, but I was wrong.

net/http

With the http_server as the culprit, I dove into the net/http codebase, specifically ListenAndServe. The serverHandler{c.server}.ServeHTTP(w, w.req) is the method that calls the handler of the proxy. I studied the code well enough to know all the factors that could cause a cancellation in my scenario. Most likely being this defer call, however because I had logs proving my handler wasn’t returning, I was also certain the ServeHTTP was blocked, waiting our reverse-proxy.

In summary, I had substantial proof that the defer calls declared before the blocked handler couldn't fire. I mean, unless of course a panic happened somewhere. But a panic would show up in the logs right? The lesson here is to hold weak assumptions when you're deep in the trenches.

Breakthrough

Taking a fresh approach, I decided to manually set a fresh context.Background() on the request sent to the reverse proxy handler. My theory was that, a fresh context woundn't get cancelled so the pushes would work. This experiment had a weird result. The docker pushes still failed, with the same cancelled error, but the ghost “request done” middleware logline started showing up all of a sudden. Not just that, there was also new logline that read: failed to write from localhost:8080->localhost:98374. i/o timeout.

The http server had low timeout configurations, so attempts to push large docker images failed midway. Pushes for smaller images randomly succeed based on caching & network conditions. This was the root cause, but why was it only showing up now?

Restoring Sanity

I had enough to stop digging, but the entire experience contradicted my understanding of program execution. I had either uncovered something or had a flawed understanding.

Through more debugging, I found this line in the reverse-proxy source code which CONDITIONALLY SUPPRESSES ERRORS based on a values set in the context. If the context lacks this field, it logs an error and returns. Otherwise, it panics with the error http.ErrAbortHandler. In golang, when a panic occurs, execution is halted, defer calls (cancels + recovers) run, and control is returned to the caller. In my case, the http.serve function was the OG caller, and it had a top level recover that logged every panic except ErrAbortHandler panics

At this point, everything made sense. It made sense why I was seeing cancel error instead of timeout error. The reverse proxy and the http server were working in tandem to suppress errors, as the expense of logging less contextual errors.

Failure mode

  • Client make a request to the remote builder server
  • Server calls the reverse-proxy, which forwards the request to the docker daemon
  • A push is initiated, however midway, the timers for the build server fires and cancels the operation.
  • This cancellation is propagated to res.Body.Read() of the reverse-proxy, whose read operation breaks down and returns.
  • The reverse proxy panics and aborts with an ErrAbortHandler, instead of returning the original error.
  • The top-level defer in the caller of the reverse-proxy (ie. http.serve) cancels all contexts and recovers from the panic, thereby suppressing it due to it being an ErrAbortHandler
  • Original timeout error isn't logged. Server logs only show request cancelled, client logs EOF because the network pipe to the server got broken.

Not fun, very nasty