Debugging on the Network

It’s hard to ignore the fact that almost every piece of software we build nowadays is networked software in one way or another, and a lot of the software we build is distributed.

One of the quirks of building such systems is that you have another component to worry about: the network. You need to have some knowledge of how one component reaches out and touches another, and how messages flow through it. I mean, sure you don’t need to know the exact details of BGP routing, or even the internals of TCP and IP, but you at least need to have some working model of the behavior of the network. It also means being able to leverage some tools to figure out issues between components.

I was helping out my teammates with a pretty interesting problem. We were in the process of onboarding a new reverse proxy in front of one of our services; it’d sit in between the service and a load balancer. Think something like HAProxy for SSL termination – the load balancers shouldn’t be doing SSL termination for HTTPS, but neither do we want our service to handle it.

Anyway, the problem in this case was this: we were fairly certain we’d set up the load balancer correctly, and we were definitely certain our service was working properly (because, really, we’d have bigger problems if it didn’t), but we couldn’t for the life of us figure out why we were getting 500 when fetching our service’s health check endpoint.

One of this I suggested was to break the problem down: figure out where in the chain from client to service the problem lies. Although we were certain our load balancer was configured correctly, we couldn’t be 100% sure – we might have missed something after all, even if we did use another load balancer configuration as reference. We were also likely to have misconfigured our SSL termination proxy, as it’s a new thing we were adding to the chain. However, we might also be triggering a bug or misconfiguration in our service, although unlikely: obviously, this is a set up that never existed before, so there’s a chance that certain assumptions about what HTTP headers are being sent (or their contents for that matter) are being violated.

It’s also more likely though that it’s the interaction between components that’s the problem. A lot of problems exhibit around the interactions between components, hence our use of integration tests to discover what these problems are. In hindsight, we probably should have built this out on an earlier part of our deployment pipeline, maybe in our beta stages – we were trialing this configuration in one of our gamma environments, as a sort of “pre-production-lite”, so we could have more confidence on it working in our production stages – but I digress.

So, we decided to run a simple request to the health check endpoint at every stage – at the load balancer, at the SSL termination proxy, and at our service. Obviously, when queried directly, our service returned fine, and obviously, when queried at the load balancer, we got HTTP 500. We also observed that error at the SSL termination proxy, so it smelled more and more like it was a problem at the SSL proxy end.

Hold on, I said to my teammates. We haven’t yet discounted our service completely.

You see, at this point dear readers, we haven’t yet isolated whether the response is coming from our service, or it’s coming from the SSL termination proxy. Sure, we’re querying the proxy – but did it return immediately with HTTP 500, or is it querying our service, and in turn because our service is getting a bad request (or something), our service is erroring out?

At this point, there are a few ways to diagnose this, but I went for the most natural for me: tcpdump(1). I actually hadn’t expected tcpdump(1) to be available on our infrastructure, so I was pleasantly surprised to find it installed. I mentioned we could watch for packets to the port our service is listening on, and see whether the HTTP requests being made to it made any sense.

One of the things I’d suspected at this point was a mismatch of protocols: the SSL termination proxy may have been making an HTTPS request to our service on our plain HTTP port, and so our service was basically getting a junk request, and the SSL proxy was in turn getting a “junk” response. The best way to have empirical evidence of that or the contrary is, of course, to take a look at what’s actually flowing between the two, and this is where tcpdump(1) comes in.

We ran tcpdump(1) with the bare minimum so we could at least see what was being sent to our service:

  $ tcpdump -n -A -i lo port 9080

Our service was listening for HTTP requests on port 9080, and we’d configured our proxy to talk to it on that port, so if there was any garbage (i.e. SSL traffic instead of plain HTTP) we’d see it immediately.

No dice. In fact, we found out that there were no requests being made by the SSL proxy to our service – another data point. Which exonerated any problems with our service, at least for now.

We eventually traced it to a configuration on the SSL termination proxy (which our load balancers supported) called Client IP insertion. Specifically, the SSL termination proxy was designed to work with our Citrix NetScaler load balancers in TCP mode – and in this mode, the load balancers inject the IP address of the connecting client in-band as part of the initial TCP handshake. When talking to the SSL termination proxy directly, we weren’t obviously sending such data (we were using curl(1) to test), and so we were causing the proxy to respond with an error.

There are two ways we could fix this, and we opted for the more obvious one, since it’s the behavior we wanted:

Switch our load balancer to injecting the client IP, as per documentation;
Tell our SSL termination proxy to not expect client IP injection

We went with the first option: it was the most convenient, and it was what we want to have anyway, but it also meant that we couldn’t ever test from the SSL termination proxy end (unless we used netcat to inject raw packets, but that’s an entirely different kind of pain I’d rather not go into).

After reconfiguring our load balancers and testing with curl(1) at the load balancer end, we were able to see a sucessful response from our service’s health check – which validated that the whole configuration worked.

This isn’t the first time I’ve used tcpdump(1) to debug a service. In the same way that I think println is the unsung hero of debugging within a service, tcpdump is its networking cousin, and I’ve used it to figure out issues with the increasingly networked services we build.

For instance, I remember trying to debug a component of a backend orchestration middleware we built whose job it was to processs SMS messages from end users by triggering various actions, including coordinating billing for an end-user service that our client was building (hence the middleware).

We had an issue where the middleware couldn’t seem to authenticate with the billing service, and since it was a custom protocol provided by the telco, the usual tools for debugging HTTP requests we had at our disposal weren’t available; we were given protocol documentation and a test environment, nothing more.

So being able to at least capture traffic from our middleware to that service via tcpdump(1) allowed us to debug the issue – and in the end it turns out we had a bug in our implementation of the protocol, causing us to incorrectly interleave disparate requests.

Sure, Wireshark would have also worked in this case (and would have been easier to use), but sometimes ssh-ing to a server and running tcpdump(1) is all you need (and all you have).

Previously: Of Varied Interests