the call is coming from inside the house

Receiving a call at 2am is never a good sign, and receiving two calls back to back, a voicemail, and multiple emails at 2am is even worse.

It usually starts with the groggy "so what is going on" question - less to understand the situation, and more to give yourself time to wake up a bit and get your bearings for what may or may not be a long day - morning - ahead.

You generally know if it got through L1 and L2 without resolution, and the call is coming at this hour, it can't be a good sign.

It started with a report of "[core service integral to business] is down, and we can't get it back up". It quickly grew to "downstream services are affected and it is delaying the execution of time-sensitive business actions".

After a few minutes of investigation, it was confirmed that a core service ingress was dropping packets, timing out, and at points becoming entirely unresponsive.

The overall network throughout was relatively normal, and there were no glaring issues that would indicate what the issue was.

However digging into the system logs showed some interesting details - one micro-service in particular was receiving a massive amount of traffic in an extremely short period of time. For an enterprise company operating at scale, and a media company at that, high throughput is to be expected. However this was entirely abnormal from the normal throughput for this service - in the range of hundreds of thousands - multiple gigabytes - of data, within milliseconds, from multiple client IPs. At face value, it had the hallmarks of a DDoS attack.

As a continuation of the above, a high profile company often in the media will expect some level of script-kiddie attacks and DDoS attempts.

While our network can handle the traffic, the burstable nature of the traffic and distributed aspect of the requests was maxing out the number of available *nix sockets on our ingress systems.

However what stood out was that the traffic was terminating at a solely internal ingress with no internet access, and all the client IPs were from our internal address space. So either someone had gotten inside our network, or one of our systems was hammering another system.

The former was quickly ruled out - we have extensive network security protocols and monitors in place, and the client IPs were in the range that we carved out for AWS Lambda functions for a particular team.

Digging deeper, it was found that the application in question was queueing requests to another service. When those requests failed, the source applications queue continued to grow, and due to retry logic without back-offs, it kept hitting the destination system with increasingly larger payloads.

To make matters worse, for every request it received, the destination system would then call back into itself - through the full RESTful API's netpath rather than an internal function call - to validate each request.

It was a positive feedback loop of API requests, retries, and callbacks.

The immediate solution was to implement rate limiting and buffering at the network level to free up resources on our ingress systems and reduce the load on the target systems.

With these fixes in place, the load on the systems was reduced and spread out, giving it - and me - some breathing room to get back to work.

By utilizing split DNS, the net path for the system's self-validation callbacks was optimized to reduce load on our ingress systems, and working with the source system's app team, we were able to reduce the overall load on the system through more strategic use of batching, callbacks, and back-offs.

last updated 2024-03-18