nginx upstream dns caching

For the past few weeks - namely after the AWS Route53 DDoS which caused myriad issues throughout our network - I had been noticing greater than usual issues with our NGINX ingress systems.

These issues revolved around NGINX proxy upstreams suddenly not resolving, causing a ripple effect of system issues throughout our network.

However when I would hop onto the NGINX servers to investigate, I was able to resolve the upstreams without issue through all our internal DNS servers. But staring me in the face were log files clearly indicating upstream resolution issues - as well as teams wondering why their systems were suddenly unresponsive.

Each time, I would do my usual investigation, pull the required logs for audits / review / grep / grok, restart the service, and go on my way.

However after a particularly large system was affected, I grabbed some coffee and hunkered down for a more in-depth investigation.

Our cloud network architecture has us taking advantage of VPC DHCP Option Sets wherever possible. This ensures that internal DNS servers are not hardcoded in systems / applications, which makes it much easier to roll and manage DNS configurations in a more central location.

While we had set NGINX resolvers for some specific servers, for the vast majority of our NGINX servers, we relied on the host OS / VPC DHCP for DNS resolution.

However after reviewing my collection of logs from systems, I noticed an interesting trend.

NGINX was reporting the upstream's resolved IP - which was working a day before - was now not responding for AWS services such as ELB, API Gateway, etc.

Yet when I did a dig against the DNS record, it was resolving perfectly fine, albeit to a new IP.

This raised the question - is NGINX caching DNS records and not respecting TTL? Short answer: yes.

After checking with our NGINX contact, I confirmed that if a resolver is not defined in NGINX, it will use the host OS's network configs to resolve all upstreams at startup / refresh, and cache these records indefinitely, regardless of the DNS TTL.

This meant that for AWS services which had constantly changing IPs behind a DNS record, the TTL was not being respected, and when AWS changed the upstream IPs, NGINX was not re-querying to get the new IPs.

To resolve this issue (no pun intended), a resolver configuration was set on the http block to point to our DNS servers, while the individual servers which required their own resolvers were still able to override in their config blocks.

With this change in place, the upstream resolution issues have gone away completely.

One would assume that a networking system such as NGINX would respect something as basic as DNS TTL when resolving through the host, but as with anything in this field - or really, in life - assumptions mean nothing in the face of facts.

Moral of the story? Explicitly define DNS resolvers in your NGINX servers, especially when communicating with upstreams which frequently roll IPs.

last updated 2022-08-20