dotnet lambda internal networking
Networking Lambdas into a private VPC is a relatively new offering from AWS, and one that is greatly beneficial when running serverless systems which still need to call into internal / on-prem systems.
We have a team which is entirely* serverless, running DotNet code almost entirely in Lambdas (the asterisk is because for performance / latency reasons, some select components are still run on "old-school" virtualized servers).
Many of their Lambdas must call into internal resources / systems, and for this reason, placing their Lambdas into our VPCs - which are connected to our internal network - is necessary.
While the vast majority of their Lambdas process without issue, for the past few months we have seen some very interesting issues when these Lambdas attempt to connect to any networked resources (internal or otherwise).
After thorough investigation, executing the same logic in other languages, and even bringing AWS on-site to review the infrastructure at a more granular level, we have found some interesting - and not always replicable - issues with how DotNet specifically operates within internally-networked Lambdas.
A bit of low-level background on Lambdas: Lambdas run in specially-developed AWS containers (not Docker, Containerd, Podman, or other "widely available" container systems) on top of Amazon Linux EC2s. While I suspect Lambda containers are simply built on top of the same cgroup and namespace basis that all other container systems use, Amazon holds the specifics of their architecture close to their chest, and Lambda code is closed-source.
DotNet was created to run primarily on Windows systems, and DotNet Lambdas require a fair bit of virtualization to work within the Linux container world. While DotNet Core is now available which does enable running (certain) DotNet code in a Linux / container environment, the "stable" release is in its infancy.
When looking at the layers of virtualization required to execute DotNet code in a Lambda, it looks something like this:
DotNet Code -> DotNet Virtualized Environment -> AWS Lambda Container -> AWS Linux EC2 -> Server
For the most part, DotNet Lambdas execute without issue, however we have found an interesting issue when tying them into our internal network.
When attaching a Lambda to a customer-owned VPC, Amazon removes the AWS network interface and attaches a customer-owned network interface to the container. For nearly all other frameworks / languages, this occurs without issue. However we have found that in certain instances, and seemingly at random times, a race condition occurs where the Lambda code begins to execute before the customer's network interface is attached to the Lambda.
This effectively means that any network requests made within the Lambda will fail with an obscure "Resource temporarily unavailable" error. And since the DotNet System.Net Class is instantiated only once at the start of the Lambda, all subsequent network requests within the Lambda will continue to fail, even after the customer's network interface has been attached.
After extensive testing, it was found that this issue is not consistent, and may be related to the underlying hardware and / or EC2 class on which the Lambda is executed, however since Lambda is a managed and "serverless" service, such low-level details are obfuscated and hidden from the user.
If possible, re-writing the application logic in a more performant / Linux-friendly language is an ideal solution, however for enterprise applications which have been built on top of the DotNet framework and with a large DotNet development team behind them, this is simply an untennable "solution".
Until a long-term resolution is implemented, we have found that waiting before the Lambda is fully instantiated, implementing sufficient retry logic within the Lambda code, and utilizing a queuing system to dispatch processes to multiple Lambdas is a viable short-term solution.
last updated 2024-03-18