Incident Postmortem of 12/10/2019
Between 13:10 and 15:30 UTC on the 12th of October 2019, Vexera's service was disrupted.
A misconfiguration of a library used by some internal services overloaded our local DNS resolver to the point it became slow to resolve queries, and degraded performance of internal services relying on it. Additionally, attempts to recover quickly have caused two additional issues - we triggered a bug in the same library and a race-condition in our internal cache - forcing us to do 2 cold restarts of Vexera in 20 minutes to fully recover in nominal conditions.
The whole incident is, therefore, the result of 3 consecutive issues impacting Vexera's internal services. We decided to write and publish this postmortem to learn from this incident while being fully transparent with our users.
Detailed events timeline
All times recorded are in UTC. To better understand events, we included some complementary technical information between square brackets.
Around 13:40 we noticed an unusual drop of events received from the Discord's real-time gateway using our internal monitoring which displays the average events per second over 30 minutes. Since these drops are usually a result of a Discord-side outage, we started to closely monitor the issue whilst watching for signs of a Discord outage.
At 13:57 we noticed that our local DNS resolver (CoreDNS) which handles requests for all of our services was pinning every CPU core to 100%.
At 14:01 we noticed a heavy increase in RAM usage for our "Sharder" micro-service cluster, which is responsible for forwarding events from Discord's real-time gateway to a queue to be used by our "Worker" cluster. Since the Sharder cluster shouldn't make any DNS requests after startup, we considered that these events were independent and we focused our efforts on bringing back our Sharder cluster to nominal values.
At 14:06 we decided to do a rolling restart of the Sharder cluster. We do this operation regularly without any impact for the end-users. [Each instance of this cluster is handling 36 shards (36 connections to Discord's gateway) with each shard handling approximately 1,000 guilds - Our >300,000 guilds are currently distributed to 288 connections. The restart of an instance consists of shutting down the old instance - closing the 36 connections - and spawning a new instance and sequentially reconnecting shards using the resume mechanism provided by the Discord gateway, which allows us to get all events missed during the reconnect.]
Right after the first instance began its restart, we noticed that the Shard's reconnections were taking longer than usual. Because the connections are sequential, the second half of shards on this instance resume keys were expiring due to the first half taking longer than expected. We quickly noticed this delay was due to the slowness of the DNS resolver (which is used to resolve the gateway IP for each connection).
At 14:12 we decided to restart our DNS resolver in an attempt to temporarily resolve this issue to continue our rolling restart in more comfortable conditions. Immediately after this restart, the whole Sharder cluster crashed due to a bug in a library used to generate statistics (in short; it wasn't able to resolve an internal service to send these stats, and hard-failed instead of temporarily queuing stats in memory).
At 14:14 we communicated to users since the impact for them is now real. A message was posted in #status-updates and sent to the 500+ Discord servers following that announcement channel at 14:16.
At 14:25, 65% of our shards reconnected to the gateway.
At 14:27 following the Sharder cluster crash, we were able to identify and confirm that the high CPU usage of the DNS service was due to a misconfiguration of the same statistics library which was doing >20,000 queries per second to resolve the IP of our internal StatsD ingest service - this misconfiguration was introduced in a patch a few days before.
At 14:33 we patched the Sharder (and 3 other services) to fix the configuration of this library.
At 14:36 we decided to deploy this patch with a cold restart of Vexera due to some technical dependencies and constraints between the micro-services to update.
At 14:38, we integrated into our codebase another patch (that was planned to be deployed later) to speed up the re-connection. [Discord recently migrated Vexera to a special sharding system intended for large bots which, among others, allow us to connect up to 16 shards at once (standard bots can only connect 1 shard per 5 seconds). This patch replaced our sequential re-connection mechanism to spawn shards in batches of 16.]
At 14:40 we killed all micro-services and started our cold boot process.
At 14:46, all shards were reconnected [without the patch, this would have taken approximately 30 mins].
At 14:49 we noticed that our workers were reporting a lot of "Unknown channel/member" errors. These errors are normal right after a cold restart since the internal cache needs to fill up, but they shouldn't last for more than a minute.
At 14:57 we concluded that this was due to a potential cache corruption due to a race condition caused by the parallel shard connections.
At 15:01 we decided to kill our Sharder cluster again and manually purge our cache to temporarily circumvent the issue during the re-connection.
At 15:05 the cache was fully purged and we started the first Sharder instance to check if everything was okay. We immediately noticed that Vexera was trying (but failed) to send the "Thanks for adding me…" message to a lot of servers which is intended to only be sent when Vexera joins new servers. Thankfully, because the cache wasn't fully populated, the function to send these messages failed.
At 15:10 we identified that this was due to an internal timeout being exceeded in the worker micro-service due to the large number of events we have to process when booting 16 shards in parallel (this was probably also the cause of the previous cache corruption). At 15:16 we deployed a patch to our worker cluster to increase this timeout (and also temporarily disable the thanks message). We then started the second Sharder instance.
At 15:18, after confirming that everything was fine, we started the remaining Sharder instances.
At 15:24, the Sharder cluster was fully booted and in a nominal state.
At 15:26, our internal cache finished filling up. We checked that everything was fine again.
At 15:49, after monitoring the situation, we confirmed that everything was nominal, and posted an update in #status-updates to inform our users that the issues were resolved.
Analysis and actions taken
The original issue - the overload of the DNS resolver - was due to a misconfiguration of a statistics library (in fact, we were relying on a weird default configuration for some options). While we didn't identify this issue before going in prod - and we wouldn't really be able to identify it in our test environment, because it happens only at scale - we can expect to detect it quickly after the deployment. We monitor a lot of values coming for our own services, but we completely forgot the basics - monitoring some values such as CPU or RAM usage). Therefore, we identified the issue only when this finally slowed down our Sharder cluster to the point it started to drop events.
The second issue - the crash caused by the stats library - seems to only happens at scale (in fact, only the production Sharder cluster crashed; other services using the same library continued to work as expected), so it's a bit hard to identify it before going into production. We are currently trying to find reliable reproduction steps to fill a bug report to the library's maintainers.
The last issue - the cache corruption caused by an internal timeout exhaustion - is partly related to our recent growth, partly related to the patch we deployed in an emergency to reconnect faster. This is a good example that things can only get worse when you are deploying big patches without enough analysis and testing. We raised the responsible timeout a lot - it shouldn't cause any issues for years to come!
During this incident, 3 developers were available; this allowed us to have a very good reaction time when things got worse. We were able to investigate the different issues in parallel, patch them and redeploy very quickly.
This is the biggest issue (from my point of view) Vexera experienced, and I hope this post gives you some insight into how Vexera operates.
If you have any questions regarding this incident, we are available on our official server!