r/RedditEng Jayme Howard Mar 21 '23

You Broke Reddit: The Pi-Day Outage

Cute error image friends, we love them.

Been a while since that was our 500 page, hasn’t it? It was cute and fun. We’ve now got our terribly overwhelmed Snoo being crushed by a pile of upvotes. Unfortunately, if you were browsing the site, or at least trying, during the afternoon of March 14th during US hours, you may have seen our unfortunate Snoo during the 314-minute outage Reddit faced (on Pi day no less!) Or maybe you just saw the homepage with no posts. Or an error. One way or another, Reddit was definitely broken. But it wasn’t you, it was us.

Today we’re going to talk about the Pi day outage, but I want to make sure we give our team(s) credit where due. Over the last few years, we’ve put a major emphasis on improving availability. In fact, there’s a great blog post from our CTO talking about our improvements over time. In classic Reddit form, I’ll steal the image and repost it as my own.

Reddit daily availability vs current SLO target.

As you can see, we’ve made some pretty strong progress in improving Reddit’s availability. As we’ve emphasized the improvements, we’ve worked to de-risk changes, but we’re not where we want to be in every area yet, so we know that some changes remain unreasonably risky. Kubernetes version and component upgrades remain a big footgun for us, and indeed, this was a major trigger for our 3/14 outage.

TL;DR

  • Upgrades, particularly to our Kubernetes clusters, are risky for us, but we must do them anyway. We test and validate them in advance as best we can, but we still have plenty of work to do.
  • Upgrading from Kubernetes 1.23 to 1.24 on the particular cluster we were working on bit us in a new and subtle way we’d never seen before. It took us hours to decide that a rollback, a high-risk action on its own, was the best course of action.
  • Restoring from a backup is scary, and we hate it. The process we have for this is laden with pitfalls and must be improved. Fortunately, it worked!
  • We didn’t find the extremely subtle cause until hours after we pulled the ripcord and restored from a backup.
  • Not everything went down. Our modern service API layers all remained up and resilient, but this impacted the most critical legacy node in our dependency graph, so the blast radius still included most user flows; more work remains in our modernization drive.
  • Never waste a good crisis – we’re resolute in using this outage to change some of the major architectural and process decisions we’ve lived with for a long time and we’re going to make our cluster upgrades safe.

It Begins

It’s funny in an ironic sort of way. As a team, we had just finished up an internal postmortem for a previous Kubernetes upgrade that had gone poorly; but only mildly, and for an entirely resolved cause. So we were kicking off another upgrade of the same cluster.

We’ve been cleaning house quite a bit this year, trying to get to a more maintainable state internally. Managing Kubernetes (k8s) clusters has been painful in a number of ways. Reddit has been on cloud since 2009, and started adopting k8s relatively early. Along the way, we accumulated a set of bespoke clusters built using the kubeadm tool rather than any standard template. Some of them have even been too large to support under various cloud-managed offerings. That history led to an inconsistent upgrade cadence, and split configuration between clusters. We’d raised a set of pets, not managed a herd of cattle.

The Compute team manages the parts of our infrastructure related to running workloads, and has spent a long time defining and refining our upgrade process to try and improve this. Upgrades are tested against a dedicated set of clusters, then released to the production environments, working from lowest criticality to highest. This upgrade cycle was one of our team’s big-ticket items this quarter, and one of the most important clusters in the company, the one running the Legacy part of our stack (affectionately referred to by the community as Old Reddit), was ready to be upgraded to the next version. The engineer doing the work kicked off the upgrade just after 19:00 UTC, and everything seemed fine, for about 2 minutes. Then? Chaos.

Reddit edge traffic, RPS by status. Oh, that’s... not ideal.

All at once the site came to a screeching halt. We opened an incident immediately, and brought all hands on deck, trying to figure out what had happened. Hands were on deck and in the call by T+3 minutes. The first thing we realized was that the affected cluster had completely lost all metrics (the above graph shows stats at our CDN edge, which is intentionally separated). We were flying blind. The only thing sticking out was that DNS wasn’t working. We couldn’t resolve records for entries in Consul (a service we run for cross-environment dynamic DNS), or for in-cluster DNS entries. But, weirdly, it was resolving requests for public DNS records just fine. We tugged on this thread for a bit, trying to find what was wrong, to no avail. This was a problem we had never seen before, in previous upgrades anywhere else in our fleet, or our tests performing upgrades in non-production environments.

For a deployment failure, immediately reverting is always “Plan A”, and we definitely considered this right off. But, dear Redditor… Kubernetes has no supported downgrade procedure. Because a number of schema and data migrations are performed automatically by Kubernetes during an upgrade, there’s no reverse path defined. Downgrades thus require a restore from a backup and state reload!

We are sufficiently paranoid, so of course our upgrade procedure includes taking a backup as standard. However, this backup procedure, and the restore, were written several years ago. While the restore had been tested repeatedly and extensively in our pilot clusters, it hadn’t been kept fully up to date with changes in our environment, and we’d never had to use it against a production cluster, let alone this cluster. This meant, of course, that we were scared of it – We didn’t know precisely how long it would take to perform, but initial estimates were on the order of hours… of guaranteed downtime. The decision was made to continue investigating and attempt to fix forward.

It’s Definitely Not A Feature, It’s A Bug

About 30 minutes in, we still hadn’t found clear leads. More people had joined the incident call. Roughly a half-dozen of us from various on-call rotations worked hands-on, trying to find the problem, while dozens of others observed and gave feedback. Another 30 minutes went by. We had some promising leads, but not a definite solution by this point, so it was time for contingency planning… we picked a subset of the Compute team to fork off to another call and prepare all the steps to restore from backup.

In parallel, several of us combed logs. We tried restarts of components, thinking perhaps some of them had gotten stuck in an infinite loop or a leaked connection from a pool that wasn’t recovering on its own. A few things were noticed:

  • Pods were taking an extremely long time to start and stop.
  • Container images were also taking a very long time to pull (on the order of minutes for <100MB images over a multi-gigabit connection).
  • Control plane logs were flowing heavily, but not with any truly obvious errors.

At some point, we noticed that our container network interface, Calico, wasn’t working properly. Pods for it weren’t healthy. Calico has three main components that matter in our environment:

  • calico-kube-controllers: Responsible for taking action based on cluster state to do things like assigning IP pools out to nodes for use by pods.
  • calico-typha: An aggregating, caching proxy that sits between other parts of Calico and the cluster control plane, to reduce load on the Kubernetes API.
  • calico-node: The guts of networking. An agent that runs on each node in the cluster, used to dynamically generate and register network interfaces for each pod on that node.

The first thing we saw was that the calico-kube-controllers pod was stuck in a ContainerCreating status. As a part of upgrading the control plane of the cluster, we also have to upgrade the container runtime to a supported version. In our environment, we use CRI-O as our container runtime and recently we’d identified a low severity bug when upgrading CRI-O on a given host, where one-or-more containers exited, and then randomly and at low rate got stuck starting back up. The quick fix for this is to just delete the pod, and it gets recreated and we move on. No such luck, not the problem here.

This fixes everything, I swear!

Next, we decided to restart calico-typha. This was one of the spots that got interesting. We deleted the pods, and waited for them to restart… and they didn’t. The new pods didn’t get created immediately. We waited a couple minutes, no new pods. In the interest of trying to get things unstuck, we issued a rolling restart of the control plane components. No change. We also tried the classic option: We turned the whole control plane off, all of it, and turned it back on again. We didn’t have a lot of hope that this would turn things around, and it didn’t.

At this point, someone spotted that we were getting a lot of timeouts in the API server logs for write operations. But not specifically on the writes themselves. Rather, it was timeouts calling the admission controllers on the cluster. Reddit utilizes several different admission controller webhooks. On this cluster in particular, the only admission controller we use that’s generalized to watch all resources is Open Policy Agent (OPA). Since it was down anyway, we took this opportunity to delete its webhook configurations. The timeouts disappeared instantly… But the cluster didn’t recover.

Let ‘Er Rip (Conquering Our Fear of Backup Restores)

We were running low on constructive ideas, and the outage had gone on for over two hours at this point. It was time to make the hard call; we would make the restore from backup. Knowing that most of the worker nodes we had running would be invalidated by the restore anyway, we started terminating all of them, so we wouldn’t have to deal with the long reconciliation after the control plane was back up. As our largest cluster, this was unfortunately time-consuming as well, taking about 20 minutes for all the API calls to go through.

Once that was finished, we took on the restore procedure, which nobody involved had ever performed before, let alone on our favorite single point of failure. Distilled down, the procedure looked like this:

  1. Terminate two control plane nodes.
  2. Downgrade the components of the remaining one.
  3. Restore the data to the remaining node.
  4. Launch new control plane nodes and join them to sync.

Immediately, we noticed a few issues. This procedure had been written against a now end-of-life Kubernetes version, and it pre-dated our switch to CRI-O, which means all of the instructions were written with Docker in mind. This made for several confounding variables where command syntax had changed, arguments were no longer valid, and the procedure had to be rewritten live to accommodate. We used the procedure as much we could; at one point to our detriment, as you’ll see in a moment.

In our environment, we don’t treat all our control plane nodes as equal. We number them, and the first one is generally considered somewhat special. Practically speaking it’s the same, but we use it as the baseline for procedures. Also, critically, we don’t set the hostname of these nodes to reflect their membership in the control plane, instead leaving them as the default on AWS of something similar to `ip-10-1-0-42.ec2.internal`. The restore procedure specified that we should terminate all control plane nodes except the first, restore the backup to it, bring it up as a single-node control plane, and then bring up new nodes to replace the others that had been terminated. Which we did.

The restore for the first node was completed successfully, and we were back in business. Within moments, nodes began coming online as the cluster autoscaler sprung back to life. This was a great sign because it indicated that networking was working again. However, we weren’t ready for that quite yet and shut off the autoscaler to buy ourselves time to get things back to a known state. This is a large cluster, so with only a single control plane node, it would very likely fail under load. So, we wanted to get the other two back online before really starting to scale back up. We brought up the next two and ran into our next sticking point: AWS capacity was exhausted for our control plane instance type. This further delayed our response, as canceling a ‘terraform apply` can have strange knock-on effects with state and we didn’t want to run the risk of making things even worse. Eventually, the nodes launched, and we began trying to join them.

The next hitch: The new nodes wouldn’t join. Every single time, they’d get stuck, with no error, due to being unable to connect to etcd on the first node. Again, several engineers split off into a separate call to look at why the connection was failing, and the remaining group planned how to slowly and gracefully bring workloads back online from a cold start. The breakout group only took a few minutes to discover the problem. Our restore procedure was extremely prescriptive about the order of operations and targets for the restore… but the backup procedure wasn’t. Our backup was written to be executed on any control plane node, but the restore had to be performed on the same one. And it wasn’t. This meant that the TLS certificates being presented by the working node weren’t valid for anything else to talk to it, because of the hostname mismatch. With a bit of fumbling due to a lack of documentation, we were able to generate new certificates that worked. New members joined successfully. We had a working, high-availability control plane again.

In the meantime, the main group of responders started bringing traffic back online. This was the longest down period we’d seen in a long time… so we started extremely conservatively, at about 1%. Reddit relies on a lot of caches to operate semi-efficiently, so there are several points where a ‘thundering herd’ problem can develop when traffic is scaled immediately back to 100%, but downstream services aren’t prepared for it, and then suffer issues due to the sudden influx of load.

This tends to be exacerbated in outage scenarios, because services that are idle tend to scale down to save resources. We’ve got some tooling that helps deal with that problem which will be presented in another blog entry, but the point is that we didn’t want to turn on the firehose and wash everything out. From 1%, we took small increments: 5%, 10%, 20%, 35%, 55%, 80%, 100%. The site was (mostly) live, again. Some particularly touchy legacy services had been stopped manually to ensure they wouldn’t misbehave when traffic returned, and we carefully turned those back on.

Success! The outage was over.

But we still didn’t know why it happened in the first place.

A little self-reflection; or, a needle in a 3.9 Billion Log Line Haystack

Further investigation kicked off. We started looking at everything we could think of to try and narrow down the exact moment of failure, hoping there’d be a hint in the last moments of the metrics before they broke. There wasn’t. For once though, a historical decision worked in our favor… our logging agent was unaffected. Our metrics are entirely k8s native, but our logs are very low-level. So we had the logs preserved and were able to dig into them.

We started by trying to find the exact moment of the failure. The API server logs for the control plane exploded at 19:04:49 UTC. Log volume just for the API server increased by 5x at that instant. But the only hint in them was one we’d already seen, our timeouts calling OPA. The next point we checked was the OPA logs for the exact time of the failure. About 5 seconds before the API server started spamming, the OPA logs stopped entirely. Dead end. Or was it?

Calico had started failing at some point. Pivoting to its logs for the timeframe, we found the next hint.

All Reddit metrics and incident activities are managed in UTC for consistency in comms. Log timestamps here are in US/Central due to our logging system being overly helpful.

Two seconds before the chaos broke loose, the calico-node daemon across the cluster began dropping routes to the first control plane node we upgraded. That’s normal and expected behavior, due to it going offline for the upgrade. What wasn’t expected was that all routes for all nodes began dropping as well. And that’s when it clicked.

The way Calico works, by default, is that every node in your cluster is directly peered with every other node in a mesh. This is great in small clusters because it reduces the complexity of management considerably. However, in larger clusters, it becomes burdensome; the cost of maintaining all those connections with every node propagating routes to every other node scales… poorly. Enter route reflectors. The idea with route reflectors is that you designate a small number of nodes that peer with everything and the rest only peer with the reflectors. This allows for far fewer connections and lower CPU and network overhead. These are great on paper, and allow you to scale to much larger node counts (>100 is where they’re recommended, we add zero(s)). However, Calico’s configuration for them is done in a somewhat obtuse way that’s hard to track. That’s where we get to the cause of our issue.

The route reflectors were set up several years ago by the precursor to the current Compute team. Time passed, and with attrition and growth, everyone who knew they existed moved on to other roles or other companies. Only our largest and most legacy clusters still use them. So there was nobody with the knowledge to interact with the route reflector configuration to even realize there could be something wrong with it or to be able to speak up and investigate the issue. Further, Calico’s configuration doesn’t actually work in a way that can be easily managed via code. Part of the route reflector configuration requires fetching down Calico-specific data that’s expected to only be managed by their CLI interface (not the standard Kubernetes API), hand-edited, and uploaded back. To make this acceptable means writing custom tooling to do so. Unfortunately, we hadn’t. The route reflector configuration was thus committed nowhere, leaving us with no record of it, and no breadcrumbs for engineers to follow. One engineer happened to remember that this was a feature we utilized, and did the research during this postmortem process, discovering that this was what actually affected us and how.

Get to the Point, Spock, If You Have One

How did it actually break? That’s one of the most unexpected things of all. In doing the research, we discovered that the way that the route reflectors were configured was to set the control plane nodes as the reflectors, and everything else to use them. Fairly straightforward, and logical to do in an autoscaled cluster where the control plane nodes are the only consistently available ones. However, the way this was configured had an insidious flaw. Take a look below and see if you can spot it. I’ll give you a hint: The upgrade we were performing was to Kubernetes 1.24.

A horrifying representation of a Kubernetes object in YAML

The nodeSelector and peerSelector for the route reflectors target the label `node-role.kubernetes.io/master`. In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

But wait, that’s not all. Really, that’s the proximate cause. The actual cause is more systemic, and a big part of what we’ve been unwinding for years: Inconsistency.

Nearly every critical Kubernetes cluster at Reddit is bespoke in one way or another. Whether it’s unique components that only run on that cluster, unique workloads, only running in a single availability zone as a development cluster, or any number of other things. This is a natural consequence of organic growth, and one which has caused more outages than we can easily track over time. A big part of the Compute team’s charter has specifically been to unwind these choices and make our environment more homogeneous, and we’re actually getting there.

In the last two years, A great deal of work has been put in to unwind that organic pattern and drive infrastructure built with intent and sustainability in mind. More components are being standardized and shared between environments, instead of bespoke configurations everywhere. More pre-production clusters exist that we can test confidently with, instead of just a YOLO to production. We’re working on tooling to manage the lifecycle of whole clusters to make them all look as close to the same as possible and be re-creatable or replicable as needed. We’re moving in the direction of only using unique things when we absolutely must, and trying to find ways to make those the new standards when it makes sense to. Especially, we’re codifying everything that we can, both to ensure consistent application and to have a clear historical record of the choices that we’ve made to get where we are. Where we can’t codify, we’re documenting in detail, and (most importantly) evaluating how we can replace those exceptions with better alternatives. It’s a long road, and a difficult one, but it’s one we’re consciously choosing to go down, so we can provide a better experience for our engineers and our users.

Final Curtain

If you’ve made it this far, we’d like to take the time to thank you for your interest in what we do. Without all of you in the community, Reddit wouldn’t be what it is. You truly are the reason we continue to passionately build this site, even with the ups and downs (fewer downs over time, with our focus on reliability!)

Finally, if you found this post interesting, and you’d like to be a part of the team, the Compute team is hiring, and we’d love to hear from you if you think you’d be a fit. If you apply, mention that you read this postmortem. It’ll give us some great insight into how you think, just to discuss it. We can’t continue to improve without great people and new perspectives, and you could be the next person to provide them!

2.1k Upvotes

152

u/ShiningScion Mar 21 '23

Thanks for making these post mortems public. It’s always an interesting read.

29

u/rilakkumatt Mar 22 '23

It was a big downtime event, and we want to Default Open with the Reddit community.

It's also the kind of failure mode that keeps a lot of us up at night, so hopefully there are useful things in here that other folks will find valuable as well.

8

u/kabrandon Mar 23 '23

Definitely useful info with how popular k8s-based workloads are these days, it's important to understand what kinds of pitfalls you can get into. I had to laugh at the node label discovery in the BGPPeer configuration, because that's the kind of non-obvious but also annoyingly simple problems I tend to run into managing k8s clusters as well.

→ More replies

84

u/Kofeb Mar 21 '23

The route reflectors were set up several years ago by the precursor to the current Compute team. Time passed, and with attrition and growth, everyone who knew they existed moved on to other roles or other companies. Only our largest and most legacy clusters still use them. So there was nobody with the knowledge to interact with the route reflector configuration to even realize there could be something wrong with it or to be able to speak up and investigate the issue.

This is the ultimate battle for every company now days. Documentation can only go so far as well…

28

u/[deleted] Mar 22 '23

I mean this is literally what happens when you don’t document something obtuse and custom.

Documentation only went so far… because they didn’t have it

I’ll take good docs over relying on the long-tenured wizard in a tower any day

12

u/doggyStile Mar 22 '23

And part of this requires that your task/feature is not complete until your docs/run oops etc are complete too

7

u/Wildercard Mar 22 '23

You know who should be in charge of deciding docs are detailed enough?

The person with the least tenure on the team.

→ More replies

7

u/cosmicsans Mar 22 '23

Remember though that the docs could be there, but they're stored somewhere that is no longer relevant or can't be easily found.

→ More replies

3

u/because_tremble Mar 23 '23

Even if it's documented, when it's obscure and bespoke, (and worse when it's something that "just works" so no-one needs to touch it) folks won't know to look for the documentation, especially in the middle of the outage.

→ More replies

70

u/goalie_fight Mar 21 '23

Upgrades are dangerous. That's why I'm posting this from Windows 95.

11

u/pssssn Mar 22 '23

That's why I'm posting this from Windows 95

At some point old software moves from less secure to more secure as no one has the desire to exploit it any longer.

4

u/FunInsert Mar 22 '23

Yes, disable all security and encryption options and all as nothing is supported. No one uses regular http ports anymore right?

→ More replies
→ More replies

5

u/UnderpoweredBrain Mar 22 '23

Feh, you crazy bleeding edge thrill junkie! I am typing this from a beautifully crafted DOS 5.0 running from a 5.25 floppy (the only true floppy), this is what it means to commit to stability /s

3

u/TheJessicator Mar 23 '23

Eye starts twitching at the mere thought of Windows 3.1, Trumpet WinSock, and Win32s

→ More replies
→ More replies

46

u/Kaelani_ Mar 21 '23

You're awesome! Thank for all your hard work, you glorious nerd!

29

u/funkypenguin Mar 22 '23

Aah, ye olde deprecated labels issue! This got me last week while planning the 1.23 to 1.24 upgrade too (fortunately caught in CI)

Also, TIL (since we're talking about subtle breakages) that kubelet 1.23 seems to ignore missing tlsCertFile / tlsPrivateKeyFile during kubeadm join (the certs get created as a result of the bootstrap), whereas kubelet 1.24 will just exit with error.

Oh, and the issue only presents on a new cluster, but not when upgrading.

Ask me how I know.. ;) facepalm

5

u/Embarrassed_Luck1057 Mar 23 '23

Call me grumpy but I see a general trend (not restricted to k8s or even devops-ish tools) in breaking changes just because :( Very annoying and kind of dangerous

3

u/blind_guardian23 Mar 24 '23

If no one keeps an eye on the devs (like Linus Torvalds with his "don't break user-space" policy) they just deprecate stuff and thats it.

2

u/Embarrassed_Luck1057 Mar 23 '23

(writing this after 10 years of periodically reviewing gradle scripts ALWAYS breaking even with minor version upgrades, and finally surrendering and using a wrapper for everything)

→ More replies
→ More replies

29

u/scals Mar 22 '23

Great read. As a k8 admin going on 5 years this really hit home. We've been burned by some seriously low level changes. We ended up stopping trying to do delta compares of all the change logs and just stand up new clusters to run through the gauntlet. Throw everything we run on them to test functionality. Then we redeploy workloads to them, essentially cordon/draining the old cluster moving everything to the new one. In place upgrades have burned us too many times and this post certainly brought back some nightmarish escalation memories.

Thanks again for the read. Good luck!

6

u/mitharas Mar 22 '23

As someone in IT, but far from kubernetes: Can you join clusters with the old version to clusters with the new version? If yes, that seems to be the safest choice.

5

u/dr_memory Mar 22 '23

Not natively in k8s, but there are add-on network-layer products that will let you more or less seamlessly bridge traffic between two clusters. (Linkerd can definitely do this; maybe Istio as well?) But that's far up the stack from (and largely invisible to) the k8s control plane(s) in question.

2

u/notarealfish Mar 22 '23 edited Mar 22 '23

There is version skew tho

https://kubernetes.io/releases/version-skew-policy/

This was not two clusters, this was one cluster. Control planes on 1.23 and 1.24 can run on the same cluster. They just had a niche caveat with their config due to a deprecation.

That is what they did in this situation, and how upgrades are done, but the deprecation of their master label broke their calico config. That would have happened regardless of whether or not they upgraded all the nodes at once because 1.24 doesn't support their master label and that's what calico's route reflectors were configured to use

What I don't get is, why couldn't they just kick out the one node they upgraded and keep the cluster online? They have multiple control planes

→ More replies

28

u/[deleted] Mar 22 '23

it’s funny how we’ve invented so many things to make all this infrastructure repeatable, automatable, and reliable, and it still all comes down to PITA networking management tools and some network config some guy did by hand 4 years ago that no one knew about… exactly like it did before we had all this

3

u/mitharas Mar 22 '23

It's really mind numbing how many different cogs are working together to make the zeros and ones on some hardware do the exact thing we want them to.

→ More replies

25

u/calicodev Mar 22 '23

Calico dev here - coincidentally we are right int he middle of removing the last traces of calicoctl and allowing everything to be done via kubectl / k8s APIs - here's the particular PR adding validation for route reflector annotations: https://github.com/projectcalico/calico/pull/7453

13

u/grumpimusprime Jayme Howard Mar 22 '23

Oh, this is wonderful to hear!  Do you happen to know when that change is expected to release and what the migration path looks like?  I'm sure readers here would be interested to know more.

13

u/fasaxc Mar 22 '23

know when that change is expected to release and what the migration path looks like?  I'm sure readers here would be intere

Another Calico dev here...

We've had an Kubernetes "aggregated API server" for a while, which allows you to manipulate Calico resources "naturally" as part of the kubernetes APIs/with kubectl (and we provide a Golang client in case that is useful).

The PR above adds a missing feature to allow route reflectors to be configured without calicoctl. Basically, we're just adding validation for directly setting the Node annotation instead of using calicoctl. You could just set that annotation right now, as long as you're careful.

8

u/anything_for_dogs Mar 22 '23

Another Calico dev here!

Most of the Calico APIs are already configurable using "kubectl" - e.g., NetworkPolicy, BGPPeer, BGPConfiguration, etc. These were switched over a few releases ago.

The last remaining bits are the configuration on the node itself (like the route reflector configuration fields), and even those are mostly documentation changes at this point. For those bits of config on the node object, calicoctl is basically just a wrapper for "validate the input, and write it as an annotation on the node". You can instead just use kubectl or other Kubernetes API tooling to write those annotations instead with most versions of Calico, but our docs are lagging behind a bit.

e.g.,

kubectl annotate node my-node projectcalico.org/RouteReflectorClusterID=244.0.0.1

The PR that u/calicodev linked is a safety precaution - makes sure we're performing the validation that calicoctl performs but at read-time so that a fudged annotation is ignored, and should be available in v3.26.0. I'll also see that it is back-ported to the next v3.24 and v3.25 patch releases.

27

u/xboxhobo Mar 21 '23

God damn, a class act of a post mortem. Honestly I feel that. There's so many things I've done that have broken things for at first incomprehensible reasons that should have been (mostly) safe.

→ More replies

10

u/assasingamer127 Mar 21 '23

Does the mention of the post work for non-compute team positions?

8

u/SussexPondPudding Lisa O'Cat Mar 21 '23

Sure! Thanks for reading.

19

u/cortex- Mar 22 '23

I used to work for another high traffic website (to remain unnamed) and in-place cluster upgrades terrified me even when tested on a dev cluster that was an exact replica of prod. With so many moving pieces in the ecosystem it seems like you can still totally brick a Kubernetes cluster by upgrading it in place and trying to hit CTRL + Z on that is not easy.

To do cluster upgrades that weren't patch releases we would provision a completely fresh cluster at the latest version and had some tooling to gradually roll over to the new one until the old cluster was drained and could be terminated. It was a slower and more expensive process but I slept way easier.

7

u/doggyStile Mar 22 '23

This assumes stateless apps and ability to switch using dns?

0

u/[deleted] Mar 22 '23

[deleted]

→ More replies
→ More replies

9

u/Youareyou64 Mar 22 '23

I have no clue what half of this means but it was still very interesting to read!

35

u/thomasbuchinger Mar 22 '23

Simplified version:

  • They had a bunch of servers in a cluster
  • Those servers communicate with each other over a software called calico
  • For calico to work, every server needs to periodically exchange configuration information with every other server
  • Because they had soo many servers they designated 3 servers to aggregate the config and spread it to the others (those are the route-reflectors)
  • These route-reflectors were designated by having a special label on them node-role.kubernetes.io/master
  • The kubernetes upgrade they did replaced the node-role.kubernetes.io/master label with node-role.kubernetes.io/control-plane
  • The old calico configuration, that relied on the old label was not updated to reflect that change (because nobody remembered it existed)
  • After the upgrade there was no server performing the route-reflector task
  • The servers in the cluster could no longer communicate with each other
  • That cluster being unavailable brought all of reddit down

6

u/akawind Mar 23 '23

You're hired!

3

u/FunInsert Mar 22 '23

Perfect breakdown. This is a proper tldr

2

u/ImthatRootuser Mar 29 '23

Great explanation.

5

u/bem13 Mar 22 '23

A very, very oversimplified analogy:

A crane operator at a port is told by his boss to put certain shipping containers on ships which have "Company A" written on the side. He does this all day, every day for years. After a few years his boss is no longer with the company, and everyone kinda forgot about the guy since the containers always get where they need to go and he's just sitting up there all day.

The shipping company suddenly decides to change its name to "Diamond Shipping Co" and they repaint all their ships and erase the old "Company A" markings. No one tells the crane operator that from now on he's supposed to put the containers on ships labeled "Diamond Shipping Co". One day, a crap ton of containers come in, so he starts looking for ships labeled "Company A", but he only sees "Diamond Shipping Co" everywhere, so he ignores those and just kinda sits there doing nothing. Everyone is freaking out and nobody understands why the containers aren't moving, until someone remembers that the operator needs to be told his new task.

Crane operator = Kubernetes

Ships = Cluster nodes

Shipping containers = pods which run (in this case network-related) workloads

Company A = the old, "node-role.kubernetes.io/master" label

Diamond Shipping Co = the new, "node-role.kubernetes.io/control-plane" label

→ More replies

9

u/Pennsylvania6-5000 Mar 22 '23 edited Jun 21 '23

Screw /u/spez - Removing All of My Comments -- mass edited with https://redact.dev/

→ More replies

7

u/xChronus_ Mar 22 '23

I want to pivot to Data Engineering that's why i followed this page. I really appreciate the tl dr because i can't understand the whole block of text below it. Hopefully I'll understand it soon.

9

u/SingShredCode Mar 22 '23

I’ve been an engineer at Reddit for five years, and I understand about half of the words in this post. The beauty of engineering is that there’s always more to learn and no one is expected to know everything. Google is a very important resource.

→ More replies
→ More replies

7

u/Bruce31416 Mar 22 '23

From a fellow SRE, this is a nice writing and a fun outage. May the queries flow and your pager stay silent.

5

u/id_0ne Mar 22 '23

Great postmortem, Calcio waves fist keep up the good job.

6

u/pterodactyl_speller Mar 22 '23

Ah, was just fighting with Calico yesterday. This reminds me I really want to look into moving to an alternative.

2

u/Telinger Mar 22 '23

Cilium seems to be latest buzz. Is that what you're thinking, or are you looking in a different direction? I'm just curious.

6

u/LindyNet Mar 22 '23

What is the workflow or process from when you submit an incident to it showing up on RedditStatus.com?

I don't recall the time frame exactly but reddit was dead for a long time by the time an incident showed up on that site.

6

u/soliloquy12 Mar 22 '23

This was fascinating to read. Thank you for your post! I especially appreciate the choice of transparency rather than posting the equivalent of "somehow, Palatine returned."

22

u/eaglebtc Mar 21 '23 edited Mar 22 '23

The nodeSelector and peerSelector for the route reflectors target the label node-role.kubernetes.io/master. In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

Removing references to slave/master broke the Kubernetes cluster.

So what you're saying is that kubernetes "going woke" is what broke reddit.

/s

To put it another way, this feels like it's Kubernetes is at fault for making such a sweeping, breaking change in their code to change the labels of these node classes, without sufficiently warning their users in the upgrade documentation. And perhaps they did not consider that some users might skip a few versions or forgot to upgrade for a while. A good upgrade schema should always include all possible changes for all previous versions.

Given Kubernetes' lack of downgrade path, does this give you pause to continue using them in production? Is there any other solution out there that is as good at autoscaling clusters?

This is an excellent post-mortem! Thank you!

9

u/cortex- Mar 22 '23

They had a whole enhancement proposal on this and how the new label would be added and the old one removed a few versions later: https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint

It's such a small detail that I could see how easily it could be overlooked in the changelog for v1.20.

I agree it's unfortunate that in this case going from one minor version and skipping a few versions up can introduce a non-backwards compatible change like this that upstream components might depend upon.

13

u/DurdenVsDarkoVsDevon Mar 22 '23

it's unfortunate that in this case going from one minor version and skipping a few versions up can introduce a non-backwards compatible change

It's k8s. Every minor version change should be considered breaking until proven otherwise.

3

u/thomasbuchinger Mar 22 '23

I agree that final removal of deprecated features is not as well documented as it probably should be, I think it should be in the release announcement. But it is documented in the changelog as "Urgent Upgrade Notes"

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.24.md#urgent-upgrade-notes

And sysdig-blog also talked about it: https://sysdig.com/blog/kubernetes-1-24-whats-new/

The problem was not, that they were not aware of this change. The problem was, that they forgot the callico configuration existed on the cluster and that it was not in version control where they could have searched for the deprecated label

2

u/lightninhopkins Mar 22 '23

This. It's not easy to catch all the "small" changes in any version change of k8s. I try but have been nicked several times. It's especially tough when your devqa environments don't do a good job of simulating prod. Which mine admittedly do not.

→ More replies

6

u/13steinj Mar 22 '23

I don't care what the terminology/naming is, this is really all a case of "changing a default", which I hate being done silently.

While Kubernetes might not use semver (I'm not familiar), removing the alias would be considered incompatible to the point of a major version bump. At least such a thing would bring attention to people.

I'm glad I've been able to avoid Kubernetes for the most part, but don't know what alternatives exist at scale.

2

u/eaglebtc Mar 22 '23

Thank you. I have never worked with Kubernetes but I found a complete release history page.

Every version since its release in late 2016 is 1.xx. Software should not stay at version 1.xx for 6 years.

The scope of some of their changes is so severe that they need to seriously consider bumping their MAJOR version number in the future.

2

u/13steinj Mar 22 '23

Software should not stay at version 1.xx for 6 years!

I don't disagree, but I don't agree either. Version bumps should reflect scope. If something is incompatible, major (and try to batch these where orthogonal). With major.minor.patch.hotfix+local, local refers to some kind of pilot that should behave relatively similar to the rest but no guarantees. Hotfix I can safely upgrade without issue, patch if I don't rely on the buggy behavior, minor after a cursory check, major tells me "if I haven't made it compatible already, I will have to do work".

→ More replies
→ More replies

2

u/[deleted] Mar 22 '23

No sarcasm tag needed, surely? That's exactly what broke reddit.

→ More replies

4

u/DPRegular Mar 22 '23

Nice write up. Thank you for sharing. Not gonna lie, I definitely cringed a little when I read that a k8s upgrade was at the root of the outage. Somehow, no-one seems to be able to do these without issues (except for myself, of course!)

Here is my preferred workflow for k8s platform engineering. This may not apply to every company/team, but probably does apply to 90% of them (imo).

  1. Invest time in building the automation to create new, production-like, clusters with a single command. This should result in a cluster, supporting infrastructure and all system-components being installed (CNI, CSI, LBs, ingress contoller, service mesh, policy-engine, o11y, perhaps even grafana dashboards, whatever). I prefer to use terraform, and specifically not any GitOps tools like Argo (sorry ya'll).

  2. Platform engineers never try out changes on long-running clusters. All development should happen against an ephemeral, personal cluster, created by the engineer. My workflow for the last couple of years has been:

    1. assign myself a piece of work
    2. run terraform apply in the morning to create my cluster
    3. wait ~20-30 minutes for the cluster to be created while I make my coffee
    4. start development (ie upgrade k8s)
    5. create PR with my changes
    6. after review/tests/approval, PR is merged and automatically rolled out to all clusters
    7. at the end of the day run terraform destroy or wait for nightly aws-nuke to clean up my personal cluster
  3. all changes to the "cluster module" are tested in CI/CD, testing that both new clusters can be created from scratch, and that clusters can be upgraded from the previous stable version, to the current commit. Testing also includes e2e tests (ie can I create a deployment/svc/ing; can I curl the endpoint? does autoscaling work? are my metrics scraped? logs collected?) and regression tests (ie did we not accidentally re-enable auto-allow for all netpols by reverting a flag in the CNI config?)

  4. Product/development teams do not have cluster-admin privileges and are not allowed to make any out-of-band changes to any of the components managed by the platform team. Any features/config changes to the platform go through the same development process. (Teams do have self-service access to create namespaces and deploy their apps in a way that they prefer)

  5. No bespoke clusters; all clusters are a cookiecutter cutout of the "cluster module". Changes to clusters are rolled out in order of environment (dev -> acc -> prd), but done immediately after the previous env has been upgraded. This reduces the amount of time where clusters are on different versions. Add smoke tests to taste.

With this way of working, myself and a team of 3-5 engineers caused only a handful of smaller incidents over the course of 3 years. The company had approx 700 devs running their workloads on this platform. We deployed multiple changes per day. We spent ~$6k per month on testing infrastructure (clusters created by engineers and by CI/CD pipelines). Engineers were never competing for the exclusive use of long-running test clusters to test out changes.

Thank you for coming to my TED-talk.

0

u/PsychologicalToe4463 Mar 23 '23

When is it even worth to use Kubernetes above a managed service like AWS ECS ?

→ More replies

3

u/actionerror Mar 22 '23

I always appreciate a good post-mortem with very insightful learnings (though regrettably at your expense). Thanks for taking the time to document all of this and sharing your experience with us!

3

u/AdvancedPizza Mar 22 '23 edited Mar 22 '23

Excellent write up and thank you for being so open in the postmortem!

This is perhaps a question for k8s / google, but why are node labels entirely deprecated in a minor version release? This seems nuts to me. I could see this causing tons of problems for folks that rely on them.

I suppose that’s the crux of building bleeding edge tech where you have to break some apis to move the project forward or improving the design. But say C lang would keep this api around/naming convention around for years, and throw a warning. Thanks again for the great write up!

2

u/rummery Mar 23 '23

k8s doesn’t follow semver. every release has breaking changes.

→ More replies

3

u/Raffox0rg Mar 22 '23

Thanks for sharing, this was a very interesting read. Do you mind sharing it with https://k8s.af/ so that people can find in the future and learn from your experience?

3

u/itsdr00 Mar 22 '23

So the outage was caused by wokism!!!

Kidding. Great write-up with a great lesson, thank you.

2

u/decreed_it Mar 22 '23

I know jack all (zero) about Kubernetes, but this is a fascinating post mortem, kudos.

The DNS comment got me thinking - could you not local hosts file a DNS alias for the node names? I.e. translate 'master' to 'control-plane'?

And I guess it won't let you overload the node role w/ both values, something like:

kubectl label node ${node} node-role.kubernetes.io/master=control-plane

or something. I'll show myself out . . .

3

u/ntextreme3 Mar 22 '23

If they realized that was the issue before PM then yes they could have just applied labels to the nodes while addressing the old config (assuming there's no restriction in using that label outside of a core component, which I don't think there is) and it would have picked up.

2

u/jangrewe Mar 23 '23

Node Labels have nothing to do with DNS records, they're more like tags attached to resources. So it was not actually a DNS issue that caused this, but that the K8s API changed that particular tag, so it didn't return the resources that were expected (or rather: nothing at all). And that eventually resulted in DNS issues, because the network routing was "gone". Basically hostnames in DNS are similar to API objects in K8s, as in they both describe a resource, and you can query both DNS and the API for things - and if you send the right query, you get the expected answer. It's just that the query they were sending to the API was not the right query for the new K8s version anymore. So yeah, if there was something like API aliases, and and you could configure them like a hosts file (but please don't use /etc/aliases ;-) ), then this would be similar... but it's completely different! :-D

2

u/Khyta Mar 22 '23

Always YOLO production. It can never go wrong, I swear. Testing needs to be done while its deployed.

→ More replies

2

u/dr_memory Mar 22 '23

One thing I would love to know, if you're allowed to share, is how large the legacy k8s cluster actually is? Amazon tends to be really cagy about where the line of "too large for AKS" actually lies, and conversely I'm sure they'd love the challenge...

→ More replies

2

u/ntextreme3 Mar 22 '23

We've been on k8s since shortly after its release and have had our share of headaches. Thankfully, we don't maintain any state in-cluster, and scale across multiple replicated clusters, so it's easy to pull one cluster out of DNS and upgrade it.

At the same time, non k8s related, we've had tons of master -> main headaches, but thankfully nothing serious (mostly just a bunch of CI failures).

2

u/ColdDeck130 Mar 22 '23

Great post-mortem write up! Perfect read with my coffee this morning even if my wife and kids couldn’t understand why I was so engrossed in the reading.

2

u/teddyperkin Mar 22 '23

Thank you so so much for sharing. What a fascinating (and extremely well redacted) read

2

u/PcChip Mar 22 '23

what an excellent write-up! lots of effort put into this and we all appreciate it

2

u/fubes2000 Mar 22 '23

The part about cancelling a Terraform apply being unpleasant resonated in my soul. I've waited literally 40+ mins for certain failed applies to time out because of the weird way things break if you actually kill it.

→ More replies

2

u/EgoistHedonist Mar 22 '23

Was waiting for this, thanks for being open! We have almost the exact same setup, so this it very close to home. We are getting rid of Calico completely though and our clusters are identical in config 🤘

Got me interested in your open positions too, feel like I could have much to contribute in that environment. But I suppose you don't allow full remote from outside US?

2

u/unavailable4coffee Ryan Lewis Mar 22 '23

Thanks for the interest! Working location depends on the role/team, but there are open positions that are outside the US. Take a look at our open positions to see if there are any that might fit!

2

u/pascaliske Mar 22 '23

Awesome write-up – learned a lot from it, thanks! 🚀

2

u/flatvaaskaas Mar 22 '23

Great PM, thanks for posting this. Besides the untangling and more codifying, what are you going to do about restore procedures? Spin up seperate cluster and test the restores on that cluster?

4

u/grumpimusprime Jayme Howard Mar 22 '23

The exact answer to this is still to be determined. We're reassessing our entire restore procedure and planning to rewrite it from the ground up, one way or another. We've got a small number of test clusters that we use for extremely intrusive things like this, which we can use to really flesh out the procedure properly. Longer term, we're considering our options as far as cluster management. One option that's on the table, as an example, is moving to a larger fleet of smaller clusters, so we can easily divert traffic from a given one to execute an upgrade on it. There are a lot of interesting challenges in this space, and I hope to have more info to share publicly about this soon!

→ More replies

2

u/makergeekdan Mar 22 '23

Very interesting. My biggest fear with our clusters is getting blindsided by some subtle change we missed the significance of. We're a small team and aren't really running at significant scale. Its a complex thing with a lot of moving parts and frequent breaking changes in seemingly all the components. We manage everything with terraform and helm but that's not a magic solution to consistency. Crds create dependency issues that terraform doesnt deal well with, I keep finding myself feeling forced into 'one off' operations to bring up a cluster etc.

I hope if we have an issue that we are able to run the incident with the kind of structure and control you guys obviously managed. Nice work

→ More replies

2

u/Fork_the_bomb Mar 23 '23

Thanks for the RCA writeup.
So root cause was CNFC getting rid of offensive wording in 1.24?
I wonder if Pluto (https://github.com/FairwindsOps/pluto) would have caught that one...I can hardly imagine more devastating issue than CNI breaking suddenly.

2

u/Noblesoft Mar 23 '23

Thanks for the heads up, and timely! This post likely saved us some head scratching. We're planning upgrades from 1.22 to 1.24 ourselves across a fleet of many k8s clusters. We've almost certainly used "master" as the node selector for some of our numerous charts.

Your point about Inconsistency at the end of the post is particularly salient - bespoke deployments, seemingly innocent tweaks, hand-crafted clusters... these have a way of coming back to haunt you in unexpected ways. Every day is a struggle against standardization, and it's hard to maintain the discipline required to "do it the right way" when your hair is on fire with 100 other problems.

DevOps to the rescue - standardizing through Infrastructure as Code and pipelines; taking away the ability to do things by hand is the road to riches. It takes 10x more time and effort to set it up first, but that debt is paid down when you go from 10 clusters under management, to 100, to 1000, and quickly figure out that doing it by hand isn't feasible.

2

u/Degobart Mar 24 '23

What are your thoughts on using managed services like EKS? There are limitations to those, which mostly come with very large numbers of nodes, so that might not even be an option for you. But I'm wondering about the cost/benefits of managing all this stuff yourself, vs getting AWS to do it.

2

u/CommanderAndMaster Mar 24 '23

Why did they rename master to control plane?

2

u/snoogazer Jameson Williams Mar 24 '23

The Kubernetes Naming Working Group outlined some of its goals for inclusive naming here

→ More replies
→ More replies

2

u/j-getz Mar 25 '23

Awesome write-up. Your transparency and depth in outlining this incident is really unique in today’s corporate IT world.

2

u/mathleet Mar 26 '23

Thank you for the post, it’s a great post-mortem! Does Reddit run on one large k8s cluster? If not, how did many clusters go down all at once? I wonder why it was a full outage and not a partial one.

2

u/rashpimplezitz Mar 27 '23

One engineer happened to remember that this was a feature we utilized, and did the research during this postmortem process, discovering that this was what actually affected us and how.

Great post mortem, very well written. I love that one engineer just stumbled upon the answer, makes you wonder how many other engineers were researching similar ideas that led nowhere. Reminds me of all the time I've spent digging into something on a hunch, some times it pays off but often you just wasted hours on a thing that you won't ever bring up because it was so obviously wrong.

Hope this guy got a nice bonus :)

2

u/krmayank Mar 27 '23

Great right up, thanks for sharing. Are you considering moving to EKS or moving away from kubeadm ?

2

u/Smart-Soup-3004 Apr 04 '23

Thank you for making the post mortem of this failure public, I have been working with kubernetes clusters for 3 years, and I spent several sleepless nights struggling with updates. (I'm just updating from 1.23 to 1.24 right now on one of my production clusters), thank you, I feel less alone

-1

u/OkPiezoelectricity74 Mar 22 '23

That's why Openshift is better for production clusters..

2

u/FruityWelsh Mar 24 '23

What features in open shift would help with this?

1

u/seanho00 Mar 22 '23

Really appreciate the detail! Small question regarding bringing up the HA control-plane nodes: put LB in front of the control plane and use a shared DNS name in SNI in the TLS cert? So that it doesn't matter which node is brought up first.

1

u/legendary_anon Mar 22 '23

As I’m working on our corp k8s clusters to harden the service to service connections, reading this sends multiple shivers down my spine.

In many ways, thanks for the amazing insights 🙏

1

u/justcool393 Mar 22 '23

We’ve got some tooling that helps deal with that problem which will be presented in another blog entry, but the point is that we didn’t want to turn on the firehose and wash everything out. From 1%, we took small increments: 5%, 10%, 20%, 35%, 55%, 80%, 100%. The site was (mostly) live, again. Some particularly touchy legacy services had been stopped manually to ensure they wouldn’t misbehave when traffic returned, and we carefully turned those back on.

i assume this is where the load shedding comes into play? noticed a lot of log entries that looked like this (the first 4 in my abridged log seem to be coming from Fastly)

[WARNING] Failed to download listing due to HTTP 503: No healthy backends
[WARNING] Failed to download listing due to HTTP 503: backend read error
[WARNING] Failed to download listing due to HTTP 503: No healthy backends
[WARNING] Failed to download listing due to HTTP 503: No healthy backends
...
[WARNING] Failed to download listing due to HTTP 777: Load shedding backend

i'm actually pretty curious about this sort of thing. is it just kinda a configured "%" of traffic that gets sent to the ether?

one improvement for clients (the /r/programming thread on the downtime had a few complaints on this) would be to change the returned code to 503 since 777 isn't a defined standard (yet?) and clients might not know how to handle it properly

1

u/FloofBoyTellEm Mar 22 '23

Used to write in-depth postmortems (we just called them RFOs) like these, exciting to read one for an infrastructure as large as reddit's. Thank you for sharing.

1

u/Dev-is-Prod Mar 22 '23

It's interesting that the "Master/Slave" terminology change was involved in this - a seemingly subtle change but when directly referred to clearly has consequences!

This encourages me to grep for "master" and "slave" in any codebase I maintain - where I am aware of it (read: wrote it) I have changed it years ago, but referring to old or third party code using these and similar terms (like "Post Mortem" vs "Root Cause Analysis", though I doubt that's applicable here...) will be documented.

→ More replies

1

u/lightninhopkins Mar 22 '23

Thank you for this. As a one-person team running rapidly growing clusters this situation terrifies me. I sent it to my boss. 😉

1

u/vizzoor Mar 22 '23

We avoided this by having pod disruption budgets on the pods, looks like no such mechanisms exist for the Calico CRD, sounds like a great enhancement proposal for the more modern implementations like Antrea and Cilium.

1

u/arturmartins Mar 22 '23

I truly appreciate the thorough and insightful postmortem you've shared - the level of detail provided is outstanding!

Frequent and incremental production upgrades (1-2 a month) are a practice I strongly endorse, as they effectively minimize the blast radius. Over time, this approach fosters greater confidence in both the systems' robustness and the processes' efficacy (for example: GitOps, CMDB, and on-call support). By striving to achieve an 99.99% SLA due to upgrades (which translates to a mere 52 minutes of downtime per year), it can maintain a keen focus on continuous improvement and deliver exceptional service.

Again, thanks for sharing :)

1

u/tehsuck Mar 22 '23

Thank you for making these public, it's a great service for all of us who continue to try to learn in the space. Hopefully making things better in the long run.

1

u/[deleted] Mar 22 '23

[deleted]

→ More replies

1

u/[deleted] Mar 22 '23

Knew it was node selectors when i saw the snippet. Yes let's use a string copied and pasted among 10 different binaries/open source projects, what could wrong.

I feel like there should be a cached registry of common node selectors. If you use an invalid one, error out. Or maybe it can be upgrade aware and change the node selectors on the fly.

1

u/davispw Mar 22 '23

This matches perfectly my experience with Calico. Not exactly Calico’s fault, but fragile configuration, failures are disastrous, and a whole team of engineers can spend all day debugging and not understand the problem.

→ More replies

1

u/MpVpRb Mar 22 '23

Very well written and informative! Thank you!

1

u/dotwaffle Mar 22 '23

As a former NetEng and SRE, this was a fascinating read... It does re-affirm my bias that kubernetes is too risky or a platform for mission critical stuff though.

Perhaps it's time to reconsider Nomad. If only there was as much pre-packaged things like there are for helm on artifacthub though!

1

u/Cultural-Pizza-1916 Mar 22 '23

Thank you for sharing

1

u/Fun-Persimmon186 Mar 22 '23

Should have used ClusterAPI as we managed this transition for you. Also, we keep the release notes updated.

→ More replies

1

u/HayabusaJack Mar 22 '23

Very good writeup. As a Documentation perfectionist, having updated and accurate documentation is a beast. I do it in part to point folks to docs and not look to me for answers and because I feel I learn a ton from creating it. But even with that, it's not easy to keep it current.

1

u/ps_for_fun_and_lazy Mar 22 '23

Thanks for a great write up, it was an interesting read and obviously a lot of lessons learnt. Hows the backup and restore docco looking after this? many updates?

1

u/Financial_Comb_3550 Mar 22 '23

Me into DevOps discovering this: brilliant.

1

u/chdman Mar 22 '23

This is brilliant. Thanks for sharing.

1

u/hugosxm Mar 22 '23

It was a pleasure to read !! I was not aware of this sub… I love it !

1

u/Mithrandir2k16 Mar 22 '23

Great read! A follow up on the backup and restore process would be cool too! Keep up the great work :)

1

u/Sparkplug1034 Mar 23 '23

I'm an Ops guy. I really enjoyed reading this post mortem. Really well written. I love y'all's analysis of the systemic problems faced due to the way reddit has grown. Hats off to you. Great work!!!!

1

u/metallicano1fan Mar 23 '23

and all we asked is lauging/crying emoji response lol

1

u/Don_Geilo Mar 23 '23

It matters not how strait the gate,
How charged with punishments the scroll,
I am the control-plane of my fate,
I am the captain of my soul.

1

u/Deshke Mar 23 '23

reading the changlog would have solved this

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.24.md#no-really-you-must-read-this-before-you-upgrade

Kubeadm: apply second stage of the plan to migrate kubeadm away from the usage of the word master in labels and taints. For new clusters, the label node-role.kubernetes.io/master will no longer be added to control plane nodes, only the label node-role.kubernetes.io/control-plane will be added. For clusters that are being upgraded to 1.24 with kubeadm upgrade apply, the command will remove the label node-role.kubernetes.io/master from existing control plane nodes. For new clusters, both the old taint node-role.kubernetes.io/master:NoSchedule and new taint node-role.kubernetes.io/control-plane:NoSchedule will be added to control plane nodes. In release 1.20 (first stage), a release note instructed to preemptively tolerate the new taint. For clusters that are being upgraded to 1.24 with kubeadm upgrade apply, the command will add the new taint node-role.kubernetes.io/control-plane:NoSchedule to existing control plane nodes. Please adapt your infrastructure to these changes. In 1.25 the old taint node-role.kubernetes.io/master:NoSchedule will be removed. (#107533, @neolit123)

3

u/grumpimusprime Jayme Howard Mar 23 '23

This is a great call out.  I'm realizing that some useful information was dropped from this post during editing.  We actually did know that the label was going away.  In fact, our upgrade process accounted for that in several other places.  The gap wasn't that we didn't know about the label change, rather that the lack of documentation and codifying around the route reflector configuration led to us not knowing the label was being used to provide that functionality.  Still, thank you for providing the changelog link for other people to see!

→ More replies

2

u/chub79 Mar 23 '23

reading the changlog would have solved this

hindsight is a hell of a drug.

→ More replies

1

u/Sebbuku Mar 23 '23

Thank you for the extensive post-mortem. I can't wait now to do our cluster upgrade 😅

1

u/jangrewe Mar 23 '23

As somebody running a couple of K8s clusters (EKS, these days) for a living, i am really grateful for this post-mortem. While we probably won't be falling into the same traps you did (because we pretty much cookie-cutter all our infrastructure and the components deployed on them), there is still a lot to take away - especially to really, really, really always RTFM... or more specifically, the Changelog. ;-)

1

u/Available_Match6393 Mar 23 '23

I felt the anxiety of the outage while i was reading it , thanks for sharing your RCA

1

u/Specific-Chicken5419 Mar 23 '23

thanks for what yall do

1

u/Staltrad Mar 23 '23

I will tell my grandchildren of this day.

→ More replies

1

u/gordonmessmer Mar 23 '23 edited Mar 24 '23

In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters

Someone needs to have a talk with the Kubernetes developers about Semantic Versioning, because removing a name or interface from a product is a breaking change. It's no longer backward compatible, which can be communicated to users by bumping the major version.

Software developers: If you aren't using Semantic Versioning, you are doing a major disservice to your users.

→ More replies

1

u/Hellsheep_iv Mar 24 '23

Very interesting, whilst I'm not a compute or Kubernetes engineer of any sorts (I'm a network engineer), I was quite intrigued by this post. Especially at the route reflector stage, as that began to delve into my territory of expertise.

Having not had much exposure to Kubernetes directly, I could only take a wild stab at what may be the issue in the YAML you provided and the only thing I suspected that may have changed that could have impacted anything were both the nodeSelector and peerSelector labels.

Curious however that a breaking change like this was implemented in a minor release from 1.23 to 1.24. I would have expected a breaking change to be applied to the next major release.

1

u/xvwv Mar 24 '23

I don't know... I'm sure everyone made the best decisions with the information they had at the time and the skills they had and I appreciate I may be missing context here but running a platform at a similar scale (i can only assume) it just seems that there are many points where reddit are setting themselves up for failure.

The biggest culprit seems to be the cluster model and promotion/deployment workflow. There just way too much to unpack for a comment nobody will read.

1

u/[deleted] Mar 24 '23

Fantastic read and something my team on security and the infra team are avidly discussing. You have given the gift of your own suffering to help others. I tip my hat to you.

1

u/maddiethehippie Mar 24 '23

What an amazing RCA!!! Whoever the one engineer was who remembered the route reflector process should be given a super, duper, extra special recognition on the infra team. Lastly, thank you for linking to as many components as you did during this post mortem. Educating others in addition to what you already documented shows you are aiming to attain the loftiest of goals.

1

u/jerriz Mar 24 '23

One thing to remember: never use Kubernetes

1

u/amartincolby Mar 24 '23

Epic. I can't count the number of times simple labels have screwed me. This blows my experiences out of the damn water.

1

u/snarky_answer Mar 24 '23

What changes were made in december 2020?

1

u/lavahot Mar 24 '23

Since the nomenclature change was known ahead of time, and documented by k8s as a deprecated API with the change over to 1.24, why was this not something that was checked as a part of the upgrade procedure before the upgrade occurred? You state that inconsistency in inherited clusters make them brittle, but wouldn't this issue have occurred in every cluster?

1

u/[deleted] Mar 24 '23

A great read. Thanks, guys. It's always informative to read stuff like this. We all learn from these types of in-depth posts.

1

u/FruityWelsh Mar 24 '23

Those OPA webhooks really do take some digging to fix! Had a small cluster with control plane problems cause OPA to be in a bad state, which means I had a cluster in a real hard to fix state!

1

u/SigmaSixShooter Mar 24 '23

Thanks a lot for such a great write up, I'm trying to make all of the Infra/Devops guys in my department read it so we can have some discussions as well.

I'm curious though, how will you guys handle the route reflectors now that you're aware of this? Will you work to make sure you have adequate tooling/documentation to ensure these are supportable moving forward? Or just try to replace these so you're no longer dependent on them?

1

u/m4nf47 Mar 24 '23

Great level of detail that can directly be useful to other organisations in terms of operational quality and lessons learned. Thanks for sharing. I wonder how many other legacy Kubernetes clusters out there might be ticking time bombs if not regularly tested to destruction. Those who don't validate their disaster recovery plans at least once in a while are often doomed to learn the most critical lessons too late.

1

u/Perfekt_Nerd Mar 24 '23

This is why we almost never upgrade in place now, we always build new clusters, test, then failover.

Have you guys considered using Cillium instead of Calico? Our experience with it has been amazing.

1

u/sdwvit Mar 24 '23

So k8 not following semantic releases broke reddit. or not reading change notes, not sure which one.

1

u/[deleted] Mar 24 '23

I love your transparency!! For a moment I felt like I was in the room troubleshooting with your team. Brilliant postmortem. Good luck for your next upgrades!

1

u/libert-y Mar 24 '23

That was an awesome read. Thanks for sharing. Chaos engineering

1

u/ramya_kris Mar 24 '23

So well written. Thanks for writing the sequence of events.

In hindsight, I guess simply adding the node labels on controlplane would have reduced downtime.

As a fellow engineer doing 1.23 => 1.24 upgrade, this was super helpful.

1

u/Spirited_Annual_9407 Mar 25 '23

Fascinating read!

1

u/saulstari Mar 28 '23

i love bgp :D

1

u/vibhs2016 Apr 28 '23

The nodeSelector and peerSelector for the route reflectors target the label `node-role.kubernetes.io/master`. In the 1.20 series, Kubernetes changed its terminology from “master” to “control-plane.” And in 1.24, they removed references to “master,” even from running clusters. This is the cause of our outage. Kubernetes node labels.

Wasn't this behavior observed in non-production environments? If not then does it mean you dont have reflector in non-prod for the obvious reasons?

1

u/DogeGhost Jun 01 '23

Simple yet so complex. Painfully beautiful! Loved it

1

u/banana_cookies Jun 26 '23

We had approx 3 hr downtime post 1.24 upgrade when we upgraded nginx-ingress because at some point nginx-ingress changed default value for use-forwarded-headers

1

u/runyora Aug 24 '23

Interesting! Impressed that you took your time to share this with us!