this post was submitted on 02 Sep 2024
97 points (94.5% liked)

Asklemmy

43945 readers
1028 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 5 years ago
MODERATORS
 

Please just provide a short, medium or long story

you are viewing a single comment's thread
view the rest of the comments
[–] dan@upvote.au 12 points 2 months ago* (last edited 2 months ago)

I took down the home page of one of the top 5 websites for around 5 minutes.

There were two existing functions that were written by a different team: An encode method that took a name of something (only used internally, never shown to the user) and returned a numeric identifier for it, and a decode method that did the opposite.

Some existing code already used encode, but I had to use decode in my new code. Added the code, rolled it out to 80% of employees, and it seemed to work fine. Next day, I rolled it out to 5% public and it still seemed okay.

Once I rolled it out to everyone, it all broke.

Turns out that while the encode function used a static map built at build-time (and was thus just an O(1) lookup at runtime), decode connected to a database that was only ever designed for internal use. The DB only had ten replicas, which was nowhere near enough to handle hundreds of thousands of concurrent users.

Luckily, it's commonplace to use feature flags changes, which is how I could roll it out just to employees initially. The devops team were able to find stack traces of the error from the prod logs, find my code, find the commit that added it, find the name of the killswitch, and disable my code, before I even noticed that there was a problem. No code rollback needed.

That was probably 7 years ago now. Thankfully I haven't made any mistakes as large as that one again!

Always use feature flags for major changes, especially if they're risky!