this post was submitted on 21 Jul 2024

-7 points (11.1% liked)

Technology

59651 readers

2640 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

-7

CrowdStrike Isn't the Real Problem (lemmy.world)

submitted 4 months ago by timewarp@lemmy.world to c/technology@lemmy.world

130 comments fedilink hide all child comments

This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.

Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren't revolutionary. They're basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.

Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.

This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it's ultimately the customers and their data that suffer.

top 50 comments

sorted by: hot top controversial new old

[–] Boozilla@lemmy.world 0 points 4 months ago (2 children)

I've worked in various and sundry IT jobs for over 35 years. In every job, they paid a lot of lip service and performed a lot box-checking towards cybersecurity, disaster recovery, and business continuity.

But, as important as those things are, they are not profitable in the minds of a board of directors. Nor are they sexy to a sales and marketing team. They get taken for granted as "just getting done behind the scenes".

Meanwhile, everyone's real time, budget, energy, and attention is almost always focused on the next big release, or bug fixes in app code, and/or routine desktop support issues.

It's a huge problem. Unfortunately it's how the moden management "style" and late stage capitalism operates. Make a fuss over these things, and you're flagged as a problem, a human obstacle to be run over.

[–] riskable@programming.dev 0 points 4 months ago (1 children)

everyone's real time, budget, energy, and attention is almost always focused on ~~the next big release, or bug fixes in app code, and/or routine desktop support issues~~ pointless meetings, unnecessary approval steps that could've been automated, and bureaucratic tasks that have nothing to do with your actual job.

FTFY.

load more comments (1 replies)

[–] BearOfaTime@lemm.ee 0 points 4 months ago* (last edited 4 months ago) (1 children)

Yep - it's a CIO/CTO/HR issue.

Those of us designing and managing systems yell till we're blue in the face, and CIO just doesn't listen.

HR is why they use crap like CrowdStrike. The funny thing is, by recording all this stuff, they become legally liable for it. So if an employee intimates they're going to do something illegal, and the company misses is, but it's in the database, they can be held liable in a civil case for not doing something to prevent it.

The huge companies I've worked at were smart enough to not backup any comms besides email. All messaging systems data were ephemeral.

[–] Boozilla@lemmy.world 0 points 4 months ago (1 children)

by recording all this stuff, they become legally liable for it

That is a damned good point and kind of hilarious. Thanks for the meaningful input (and not just being another Internet Reply Guy like some others on here).

load more comments (1 replies)

[–] Rhaedas@fedia.io 0 points 4 months ago

I don't think it's that uncommon an opinion. An even simpler version is the constant repeats over years now of information breaches, often because of inferior protect. As a amateur website creator decades ago I learned that plain text passwords was a big no-no, so how are corporation ITs still doing it? Even the non-tech person on the street rolls their eyes at such news, and yet it continues. CrowdStrike is just a more complicated version of the same thing.

[–] istanbullu@lemmy.ml 0 points 4 months ago

The real problem is the monopolization of IT and the Cloud.

[–] computergeek125@lemmy.world 0 points 4 months ago (1 children)

Getting production servers back online with a low level fix is pretty straightforward if you have your backup system taking regular snapshots of pet VMs. Just roll back a few hours. Properly managed cattle, just redeploy the OS and reconnect to data. Physical servers of either type you can either restore a backup (potentially with the IPMI integration so it happens automatically), but you might end up taking hours to restore all data, limited by the bandwidth of your giant spinning rust NAS that is cost cut to only sustain a few parallel recoveries. Or you could spend a few hours with your server techs IPMI booting into safe mode, or write a script that sends reboot commands to the IPMI until the host OS pings back.

All that stuff can be added to your DR plan, and many companies now are probably planning for such an event. It's like how the US CDC posted a plan about preparing for the zombie apocalypse to help people think about it, this was a fire drill for a widespread ransomware attack. And we as a world weren't ready. There's options, but they often require humans to be helping it along when it's so widespread.

The stinger of this event is how many workstations were affected in parallel. First, there do not exist good tools to be able to cover a remote access solution at the firmware level capable of executing power controls over the internet. You have options in an office building for workstations onsite, there are a handful of systems that can do this over existing networks, but more are highly hardware vendor dependent.

But do you really want to leave PXE enabled on a workstation that will be brought home and rebooted outside of your physical/electronic perimeter? The last few years have showed us that WFH isn't going away, and those endpoints that exist to roam the world need to be configured in a way that does not leave them easily vulnerable to a low level OS replacement the other 99.99% of the time you aren't getting crypto'd or receive a bad kernel update.

Even if you place trust in your users and don't use a firmware password, do you want an untrained user to be walked blindly over the phone to open the firmware settings, plug into their router's Ethernet port, and add https://winfix.companyname.com as a custom network boot option without accidentally deleting the windows bootloader? Plus, any system that does that type of check automatically at startup makes itself potentially vulnerable to a network-based attack by a threat actor on a low security network (such as the network of an untrusted employee or a device that falls into the wrong hands). I'm not saying such a system is impossible - but it's a super huge target for a threat actor to go after and it needs to be ironclad.

Given all of that, a lot of companies may instead opt that their workstations are cattle, and would simply be re-imaged if they were crypto'd. If all of your data is on the SMB server/OneDrive/Google/Nextcloud/Dropbox/SaaS whatever, and your users are following the rules, you can fix the problem by swapping a user's laptop - just like the data problem from paragraph one. You just have a team scale issue that your IT team doesn't have enough members to handle every user having issues at once.

The reality is there are still going to be applications and use cases that may be critical that don't support that methodology (as we collectively as IT slowly try to deprecate their use), and that is going to throw a Windows-sized monkey wrench into your DR plan. Do you force your uses to use a VDI solution? Those are pretty dang powerful, but as a Parsec user that has operated their computer from several hundred miles away, you can feel when a responsive application isn't responding quite fast enough. That VDI system could be recovered via paragraph 1 and just use Chromebooks (or equivalent) that can self-reimage if needed as the thin clients. But would you rather have annoyed users with a slightly less performant system 99.99% of the time or plan for a widespread issue affecting all system the other 0.01%? You're probably already spending your energy upgrading from legacy apps to make your workstations more like cattle.

All in trying to get at here with this long winded counterpoint - this isn't an easy problem to solve. I'd love to see the day that IT shops are valued enough to get the budget they need informed by the local experts, and I won't deny that "C-suite went to x and came back with a bad idea" exists. In the meantime, I think we're all going to instead be working on ensuring our update policies have better controls on them.

As a closing thought - if you audited a vendor that has a product that could get a system back online into low level recovery after this, would you make a budget request for that product? Or does that create the next CrowdStruckOut event? Do you dual-OS your laptops? How far do you go down the rabbit hole of preparing for the low probability? This is what you have to think about - you have to solve enough problems to get your job done, and not everyone is in an industry regulated to have every problem required to be solved. So you solve what you can by order of probability.

load more comments (1 replies)

[–] breakingcups@lemmy.world 0 points 4 months ago (15 children)

Please, enlighten me how you'd remotely service a few thousand Bitlocker-locked machines, that won't boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common "basic hygiene" practices would've helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.

Not saying the rest of your post is wrong, but this stood out as easily glossed over.

[–] ramble81@lemm.ee 0 points 4 months ago (5 children)

You’d have to have something even lower level like a OOB KVM on every workstation which would be stupid expensive for the ROI, or something at the UEFI layer that could potentially introduce more security holes.

[–] Leeks@lemmy.world 0 points 4 months ago (1 children)

Maybe they should offer a real time patcher for the security vulnerabilities in the OOB KVM, I know a great vulnerability database offered by a company that does this for a lot of systems world wide! /s

[–] A_A@lemmy.world 0 points 4 months ago (1 children)

Lol 😋 ! also i need a "Out-of-Band, Keyboard, Video, and Mouse" to your "OOB, KVM" so to ~~steal the bank~~ improve security.

[–] Leeks@lemmy.world 0 points 4 months ago

“It’s turtles all the way down”.

load more comments (4 replies)

[–] mynamesnotrick@lemmy.zip 0 points 4 months ago* (last edited 4 months ago) (2 children)

Was a windows sysadmin for a decade. We had thousands of machines with endpoint management with bitlocker encryption. (I have sincd moved on to more of into cloud kubertlnetes devops) Anything on a remote endpoint doesn't have any basic "hygiene" solution that could remotely fix this mess automatically. I guess Intels bios remote connection (forget the name) could in theory allow at least some poor tech to remote in given there is internet connection and the company paid the xhorbant price.

All that to say, anything with end-user machines that don't allow it to boot is a nightmare. And since bit locker it's even more complicated. (Hope your bitloxker key synced... Lol).

[–] Spuddlesv2@lemmy.ca 0 points 4 months ago

You’re thinking of Intel vPro. I imagine some of the Crowdstrike ~~victims~~ customers have this and a bunch of poor level 1 techs are slowly griding their way through every workstation on their networks. But yeah, OP is deluded and/or very inexperienced if they think this could have been mitigated on workstations through some magical “hygiene”.

[–] LrdThndr@lemmy.world 0 points 4 months ago (1 children)

Bro. PXE boot image servers. You can remotely image machines from hundreds of miles away with a few clicks and all it takes on the other end is a reboot.

[–] wizardbeard@lemmy.dbzer0.com 0 points 4 months ago (1 children)

With a few clicks and being connected to the company network. Leaving anyone not able to reach an office location SOL.

load more comments (1 replies)

[–] JasonDJ@lemmy.zip 0 points 4 months ago (3 children)

Does Windows have a solid native way to remotely re-image a system like macOS does?

[–] catloaf@lemm.ee 0 points 4 months ago (2 children)

No.

Maybe with Intune and Autopilot, but I haven't used it.

load more comments (2 replies)

[–] Dran_Arcana@lemmy.world 0 points 4 months ago* (last edited 4 months ago) (4 children)

Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

Bitlocker keys for the OS partition are irrelevant because nothing of value is stored on the OS partition, and keys for the data partition can be stored and passed via AD after the redeploy. If someone somehow deploys an image that isn't ours, it won't have keys to the data partition because it won't have a trust relationship with AD.

(This is actually what I do at work)

[–] I_Miss_Daniel@lemmy.world 0 points 4 months ago (1 children)

Sounds good, but can you trust an OS partition not to store things in %programdata% etc that should be encrypted?

load more comments (1 replies)

[–] Trainguyrom@reddthat.com 0 points 4 months ago* (last edited 4 months ago) (1 children)

Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

At that point why not just redirect the data partition to a network share with local caching? Seems like it would simplify this setup greatly (plus makes enabling shadow copy for all users stupid easy)

Edit to add: I worked at a bank that did this for all of our users and it was extremely convenient for termed employees since we could simply give access to the termed employee's share to their manager and toss a them a shortcut to access said employee's files, so if it turned out Janet had some business critical spreadsheet it was easily accessible even after she was termed

load more comments (1 replies)

[–] Brkdncr@lemmy.world 0 points 4 months ago (1 children)

But your pxe boot server is down, your radius server providing vpn auth is down, your bitlocker keys are in AD which is down because all your domain controllers are down.

[–] Dran_Arcana@lemmy.world 0 points 4 months ago

Yes and no. In the best case, endpoints have enough cached data to get us through that process. In the worst case, that's still a considerably smaller footprint to fix by hand before the rest of the infrastructure can fix itself.

load more comments (1 replies)

[–] Saik0Shinigami@lemmy.saik0.com 0 points 4 months ago

Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection,

Intel AMT.

[–] felbane@lemmy.world 0 points 4 months ago (1 children)

Rollout policies are the answer, and CrowdStrike should be made an example of if they were truly overriding policies set by the customer.

It seems more likely to me that nobody was expecting "fingerprint update" to have the potential to completely brick a device, and so none of the affected IT departments were setting staged rollout policies in the first place. Or if they were, they weren't adequately testing.

Then - after the fact - it's easy to claim that rollout policies were ignored when there's no way to prove it.

If there's some evidence that CS was indeed bypassing policies to force their updates I'll eat the egg on my face.

[–] originalucifer@moist.catsweat.com 0 points 4 months ago

from what ive read/watched thats the crux of the issue.... did they push a 'content' update, i.e. signatures or did they push a code update.

so you basically had a bunch of companies who absolutely do test all vendor code updates beings slipped a code update they werent aware of being labeled a 'content' update.

[–] LrdThndr@lemmy.world 0 points 4 months ago* (last edited 4 months ago) (5 children)

A decade ago I worked for a regional chain of gyms with locations in 4 states.

I was in TN. When a system would go down in SC or NC, we originally had three options:

(The most common) have them put it in a box and ship it to me.
I go there and fix it (rare)
I walk them through fixing it over the phone (fuck my life)

I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each client at each club was configured to PXE boot from the fog client.

The server contained images of every machine we commonly used. I could tell FOG which locations used which models, and it would keep the images cached on the client machines.

If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for reimage and at next boot, the machine would check in with PXE and get a complete reimage from premade images on the fog server.

So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer. Since it ran in PXE, the condition of the os didn’t matter at all. It never loaded the os when it was flagged for reimage.

This was free software. It saved us thousands in shipping fees alone.

There ARE options out there.

[–] cyberpunk007@lemmy.ca 0 points 4 months ago (4 children)

This is a good solution for these types of scenarios. Doesn't fit all though. Where I work, 85% of staff work from home. We largely use SaaS. I'm struggling to think of a good method here other than walking them through reinstalling windows on all their machines.

[–] LrdThndr@lemmy.world 0 points 4 months ago (1 children)

That’s still 15% less work though. If I had to manually fix 1000 computers, clicking a few buttons to automatically fix 150 of them sounds like a sweet-ass deal to me even if it’s not universal.

You could also always commandeer a conference room or three and throw a switch on the table. “Bring in your laptop and go to conference room 3. Plug in using any available cable on the table and reboot your computer. Should be ready in an hour or so. There’s donuts and coffee in conference room 4.” Could knock out another few dozen.

Won’t help for people across the country, but if they’re nearish, it’s not too bad.

[–] cyberpunk007@lemmy.ca 0 points 4 months ago

Not a lot of nearish. It would be pretty bad if this happened here.

load more comments (3 replies)

[–] magikmw@lemm.ee 0 points 4 months ago (7 children)

This worksbgreat for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.

The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.

The only winning move here was to not use crowdstrike.

[–] LrdThndr@lemmy.world 0 points 4 months ago (1 children)

Absolutely. 100%

But don’t let perfect be the enemy of good. A fix that gets you 40% of the way there is still 40% less work you have to do by hand. Not everything has to be a fix for all situations. There’s no such thing as a panacea.

[–] magikmw@lemm.ee 0 points 4 months ago (4 children)

Sure. At the same time one needs to manage resources.

I was all in on laptop deployment automation. It cut down on a lot of human error issues and having inconsistent configuration popping up all the time.

But it needs constant supervision, even if not constant updates. More systems and solutions lead to neglect if not supplied well. So some "would be good to have" systems just never make the cut, because as overachieving I am, I'm also don't want to think everything is taken care of when it clearly isn't.

[–] catloaf@lemm.ee 0 points 4 months ago

Yeah. I find a base image and post-install config with group policy or Ansible to be far more reliable.

load more comments (3 replies)

load more comments (6 replies)

[–] Brkdncr@lemmy.world 0 points 4 months ago (1 children)

How removed from IT are that you think fog would have helped here?

[–] LrdThndr@lemmy.world 0 points 4 months ago* (last edited 4 months ago) (1 children)

How would it not have? You got an office or field offices?

“Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?

The damn thing’s even boot looping, so you don’t even have to reboot it.

I’m sure the user saved all their data in one drive like they were supposed to, right?

I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.

But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.

But to actually answer your question and be transparent, I’ve been doing Linux devops for 10 years now. I haven’t touched a windows server since the days of the gymbros. I DID say it’s been a decade.

[–] Brkdncr@lemmy.world 0 points 4 months ago (6 children)

Because your imaging environment would also be down. And you’re still touching each machine and bringing users into the office.

Or your imaging process over the wan takes 3 hours since it’s dynamically installing apps and updates and not a static “gold” image. Imaging is then even slower because your source disk is only ssd and imaging slows down once you get 10+ going at once.

I’m being rude because I see a lot of armchair sysadmins that don’t seem to understand the scale of the crowdstike outage, what crowdstrike even is beyond antivirus, and the workflow needed to recover from it.

load more comments (6 replies)

load more comments (2 replies)

load more comments (8 replies)

[–] lemmyng@lemmy.ca 0 points 4 months ago

https://how.complexsystems.fail/

[–] FaceDeer@fedia.io 0 points 4 months ago (4 children)

particularly for companies entrusted with vast amounts of sensitive personal information.

I nodded along to most of your comment but this cast a discordant and jarring tone over it. Why particularly those companies? The CrowdStrike failure didn't actually result in sensitive information being deleted or revealed, it just caused computers to shut down entirely. Throwing that in there as an area of particular concern seems clickbaity.

load more comments (4 replies)

[–] technocrit@lemmy.dbzer0.com 0 points 4 months ago* (last edited 4 months ago) (3 children)

The underlying problem is that legal security is security theatre. Legal security provides legal cover for entities without much actual security.

The point of legal security is not to protect privacy, users, etc., but to protect the liability of legal entities when the inevitable happens.

neglecting the due diligence necessary to ensure those solutions truly fit their needs.

CrowdStrike perfectly met their needs by proving someone else to blame. I don't think anybody is facing any consequences for contracting with CrowdStrike. These bad incentives are the whole point of the system.

[–] riskable@programming.dev 0 points 4 months ago (1 children)

I don't think anybody is facing any consequences for contracting with CrowdStrike.

This is the myth! As we all know there were very serious consequences as a result of this event. End users, customers, downstream companies, entire governments, etc were all severely impacted and they don't give a shit that it was Crowdstrike's mistake that caused the outages.

From their perspective it was the companies that had the upstream outages that caused the problem. The vendor behind the underlying problem is irrelevant. When your plan is to point the proverbial finger at some 3rd party you chose that finger still--100% always--points to yourself.

When the CEO of Baxter International testified before Congress to try to explain why people died from using tainted Heparin he tried to hand wave it away, "it was the Chinese supplier that caused this!" Did everyone just say, "oh, then that's understandable!" Fuck no.

Baxter chose that Chinese supplier and didn't test their goods. They didn't do due diligence. Baxter International fucked up royally, not the Chinese vendor! The Chinese vendor scammed them for sure but it was Baxter International's responsibility to ensure the drug was, well, the actual drug and not something else or contaminated.

Reference: https://en.wikipedia.org/wiki/2008_Chinese_heparin_adulteration

load more comments (1 replies)

load more comments (2 replies)

[–] scytale@lemm.ee 0 points 4 months ago* (last edited 4 months ago) (3 children)

For sure there is a problem, but this issue caused computers to not be able to boot in the first place, so how are you gonna remotely reboot them if you can’t connect to them in the first place? Sure there can be a way like one other comment explained, but it’s so complicated and expensive that not all of even the biggest corporations do them.

Contrary to what a lot of people seem to think, CrowdStrike is pretty effective at what it does, that’s why they are big in the corporate IT world. I’ve worked with companies where the security team had a minority influence on choosing vendors, with the finance team being the major decision maker. So cheapest vendor wins, and CrowdStrike is not exactly cheap. If you ask most IT people, their experience is the opposite of bloated budgets. A lot of IT teams are understaffed and do not have the necessary tools to do their work. Teams have to beg every budget season.

The failure here is hygiene yes, but in development testing processes. Something that wasn’t thoroughly tested got pushed into production and released. And that applies to both Crowdstrike and their customers. That is not uncommon (hence the programmer memes), it just happened to be one of the most prevalent endpoint security solutions in the world that needed kernel level access to do its job. I agree with you in that IT departments should be testing software updates before they deploy, so it’s also on them to make sure they at least ran it in a staging environment first. But again, this is a tool that is time critical (anti-malware) and companies need to have the capability to deploy updates fast. So you have to weigh speed vs reliability.

load more comments (3 replies)

[–] Leeks@lemmy.world 0 points 4 months ago (4 children)

bloated IT budgets

Can you point me to one of these companies?

In general IT is run as a “cost center” which means they have to scratch and save everywhere they can. Every IT department I have seen is under staffed and spread too thin. Also, since it is viewed as a cost, getting all teams to sit down and make DR plans (since these involve the entire company, not just IT) is near impossible since “we may spend a lot of time and money on a plan we never need”.

load more comments (4 replies)

load more comments