this post was submitted on 23 Sep 2023
37 points (93.0% liked)
Asklemmy
43940 readers
827 users here now
A loosely moderated place to ask open-ended questions
Search asklemmy ๐
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Anecdote time: I'm an IT dude for the offshore industry. I deal with production server clusters on board ships, and I usually show up when we're mobilizing to make sure the system is ready to go. In europe this is mostly smooth sailing, pun only partially intended.
Back in april I showed up in Singapore two weeks before anyone else, because there was some maintenance I needed to do beforehand, mostly down to setting up some huge RAID volumes from scratch, and (luckily) restoring some stuff from a backup. This mostly would involve a lot of waiting at the hotel while routinely checking in via VPN.
These production clusters consist of four servers mounted in an airconditioned rack, which in turn is mounted in an air conditioned shipping container. This way this system can be used all over the world without much adaptation needed.
When preparing to do this maintenance I did what I always do at the start: connect power, but leave the UPS off, just keeping the AC running. I then grabbed a Grab taxi and went to Sim Lim tower for some odds and ends that I needed, with the intention of just letting the AC do its thing for a few hours.
Upon returning after some shopping, geek browsing, and lunch, I checked that the temperature and humidity in the rack was fine. I then proceeded to power up the stuff. Network; fine. Both 10gig and 100gig networks. Various purpose made proprietary hardware was also fine. I then proceeded to power on the servers, and that's when the issues started; the first one didn't power on. "Huh, odd", I thought. Not too alarming as I was doing it remotely over LAN, and this sometimes croaks and can often be resolved by flashing the firmware on the management interface. So I went for the button instead; nothing. "Shit...".
I always keep a spare server with the systems for cases like this, so I started manhandling these 70Kg beasts, getting the fauly one out and the spare in, swapping over all the drives and after powering it and importing the RAID arrays it worked as intended. The most important server was up and running.
2nd server, almost as importantas the first one: same issue. "Shit, that was my one spare". So i started troubleshooting, swapping around CPUs to make sure they weren't the issue (each server has two, and can run with just one, so unlikely).
After spending an entire day checking all combinations I concluded that I had two faulty motherboards on my hands, despite them working fine before it was shipped from europe. I called my vendor and to my luck, they had a motherboard of the right type in store. Only problem was that it was on the other side of the planet.
I then phoned up a coworker of mine. I knew he was flying in the day after, so I offered him free beer on arrrival if he could drive for three hours to the vendor and pick up my motherboard, and then handcarry it around the world to me.
Luckily the 3rd and 4th server powered fine, so thebreraid operation could churn along while I waited for spares. My coworker arrived with the spare a few hours before I needed everything working, and there was a frantic scramble of tools, personnel and hardware to get the last machine operational. Swapping motherboards in these servers is bloody annoying - imagine swapping motherboard in a normal PC, but with 10 times as much hardware populating it.
Finally, an hour before the ship was about to sail, everything was powered and ready to go. Investigation indicated that even though the air measured fine, there were most likely too much humidity inside the two servers at the bottom. Moral of the story: electronics don't like Singaporean climate.
After a stressful week I finally got to properly unwind by overdosing on Burnt Ends and Shiner Bock at Decker BBQ that night.
So what's the new run book for starting your skiffs in other countries? Hydrometer readings in the chassis before applying power?
Great storytelling by the way very gripping. I was enthralled the entire time
Most important servers go on top (we can run relatively fine without one of them, but with reduced capacity, as long as the cluster primaries work) and run the AC for longer during mobilization (a full day and overnight) Also, this utility hatch had a broken gasket and loads of bolts missing. This has now been fixed to make sure the container is reasonably air tight.
I've also considered some sort of heater in the bottom of the rack to prevent condensation during the initial AC cycle.
If you had been able to get the humidity down, before applying power, do you think the servers would have been okay?
Maybe. It could be that the damage was already done, as the container spent 6 month being slow shipped around the world. That utility hatch was basically venting in sea breeze continuously during transit.