News & Updates It's a Bad Time for Computer Trouble

Status
Not open for further replies.
You know, most of the time, when there's a server outage and the site goes down, it's due to some mundane thing like planned software updates or the typical datacenter fire. But this week, we had some rather annoying issues that really give fuel to the old System Admin adage: "Well, I've never seen that before."

Monday started off with a classic server hang. Everything just went offline like the server had been unplugged. And, in a way, it was. And of course, it was my office day. Though I couldn't look into it until the evening, I was prepared for a scenario where something had gone Quite Wrong™. In reality, it hadn't. The server was simply not responding and needed a power cycle. It booted right back up without issue. Hours of downtime that were solved in minutes. C'est la vie. Things were fine. But not for long.

Barely made it 24 hours before the unthinkable happened. Tuesday evening, the site went down again, but this time, not completely down. I could still log in on the console even though the site was offline. Oh, could I ever log in. But it was not pretty.
The database, our precious storage location for every forum game, roleplay, meme, and inappropriate private conversation, was offline and could not start. Why? File corruption. Some pages were missing from our Big Book of RpNation Nonsense Carefully Worded and Thoughtful Roleplays. Our server's filesystem is so good that it can detect when data becomes corrupted and is more stubborn than your most hated in-law when it comes to attempting to read that bad data. Bad data is worse than no data. We store our data in two places at the same time, a pair of SSD drives in the server, each with a full copy of the data. But today, the server regretfully informed me (the server is a tall, imposing, British butler wearing a tuxedo in today's story), data on both drives was mangled at the same time. Now, we do also sync the data offsite to another location as an off-site backup. So the ship certainly isn't sunk, but rolling back to a backup means data loss, and that's not ideal. (Can you imagine having to re-create the last several harrowing minutes of the thread where we see how high we can count without a mod??)

It's worth noting that Reginald McServerface over here only really cares about the data that's detected as bad. So I go looking through the logs of the database. It turns out that the file in question with bad data isn't actually part of the data storage, but is nonetheless an essential part of the database's functionality (hence why it won't start). So I check the server logs while experimenting with the database. It seems like the part of the file that is corrupted is fairly small. So I have the server copy out all of the file it can read, while ignoring the bad parts. Only a few bytes were missing. Not that bad, really. Now the question is, can the database take this "reconstructed" file and fix it based on its own recovery mechanisms. Only one way to find out. We replace the bad file with the reconstructed file and start the database.
Everything came right up.

So we've avoided data loss! At least, for the most part. It's possible a post or profile comment or something got lost right around when the server went down. No way to know for sure. But what happened? Why did our redundant data storage scheme fail?
Well to that, I have not much more than guesses. I contacted our hosting company about the issue, and they offered to take a look at the server hardware. Since the site was down anyway, that posed no problem. I shut the server down and they replaced both the power supply and the connectors for the SSD drives. Best guess? Something faulty in the communication (RAM, PCIe bus errors, something else) caused by some random event (cosmic rays, electrical power surge, bad karma, a glitch in the Matrix) made it so that bad data was written to both SSDs at the same time. In that case, the data can't be automatically recovered. With luck, the problem came from the hardware they replaced and we won't have to deal with this again.

But we're back now (as of Wednesday evening), and everything has been caught up with regard to our data backups. If you have any questions or just want more details about the technical side, let me know and I'll see if I can elaborate.

Go RpNation!
 
Ghan Ghan thank you as usual for all the astonishing work you do for us. I notice every time the site goes down it's at a super inconvenient time, like when you have to be in the office.
 
I wish I understood what it means to work with that kind of tech and coding. I was a fish scientist, I only speak professional ocean.
 
Ghan Ghan You are gifted.

Not only do you have the ability to restore what is for possibly all of us unrestorable, you are able to lay the story out in layman's terms with humor and fun along the way. =)

For some of us, these posts are our dreams. This is as close to some of us will ever get to sharing those dreams with friends. And you protect those dreams - that data - as if it were your own. You are terrific, Ghan!

I hope you share your love of Trans Siberian Orchestra again this year? If so, I would like to enjoy that again! Your readings and enthusiasm make it special! =)

Honor and fun,
Dann =)
 
Thank you so much for restoring it. I don’t know what I’d do if I lost all of my RPs. 😭 You and the staff team rock! :)
 
Thank you so much for restoring it. I don’t know what I’d do if I lost all of my RPs. 😭 You and the staff team rock! :)

At worst, we were only in danger of losing a day's worth in a rollback. We've been going since 2008 and in the entire site's history we've only ever lost data during the site fire and I think it was only a couple of hour's worth of posts and only because the fire took down the datacenter to the point where we wouldn't know when the servers would come back, so we opted to restore the site from an older backup. It ended up being the right move because that one particular server wouldn't end up returning for weeks.

Something worse than this would have to happen to completely take us down, It's what backups are for :)!
 
Im so happy the problem wasnt too bad i wouldnt know what ive done if this site went down 💀 this is literally my life and mental health support system 😭😭💚💚
 
I've been here nearly a decade and am so grateful for the time and effort everyone has always put in to keep this site running! Thanks for keeping our site up and our writing safe :)
 
Thank you to all the RPN staff for all their hard work! I speak for everyone when I say I’m grateful for all you guys do!

<3
 
Thank you very much for the image of a British butler delivering the news, I think every inconvenient thing should have that image attached to it 🤔

Praise be to Ghan!
 
Im so happy the problem wasnt too bad i wouldnt know what ive done if this site went down 💀 this is literally my life and mental health support system 😭😭💚💚

It's all of ours as well <3.

As long as we are alive, the site will never go away. We take backups and other cautions. Always remember, the worst thing that we can ever suffer through is just the downtime of not having the time while the issue is resolved.

Current snapshots of the site are every 5 minutes (before the fire in 2021, they used to be hourly)
 
Well, clearly you've never met my in-laws.

In all seriousness, thank you. I've been less active recently due to some health concerns but I still log in to re-read RPs and occasionally chat with partners. Losing any of that would have led to an extremely sick person becoming extremely upset and even the snazziest of British butlers couldn't properly portray how bad my mood would be. Thank you for all of your time, dedication, and hard work. We all appreciate you and your sense of humor on this site!

You dropped this, King 👑
 
Ok, now I’m imagining the post in a British butler voice thanks Ghan!
classic film eating GIF by FilmStruck
 
Status
Not open for further replies.

Users who are viewing this thread

Back
Top