Server Outage Retrospective: Buried in Cookies!

Hi folks! Kitasu speaking. We've had quite a few hiccups with the server and website recently. They should all be resolved now, but I wrote up an incident report about the situation and I thought I'd share it with y'all. The TL;DR is:

The game's totally fine, it's not going anywhere!
Bugs can take a really, really long time to track down if you can't get an error message.
We sure have a lot of cookies!

Note: The report below has been edited for public viewing -- some parts may be a bit vague as a result.

DAY 1

Early in the morning on November 1st the server ran a scheduled routine item expiration script. This script uses a single operation to remove all cookies from the database. In this case, the operation took too long, which made the database lock up, which made the game crash.

Around the same time, the domain we use to access our servers started failing to resolve DNS. I don't know the exact details but it locked us out of the game server entirely for most of the day. In the meantime we had our web provider reboot the machines, which our servers always ... love. They love it. (They don't love it.)

Eventually I was able to track down the direct IPs for the servers and we were able to access the machine and restart the game server. Kashi Commodore, who owns and maintains the alias server, was also able to work with Emerald Flame to fix its DNS issues. Yay! However, portraits couldn't be downloaded and we didn't know at the time what had caused the server outage -- we couldn't find any logs that indicated what had happened.

DAY 2

We'd thought that perhaps the alias servers' DNS issues had caused the outage, but day 2 proved us wrong. Because the item expiration script didn't successfully delete the cookies in the database the prior night, it had even more cookies to delete. Predictably-in-hindsight it crashed the game again.

I was able to start the game server, but none of the other DEP staff were online so I wasn't confident enough to open the server up to everyone. That caused a delay of a few hours before lockdown mode was turned off and furres were able to log in. I was out most of the day for a prior commitment, but when I got back I looked into portraits not working and concluded that the portrait downloading service probably hadn't started after the machine rebooted on day 1. Dr. Cat found the service wasn't running and restarted it on day 3.

DAY 3

We still hadn't turned off the script, but this time one server shard survived! The other shards came up but were in lockdown mode until we woke up. Dr. Cat put in a temporary fix to turn off lockdown mode for shards until the next full server restart.

Next we received reports of issues with whispering, joining, summoning, and seeing other people online. This happens when specific shards start up incorrectly and thus fail to connect to the shard negotiator -- we found the shards in question and restarted them to fix the errors.

Shortly after that Pounce stopped showing any dreams or furres online. At the same time, our CMS website started showing errors to people who accessed it. I ran a terminal connected to our web server to try to figure out both issues, and while doing so I unintentionally changed the permissions (specifically, which account on the machine was the 'owner') of one of the folders that our CMS software Joomla uses. I didn't realize at the time that this had any effect because I didn't immediately restart the PHP process that runs the website, so my changes did not go into effect until...

After a little while we figured out that one of our web server machine's partitions was at 100% usage. We found that our online checker service, which Pounce accesses to get its list of online furres, was holding onto a massive file that had been deleted. Most systems keep files 'in memory' if a program has them open, so even a deleted file can take up space if it's being held open. At first we restarted the PHP web server via Apache, but this didn't work and we quickly realized our scripts were being run by a different PHP process. We were eventually able to find the right process and kill it, thus resolving the full partition.

Pounce was fixed! But that's when we realized the CMS website was down. All the way down -- none of the pages could be accessed, not even the homepage.

DAY 4

We turned the item removal script off on day 3, so for the first time since day 1 our server didn't go down. Now that we knew the likely cause of the original crashes we turned our attention to the CMS website. Although the website had shown errors before going down, the errors didn't give us a clear lead. Our web coder Aza d'Orano popped by to help and, after we realized the full partition issue we had on day 3 and was the same partition that holds the CMS website, suggested I restore the website files from backup. This was the best solution because a full partition can corrupt files or remove them unexpectedly and it's effectively impossible to manually track them down. Luckily we have our website files backed up to a central version control repository, so I went ahead and restored them. Static pages were back!

We also spent some time looking into Exclusive Turfs. For those who don't know, turfs were a Second Dreaming reward -- furres who bought them may request to reserve a specific spot on a main map for themselves, which gives them the ability to boot other furres out of that spot. We'd learned on day 3 that the exclusive turfs in New FurN had deactivated after one of the earlier server outages. We found that it was most likely due to the dream's PhoenixSpeak database getting corrupted, which is a known issue during server restarts. After exploring a few options on how to restore the turf spots we decided to re-create them manually, which I ended up doing on day 5 after getting the necessary information of whose turf was in which spot.

DAY 5

On day 4 we had agreed that, since the item expiration script was our most likely suspect but wasn't logging its errors to a file, I would start day 5 by running the script manually while watching its output. It took down three of our shards so we knew for sure that script was our target. Luckily it also spat out an error message!

I did what every professional programmer in the past ~10 years has done when they see an error message they're unfamiliar with, which is to say I googled the text of the message to find the top search result about it on StackOverflow. Fortunately the problem our script was causing was a pretty common and straightforward issue -- an operation that deletes rows from a database, like the item expiration script was doing, can lock up if you try to delete too many rows in a single go.

The most straightforward solution is to split the task up into multiple operations (i.e., multiple "DELETE FROM" SQL statements with a "LIMIT" on them), effectively rate-limiting the script to not overwhelm the database. I added a loop that uses a counter based on the number of cookies in the database to re-run the operation until the cookies were gone. On day 6 I tested this script by running it manually and it worked as expected, so it's back to running every night like it should!

All that was left was the website. While static pages worked, trying to log in failed and simply refreshed the page. FurEd also wouldn't load, but after some discussion with Aza d'Orano we determined that the FurEd issue was being caused by the login issue. After hours of debugging -- there were no errors being returned by the server that I could find -- I finally traced the refreshing page behavior down to JSession::checkToken(), which was silently failing before the login function ran.

This line is a part of core Joomla and an important security feature, so I had to figure out why it was breaking rather than simply remove it. Other Joomla users have reported similar behavior implicating that line when something is wrong with their session storage, so I checked the session folder permissions and realized I'd changed them by accident. I changed them back and cleared the old sessions for good measure, and website logins finally worked!

WHAT WE LEARNED

We found a few pain points that we didn't know about and can now fix, like how the portrait downloading service doesn't automatically start if its machine is rebooted and how some of our scripts weren't saving their errors to a log file. I plan on also looking into setting up an alert for when our partitions get close to capacity -- it's not something we expect to happen frequently but it's also really not a thing you want to be surprised by.

Oh, and... don't mess with the Joomla folder. Never. Mess. With the Joomla folder.

I'm a programmer by trade, which is very different than a sysadmin, so there's a lot I didn't know when I started (and a lot I still don't know). That's why I've been building documentation based on everything I learn as I learn it, both for myself and for anyone who in the future may need to know why portraits aren't downloading or what to do if exclusive turf spots disappear. In that spirit, here's some of the useful Linux terminal commands I've learned since becoming staff!

find / -name "*keyword*" -- find everything on the machine containing the keyword
df -a -h -- check disk usage by partition in a human-readable format
du --max-depth=1 --si ./ -- check disk usage of the folders in your current directory
pkill -f keyword -- kill all processes matching the keyword
lsof +L1 -- show deleted files being held open by programs
netstat -ntu | awk ' $5 ~ /^[0-9]/ {print $5}' | cut -d: -f1 | sort | uniq -c | sort -n -- look at all the current connections to the machine and count how many are from each source

This thing of beauty/horror was, of course, from StackOverflow.