Item Normalization was released on 07/07 at 6:00 AM PST. The release required downtime from 1:00 AM to 6:00 AM. Everything went mostly to plan, with a few major problems in the beginning that we worked around.
Gaia Online is like a nuclear reactor, if one part stops working, it cascades across all of the servers, slowly taking down everything. We had to figure out a way to shut it all down and then start it all back up. We also needed contingency plans, in case something wrong.
Once we planned out our Launch Procedure, we set up a command center in our main conference room. We had an HDTV displaying our overall ISP bandwidth, but that was mostly just for entertainment. Most participants had several monitors hooked up, all spitting out log information and other useful goodies.
click for video
Front and center is myself, I am running the show. Across from me is the Director of Engineering. To the far right is Director of Operations (lumberchicken). omgwhat, vryhngry, Panagrammic, and soundfx440 were physically in the war room with us. 72squared and Jakobo joined us remotely, over IRC. Lanzer also made a guest appearance to keep us in good spirits. He also was the only one that remembered how the gift system data structure worked (he wrote it like 4 years ago), so he helped us out a bit during release .
I had worked straight through 4th of July and the rest of the weekend because of some last minute problems. Nevertheless, I showed up 10:30 PM Sunday night to prepare. At 1:00AM PST, we started our shutdown procedure by putting up the maintenance page.
We then meticulously followed a couple pages of procedure to slowly bring different systems down. Traffic came to a screeching halt and hard drives went quiet (presumably, the actual servers are colocated).
The next few hours were spent shuffling data around, backing it up, running scripts, running tests, deploying code to the live servers, redirecting traffic. Pressures were high, I was pacing back and forth trying to figure out what to do about the vendexpiration.php script running to slow canceling all the items in Marketplace.
"When will we know?", The Director of Engineering had asked. Over five hours into the process, we were still not sure if we were going to pull it off. My code had to work perfectly or it was all a flop and we would have to roll back. The effort would be wasted and we'd have try again. Rolling back was just as ominous of a process as actually releasing. Nothing we were doing had ever been fully tested.
Then the moment of truth came, just as the sun was coming up. We started firing up systems. Whoops! I forgot to reserialize the items in users' avatars. It was too late to turn back just for this one issue, so we bit the bullet and had Panagrammic work on communicating this and other issues to the users. vryhngry spent several hours after release trying to resolve this, but had to abort after some complications. A valiant effort, nonetheless.
Users trickled in slowly, at first, using a system designed to limit logins. Then a flood, as people started waking up, logging in, and realizing their inventory was all jacked up. All good though, we found and fixed problems as soon as they came up. The code wasn't perfect, but at least we didn't see corrupted data, and we never came across an issue that required rollback.
I stayed until 7:00 PM on Monday evening, trying to resolve as much as I could. By the end, I was on my 7th wind and had enough caffeine to kill several elephants.
This was, by far, the coolest thing I've ever accomplished at Gaia. Congratulations Item Normalization Team, and thanks for all the hard work.
We still have many issues to work out! The job system didn't scale the way we wanted, the main database server that holds users' inventories CRASHED (bet you didn't notice that did you?), taking out some vends with it (I think, not sure yet), we have caching issues, loads of warnings in the logs, and sleep to catch up on. We have marketplace listings and old bank trades that we are still churning through.
If ya don't mind, please don't comment to report a problem. TRUST ME, I know about every single one of them. Keep it to QA.
join the VJ: