It’s about 1 week away from our annual user conference. Well that is for anyone who is reading this blog entry between today and Wednesday, July 10, 2013. That’s when I’m giving this presentation to hopefully a packed room of awesome, interested professionals who’s day to day life involves managing the application I build. The goal of my presentation is to share a story that started about two weeks after I got back from last year’s user conference. I’m not trying to mislead my users with the title. The story is definitely about these three terrible days we lived through in September, but the story started long before those ill-fated 72 hours. In many ways it continues on today.
Click Here for My Slides
Everybody loves a promotion right? Well that’s what happened last year right after I came back from our user conference. My boss came into my office, closed the door and gave me the news that I was going to become a Vice President in our development organization. It was definitely something I wanted for years, but didn’t think it would come in my current capacity of running our Performance and Security Engineering teams and three of our six development teams. I figured it would come if I left the company, someone retired or I switched divisions. It did come and the premise of the promotion was pretty cool. I would be asked to build a more scalable practice for both Performance and Security across the entire company rather than just within my product division. It would come with a couple of catches. The first catch was that I had to give up my three development teams. The second catch was that I would have to take on a new third team, our Release Engineering and Infrastructure team.
It was kind of like a catch within a catch. I guess they call that a catch-22 because while I would inherit this team, I wouldn’t get the Director who ran the team. He was leaving the team to be part of a new product team, but I digress as that’s a story for some other time. So I accepted the promotion, but of course under my terms. I like to think that I actually influenced the terms, but the truth is what I asked for didn’t really make anyone flinch. Basically I said I would take the change, but it meant I would get all of our DBAs and I would also get to hire a new Director for the team I just inherited. Oh and by the way there was another stipulation. I said I would take on this team if we could rebrand the team DevOps. I’m sure all the “Puritan DevOps” folks who hate others using the term DevOps in their title or team have stopped reading my blog.
It took me a week to decide if I wanted the promotion. Well…it really took me negative 6 seconds, but I did want to do my due diligence to understand the problems facing this team and the challenges that I might face trying to find a strong, capable leader to run the team. If you are wondering, I did eventually find someone to run the group. He’s a great leader and we have become quite good friends over the past year. Unfortunately he didn’t come into the team until after our incident. One might argue that it was the incident that gave the business justification to make the hire last November.
Ok…so let me get back to the story.
I want to say that I ended up taking the team officially the first week of August. I don’t particularly remember all of the things I said to each member of the team. There is one line I told each team member, including the exiting Director. Basically, I said it’s going to take me a few weeks to assess the team, build a road map, sort out our budget and initiate our recruiting process. I continued with a request of each person, which was to continue with their current work and most importantly if something was going to happen, “I wasn’t going to lose any sleep at night.” Basically, I gave each team member the opportunity to let me know if they felt any of our critical applications and/or business processes were fragile and/or at risk of going offline outside of our business hours. If anyone felt concerned, they could speak now and we could re-prioritize our work to address those potential problems. The funny thing is that everyone replied with “Everything is Safe and Sound…Nothing to Look at Here…Just Move Along.”
I took their comments at face value…but deep down inside I knew my proclamation was empty. Of course I would lose sleep if something happened.
The First Alarms
The Outage happened sometime around 2am EST on a Sunday night. Of course nobody knew it happened at the time because the little alert controls that were put in place were masked behind email rules. Nearly every member of the team had setup email rules to keep alerts from Nagios and our Cron jobs out of their Inbox and moved them to another folder. Well…everyone except me. That was one of our topics of discussion in our retrospective after the outage. I wasn’t awake at 2am so I didn’t see the nearly 800 emails I received from those two systems until about 6am when I woke up…nor did anyone else for that matter.
I knew something was definitely up when I woke up, but I didn’t know the full context of the issue. Sprinkled within the 800 emails were several emails from team members overseas in Australia, China, India, Amsterdam and the UK. We didn’t know what the outage was because nearly every system was producing an alert. The emails we received from our end users complained about access to our data center, logging into our Atlassian products and syncing our source code system (Perforce). To the naked eye, the issue looked like a clear cut network problem. We would have someone look at one of the many switches and our core router. We had seen network problems before, quite possibly fairly recently to the outage.
One of my engineers on our DevOps team was responsible for our build process and workflow, as well our source code system. He tended to keep really early hours so by 6:30am he was already in the office and looking at the issues. Every build failed because of timeouts between our build systems and our source code system. It turns out that our backups for our source code system also failed, but not because of a time-out. We actually didn’t know why it failed at the time as the error code was corrupted. All we knew at the time was it failed. This engineer went to open up a JIRA ticket for tracking purposes, which he couldn’t at the time because JIRA wouldn’t load. This same engineer had ssh’d to our build system which took 5 minutes to login to a Linux shell. Once he got on the system it was taking forever to change directories or lookup a process.
You take these three pieces of data (ie: build time-outs, back-ups failed, slow access to JIRA and sluggish Unix commands) and the likely hypothesis was that we were most likely having a network issue. We weren’t, which is obvious now, but wasn’t obvious then. We didn’t look at the right data. In fact, we hand selected all the wrong data. We made it worse with our next sequence of events.
We actually made more than 1 mistake. This first mistake I’m going to reference was a doozy. When our teammate couldn’t do anything on our source code system (note it’s a Linux system), he decided that maybe it would be a good idea to issue a Reboot from the command line. Looking back I’m not quite sure why he felt a reboot would address a network problem, which was his first diagnosis. In the retrospective, he tried to correlate the latency seen our source code system with a daemon process that had just run wild. Rather than kill the process, he went for the full shebang with a reboot.
It gets better…not that I mean to laugh. I can laugh now. Back then I had more tears in my eyes than laughs in my belly. So the system would reboot. Well it sort of did. It was stuck in a boot sequence. The machine had been built as bare-metal. It was one of just a few systems in our data center that wasn’t virtualized. It also had 2 TB of storage, but this storage was all local. Keep track of this information, because all of these details are important as the story pieces together.
So when the system wouldn’t come back online and couldn’t be accessed via SSH or even over KVM, the engineer emailed one of our operations guys to see if he could swipe the boot switch and hard boot the box. YIKES!!! Our Ops Engineer complied and that’s when the blinking lights started to go crazy.
The Blinking Lights
The server didn’t come online after the hard-boot. You could say this was a blessing at first because we were so far off the trail and no scooby snack was going to get us back on the trail in time before the damage was done. It was our first clue though. The Ops team attempted to do a reboot, failed and then decided it was best to run a diagnostic test. This test failed with flying colors. We had a disk failure and would require a replacement. We had a RAID-5 configuration and on-premise support from Dell. We should have easily been able to bring the disk offline, started without issue and then had Dell replace the drive. That’s exactly the path we went down.
While the system was offline all of our other problems went away. We couldn’t explain why. Our problems with Atlassian should have been unrelated to our source code system. Our SC repository (while I’ll share now is Perforce) had no integration with Crowd or AD. We managed accounts locally within Perforce. When our Perforce system was down, all of our other problems seemed to disappear. We thought we were in the clear.
Houston We Have a Problem
So after the diagnostic test, we were able to bring the server back online. At first there appeared to be no issues. From a console, the shell was lightening quick. Within about 2 minutes the system came to a crawl. Then a whole bunch of other systems in our data center started to crawl. By this point it was about 8am and I had come into the office. I setup shop in my DevOps room and was quietly watching, while at the same time providing a little bit of direction and support.
Every problem that could go wrong did go wrong. We were getting pummeled by emails of alerts and warning messages. The few folks that were in the office all walked into the DevOps office asking whether we knew what was wrong and whether we had any idea of when the problem would be resolved. We had no idea. We didn’t even know what the issue was other than we knew we had a bad disk on a server, which at this point we couldn’t access because the load average had shot up to some number greater than 70.
What we believed was an isolated incident to our Perforce system simply didn’t add up. This system wasn’t virtual, not that this piece of data really mattered. It had no attached or networked storage. There was no integration with Crowd or AD. So what was up with this system?
Our instincts were telling us that that the issue wasn’t network. We were able to ssh to other systems with relative ease. We narrowed down pretty quickly why the other systems were having issues. It turns out that any system using Crowd for authentication was having problems. If Crowd was by-passed, basically navigating to any application that allowed anonymous access, we had no issues. Of course the first thing we did was ssh to Crowd and run a few commands like top and netstat. The crowd server was doing nothing from a resource perspective. From a netstat perspective, we had a ton of connections to a postgres database system. There were a handful of connections to our JIRA server, as well as our Fisheye server and our Crucible server.
So once again we had a problem, but we didn’t know what the problem really was.
Knowing is Half the Battle
When you are in a crisis, it is really hard to think rationally and have perspective. I basically gave the guys on my team a free pass from thinking rationally at the time. They didn’t see it coming. They never prepared for the super bowl of outages. Nearly 90% of our enterprise applications used Crowd, which wasn’t functioning correctly. Our source code was basically still timing out, which meant developers couldn’t sync. They could work locally. They couldn’t get to their JIRA issues or to Confluence. You could say it was a confluence of disasters all rolled into one.
It was about 9:30am when we finally started narrowing down the issues. I believe at about 9:15, I strongly urged one of the guys on my team to open up a case with Perforce to see if they could aid and assist us with the time-out issue. That was throwing us for a loop given that we had diagnosed the only issue with Perforce as a storage issue. Between 6:30am and about 9:30am, the Perforce process must have been restarted a dozen times. I remember asking the engineer if I could get on the Perforce server as well, which is when we had our first discovery.
Our Perforce server was aliased as Carbon. I noticed when running top that there were Java processes for Fisheye and Crucible. Whoa…what’s this discovery. I logged into our Fisheye server, which was alias Fisheye. I did the same with Crucible. I did a simple who -u and then it dawned on me, we didn’t have 3 servers running these critical processes. Rather, all 3 applications were running on the same server and they just happened to be aliased via DNS. Holy moly is all I could say. By then the damage was really done.
Before I explain what I saw, I had a parallel discovery with Fisheye, Crucible and Crowd. When I noticed the Crowd server was doing nothing, but waiting on its database server, I immediately logged into the database server. My background with databases had been limited to Oracle, SQL Server, DB2 and MySQL. Postgres was as foreign to me as the French Legion. The first thing I noticed about this Postgres system was that every single Atlassian system had representation on this system. We must have 29 different databases including development and test systems, as well as non-essential systems. They were all residing on the production Postgres instance. I didn’t think it could get any worse off, but it did.
The Damage Was Already Done
We ended up shutting down the Fisheye and Crucible JVMs on the Perforce system. Within seconds, nearly all of our Crowd problems had disappeared. If you don’t know Fisheye or Crucible, I suggest you brush-up by looking at Atlassian’s site rather than getting the skinny from me. In 10 seconds or less, Fisheye gives us insight into DIFFs from within JIRA. Crucible basically indexes the code. So at a minimum we knew at least one of those applications was accessing Perforce.
It was about 11am by this point if I recall correctly. We shutdown Perforce and disabled access so no users could log in. Crucible and Fisheye were offline. Basically JIRA, Confluence and all of our other applications using Crowd were working. We tried several times to bring-up Perforce, but we continued to get time-outs when we tried to sync. We finally got Perforce on the phone around 1 or 1:30. That was a long 2 to 2.5 hours if I recall of “trusting” my Perforce engineer to sort out the problem and get Perforce up. I finally grew impatient and said we had to get Perforce on the phone with a Sev-1 issue. It turns out that when he had called earlier in the morning, he didn’t say the production system was down, so Perforce didn’t assign an engineer on the phone. Rather, they sent us a few emails asking us for some logs and some other information. We obliged, but clearly our team didn’t move fast enough.
Within minutes of getting our support engineer on the phone, he quickly identified that we couldn’t start Perforce because our Journal file was corrupt. No worries…we could just restore our last checkpoint. Restoring the checkpoint was pretty simple. All you needed was a successful back-up with the correct checksum after the deflate of the file. We spent about 10 minutes moving the checkpoint from our filer back to the Perforce server. We then spent another 20 minutes extracting the tar file. At about the 19th minute, the tar extraction failed. That sucked…the file was corrupt. We tried it for a few other files from previous days. It turns out that our checkpoint process had failed at 2am, hence a bunch of alarms.
Our Perforce Support Engineer wasn’t alarmed though. He felt like he could still restore and piece together the missing data if we could find a clean checkpoint. While we were un-tarring files, he was digging into our corrupt Journal file. The file was massive. It was 100′s of MBs. It had filled up with nearly a million lines of Crucible debugging statements.
Our Dell engineer who was working on the disk replacement with our Operations team came to the conclusion that Fisheye’s log file must have collided with the Perforce journal file. Basically, Fisheye was dumping thousands of debug statements per minute into this journal. It ran for over an hour, which certainly spelled disaster.
Are You Keeping Track of All of Our Issues?
We tee’d this disaster up for a giant home run in the history of disasters. The system was architected for disaster (no virtualization, local storage, etc…). Nobody was reading alerts. The team reboot a Unix box. When the reboot failed, the team had to hardboot the box. Our most important software system was also hosting two other applications, but the majority of the team had no idea because legacy teammates had set it up. Backups had failed and nobody really knew, nor did anybody care. Oh yeah and every database under the sun, whether production, dev or test resided on a single point of awesome failure.
Keeping Your Users Informed
Somewhere during the day we had a few moments of clarity to keep our users informed. We must have sent out emails every 30 minutes informing our users with a status. I remember at about 7pm, we projected that our restore might be ready by 11pm and Perforce service would be restored about an hour later. That hour came and went and we had no idea if we would be ready. We asked our users for an extension of 90 more minutes. By 12:30, we sent out a note saying we would update status around 6am the next morning.
None of us went to be at this point. Perforce stayed on the phone with us the entire time. We had started the issue at noon our time on Monday. It was now 2am on Tuesday. We had rotated support engineers now for the third time (East Coast, West Coast and Australia).
Keeping Your Executives Informed
I kept an open email thread and a text conversation with a few key executives letting them know the really gruesome details of what we were dealing with in this outage. I assured them we would go to sleep until we resolved the problems. I kept it honest and said, we had a really good chance of loss of data. The dominos lined up prior to this outage perfectly. My bet was we would lose at least Monday’s limited data and potentially any check-ins over the weekend. We had forecasted somewhere between 8 or 10 changelists. We thought we could get that data back since we were pretty confident most of the developers had their changelists locally in their cache.
The key was to find a back-up from the weekend that worked.
Keeping Your Team Calm
Yeah…so we couldn’t find a good checkpoint from the weekend, which meant we had to either take Friday’s or Thursday’s checkpoint. That would really suck because we would also have to be dependent on our developers to resubmit their CLs, which meant we had true data loss. For an operations engineer to hear data loss is like stabbing them in the heart. It means their DR initiative never worked, which makes executives ask the question, “Why are we paying you?”
As a manager it was my job to keep all of the players calm and collected. Transparency was key. It was hard after 24 hours with no sleep. It was 6am by this point. While I had gone home around 2am, we still had 3 or 4 guys still in the office. They were working the problem, but by now support was with London, their 4th support engineer. Our journal file that had been corrupted was finally clean after about 15 hours of manual effort. We found a checkpoint to restore and with the clean journal file, we were pretty sure we would lose maybe some small metadata about the CLs from the past few days. We wouldn’t lose source or versioning.
One thing we did do about 2pm on Monday was start using a skype channel. This was required early on as we had 2 guys at our data center and the rest of the team in DC. We needed constant communication. It was that skype channel that kept the mood up-beat and at times full of humor. Our DevOps team has a lot of personality. When you are up and working for what would be 72 hours, that’s a big deal.
Day Two…Making Decisions
We had to decide a lot on the fly. While the bad disk had been replaced, I wasn’t about to go back into production with the hardware we had. It wasn’t a question about the hardware, but rather concerns over bare-metal configurations with local storage. We made the decision to repurpose some hardware inventory, which we built-out the morning of day 2. We virtualized the server, though we dedicated the entire VM to the system. This at least gave us the ability to VMotion the VM to another server if we had physical issues. Second, we went against the grain and deployed on our enterprise NetApp system. I remember Perforce and some members of my team were against this move. My operations guys were in my camp on both the VM and NFS fronts.
I did some research and found at least two instances of customers who had a) virtualized and b) run over NFS. I reached out via email to a few bloggers who had written about their escapades in setting up with this alternative configuration. Eventually by late evening on day two I received a response from one of the bloggers who encouraged our configuration. It didn’t matter at that point as we had already made the decision to make the move.
We did a lot of game-planning on day two. Every white board in our shared office was filled with notes. We had to send a few of our guys home to get some sleep as well. We got smarter and went to shifts. Each shift had clear set of goals. If I recall, no shift was really longer than 6 hours.
Day Three…The Home Stretch
I think about 2am on Wednesday morning, we had successfully restored the original production system. We had also built-out our new system. We ended up spending about 12 hours from about 2pm on Tuesday to 2am on Wednesday moving files, un-tarring, starting up, etc…The system was finally up and running. We still had time-out issues on our new system, which meant we weren’t in the clear.
We made the decision to upgrade before we released back to our users. So not only did we recover, but we were also going to upgrade. Talk about taking risks…but at this point our most critical system had been down for two days and we were trying to get back online.
We decided that we wouldn’t do an upgrade on production unless we could confirm a successful backup and restore. By 5am we had done both. We planned to do the upgrade at 7am right after Perforce’s next shift change. We wanted them on the phone with us in case of an issue.
We upgraded the system with a full checkpoint around 10am. Eventually we handed over the system back to our users around noon. Sending out that email to the team was a great feeling. At least the system was back and operational.
What We Lost
Credibility…that’s what we lost. We ended up getting 99.9% of the data back. Our community of users lost nearly every shred of confidence in our team. We were basically down for 3 days. Our developers did their best to be productive, but how productive could they really be working locally.
Our users lost a lot of faith in Perforce as well. It didn’t matter that in our eight previous years of using Perforce, we never had an outage. The thing is, they were savior and got us back operational. The problems that led up to our disaster were caused by ourselves. We were just lucky that in the previous eight years we never had an outage. Another key thing to note is this outage came at a time when a lot of our developers were working on a project using Github. Our Perforce users were ready to revolt and they wanted to be off of Perforce ASAP and wanted us to migrate to Github.
We Made Lots of Mistakes
I’m specifically referring to the first 36 hours in which nearly everyone on the team worked around the clock without sleep. We made mistakes before we were tired though. When the outage started on Monday morning, we made mistakes by using the wrong litmus tests. We didn’t understand what our issues were. We didn’t understand how our systems were architected. Probably the biggest mistake was that we didn’t have a true plan to recover. We literally had to plan on the fly. Nobody was ready. It showed during those first 36 hours.
I remember at one point after we made the decision to build-out a second server, one of our teammates who had been up for 32 straight hours was working on recovering the backup. He was supposed to do a restoration on the new server. Not realizing which server he was on, he initiated a recovery on the production system that had failed us the day before. That mistake set us back 4 hours. I remember just 6 hours earlier, another team member fat-fingered a key stroke, which ended up deleting a critical file we spent copying over the network for 90 minutes. That of course set us back an additional 90 minutes.
The True Recovery
Getting the system back online was crucial for our users. It took us nearly three days to get our source code system back online. We didn’t end up bringing Fisheye or Crucible back online for nearly 3 weeks after the outage. We experienced a lot of issues getting those systems recovered. The support experience between Atlassian and Perforce was eye opening. I truly enjoy the Atlassian products. I think they deliver very value-driven products. I don’t think they hold a candle to Perforce from a customer experience perspective. Note that when I add software or renew my licenses with Atlassian, I simply make the purchase via a credit card. I get an email confirmation and that’s it. I don’t have a single POC from sales or support. I’m pretty convinced they have neither.
While my users were griping about the Perforce outage and wanting Github, around the same time there were intermittent Github outages. Call it what you want, but from a Perforce perspective, it was fortunate timing to see Github having issues.
Where We Are Now
We have been outage free for nearly 10 months. We put a lot of controls, investment, training and planning in place. We didn’t spend any new dollars in 2012 or even 2013 in hardware or software. We did make a small investment in training, specifically administrative training of Perforce. If you recall, during the outage, we had to have Perforce on the phone for nearly the entire 72 hour ordeal. The fact that this vendor stayed with us around the globe for three days is truly a remarkable success story of a vendor and a client working together to achieve success.
It turns out that we did our training within the first two weeks of our Director of DevOps starting. That was a critical week for us in many regards. First, we set a goal of designing and deploying a highly resilient system that could scale and support our growth and usage. Our mission was accomplished as we have two additional failover points. Going from one to three server instances wasn’t too challenging. We simply had to make the investment. Second, we worked with our vendor to implement their most reliable and true best practices for generating back-ups and check-points. Every day we take successful checkpoints. It’s the only email we get Sunday through Friday. Then on Saturday’s we do a full back-up, but with zero down-time.
We were taking bad back-ups is the best way to put it. We would bring our users offline every day. Because it happened at 2am, we didn’t think twice about it. We didn’t have a true global team back when Perforce was first implemented. We had developers who were working at 2am, but basically they new that the system was offline from about 2am to 4am. We never asked our international users how they felt about being offline for 2 hours during the peak of their day. Needless to say, these changes positively affected their productivity.
We also implemented caching servers in our distributed offices to improve their quality and experience with the application. Out of site…out of mind is definitely a theme that resonated with the DevOps team prior to me joining the group. Don’t get me wrong, they serviced their end users as best they could, but if a user was remote, they had a greater likelihood of a poor experience versus let’s say a local team member out of DC.
I would say our two greatest accomplishments are what’s keeping this product running today. First, our operations guys have pretty much moved into the realm of application management. During the outage, the key folks supporting Perforce were developers. Our operations guys, who are more like system administrators had limited experience running Perforce. They weren’t even power users, just passive users. Today, they pretty much run the Perforce production environment. They handle nearly every aspect from account management to storage provisioning to back-up and recovery. Second, we run through our recovery process every single day in an automated fashion. We literally practice restores in seconds versus hours or days.
Of course there are other aspects that we follow such as maintaining active versions by upgrading when the time is right.
The One Metric that Matters
There is nothing worse than having your end users tell you that you have a problem. We had alerts and alarms. We might have had a dozen or so rules in Naggios to tell us if we had an issue with Perforce. None of them really matter if they are white noise, completely ignored by the team.
When I first took over DevOps and was added to their Distribution Lists, my email would fill-up with two to three hundred automated messages per day. It was nearly impossible to decipher what was real and what was fake. I completely understand why people created Outlook rules to manage these alerts.
Today we keep track of three emails. We expect a daily email summary of the checkpoint and the recovery. It happens around 10am every morning, which was done on purpose. Since the process happens while the system is operational and accessible, we wanted this most critical operation to happen during business hours so we could respond. The second email is the Saturday full back-up and restoration, which is our automated “Super Bowl” of back-up and recovery operations. The one email we don’t want to get is a single Naggios alert that the system is down. We keep track of small things like storage growth, but we don’t alert the whole team.
We really only care about 1 email a day, which is our checkpoints.
A Plan Means Nothing If You Don’t Practice It
One of the first deliverables the team completed when Perforce came on-site was a DR plan. We didn’t just identify hardware and put together visios of the architecture, though from an operations perspective it’s a critical piece of data. We have documented roles and responsibilities. We understand the sequence of events. Most importantly the whole process is scripted. It is automated and as I mentioned before, we run it every day. We literally practice our DR activities every day between 9:50am and 10am. The system is 100% operational.
We have had nearly a dozen live drills as well as part of lab maintenance and patch work to our infrastructure. We have had a few storage maintenance windows requiring us to bring the systems offline. We have also had to assist with a full upgrade of a Perforce instance for our services organization.
We don’t view Perforce as a fragile system anymore. Enough people on the team understand the ins and outs of the system. They understand how to take checkpoints. They understand how to recover a system. They understand how de-frag the system as well. It’s not just one person, but 4 people on the team. In fact, the team has had a small bit of turn-over. We lost 2 of the software engineers on the team who had previously managed the system primarily. Now our operations team manages the system for the most part. Our faith and reliability in the service is at an all-time high.
Taking Two Steps Back to Jump Five Steps Forward
We are in a better place today with Perforce then we have ever been before. Those of us who lived through the outage see the story as a badge of honor. Most operations guys want to have two or three of these kinds of war stories as a mans of beefing-up their own credibility. Today we are prepared, where as last year we simply were not. We are not as active in the Perforce community as much as I would like. We are more stable, reliable and performant.
Essentially our mistake or series of mistakes was the result of human error. We had many mistakes. These mistakes started before the outage, in the early stages of the outage and during the late evening of day one. Human error is often to blame. When you think about the alarms we had setup, even though we were using technology to alert us about a problem, it was human error to ignore them. It was human error to set-up so many alarms.