It’s Not Goodbye, but See You Later…

I’ve never been one for goodbyes over the years, so I will leave you with a blog of hope that one day we will see each other again. We might be working together away from Blackboard or maybe one day I will come back. I just don’t know…What I do know is that for this blog I’m going to cover some parting words that hopefully will resonate with my reading audience. I’m hoping it will ignite some kind of spark and make an impact on the future.

 

When I started the Performance Engineering team in the fall of 2003, I set out on a mission to make Blackboard the fastest performing, most scalable e-Learning software platform in the world. I wanted us to be the benchmark in the software space, where companies look at us lovingly in the distance with deep admiration and respect. That leads me to my first major point…

 

1) Set high expectations for yourself and your teammates…Do what you can to achieve them.

 

There were a lot of things I wanted to accomplish when I came to Blackboard, but the one thing I knew I didn’t want was to fail. I came here in my late twenties. I was a mere child in terms of professional experience. I was being entrusted to build a multi-million dollar team for a $100 million dollar company that wanted to go public and become a $1 billion dollar company.

 

Our CEO, Michael Chasen had high expectations for me, therefore I needed to set higher expectations for myself and the team I was building. Setting expectations is really about setting goals and then being transparent about those goals. Achieving expectations is about being both strategic and operational at the same time.

MUPPETS MOST WANTED
You don’t have to have 20+ years of experience to be successful in any venture. You have to be smart, committed and resilient. The smarts come from planning, researching and my personal favorite, continuous learning from experience. The commitment is about execution to plan, as well as a willingness to re-plan after learned mistakes. The resilience is about perseverance when times are challenging.

 

2) Every day is a benchmark

 

I wrote a blog back in June of 2007 to my team about the importance of seizing the moment. Unfortunately, the blog was internal and I never posted it. I was half correct with the blog. The part I nailed was the part that insisted every day is a chance to start over. Every day is a chance for a new beginning.

 

I missed an essential part which looking back could have and should have fundamentally changed our team’s purpose. It was an aha moment that if I could do it all over again, I would totally have done it differently.

epic

The focus of that blog was about testing and benchmarking. In 2007, we were a very good testing and benchmarking organization. Some in the industry might have said we were one of the best given our maturity, practice and tooling. We should have looked at all of that production data real-time and built an analytics engine that studied live system data. That was the real data we needed more than anything. I’m not talking volumetric sampling. I’m talking APM (Application Performance Measurement).

 

We should have built the collection tools and engine to process the data. That would have been disruptive. It would have been game changing. We didn’t and as a result we failed to reset expectations and learn from our past experiences.

 

3) We cannot change the cards we are dealt…we can change how we play the hand

 

I don’t know the original author of this quote. The context for me is hearing it back in 2008 watching a YouTub clip of Randy Pausch giving his famous Last Lecture. I think I watched that lecture a dozen times. I bought the book and read it over and over as well.

last_lecture

That quote has been in my head constantly for the last few months as I’ve been deciding whether to leave Blackboard or stick around. I’ve thought about it in the context of my 11 years here. I realized over and over again me and my teammates were dealt blow after blow. Some of those blows were good…some were bad. Rarely did anything we as a group planned happen in a natural order. More often than not we found ourselves treading water or playing a defensive game of ping pong.

 

We got through it all. The way we got through it all was being adaptive and willing to change our plan.

 

Blackboard will hopefully outlive me for many decades to come. The folks who are a part of the future, will hopefully adapt like me and my colleagues adapted over the years. Looking back, that’s what made this place so special. There was a simpatico ebb and flow to change on the fly. Hopefully the people and the company won’t forget that going forward.

 

- Steve Feldman

Blackboard (2003 – 2014)

Continuous Delivery…Continuous Integration…Continuous Deployment…How About Continuous Measurement?

I spend a lot of my free time these days mentoring startups in the Washington, DC and Baltimore, Maryland markets. I mentor a few CEO’s, who are building software for the first time, as well as a few folks in the roles of VP of Engineering or Director of Development. It’s fun and exciting in so many ways. I feel connected to a lot of these startups and personally feel a lot of satisfaction mentoring some really great people, who are willing to put it all out there for the sake of fulfilling an entrepreneurial spirit.

I’m not just partial to startups. I enjoy collaborating with peers and colleagues that work at more tenured companies. I think it’s important to get alternative perspectives and different outlooks on various subjects such as engineering, organizational management, leadership, quality, etc…

http://www.miraclegroup.com/images/easyblog_images/205/lean.gif

For about four years or so there’s been a common theme amongst many of my peers and the folks I mentor. Everyone wants to be agile. They also want to be lean. There’s a common misconception that agile = lean. Yikes! I’ve also noticed that a lot of them want to follow the principals of Continuous Delivery. Many assume that Continuous Delivery also means Continuous Deployment. The two are related, but they are not one and one the same. Many of them miss that Continuous Integration is development oriented while Continuous Delivery focuses on bridging the gap between Development and Operations (aka…the DevOps Movement). Note: DevOps is a movement people…not a person or job.

The missing piece…and I say this with the most sincere tongue by the way…is that there still remains this *HUGE* gap with regards to “What Happens to Software In Production?”. My observation is that the DevOps movement and the desire for being Continuous prompted a lot of developers and operations folks to automate their code for deployment. The deployments themselves have become more sophisticated in terms of the packaging and bundling of code, auto-scaling, self-destructing resiliency tools, route-based monitoring, graphing systems galore, automated paging systems that make you extra-strong cappuccinos, etc…Snarky Comment: Developers and Operation Engineers can’t be satisfied with deploying an APM and/or RUM tool and calling it a day.

gadget  monkey

Continuous Measurement is really what I’m getting at. It’s not just the 250k metrics that Etsy collects, although that’s impressive or maybe a little obsessive to say the least. I would define Continuous Measurement as the process of collecting, analyzing, costing, quantifying and creating action from measurable data points. The consumers of this data have to be more than just Operations folks. Developers, architects, designers and product managers all need to consume this data. They *need* to do something with it. The data needs to be actionable and the consumer needs to be responsive to the data and thoughtful going forward for next generation or future designs.

In the state of Web Operations today, companies like Etsy or Netflix make a tremendous amount of meaning from the data they collect. The data drives their infrastructure and operations. Their environments are more elastic…more resilient and most of all scalable. I would ask some follow-up questions. For example, how efficient is the code? Do they measure the Cost of Compute? (aka: the cost to operate and manage executing code)

Most companies don’t think about the Cost of Compute. With the rise of metered computing, it’s amazing to abstract the lost economic potential and the implied costs because of inefficient code. Continuous Measurement should strive to balance that lost economic opportunity (aka…less profit). Compute should be measured as best as can be from a service, feature and even at a patch set level.

A lot of software companies measure the Cost to Build.  Some companies measure the Cost to Maintain. Even less measure the Cost to Compute. Every now and again you see emphasis placed on Cost to Recover. Wouldn’t it be a more complete story with regards to Profit if one was able to combine the Cost to Build with the Cost to Maintain and the Cost to Compute?

Maybe the software community worries about the wrong things. Rather than being focused on speed/delivery of code and features, maybe there should be greater emphasis placed on efficiency of code and effectiveness of features. Companies like Tesla restrict their volume so that each part and component can be guaranteed. Companies like Nordstrom’s and the Four Seasons are very focused on profit margins, but at the same time they value brand loyalty in favor. I used to think that of Apple, but it’s painfully obvious that market domination and profitability have gotten in the way of reliable craftsmanship. I love my Mac and IPhone, but I wish they didn’t have so many issues.

http://www.toonpool.com/cartoons/buy%20magic%20beans_53582

I have no magic beans or a formula for success per se. I would argue that if additional emphasis was placed on Continuous Measurement, many software organizations would have completely different outcomes in their never-ending quest to achieve Continuos Delivery, Continuous Integration and Continuous Deployment. It just takes a little bit of foresight to consider the notion that Continuous Measurement is equally important.

What Ever Happened to Software Patterns and Anti-Patterns

Thirteen years have passed since I first read Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software, which if you know me well I consider it my bible of Software Performance Engineering. Note, I’m not a religious guy so the fact that I referenced the bible is saying a lot. I still keep a copy of it about 5 feet from my desk. I lend it out at least 3 or 4 times a year to members of my team still to this day. It’s a book that clearly maintained its luster. I can’t call-out anything in the book that doesn’t apply to today’s computer scientists. 

Image               Image

My big takeaway from the book and the teachings’ of Smith and Williams is the notion of Software Performance Anti-Patterns. Earlier in my career and my studies I was intimate in my learnings about Software Patterns. I read the Gang of Four’s classic Design Patterns which was published in the mid-1990′s and in awesome fashion is still relevant today. I have a copy of that book sitting next to my copy of Performance Solutions. I wonder if today’s CS graduates are even reading either of these books as solid references. It’s like a journalist or English major making it through undergraduate studies without reading Strunk and White’s Elements of Style. Is it possible to graduate without reading these books?

Image

As a young engineer and computer scientist I focused persistently on software performance, I lived and breathed patterns and anti-patterns. I used them for meaning as well as guidance in helping my fellow developers learn from simple coding mistakes that in the early days of Java were easily exposed. Early Java code in the days of Java 1.3 and 1.4 was wasteful. Heck, there’s still a lot of Java code today that’s wasteful as well. By wasteful I am referring to poor memory management and wasteful requests to name a few. Simple anti-patterns such as wide loads or tall loads were common. There was blatant disregard for understanding the lifecycle of data, how data was accessed, whether the data was ephemeral or long-lived. There was too much inter-communication transport between containers and relational data stores. Not that I’m trying to equate every software pattern to memory management, data transport or advocating the use of caches to solve every problem. I’m just picking a few off the top of my head.

Image

So my goal of this post is not to sound like a curmudgeon. I’m too young to be old and crabby. I’m not a purist either. I’m more of a pragmatic engineer who likes to constantly ask questions. That’s the forensic side of my personality I guess. The question that’s on my mind these days is do the developers of today…specifically the developers who are going hard after languages such as PHP, NodeJS and Ruby, think about software patterns and anti-patterns? Do we see an absence of thought around challenging our designs because our code can be quickly assembled via frameworks tied to these languages and others? Has Lean Thinking made developers more conscious of code productivity versus code efficiency?

I’m sure I’m going to get lambasted by a few readers who are passionate about these modern web frameworks and the new stacks. That’s cool and fine. I personally am a big fan of these stacks and capabilities. Mainly because they make the development of software more accessible to everyone. That’s not my point in being interrogative about design patterns and anti-patterns. I guess I’m more curious about whether developers today are thinking about design patterns and are able to identify anti-patterns, or whether they are more focused on writing code faster and with less effort. Don’t forgot code that’s testable from the start.

That’s actually one of the greatest observations I’ve seen about today’s developers using these newer languages and frameworks. The are social coders that fork and pull all of the time. They are following more agile practices around TDD and BDD. They write their own automation. A lot of these developers take more accountability of the quality of their code than any other generation of developers that I’ve witnessed. Note I am young (38) and really have only worked with 2 or 3 generations of coders. A lot of these developers are focused on deployment of their code as well. They make use of the accessibility of systems through providers like Amazon, Rackspace, Azure, Google and Heroku. They leverage deployment tools like Capistrano or RunDeck. They write automation with configuration management frameworks like Chef, Puppet and Ansible. They love APM tools like New Relic or AppDynamics. All indications support the thesis that today’s more modern developers take more accountability of so many facets of development.

We should commend those developers for what they are doing. Greater consolidation of languages, frameworks and tools increases the likelihood or probability that the community of contributors to these technologies will give back. It also leaves open, which in my small sample size, the possibility of bliss or unawareness to good design, structure and scale. You have more developers today than in any time in the world. There are more outlets for developers for social coding, open source contributions, etc…Is it possible that a larger percentage of developer are really just coders, assembling software as a commodity? This is more passion and theory, than empirical analysis…

I did some unscientific research…aka lots of Google Searches…here’s my observations:

1. The few relevant postings I saw about software design patterns and anti-patterns were more scholarly. They were traditional research papers written by academia posted on ACM or IEEE. While I used to be a big ACM and IEEE reader back in the day, few of my contemporaries use it or refer to it. In fact, I haven’t read an ACM or IEEE article since after 2010, which is kind of sad.

2. A lot of blog posts and even slideshare presentations used anti-patterns in the software context about developer habits or bad behavior. This kind of annoyed me the most because there were some folks that run in the same circles that I run in. They weren’t talking about code design (good or bad), but rather behavior. 

3. The one community that anecdotally had the most entries around design patterns and anti-patterns was the Scala community. That made a lot of sense to me as every Scala developer I know was a hard-core Java developer who made a run at Ruby for a project or two and decided that Play was even cooler than Rails. 

4. The MVVM community big on backboneJS, angularJS, emberJS, etc…didn’t really write much about anything pattern or anti-pattern at all. There were blogs and presentations. Some were good…some were so so. Most were developer behavioral. Most were about code efficiency. There was this one blog about backboneJS that was ok. Nothing game changing…Nothing that would act like a slap in the face to developers to think about the efficiency of their design, ability to scale and the cost of compute.

That last phrase I guess is what blows my mind (ie: “cost of compute”). I got into the world of software performance and optimization at a time when compute was really expensive. We saw years of compute (CPU and Memory) become a commodity. If our code couldn’t scale, we would simply add more memory, more CPUs, more bare metal systems, more VMs, more storage, etc…The public cloud makes that access to more compute so simple…

The compute of 2014 and beyond is metered now or better yet again. It was metered back in the early days of timeshare computing. Today, I find myself getting out of the game of running private data centers or using colos. I buy less hardware and storage each year. My private data center footprint is the smallest it’s been in years. Not because of virtualization and consolidation, but rather because I’m moving more stuff outside to the public cloud.

Each month I look closely at that bill I get from Amazon and Rackspace. That meter is constantly running. Pennies add up and you don’t realize it until it’s late in the game. It turns out a lot of that waste is because my cost of compute (aka…the efficiency of our code) isn’t as great as it could be. We write a lot less code, but it’s not necessarily all that efficient.  

I’m hoping this blog starts a conversation, not a fight. The thing I see is that innovation in the software world is at an all-time high. We do have more developers than we have ever had in our lifetime. We have more languages and frameworks today than ever before. We have more choice…more variation…more outlets. At the same time, I can’t help but think about questions like:

  • Are we producing more coders than developers because we have a supply/demand problem?
  • Are the CS grads we are producing around the world bliss of solid architectural design?
  • Do developers focus less on good system design, sustainable, long-lasting architectures to be used for years or do they place more emphasis on quick applications?
  • Has the days of profiling become an ancient practice to be done by the few developers and avoided by coders?
  • Has the accessibility of cheap compute blinded our awareness of cost?

I probably could go on for longer…

HTTP Archives and Playing Around with Piwik

For the last few weeks I’ve been working a problem with a few of my teammates on front-end web performance. I use the term problem in the general sense. It’s not like we have a “performance” problem from the responsiveness perspective. Rather, it’s that we need to do a better job measuring front-end performance as part of our Continuous Delivery goals and objectives. As we attempt to make our product more responsive, modern and ubiquitous from a browser perspective, our product needs to be fast if not near instantaneous.

Our visibility is unnecessarily limited. We have ample opportunities to measure front-end performance. Rather rest on our laurels, we need to plug away at the data and make something happen to better “capture, analyze and act” to quote my good friend and teammate, Mike McGarr.

So for the last few weeks I’ve been working with a couple of teammates exploring how we could make use of HAR files. If you are unfamiliar with HAR (HTTP Archive), it’s a specification to encapsulate network, object and page requests over HTTP as an exportable format. The format is JSON and it has become the de facto standard for measuring and capturing web performance timings.

Over the years my team has made ample use of the data that can be exported in the HAR. We’ve used tools like Fiddler, YSlow, HTTPWatch, Firebug, Chrome Developer and WebPageTest to capture page waterfalls, CSS/JS behaviors and object timings. Much of that analysis happens real time in the tools/proxies that capture that information. Nearly all of the analysis was individual request driven, meaning that we studied individual pages by themselves and not via a time-series view. When a performance optimization was made, we did compare before/after. That was done in a fairly analog method by hand.

What we decided to do was look at this problem of HAR analysis in a more scientific, programmatic and analytical way. We have this massive library of Selenium automation that exercises almost 80% of the UI functionality within our product. It’s quite remarkable the coverage considering the size of our product. We have been prototyping BrowserMob sitting between our SUTs (System Under Test) and our Selenium scripts which we execute via the Fitnesse framework. With BrowserMob we can capture most if not all of the critical HAR information in a JSON format. From there, we are moving the JSON object into a JSON compliant file store.

We are considering a variety of stores for now, with the obvious candidates being Postgres, MongoDB and Redis. Our goal is to archive these JSON files and then be able to analyze the contents in both a vertical (within a JSON file) and horizontal (across many JSON files) with relative ease. We want to be able to study/measure regression continuously, as well as alert/fail builds when conditions or criteria of object/page responsiveness is violated.

It seems simple in nature, but in actuality it’s kind of a tough task. While there is the HTTP Archive (http://httparchive.org/) which does some high-level stats, I can’t for the life of me find anything that’s been built that looks to programmatically evaluate HARs from both a micro and macro perspective, as well as over time-series (iterations and improvements…or worst with degradations). So right now that’s something we are looking to build. It could give us some very powerful data as we capture, analyze and act on the data. Once we have something, look for us to put it on our GitHub site so others can fork or extend.

There’s no reason for us to necessarily wait for this kind of data that may be obtainable in a different format. Back in May, I stumbled upon a took called Piwik. At the time I posted a quick Tweet to see if any of my friends in the Performance or DevOps community were familiar with it. Nobody ever responded or retweeted. The folks at Piwik didn’t even give me props, which is fine. I’m not in the Twitter game for props. Basically, I posted the tweet and then dropped it from memory.

Just yesterday, a few of my teammates were fortunate to get on a conference call with a client of ours named Terry Patterson. He’s a really knowledgeable Blackboard SME and even wrote a book about how to be a successful Blackboard System Administrator. I’ve read it and I have to say that I’m quite impressed with what he put together. I’m not really certain about the purpose of the call, but one of the outputs of the call was that Terry shared his implementation of Piwik to my teammates.

Image

Years ago we did a proof of concept getting our product to seamlessly integrate with Google Analytics. We even introduced a simple template to inject the JavaScript into our googleAnalyticsSnippet.vm file that we store under our shared content directory. To make use of Pikwik, you leverage the configuration we enabled for Google Analytics. The exact location can vary from system to system depending on the shared content, but basically it can be found under $BLACKBOARD_HOME/$SHARED_CONTENT/web_analytics/googleAnalyticsSnippet.vm. Simply modify this file with the JavaScript source that should be used for Google Analytics. It’s basically the same thing. I’ve pasted an example below which references my localhost for the server. 

<!– Piwik –>

<script type=”text/javascript”>

  var _paq = _paq || [];

  _paq.push(['trackPageView']);

  _paq.push(['enableLinkTracking']);

  (function() {

    var u=((“https:” == document.location.protocol) ? “https” : “http”) + “://localhost//”;

    _paq.push(['setTrackerUrl', u+'piwik.php']);

    _paq.push(['setSiteId', 1]);

    var d=document, g=d.createElement(‘script’), s=d.getElementsByTagName(‘script’)[0]; g.type=’text/javascript’;

    g.defer=true; g.async=true; g.src=u+’piwik.js’; s.parentNode.insertBefore(g,s);

  })();

 

</script>

<noscript><p><img src=”http://localhost/piwik.php?idsite=1” style=”border:0″ alt=”” /></p></noscript>

<!– End Piwik Code –>

You also can modify the velocity.properties file under $BLACKBOARD_HOM/config/internal to change the modificationCheckInterval from 12 hours down to let’s say 60 seconds. Note that the parameters is in seconds. So simply change the default parameter which is 43200 seconds (12 hours) to 60. Simply restart your server instance and it will start sending information to the Piwik server in seconds.

This is great and very powerful information. Every customer should leverage this powerful data. The question that is going through my mind is how can we leverage this information within our test laboratories? Of course it would be amazing to have this deployed in most if not all of our customer’s environments. Having client data would be invaluable. Having a way to analyze our own test data can be just as meaningful.

First, I see us leveraging the performance data that comes from generation time of a page (initiation of the request to full DOM rendering). This data is pretty clean from a mapping of pages in our product to business/functional purpose. We could capture this data from our automation tests and perform regression analysis from build to build and over time.

Image

This makes me think about our LoadRunner library. Over the years we have created naming conventions for transactions. Looking at how our product has evolved and how pages are named with relative ease from the markup. We might as well consider capturing the transaction from the page name. It’s like mapping Business Transactions from a dynaTrace, New Relic or AppDynamics perspective. Looking back on it, we created a naming convention because at the time none existed that was consistent and uniform, but for years the names of pages have been meaningful. We simply didn’t adapt.

Second, we could compare this data with the HAR data to validate complete page timings. The HAR data will give us page/object level timings, whereas Piwik gives us full page generation timings. Remember that web analytics isn’t really a goal from a testing perspective. All of my load will come from automated SUTs from the same data center. I will see browser variation, but that’s known since we control the automation engine. So my point is that we can see page load behavior from two perspectives.

It’s pretty clear from the randomness of this blog that I still have a lot of thinking about the data. At a minimum, I think we should play around with Piwik both internally and externally to see what value we get. It may even make sense for my other teams like my DevOps group to consider incorporating this into our Confluence and JIRA environments as well.

The Week After BbWorld

This might be the first BbWorld/DevCon that I’m bummed to be back. The energy, excitement and awesomeness of DevCon was amazing. I’ve been to 10 DevCons and never felt this way at or after the conference. We had a solid audience of about 350 developers, administrators and of course those who straddle the lines of both with the moniker “DevOps”. 

I didn’t share the same enthusiasm with the big conference. It has really nothing to do with the size or even topics of BbWorld, but rather that our DevCon was simply two days of fun and networking. It was also the most I’ve ever worked at DevCon. The team of me, Mike McGarr, David Ashman and Chris Borales have put months planning, rehearsing, changing ideas, etc…Given the volume of changes we made even five minutes up to the show, it’s amazing we didn’t fall on our faces.

I could spend this entire blog talking about how great DevCon was and how sad I am to be back here, but that would be a total waste of a blog. So rather, let me capture some of the key themes and topics from the conference. The list is long, but I’ll focus this blog on five key topics/takeaways:

  1. DevOps: Culture of shared ownership and accountability
  2. Tools do matter: Get the right tools in everyone’s hands
  3. Not enough folks are monitoring with passion
  4. Where are the open source B2s? Has anyone heard of Github?
  5. The Developer VM is huge

DevOps: Culture of Shared Ownership and Accountability

The term DevOps is not new to me. I’ve been hearing this term for years as a member of the Velocity community. I am pretty sure that half of the folks in the audience had never heard the term DevOps before. The other half may have heard it, but never put too much context around the term. Remember that half the audience at DevCon tends to be developers and half are administrators. A lot of folks where both hats in the audience, but never put too much thought into a new classification of role. 

What were we trying to accomplish by spreading more gospel about DevOps? I’ll tell my perspective and then look for Mike to maybe post his message on his blog “Early and Often“. First, I wanted to blur the lines between developer and administrator. It’s better to go to battle each day as a team of one rather than two individuals, who only think about their individuals roles and responsibilities. I want my developers to think about what their code is doing at 2am in the morning. I want my administrators to understand the code so they can measure, monitor and log the crap out of the code to give good feedback to the developers.

Do I really care about the blame game that happens when disasters happen? Nope…it’s not about blaming. It’s about working together to solve the problem as fast as possible, learn from the mistakes (because there are always human mistakes) and then make the developer/admin alliance even stronger so that disasters simply become small nuances. 

I think a whole bunch of us talked about so many of the DevOps mantras that I can easily say the DevOps message was more than just a token statement about take a developer with a white shirt and an admin with a black shirt, stick them in a room for a hug-a-thon and watch them come out with two grey shirts. Mike covered C.A.M.S (Culture, Automation, Monitoring and Sharing). Gene Kim, author of the Phoenix Project covered the “Three Ways” and I brought up the rear with turning on the lights with efficient logging tools. 

As I mentioned, we only gave the audience a taste. Now we need to take the message further with more focused sessions that mix culture, infrastructure as code, monitoring and logging, collaborative development, sharing of data, etc… 

There was a great presentation worth checking out called “There’s No Talent Shortage” from a recent DevOps Days in Australia. 

Tools Do Matter: Get the Right Tools in Everyone’s Hands

So a couple of weeks ago I went to my fifth Velocity conference. It was good…not the best though. Everyone says the best tracks are the hallway tracks, which to some degree I believe as well. I like to go to sessions. I expect to be entertained and learn something from my peers and our industry’s experts. This year I didn’t learn as much as year’s past. Well maybe this year I learned that a lot of tools meet their fate of non-existence in the Performance community. The tools everyone were talking about were WebPageTest and mPulse (from SOASTA). The tools no one was talking about were YSlow (to a smaller degree), PageSpeed, Fiddler, TrueSight or any tool that attempted to “fix” the so-called Web Performance rules. 

On the infrastructure side, we continue to see better tooling in the spaces of both monitoring and logging. Etsy’s presentation about 250k metrics still makes me shake my head. I think they are making matters way too complicated. It’s not like they are launching the space shuttle, but rather are providing store fronts for artists and craftsmen. Heck, maybe I want this level of granular monitoring accessible at any time for forensic recall…who knows. 

I personally did a session on Logstash, which could have gone a lot better. Though, I got my key takeaway out to the audience. I wanted them to take accountability for studying, measuring and automating their logs. I wanted our audience to know of tools that were out there to aggregate and visualize their logs. If I’m lucky we can get the community to contribute to a LogStash initiative. We have to be the lightening in the bottle to spark the initiative. We need to post our work to the Blackboard Github page and get the community to either contribute or fork. Either way we are accomplishing our goals around log centralization, mining and awareness.

Not Enough Folks are Monitoring with Passion

We talk a lot about monitoring at DevCon every year. I don’t think the audience is doing more than the minimum. I don’t think us as presenters are making monitoring more tangible. It’s one thing to reference a tool, but it’s another thing to put the tool in action. We have done some of that with Zabbix, but not enough. We need the community to start the conference call. It’s almost like we need a monitoring summit to talk about what we need to monitor, how we monitor it, what frequency we monitor, how we visualize, what we store, etc…etc…

How cool would that be to do a monitoring summit inside of DevCon? I kind of just thought of that as I wrote it by the way. I’ll take credit for that one, but really it’s this blog entry that stir up the creative juices. I’ll write another blog about the “Monitoring Summit” with CAPs meaning it could and should be a real thing or idea.

I personally love monitoring. Heck, McGarr has a laptop sticker that says “I Love Monitoring” though I just believe he likes the idea of monitoring and what he’s really in love with is automating monitoring ;) I’m the data geek on the tea and make is the automation geek. Expect a rebuttal blog from McGarr about he loves monitoring and that I just love myself.

This year there was a little bit of a drop-off about monitoring. There was a session about the Admin Console and then a client presentation by Nck McClure about System Administrator basics which touched upon monitoring. We need more sessions if not a full fledge track on monitoring (well, it can continue to be with the Performance and Security track), or at a minimum we should have had 3 or 4 in-depth sessions on monitoring approaches, tools and solutions.

Where are the Open Source B2s? Has Anyone Heard of Github?

Every year I leave DevCon asking why don’t more developers in our community put their stuff out on Github for others to contribute to or fork. Why aren’t we collaboratively building more B2′s? This year I’m going to ask it differently…Why aren’t we doing that first ourselves? I bet if we started putting B2′s out there and of course marketed the heck out of them, several developers would download, fork or contribute to the initiative. Isn’t that what we want? I know that’s something I want. Just need to convince others that it’s ok.

The Developer VM is Huge

I knew before I even took off for Vegas that the Developer VM that our DevOps team built was going to be a huge hit. Not because it’s running on Centos…not because it has Postgres…It’s mainly because we provided a fast, efficient and simple way to get developing. Now that I think about it, a lot of Bb’ers may want to make use of this as well that are outside of development. It’s a game changer in my mind. I had demos up and running within minutes. 

Dealing with Outages…Three Days in September

It’s about 1 week away from our annual user conference. Well that is for anyone who is reading this blog entry between today and Wednesday, July 10, 2013. That’s when I’m giving this presentation to hopefully a packed room of awesome, interested professionals who’s day to day life involves managing the application I build. The goal of my presentation is to share a story that started about two weeks after I got back from last year’s user conference. I’m not trying to mislead my users with the title. The story is definitely about these three terrible days we lived through in September, but the story started long before those ill-fated 72 hours. In many ways it continues on today.

Image

Click Here for My Slides

The Setup

Everybody loves a promotion right? Well that’s what happened last year right after I came back from our user conference. My boss came into my office, closed the door and gave me the news that I was going to become a Vice President in our development organization. It was definitely something I wanted for years, but didn’t think it would come in my current capacity of running our Performance and Security Engineering teams and three of our six development teams. I figured it would come if I left the company, someone retired or I switched divisions. It did come and the premise of the promotion was pretty cool. I would be asked to build a more scalable practice for both Performance and Security across the entire company rather than just within my product division. It would come with a couple of catches. The first catch was that I had to give up my three development teams. The second catch was that I would have to take on a new third team, our Release Engineering and Infrastructure team.

It was kind of like a catch within a catch. I guess they call that a catch-22 because while I would inherit this team, I wouldn’t get the Director who ran the team. He was leaving the team to be part of a new product team, but I digress as that’s a story for some other time. So I accepted the promotion, but of course under my terms. I like to think that I actually influenced the terms, but the truth is what I asked for didn’t really make anyone flinch. Basically I said I would take the change, but it meant I would get all of our DBAs and I would also get to hire a new Director for the team I just inherited. Oh and by the way there was another stipulation. I said I would take on this team if we could rebrand the team DevOps. I’m sure all the “Puritan DevOps” folks who hate others using the term DevOps in their title or team have stopped reading my blog. 

The Introductions

It took me a week to decide if I wanted the promotion. Well…it really took me negative 6 seconds, but I did want to do my due diligence to understand the problems facing this team and the challenges that I might face trying to find a strong, capable leader to run the team. If you are wondering, I did eventually find someone to run the group. He’s a great leader and we have become quite good friends over the past year. Unfortunately he didn’t come into the team until after our incident. One might argue that it was the incident that gave the business justification to make the hire last November.

 Image

Ok…so let me get back to the story.

 

I want to say that I ended up taking the team officially the first week of August. I don’t particularly remember all of the things I said to each member of the team. There is one line I told each team member, including the exiting Director. Basically, I said it’s going to take me a few weeks to assess the team, build a road map, sort out our budget and initiate our recruiting process. I continued with a request of each person, which was to continue with their current work and most importantly if something was going to happen, “I wasn’t going to lose any sleep at night.” Basically, I gave each team member the opportunity to let me know if they felt any of our critical applications and/or business processes were fragile and/or at risk of going offline outside of our business hours. If anyone felt concerned, they could speak now and we could re-prioritize our work to address those potential problems. The funny thing is that everyone replied with “Everything is Safe and Sound…Nothing to Look at Here…Just Move Along.”

I took their comments at face value…but deep down inside I knew my proclamation was empty. Of course I would lose sleep if something happened. 

The First Alarms

The Outage happened sometime around 2am EST on a Sunday night. Of course nobody knew it happened at the time because the little alert controls that were put in place were masked behind email rules. Nearly every member of the team had setup email rules to keep alerts from Nagios and our Cron jobs out of their Inbox and moved them to another folder. Well…everyone except me. That was one of our topics of discussion in our retrospective after the outage. I wasn’t awake at 2am so I didn’t see the nearly 800 emails I received from those two systems until about 6am when I woke up…nor did anyone else for that matter. 

I knew something was definitely up when I woke up, but I didn’t know the full context of the issue. Sprinkled within the 800 emails were several emails from team members overseas in Australia, China, India, Amsterdam and the UK. We didn’t know what the outage was because nearly every system was producing an alert. The emails we received from our end users complained about access to our data center, logging into our Atlassian products and syncing our source code system (Perforce). To the naked eye, the issue looked like a clear cut network problem. We would have someone look at one of the many switches and our core router. We had seen network problems before, quite possibly fairly recently to the outage.

One of my engineers on our DevOps team was responsible for our build process and workflow, as well our source code system. He tended to keep really early hours so by 6:30am he was already in the office and looking at the issues. Every build failed because of timeouts between our build systems and our source code system. It turns out that our backups for our source code system also failed, but not because of a time-out. We actually didn’t know why it failed at the time as the error code was corrupted. All we knew at the time was it failed. This engineer went to open up a JIRA ticket for tracking purposes, which he couldn’t at the time because JIRA wouldn’t load. This same engineer had ssh’d to our build system which took 5 minutes to login to a Linux shell. Once he got on the system it was taking forever to change directories or lookup a process. 

You take these three pieces of data (ie: build time-outs, back-ups failed, slow access to JIRA and sluggish Unix commands) and the likely hypothesis was that we were most likely having a network issue. We weren’t, which is obvious now, but wasn’t obvious then. We didn’t look at the right data. In fact, we hand selected all the wrong data. We made it worse with our next sequence of events.

The Mistake

We actually made more than 1 mistake. This first mistake I’m going to reference was a doozy. When our teammate couldn’t do anything on our source code system (note it’s a Linux system), he decided that maybe it would be a good idea to issue a Reboot from the command line. Looking back I’m not quite sure why he felt a reboot would address a network problem, which was his first diagnosis. In the retrospective, he tried to correlate the latency seen our source code system with a daemon process that had just run wild. Rather than kill the process, he went for the full shebang with a reboot. 

It gets better…not that I mean to laugh. I can laugh now. Back then I had more tears in my eyes than laughs in my belly. So the system would reboot. Well it sort of did. It was stuck in a boot sequence. The machine had been built as bare-metal. It was one of just a few systems in our data center that wasn’t virtualized. It also had 2 TB of storage, but this storage was all local. Keep track of this information, because all of these details are important as the story pieces together.

So when the system wouldn’t come back online and couldn’t be accessed via SSH or even over KVM, the engineer emailed one of our operations guys to see if he could swipe the boot switch and hard boot the box. YIKES!!! Our Ops Engineer complied and that’s when the blinking lights started to go crazy.

The Blinking Lights

The server didn’t come online after the hard-boot. You could say this was a blessing at first because we were so far off the trail and no scooby snack was going to get us back on the trail in time before the damage was done. It was our first clue though. The Ops team attempted to do a reboot, failed and then decided it was best to run a diagnostic test. This test failed with flying colors. We had a disk failure and would require a replacement. We had a RAID-5 configuration and on-premise support from Dell. We should have easily been able to bring the disk offline, started without issue and then had Dell replace the drive. That’s exactly the path we went down.

 Image

While the system was offline all of our other problems went away. We couldn’t explain why. Our problems with Atlassian should have been unrelated to our source code system. Our SC repository (while I’ll share now is Perforce) had no integration with Crowd or AD. We managed accounts locally within Perforce. When our Perforce system was down, all of our other problems seemed to disappear. We thought we were in the clear.

Houston We Have a Problem

So after the diagnostic test, we were able to bring the server back online. At first there appeared to be no issues. From a console, the shell was lightening quick. Within about 2 minutes the system came to a crawl. Then a whole bunch of other systems in our data center started to crawl. By this point it was about 8am and I had come into the office. I setup shop in my DevOps room and was quietly watching, while at the same time providing a little bit of direction and support. 

Image 

Every problem that could go wrong did go wrong. We were getting pummeled by emails of alerts and warning messages. The few folks that were in the office all walked into the DevOps office asking whether we knew what was wrong and whether we had any idea of when the problem would be resolved. We had no idea. We didn’t even know what the issue was other than we knew we had a bad disk on a server, which at this point we couldn’t access  because the load average had shot up to some number greater than 70. 

 

What we believed was an isolated incident to our Perforce system simply didn’t add up. This system wasn’t virtual, not that this piece of data really mattered. It had no attached or networked storage. There was no integration with Crowd or AD. So what was up with this system? 

 

Our instincts were telling us that that the issue wasn’t network. We were able to ssh to other systems with relative ease. We narrowed down pretty quickly why the other systems were having issues. It turns out that any system using Crowd for authentication was having problems. If Crowd was by-passed, basically navigating to any application that allowed anonymous access, we had no issues. Of course the first thing we did was ssh to Crowd and run a few commands like top and netstat. The crowd server was doing nothing from a resource perspective. From a netstat perspective, we had a ton of connections to a postgres database system. There were a handful of connections to our JIRA server, as well as our Fisheye server and our Crucible server. 

So once again we had a problem, but we didn’t know what the problem really was. 

Knowing is Half the Battle

When you are in a crisis, it is really hard to think rationally and have perspective. I basically gave the guys on my team a free pass from thinking rationally at the time. They didn’t see it coming. They never prepared for the super bowl of outages. Nearly 90% of our enterprise applications used Crowd, which wasn’t functioning correctly. Our source code was basically still timing out, which meant developers couldn’t sync. They could work locally. They couldn’t get to their JIRA issues or to Confluence. You could say it was a confluence of disasters all rolled into one. 

It was about 9:30am when we finally started narrowing down the issues. I believe at about 9:15, I strongly urged one of the guys on my team to open up a case with Perforce to see if they could aid and assist us with the time-out issue. That was throwing us for a loop given that we had diagnosed the only issue with Perforce as a storage issue. Between 6:30am and about 9:30am, the Perforce process must have been restarted a dozen times. I remember asking the engineer if I could get on the Perforce server as well, which is when we had our first discovery. 

Our Perforce server was aliased as Carbon. I noticed when running top that there were Java processes for Fisheye and Crucible. Whoa…what’s this discovery. I logged into our Fisheye server, which was alias Fisheye. I did the same with Crucible. I did a simple who -u and then it dawned on me, we didn’t have 3 servers running these critical processes. Rather, all 3 applications were running on the same server and they just happened to be aliased via DNS. Holy moly is all I could say. By then the damage was really done.

Before I explain what I saw, I had a parallel discovery with Fisheye, Crucible and Crowd. When I noticed the Crowd server was doing nothing, but waiting on its database server, I immediately logged into the database server. My background with databases had been limited to Oracle, SQL Server, DB2 and MySQL. Postgres was as foreign to me as the French Legion. The first thing I noticed about this Postgres system was that every single Atlassian system had representation on this system. We must have 29 different databases including development and test systems, as well as non-essential systems. They were all residing on the production Postgres instance. I didn’t think it could get any worse off, but it did.

The Damage Was Already Done

We ended up shutting down the Fisheye and Crucible JVMs on the Perforce system. Within seconds, nearly all of our Crowd problems had disappeared. If you don’t know Fisheye or Crucible, I suggest you brush-up by looking at Atlassian’s site rather than getting the skinny from me. In 10 seconds or less, Fisheye gives us insight into DIFFs from within JIRA. Crucible basically indexes the code. So at a minimum we knew at least one of those applications was accessing Perforce.

It was about 11am by this point if I recall correctly. We shutdown Perforce and disabled access so no users could log in. Crucible and Fisheye were offline. Basically JIRA, Confluence and all of our other applications using Crowd were working. We tried several times to bring-up Perforce, but we continued to get time-outs when we tried to sync. We finally got Perforce on the phone around 1 or 1:30. That was a long 2 to 2.5 hours if I recall of “trusting” my Perforce engineer to sort out the problem and get Perforce up. I finally grew impatient and said we had to get Perforce on the phone with a Sev-1 issue. It turns out that when he had called earlier in the morning, he didn’t say the production system was down, so Perforce didn’t assign an engineer on the phone. Rather, they sent us a few emails asking us for some logs and some other information. We obliged, but clearly our team didn’t move fast enough.

Within minutes of getting our support engineer on the phone, he quickly identified that we couldn’t start Perforce because our Journal file was corrupt. No worries…we could just restore our last checkpoint. Restoring the checkpoint was pretty simple. All you needed was a successful back-up with the correct checksum after the deflate of the file. We spent about 10 minutes moving the checkpoint from our filer back to the Perforce server. We then spent another 20 minutes extracting the tar file. At about the 19th minute, the tar extraction failed. That sucked…the file was corrupt. We tried it for a few other files from previous days. It turns out that our checkpoint process had failed at 2am, hence a bunch of alarms. 

Our Perforce Support Engineer wasn’t alarmed though. He felt like he could still restore and piece together the missing data if we could find a clean checkpoint. While we were un-tarring files, he was digging into our corrupt Journal file. The file was massive. It was 100′s of MBs. It had filled up with nearly a million lines of Crucible debugging statements. 

Our Dell engineer who was working on the disk replacement with our Operations team came to the conclusion that Fisheye’s log file must have collided with the Perforce journal file. Basically, Fisheye was dumping thousands of debug statements per minute into this journal. It ran for over an hour, which certainly spelled disaster. 

Are You Keeping Track of All of Our Issues?

We tee’d this disaster up for a giant home run in the history of disasters. The system was architected for disaster (no virtualization, local storage, etc…). Nobody was reading alerts. The team reboot a Unix box. When the reboot failed, the team had to hardboot the box. Our most important software system was also hosting two other applications, but the majority of the team had no idea because legacy teammates had set it up. Backups had failed and nobody really knew, nor did anybody care. Oh yeah and every database under the sun, whether production, dev or test resided on a single point of awesome failure. 

Keeping Your Users Informed

Somewhere during the day we had a few moments of clarity to keep our users informed. We must have sent out emails every 30 minutes informing our users with a status. I remember at about 7pm, we projected that our restore might be ready by 11pm and Perforce service would be restored about an hour later. That hour came and went and we had no idea if we would be ready. We asked our users for an extension of 90 more minutes. By 12:30, we sent out a note saying we would update status around 6am the next morning. 

None of us went to be at this point. Perforce stayed on the phone with us the entire time. We had started the issue at noon our time on Monday. It was now 2am on Tuesday. We had rotated support engineers now for the third time (East Coast, West Coast and Australia). 

Keeping Your Executives Informed

I kept an open email thread and a text conversation with a few key executives letting them know the really gruesome details of what we were dealing with in this outage. I assured them we would go to sleep until we resolved the problems. I kept it honest and said, we had a really good chance of loss of data. The dominos lined up prior to this outage perfectly. My bet was we would lose at least Monday’s limited data and potentially any check-ins over the weekend. We had forecasted somewhere between 8 or 10 changelists. We thought we could get that data back since we were pretty confident most of the developers had their changelists locally in their cache. 

The key was to find a back-up from the weekend that worked. 

Keeping Your Team Calm

Yeah…so we couldn’t find a good checkpoint from the weekend, which meant we had to either take Friday’s or Thursday’s checkpoint. That would really suck because we would also have to be dependent on our developers to resubmit their CLs, which meant we had true data loss. For an operations engineer to hear data loss is like stabbing them in the heart. It means their DR initiative never worked, which makes executives ask the question, “Why are we paying you?” 

As a manager it was my job to keep all of the players calm and collected. Transparency was key. It was hard after 24 hours with no sleep. It was 6am by this point. While I had gone home around 2am, we still had 3 or 4 guys still in the office. They were working the problem, but by now support was with London, their 4th support engineer. Our journal file that had been corrupted was finally clean after about 15 hours of manual effort. We found a checkpoint to restore and with the clean journal file, we were pretty sure we would lose maybe some small metadata about the CLs from the past few days. We wouldn’t lose source or versioning. 

One thing we did do about 2pm on Monday was start using a skype channel. This was required early on as we had 2 guys at our data center and the rest of the team in DC. We needed constant communication. It was that skype channel that kept the mood up-beat and at times full of humor. Our DevOps team has a lot of personality. When you are up and working for what would be 72 hours, that’s a big deal.

Day Two…Making Decisions

We had to decide a lot on the fly. While the bad disk had been replaced, I wasn’t about to go back into production with the hardware we had. It wasn’t a question about the hardware, but rather concerns over bare-metal configurations with local storage. We made the decision to repurpose some hardware inventory, which we built-out the morning of day 2. We virtualized the server, though we dedicated the entire VM to the system. This at least gave us the ability to VMotion the VM to another server if we had physical issues. Second, we went against the grain and deployed on our enterprise NetApp system. I remember Perforce and some members of my team were against this move. My operations guys were in my camp on both the VM and NFS fronts. 

I did some research and found at least two instances of customers who had a) virtualized and b) run over NFS. I reached out via email to a few bloggers who had written about their escapades in setting up with this alternative configuration. Eventually by late evening on day two I received a response from one of the bloggers who encouraged our configuration. It didn’t matter at that point as we had already made the decision to make the move.

We did a lot of game-planning on day two. Every white board in our shared office was filled with notes. We had to send a few of our guys home to get some sleep as well. We got smarter and went to shifts. Each shift had clear set of goals. If I recall, no shift was really longer than 6 hours. 

Day Three…The Home Stretch

I think about 2am on Wednesday morning, we had successfully restored the original production system. We had also built-out our new system. We ended up spending about 12 hours from about 2pm on Tuesday to 2am on Wednesday moving files, un-tarring, starting up, etc…The system was finally up and running. We still had time-out issues on our new system, which meant we weren’t in the clear. 

We made the decision to upgrade before we released back to our users. So not only did we recover, but we were also going to upgrade. Talk about taking risks…but at this point our most critical system had been down for two days and we were trying to get back online. 

We decided that we wouldn’t do an upgrade on production unless we could confirm a successful backup and restore. By 5am we had done both. We planned to do the upgrade at 7am right after Perforce’s next shift change. We wanted them on the phone with us in case of an issue. 

We upgraded the system with a full checkpoint around 10am. Eventually we handed over the system back to our users around noon. Sending out that email to the team was a great feeling. At least the system was back and operational. 

What We Lost

Credibility…that’s what we lost. We ended up getting 99.9% of the data back. Our community of users lost nearly every shred of confidence in our team. We were basically down for 3 days. Our developers did their best to be productive, but how productive could they really be working locally. 

Our users lost a lot of faith in Perforce as well. It didn’t matter that in our eight previous years of using Perforce, we never had an outage. The thing is, they were savior and got us back operational. The problems that led up to our disaster were caused by ourselves. We were just lucky that in the previous eight years we never had an outage. Another key thing to note is this outage came at a time when a lot of our developers were working on a project using Github. Our Perforce users were ready to revolt and they wanted to be off of Perforce ASAP and wanted us to migrate to Github.

We Made Lots of Mistakes

I’m specifically referring to the first 36 hours in which nearly everyone on the team worked around the clock without sleep. We made mistakes before we were tired though. When the outage started on Monday morning, we made mistakes by using the wrong litmus tests. We didn’t understand what our issues were. We didn’t understand how our systems were architected. Probably the biggest mistake was that we didn’t have a true plan to recover. We literally had to plan on the fly. Nobody was ready. It showed during those first 36 hours. 

I remember at one point after we made the decision to build-out a second server, one of our teammates who had been up for 32 straight hours was working on recovering the backup. He was supposed to do a restoration on the new server. Not realizing which server he was on, he initiated a recovery on the production system that had failed us the day before. That mistake set us back 4 hours. I remember just 6 hours earlier, another team member fat-fingered a key stroke, which ended up deleting a critical file we spent copying over the network for 90 minutes. That of course set us back an additional 90 minutes.

The True Recovery

Getting the system back online was crucial for our users. It took us nearly three days to get our source code system back online. We didn’t end up bringing Fisheye or Crucible back online for nearly 3 weeks after the outage. We experienced a lot of issues getting those systems recovered. The support experience between Atlassian and Perforce was eye opening. I truly enjoy the Atlassian products. I think they deliver very value-driven products. I don’t think they hold a candle to Perforce from a customer experience perspective. Note that when I add software or renew my licenses with Atlassian, I simply make the purchase via a credit card. I get an email confirmation and that’s it. I don’t have a single POC from sales or support. I’m pretty convinced they have neither.

Image 

While my users were griping about the Perforce outage and wanting Github, around the same time there were intermittent Github outages. Call it what you want, but from a Perforce perspective, it was fortunate timing to see Github having issues. 

Where We Are Now

We have been outage free for nearly 10 months. We put a lot of controls, investment, training and planning in place. We didn’t spend any new dollars in 2012 or even 2013 in hardware or software. We did make a small investment in training, specifically administrative training of Perforce. If you recall, during the outage, we had to have Perforce on the phone for nearly the entire 72 hour ordeal. The fact that this vendor stayed with us around the globe for three days is truly a remarkable success story of a vendor and a client working together to achieve success.

It turns out that we did our training within the first two weeks of our Director of DevOps starting. That was a critical week for us in many regards. First, we set a goal of designing and deploying a highly resilient system that could scale and support our growth and usage. Our mission was accomplished as we have two additional failover points. Going from one to three server instances wasn’t too challenging. We simply had to make the investment. Second, we worked with our vendor to implement their most reliable and true best practices for generating back-ups and check-points. Every day we take successful checkpoints. It’s the only email we get Sunday through Friday. Then on Saturday’s we do a full back-up, but with zero down-time. 

We were taking bad back-ups is the best way to put it. We would bring our users offline every day. Because it happened at 2am, we didn’t think twice about it. We didn’t have a true global team back when Perforce was first implemented. We had developers who were working at 2am, but basically they new that the system was offline from about 2am to 4am. We never asked our international users how they felt about being offline for 2 hours during the peak of their day. Needless to say, these changes positively affected their productivity. 

We also implemented caching servers in our distributed offices to improve their quality and experience with the application. Out of site…out of mind is definitely a theme that resonated with the DevOps team prior to me joining the group. Don’t get me wrong, they serviced their end users as best they could, but if a user was remote, they had a greater likelihood of a poor experience versus let’s say a local team member out of DC.

I would say our two greatest accomplishments are what’s keeping this product running today. First, our operations guys have pretty much moved into the realm of application management. During the outage, the key folks supporting Perforce were developers. Our operations guys, who are more like system administrators had limited experience running Perforce. They weren’t even power users, just passive users. Today, they pretty much run the Perforce production environment. They handle nearly every aspect from account management to storage provisioning to back-up and recovery. Second, we run through our recovery process every single day in an automated fashion. We literally practice restores in seconds versus hours or days. 

Image 

Of course there are other aspects that we follow such as maintaining active versions by upgrading when the time is right.

The One Metric that Matters

There is nothing worse than having your end users tell you that you have a problem. We had alerts and alarms. We might have had a dozen or so rules in Naggios to tell us if we had an issue with Perforce. None of them really matter if they are white noise, completely ignored by the team. 

When I first took over DevOps and was added to their Distribution Lists, my email would fill-up with two to three hundred automated messages per day. It was nearly impossible to decipher what was real and what was fake. I completely understand why people created Outlook rules to manage these alerts. 

Today we keep track of three emails. We expect a daily email summary of the checkpoint and the recovery. It happens around 10am every morning, which was done on purpose. Since the process happens while the system is operational and accessible, we wanted this most critical operation to happen during business hours so we could respond. The second email is the Saturday full back-up and restoration, which is our automated “Super Bowl” of back-up and recovery operations. The one email we don’t want to get is a single Naggios alert that the system is down. We keep track of small things like storage growth, but we don’t alert the whole team.

We really only care about 1 email a day, which is our checkpoints.

A Plan Means Nothing If You Don’t Practice It 

One of the first deliverables the team completed when Perforce came on-site was a DR plan. We didn’t just identify hardware and put together visios of the architecture, though from an operations perspective it’s a critical piece of data. We have documented roles and responsibilities. We understand the sequence of events. Most importantly the whole process is scripted. It is automated and as I mentioned before, we run it every day. We literally practice our DR activities every day between 9:50am and 10am. The system is 100% operational. 

We have had nearly a dozen live drills as well as part of lab maintenance and patch work to our infrastructure. We have had a few storage maintenance windows requiring us to bring the systems offline. We have also had to assist with a full upgrade of a Perforce instance for our services organization. 

We don’t view Perforce as a fragile system anymore. Enough people on the team understand the ins and outs of the system. They understand how to take checkpoints. They understand how to recover a system. They understand how de-frag the system as well. It’s not just one person, but 4 people on the team. In fact, the team has had a small bit of turn-over. We lost 2 of the software engineers on the team who had previously managed the system primarily. Now our operations team manages the system for the most part. Our faith and reliability in the service is at an all-time high.

Taking Two Steps Back to Jump Five Steps Forward

We are in a better place today with Perforce then we have ever been before. Those of us who lived through the outage see the story as a badge of honor. Most operations guys want to have two or three of these kinds of war stories as a mans of beefing-up their own credibility. Today we are prepared, where as last year we simply were not. We are not as active in the Perforce community as much as I would like. We are more stable, reliable and performant. 

Essentially our mistake or series of mistakes was the result of human error. We had many mistakes. These mistakes started before the outage, in the early stages of the outage and during the late evening of day one. Human error is often to blame. When you think about the alarms we had setup, even though we were using technology to alert us about a problem, it was human error to ignore them. It was human error to set-up so many alarms. 

Even Twitter Does Performance CI

I’m not shocked that Twitter does Continuous Integration. You would expect a big technology player like Twitter to be savvy enough to get CI in place for not only functional development, but also for ensuring responsiveness. It took them a while…It took outside influence, technically outside influence, meaning it took a newbie to come on-site and realize there was a huge gap in their CI pipeline. Well, that’s the story that Marcel Duran (@marcelduran) sold today during his very informative talk “Avoiding Performance Regression at Twitter” at the Velocity Conference 2013 in Santa Clara, CA.

Duran’s presentation was very good. I walked away with a lot of validation of my own ideas for our own software CI process, as well as learned a ton of new ideas from Duran himself. Let’s see if I can summarize a few key takeaways in awesome bullet format below, then expand on the ideas with a little bit of creative pen magic.

  • YSlow.org has now been open-sourced. They have a command line tool as well (npm -install yslow -g). You can run YSlow against a HAR file which is pretty cool.
  • Phantom.js is a really cool little tool that integrates quite well with YSlow.
  • Within the last 3 years, Yahoo shutdown it’s Exceptional Performance team and just recently brought it back. Wowzers…
  • All that work my team did with WebPageTest could be revived…Yeah!!!
  • There’s a new cool performance site called SpeedCurve

I think I new YSlow was open-sourced a year ago, but never really thought twice about it. When YSlow came out a few years back, I thought it was cool, but I felt it was a little too all encompassing with the rules of Web Performance. Well, maybe not too all encompassing, but understand that my application doesn’t work with a CDN so I don’t want to be graded because I don’t have a CDN. Note that in later versions I can filter stuff like that out because I guess other folks like me complained their scores were bad because they didn’t use a CDN.

speedometer

The command line tool looks very appealing to me. A caveat is that it a) Runs on Node.JS and b) requires a HAR file already generated as its source. What I found real cool about this is that I would love to go the route of generating HAR files from our Selenium tests. It’s not too difficult and it would be great for archiving (hence the abbreviation HTTP Archive).

I’m kind of surprised that it took me until I came to Velocity to find out about Phantom.JS. It’s a headless WebKit scriptable with JavaScript or CoffeeScript. It has pretty cool integration with YSlow and even Jenkins. Take a look at Marcel’s YSlow.org page on the integration (http://yslow.org/phantomjs/).

phantomJs

Marcel and I quickly chatted on Twitter about how PhantomJS and YSlow were being used. I kind of envisioned that the function regression suite generated the HAR and then phantomJS and YSlow were run via Jenkins to evaluate. It turns out this was really Marcel’s cole slaw to their real sauerkraut. At Twitter they use WebPageTest instead, which is cool because my team did a ton with WebPageTest back in the day. Problem is it was “Back in the Day” like 2 years ago.

I won’t go into details because quite frankly I don’t know them. All I know is that Marcel came to Yahoo after Steve Souders left to go to Google. Marcel left to go to Twitter. I believe Stoyan Stefanov left Yahoo to go work at Facebook. There were others, but I don’t know all of their names. I guess Yahoo couldn’t keep the talent from leaving. According to Duran, the team is reforming, but I’m not sure who’s going to work there. I cannot confirm or deny that they team ever was dissolved. Either way good for Yahoo for bringing their team back!

I probably have more to write…don’t worry I’ll be here all week and will have more to post. For now good night!