Monthly Archives: October 2008

SPT Notes About Transactional Analysis

Most of the information in this blog relates to questions that I am asking inside my head. Anyone is free to respond if they have ideas. The blog is kind of an amalgam of random thoughts, initial process and some reporting ideas.

I would say that we have some problems that need addressing. The problems that I am thinking about relate to our inability to thoroughly perform transactional analysis during the test cycle. There are a number of reasons for it. More then reasons we have excuses. I’m kind of tired of excuses. I think it’s time to just do it. The question that come to mind relate to who will perform the analysis and when will it be done? How early in the cycle can something be called out? Who responds to the initial escalation?

I think the key word I just used is escalation. Yesterday I gave a presentation to the Leadership Group within Product Development. One of the slides I presented was about differing skill sets on the Performance Engineering team. Each of our engineers have different skills and interest. Because there is such differences in the skill sets on the team, it means that not everyone on the team can perform all of the functions that a cross-functional team member might ask of a performance engineer. As it relates to a performance forensics problem, which we would define as a problem to be solved from the result of a test simulation, we need to be able to provide a chain of escalation so problems can be effectively raised, solved and prevented again.

Request #1: The Need for Robust Forensics Escalation Process

We need to figure out an effective chain of command in which first level and second level forensic analysis can happen directly by our AppLabs team. This means that we must be able to build the hierarchy of response, as well as invest time from a training perspective. There are certain forensics skills that each member of the team has to have. Then there are a set of secondary skills that a second tier is expected to acquire. We as a team will have to decide what those skills are. From our second tier, we must then have a third tier of responsiveness. This is really to be determined with regards to who plays this role. It’s simple to say Cerbibo, but I don’t think it’s quite that easy. The Cerbibo team has some very advanced skills that might not be effectively used performing tier three support. We are going to have to step back and decide whether it’s our North American Engineers or Cerbibo. What I can say is that I would expect a fourth and possibly fifth tier of escalation. The fourth tier would most likely be whichever team was not selected for tier three (North American Engineers or Cerbibo). Tier five would be the more senior members of the Performance Engineering team. Tier six and higher would be either Steve or an Architect within the Development team. We need to sort this escalation chain over the course of 9.0 testing. Now is the best time to roll-out tier one and two to AppLabs.

Galileo Report Enhancement #1

I will definitely open up a Galileo ticket before the day ends. We need to develop a report with detailed meta-data calling out transactions in question. This might sound fairly obvious and we might even have a report that does some of this. I would call this report a slightly more intelligent report the the existing reports we have today. This new report would be criteria/rule based when an either/or is met from below:

  • Transaction Response Time is greater then X seconds
  • Mode Percentage is Lower then X Percentage
  • Standard Deviation of a Transaction is greater then X seconds
  • Skew Factor is Off-Set

We obviously need to provide parameters to these inputs I specified below. I don’t think it’s as easy as saying when a transaction is greater then 10 seconds or a mode percentage is less then 25%. Then again, maybe it is as easy saying that, otherwise how are we going to say whether we need to investigate a problem.

The report itself will need meta-data about the transactions. Specifically, it will need the following:

  • Transaction Name (Grouped by a Count)
    • If more then 1 sample of the same transaction meets the criteria, we will need to either drill down into a child report page or a tree.
  • Dimensional Information
  • Response Time
  • Transactional Mean (all samples of this transaction)
  • Transactional Standard Deviation (all samples of this transaction)
  • Transactional Mode (all samples of this transaction)
  • Transactional Mode Percentage (all samples of this transaction)
  • Transactional Skew (all samples of this transaction)
  • CPID
  • Time Stamp
  • Server (if applicable)

I could envision this report looking a few ways. I know there are a few things that I am interested in seeing. First, I would love to see a scatterplot represented over time. I don’t want to rely solely on a scatterplot. I would love to see a table/chart as well. If the data could be easily exported to Excel that would be of interest. Not sure if this is possible, but can we generate image files from our scatterplot?

To go along with this report, we need to train the team how to effectively use the Galileo data to dig into the logs in order to collect enough information to account for a systemic issue. We then need to go into our other forensic data to make a correlation. If we can conclude that the issue was not a resource/interface issue, we might need to perform another sample as that user with different forms of instrumentation enabled.

Galileo Report Enhancement #2

This second report enhancement is about transactional performance comparison from dimension to dimension, platform to platform and test to test. Let’s say that we have a transaction, we will call it T1. The transaction T1 took 20s during a most recent test. It turns out that the transaction came from a Solaris test and was in the XL dimension. All other samples of transaction T1 were less then 1 second. How do we explain this accurately and work this problem out?

The problem isn’t incredibly simple. I kind of lead the listener hanging. I don’t same how many samples of the transaction we have taken, nor do I say whether any of those were in the XL dimension or a larger dimension. Assume the data can provide clarity to this problem. If that assumption could be made, would I would love to see if something like the following:

  • Color code each dimension
  • Provide different shapes by platform
  • Insert the line for the mode value
    • When comparing multiple tests, we could color code the line of mode

This would be an individual transaction only report. So it would be linked from other reports in the system that presents transaction summary details. Below is a crude attempt at visualizing this chart.

Galileo Report Enhancement #3

One thing that would be great would be the ability to dynamically alter a report. Let’s say you are looking at a report that displays all transactions greater then 10 seconds for a particular test. It would be awesome to apply a filter that says, show me every transaction that is greater then or less then X seconds. Basically, having the ability to customize a report on the fly would be the ultimate goal.

 

Analogy of Sorts KC-135 and the Role of Performance Inspection

This morning I was listening to NPR  and a story came on about the KC-135 Stratotanker . You can listen to the story by clicking here . There’s a second audio slideshow  worth checking-out as well.

The KC-135 Stratotanker is an in-air jet refueling system. It was built in the late 1950’s in order to refuel the B-52 bomber while in mid-air. It’s been in service every since and as the reporter says, the government has no plans to replace the plane. Basically, every 5 years KC-135’s are grounded and brought into a hanger at Boeing’s Texas facility which is a little bit over a mile long and wide for weeks of inspection. A fleet of personnel perform depot-level inspections, repairs, and maintenance, modifications, re-painting and supply chain services. What’s interesting about this system is that every single screw, bolt, nut, wire, etc…is visually inspected and tested before the plane is deemed airborne ready again. Part of this inspection is due to a number of explosions that occurred in the 1990’s because of a fuel line that was faulty. A number of KC-135’s exploded in mid-air.

This got me thinking about the role of performance engineering as it relates to the KC-135. Obviously there is a sense of marvel engineering from the original designers and manufacturers. They constructed a fleet of planes that has withstood the sands of time. The planes are 50+ years old and still running. They have had some updates here and there (like new engines), but the same plane for all intensive purposes is still running.

Care and feeding of the aircraft has got to be the #2 reason for it’s longevity. I think it goes further then that. When you listen to the reporter talk about the maintenance process, you hear about a rigorous process of inspecting and testing every single part. There is a sense of intimate knowledge of what each part is responsible for. No part or component is looked over.

So I ask the question…how feasible is it for us to have a similar process…without going to the extreme of inspecting every single component? I would rather spin it a little different. Rather then us go through deep inspection, we go through focused inspection. For example, in our last benchmark we focused on database performance. Couldn’t it have been possible to do deep inspection of just the database? If so what would we have been focused on?

I would suggest taking a similar approach as the inspection process for the KC-135. They engineers start with visual inspection of the physical components (emphasis on form and composition). They then move on to the logical purpose of each component with an emphasis on consistent reliable function. Each engineer has intimate knowledge of each component and sub-component of the KC-135. No changes occur to the aircraft unless an inspection activity calls out a risk, flaw or issue.

I’m pretty sure the same applies to us…it’s just a question of choosing the right inspection activities.

Five Articles Worth Reading About Client-Side Performance

The days of focusing on server performance appear to be shrinking. We need to build-up our skills in the area of client side processing. While I’ve made a number of posts on the subject over the past 18 months, new posts might become more of a daily or weekly pattern. There are five articles that I would like to have the team read. They are quick reads. Four out of the five should take less then 10 minutes. The PowerPoint presentation from Yahoo might take a little longer. Abstracts below:

A Study of Ajax Performance Issues

The first article comes from the blog Direct from Web2.0 . It covers 6 points that are primarily about the competitive browsers during the early part of 2008. Nothing is captured about Google Chrome as the browser was not available at the time. It’s definitely a good primer to read.

  • Array Is Slow on All Browsers
  • HTML DOM Operation Performance in General
  • Calculating Computed Box Model and Computed Style
  • FireFox Specific Performance Issues
  • IE Specific Performance Issues
  • Safari Specific Performance Issues

Optimizing Page Load Time

The second article is about Optimizing Page Load time in web applications. The author Aaron Hopkins covers a lot of the information that the YSlow team has written about over the past 2 years. The author has a very comprehensive list of tips, plus about 1/2 dozen links of comparative information on the topic of client side performance. This is definitely worth reading and following the links.

HTTP Caching and Cache Busting Presentation from Yahoo

This third article takes the prize for being the most comprehensive of the group. It’s really a presentation from the Apache Conference back in 2005 from Michael Radwin of Yahoo . For those of you who want to know more about HTTP caching, this takes the prize.

Caching Tutorial

This is a somewhat obscure article that came from Mark Nottingham. What I like about this posting is how he simplifies the topic of web caching. He doesn’t make the sophisticated reader feel bored or the unknowing reader feel stupid. He simply states the data with very clear and easy to understand content.

Circumventing Browser Connection Limits for Fun and Profit

This fifth article comes from Ryan Breen at Gomez. He’s the author of Ajax Performance. What Breen is talking about is that not all browsers behave the same. Most load in a synchronous fashion that still cause latency when interacting with a client-rich page. These browsers can be manipulated to do parallel operations, but require configuration changes. The author makes a great point about why configuring these changes can really speed up performance.

Seven Habits of Highly Effective Performance Engineering

For those of you who know me, you probably wonder why I read so many books that have nothing to do with Performance Engineering and mostly have to do with management, organizational behavior and sociology. At any given time, I am typically reading 3 or 4 books in which the majority are always a business book and then of course I read at least one technical book. There is one book that I read quite some time ago from Steven Covey called the Seven Habits of Highly Effective People that I enjoyed most of all. While driving this evening, some folks on NPR brought up the book and it got me thinking. There are some characteristics that I believe define a good performance engineer from an average performance engineer. If you know me well, you know that I love what I do. It’s not just a job for me, but a career.

Learn SPE…Don’t Just Memorize It

As a young performance engineer I lacked any formal guidance. The team I worked on had little process or approach. It wasn’t a bad team. In fact at our old organization, it was one of the most respected teams in the company…almost to the point that my old manager was god-like in the eyes of so many employees at the company. Before I left, I joined a professional group called CMG and quickly learned about an interesting methodology called SPE (Software Performance Engineering). Below is a definition of SPE from Williams and Smith:

Software Performance Engineering (SPE) is a systematic, quantitative approach to the cost-effective development of software systems to meet performance requirements. SPE is a software-oriented approach that focuses on architecture, design, and implementation choices.

SPE gives you the information you need to build software that meets performance requirements on time and within budget.

The SPE process begins early in the software life cycle and uses quantitative methods to identify satisfactory designs and to eliminate those that are likely to have unacceptable performance before developers invest significant time in implementation. SPE continues through the detailed design, coding and testing phases to predict and manage the performance of the evolving software as well as monitor and report actual performance versus specifications and predictions. SPE methods encompass: performance data collection; quantitative analysis techniques; prediction strategies; management of uncertainties; data presentation and tracking; model verification and validation; critical success factors; and performance design principles, patterns and antipatterns.

When I came to Blackboard in the fall of 2003, the first thing I set-out to do was build an SPE practice. I’ve been moderately successful, which got me thinking as recently as 1 month ago that maybe the methodology is just too cumbersome for the team. I’ve been watching the team quite closely and I realized, I don’t think that the methodology is hard to understand. It’s quite practical and can be easily streamlined. Where I think the problem resides is that learning the methodology for most of the team has been viewed as a heavy academic exercise. Heck, I hand every new employee on the team a copy of Performance Solutions: A Practical Guide to Creating Responsive and Scalable Software and I think the book is so big and heavy (it’s a text book used by a lot of graduate school programs) that the content get’s lost. You tell me when the last time you snuggled up in bed to read a text book? Most likely it was college and I bet within 10 minutes you used to fall asleep while reading a book like this. The thing about this book is that the workflow and process for executing SPE is so simple and so straightforward that getting through the book shouldn’t be such a struggle. I’ve read the book 30 times in 5+ years. That’s 6 times a year or once every 2 months…

Research Daily

Another tidbit about me…did you know that I am a list guy? That’s right…every day I write a list of everything that I am to work on. I start the list in the car at stop light, at my desk and sometimes even at meetings. I work through the list each and every day. I don’t necessarily finish my list each day, but I get real close. Before having lists my days were quite chaotic and I wasn’t quite sure what I accomplished. One of the items that makes my list each day is a line item to read through Google Reader. I use Google Reader as my RSS reader. As of today, I have 186 subscriptions and the list is growing each day. I recommend that you download my OPML file (google-reader-subscriptions_110308) and use it as a starting point. Every day, usually around 12:30pm I scan through the blogs from the previous day. Some blogs are bogus and I skip over them…others are pretty cool and I tag them. Lately I’ve been starring them in Google Reader, posting them to Scholar and sending it out to the team. Every day we need to learn. Blogs are one way. Books are another…remember that the library is free. So when you think that you are having a bad day stuck in meetings or whatever…take 30 minutes to do some reading about our discipline of performance engineering.

Share Your Experiences

Each and every day we run into problems or learn something new (ie: habit #2). We need to be more vocal about our experiences so that everyone on the team can learn. Our blogs are meant to be our space for sharing these experiences. I’m really starting to believe that if we don’t have an entry a day, it’s a wasted day. That might sound odd, but I really mean it. Let me throw in a caveat. Some days you simply don’t have time to write. That happens to me all of the time. In those case, I save my writings for the days that I can contribute a posting. Often, I rail off 3 or 4 blog posts in a day as a way of catching up for those days that I’m too busy to post. What does it mean to share your experiences? It really depends. I tend to talk as though my reading audience is really a listening audience. My blogs are somewhat conversational and impersonal. I like to put lots of links in so that if my point(s) aren’t as obvious as they should be, the reader can navigate to one of those links for additional context. I also like to put pictures in my postings as often as I can. Sadly, some of the visuals I want to post are too time consuming to put together, so you might get a low budget visual. As of most recent, I’ve been doing video blogs so that I can review a topic or tool set. This is something that I want to make more a part of my weekly routine. Watching a video can be extremely powerful…especially with embedded audio.

Show Your Work

This is really an extension of #3 Share Your Experience. For this point, I want to share a quick story. In high school, I had a Math teacher named Captain McIsaac. My high school was originally a feeder school for the Naval Academy, Coast Guard Academy and the Merchant Marine. So we had a lot of older teachers who used to be former Navy. Well anyways…Old Cap McIsaac was an interesting guy. He looked like Ted Kennedy’s twin and probably scored the same on most of his breathalizer tests. He was a terrible Math teacher. Most of us thought he was awesome because he would give us the answers to the questions on our exams during an exam. We never had to show our work. That’s great for kids who cheat off each other. I have to admit…looking back the guy was terrible. He didn’t hold us accountable for our work. It showed in all of my Math classes after Cap’s class. I did well because I love Math, but it takes an awfully long time to break bad habits. You can pick-up a bad habit in seconds, but it takes weeks…sometimes years to break a bad habit.

There’s an important reason for showing your work…actually there are multiple. The number one reason is so that you personally can spend the time reviewing what you did and explaining it to your peers in a visual manner. Don’t worry if you change your ideas…you just write new blogs. The second reason is that we are a global team. Everyone on the team should get the opportunity to learn from other members of the team. It’s a good way to get feedback and share work. The third reason, which is sadly a bit lame is that our days become so busy, that sometimes we need to be able to comment on a blog rather then having a conversation or email thread.

Become a Statistics Guru

I love statistics. I think about them night and day. You too should think about them every opportunity you have. We use statistics for everything we do as performance engineers. Think about it:

  • Response time sampling and analysis
  • Probability models for our performance simulations
  • Profiling and analysis
  • Data modeling
  • SQL tuning

The list goes on and on. If you haven’t taken a statistics class, I strongly recommend you do and do it fast. Then once you finished getting a good primer on statistics, we need you to start making recommendations on more meaningful statistics that will help the team.

The Need to Be a Generalist

Systems engineers are generalists. They need to understand hardware, software, integrations, functionality, business process, etc…For all intensive purposes, Performance Engineering is really just another name for a systems engineer. To be a great performance engineer, you need to know a lot about lot. That’s a lot to digest, but it’s true. I recommend that with each project, you try to learn as much as you can about one thing. As of late, I’ve taken this approach with the team. In Sprint 1 of 9.0 cyclical testing, the team focused on database performance. There were a lot themes such as query optimization and cost execution plans. The gist is that the team focused on just the database. In sprint 2, the team is focusing on ehCache (learning as much as they can to calibrate) and JVM instrumentation. With both projects, I can guarantee that each member of the team will learn something about their area of focus during a sprint. This way they won’t be too over loaded with data. This gets back to habit #2 (Researching). Use these sprint opportunities to focus on a given component in the stack or a particular technology as an opportunity to do external research. If you are learning about ehCache, one of the first things you should do is search for blogs that talk about ehCache on a regular or irregular basis. Chances are if someone is writing about it, they are likely to share links or references to others who are writing about the component or technology. One hint I give to you would be to look at the blogroll when you identify a good blog worth reading. I tend to build my blog OPL file based on evaluating other blogs referenced in the blog roll.

Question…Hypothesize and Prove It Out

The seventh habit is my favorite. We work in a lab. It might not be pretty and we don’t wear white overcoats. No matter how we spin it, the work we do is lab-like. That means we need to able to do 3 things really well (aside from researching). First we need to question. I don’t mean it in a negative way. I mean in it in an interrogative way. For example as a kid I always wondered how a radio worked. It’s a fairly easy thing to learn, but if you don’t start by asking questions such as: * How does a radio work when it’s not stationary? * How does it work inside buildings? From there, we go back to habit #2 and partake in research. From our research we develop hypothesis. A hypothesis is a calculated guess based on some form of empirical data. Then we must prove the hypothesis. We do that by testing. Where do we test? We test in the lab…So don’t ever forget…you work in a lab.