Getting to the Heart of the Matter…More Data Needs

From my internal blog: Last changed: Dec 07, 2008 16:11

Patrick and I had an awesome conversation on Friday. We were talking about the future of analytics on the team and how we simply have greater needs for continuous access to this data on day by day basis…possibly minute by minute. I want to use this blog to talk about this conversation and some of our ideas. It’s essentially a continuation of the blog I wrote last. This type of data is not necessarily behavioral data, but more with regards to adoption and volumetric profiling. I will explain what I mean later in this blog.

The Beginning of Our Data Model

I first heard the term volumetric sampling back in 2001 when I was working on a project at Staples. We had been hosting Staples deployment of the Manugistics Collaborate, Monitor and Market Manager applications. We had about 1/2 dozen clients part of the supplier network. If I forgot to mention, what we were doing was building a web portal that allowed suppliers to work with Staples on Distribution Level forecasts. The kinds of suppliers who were working with us were Avery, 3M, HP, Lexmark and Ampad. You probably have heard of each of these companies because you saw some of their supplies at a Staples store. The Staples DBA requested a volumetric summary of these sample clients who had already spent a year in the program and collected/generated a substantial amount of data. I had no idea what he meant by volumetric summary. Eventually when I didn’t get him what he wanted, we sat down and told me exactly what he was trying to accomplish.

The conversation was pretty straightforward, but looking back it was one of the most influential conversations I’ve ever had with anyone as it relates to data modeling. See the DBA wasn’t too concerned about moving a copy of the current production system from the Manu hosting lab over to Staples. He had enough data to let him know how big the SID and tablespaces needed to be sized. What he wanted to understand were questions like the ones below:

  • Which entities were the largest?
  • How were entities sized (small, mean, maximum, variance and standard deviation) as well as building actual histograms of the entities?
  • At what rate was data being created?
  • What were the complex relationships between simple entities?
  • What attributes/characteristics could be identified from the data to better understand the behavior of data growth and system adoption?

There were obviously a dozen or two pieces of data we were ultimately after beyond what I listed above. These were the key points though for us to focus on.

So why do I feel the need to tell you this background story? Mainly because one of the most important foundations of a performance engineering team is a supporting data model to simulate against. I learned early my career as a performance engineer that the best way to construct a data model (or multiple models) was to sample the install-base in order to design data model profiles. Just like in the previous blog, we relied very early on our ASP organization to supply us enough client samples to construct initial models.

Our first sampling efforts were sufficient given our maturity as a team at the time. We got a lot of counts on entities, mostly statistical summaries of simple entities with a few formulas for complex entities to give us a good enough overview of building a 1.0 data model.

Crudeness at First

Our early sampling exercises were really crude. We had a bunch of SQL statements and a spool statement at the beginning and end of the file. For every entity we wanted a statistical summary, we would run 5 queries sequentially. Looking back it was pretty silly how we did it. It took so long to get the data that we would limit ourselves to twenty to thirty customers. We really wanted to get 100% of the install-base (at least the ASP install-base), but it was virtually impossible. Just the sheer manpower we needed from the ASP DBAs was too much to ask for. We even got desperate enough to send the scripts to clients that were not hosted. Some would respond, but for the most part our requests were littered with issues. Every client had an issue of some sort.

We took the initial results and built a really crude data model. It was 5 or 7 course packages that we built by hand that we eventually copied via Snapshot when setting up for a benchmark. The problem as we saw it from this model besides the fact that scaling was painfully slow was that it was very challenging to include activity data. We could archive packages, but the activity data would be identical from clone to clone. It wouldn’t have some of the key behavioral data we wanted by going this route.

A Sense of Accomplishment

What we have accomplished thus far with clp-datagen is simply remarkable. I consider this project one of the most complex projects I’ve been a part of in my history of software performance engineering. While the team has challenges from time to time and the fact that we still curse how long it takes, it’s still a remarkable, flexible tool. We can create fascinating representations of clients. The 10 dimensional data model simplifies our ability to create profiles of models within a single data model. As I will write below, clp-datagen has a lot more maturity ahead of itself. We really need to start thinking about how our future looks like for data models.

The Need to Sample More

One of the things I’ve been telling the team a lot lately is that much of our infrastructure is based on historical statistical analysis. I’ve also been saying that at this point in our maturity, we have to have more confidence in our ability to hypothesize, whether that’s about the way represent a data model or behavioral model during a benchmark. We can always go back after the fact to determine whether we need to make changes or not.

I’m starting to get equally antsy. I really want to see us have the data at our fingertips just like we have with Galileo. Just this morning, Patrick wrote me that Galileo has over 33 million rows of transactional data. We still have mounds and mounds of data to import into the system from historical tests. When we get to a 100% automated testing model, that number of rows of data is only going to double, triple and quadruple in a very short period of time.

I just finished up an interesting book called Click which is about online Internet search behavior. The author has endless amounts of data at his disposal. It’s actually quite frustrating from my perspective that he has all of this data and can do whatever he wants with it. It’s like having a goose that lays a golden egg.

The thing is, we have data. We have lots of data. We just need a better way to get it, transform it, visualize it and analyze it. As I mentioned in this blog and the previous entry. our hosting organization has hundreds of clients across nearly every profile of our install-base. We just need to figure out how to harness and mine that data. I think the biggest challenge is figuring out how to work best with our hosting group to get the data. The higher-ups from ASP want to share it. The problem is that the team that can make it happen, might not be able to help us the way we want them to help us.

Creating Profiles

At the heart of my conversation with Patrick was this idea of identifying a select group of clients who could be identified as the trend setters or followers (could be both) within a unique client profile. I can generalize a profile right now for you, but I want to make it clear that this is an example that might not apply in a few months when this project takes off. One profile might be called Online K-12. What this might be is a profile of clients who are either 100% online oriented, or have instituted laptop initiatives in the classroom. Out of the 900+ clients in ASP, we might have a sample population of 10 clients. We would then identify the 1 client who best represents the population.

If we took an abstract example of let’s say popular comedies from the 1980’s we might sample any number of movies from 1980 through 1989 that fall in the comedy genre. One approach might be to identify the entire population of comedy films produced during that era and ranking them based on particular criteria. We simply ignore the idea of time (year). We could easily have film A made in 1983 and film B made in the same year ranked 1 and 2 as part of the decade. We then might choose the best rated film to study for our sampling exercise. Another approach might be to call-out the top 3 films per year based on a subset of criteria and bringing the sampling audience down to 30 total. We could even go lower if necessary. We then look at a smaller universe of clients with more, stringent criteria in order to identify the comedy movie of the 80’s that best represents the decade.

All I’ve done in these two examples is find a way to identify a “representative” of the comedy genre. It’s no different then trying to find the best representative of Online K-12 schools. In the first approach I take the route of looking at 100% of population. Because the sample is so large, I have to narrow in on a small subset of criteria that can help make a decision about who best represents the decade. The second example looks similar, but it’s not. See what I am doing is breaking the population down even further. I’m using year as the differentiator to narrow the sample population down. Once I have a smaller population of data to work with, I can look at a wider array of criteria to help make a decision about the “best” representative.

It’s essential to create profiles. In my movie example, let’s say I want to know the “best” overall movie from the 80’s. All I’ve looked at is the comedy genre. So I can’t say that I have a good understanding of other genre’s. It might be that the 80’s was a decade of laughs. We might find that 62% of all movies in the 1980’s were comedies (that’s just a guess by the way). Is it fair to forget about the other 38% of the population? I don’t think it’s fair because it might be that not a single comedy won the Oscar award in the 80’s for the best overall picture. To many the measure of what is “best” might be that award. To others the criteria might be gross revenue. Someone might argue that an even more interesting statistic, such as sales since the turn of the century, or maybe even some far off correlation in which we study movies made by the actors from our top 5 list of 1980’s comedy genre since the 1990’s.

The key point I’m making is that it’s not fair from a statistical point to classify some thing (in our case it’s Blackboard institutions) based on size. A more lucid example would be taking the two clients from the online K-12 example above. Client A could have been a Blackboard customer for 8 years. They might not have a data policy that’s restrictive. So they might have 8 years of historical data. Client B though could be a Blackboard client with less then 1 year of use of the application. Well anyways…both schools might have populations within a few thousand of each other. From a demographic standpoint, they both look and feel the same. It turns out that client B is exhibiting some major adoption differences from Client A. Client B’s users are now using Blackboard as much as 7 days a week. They are generating 4x to 5x more data then client A. It’s difficult to see this in a transparent manner because client A has so much history of data. From a straight-up record for record count, client A might outnumber client B by 4 to 1. We might foolishly think that client A is the best representative for deeper sampling because we are more familiar with them as a client, but also that their gross count of data is larger. The difference as I see it, is that we want to be nimble enough to sample client B from a deeper perspective because they are changing the adoption behavior of the system. Most likely this change is something we want to be aware of from a day to day perspective (for trending purposes) in order to ensure that the behavior is becoming a norm of client B. This is something I am going to dig deeper below.

Wide Sampling

Wide sampling is what I am describing with the example of client A and B above. In our case, we want to be able to sample as much of the population that is possible. 100% is the ultimate goal. We then can determine based on constraints whether 100% is valid or not. Also, we can look at change (of data) as a determining factor. If a subset of the population has no change or an insignificant change in a certain set of data metrics, we have to make a decision as to whether it’s worth sampling the client on a consistent/daily basis.

I see the purpose of wide sampling to be about identifying behavioral changes in the install-base. There’s certain data we want to study from a wide sampling perspective. One piece of data is “standard data”, such as course counts, enrollment counts, etc…This type of data helps us differentiate large versus small. There’s a second piece of data that helps us differentiate activity levels. I like to call this data “behavioral data” because it helps us understand activity on a system. We could have a client that has 10,000 users with 95% adoption and usage policy by users producing more work on a system then a client that has 100,000 users and 5% adoption policy.

My personal thoughts are that we need to identify a subset of data to collect from as wide of a population that we can find. The data collection has to be frequent, meaning no less then weekly, but preferably daily. I’m not sure about what data we want to collect right now. This is something we need to vet out over the next few weeks. This data has to help us identify changes in the trend. We want to see who is quick to become a follower and who’s quick to set the trends.

Predictive Modeling

One of the ultimate goals of collecting this data is for the purpose of data model construction. This is without a doubt one of the “long-term” goals of Galileo (note that I am just assuming that the data collection that I am talking about in this blog will naturally go in Galileo) in which Galileo will be an interface for constructing the data model XML files. One of the second goals is for us to be able to make decisions about simulation goals from a behavioral perspective. What I mean is that by understanding the rate of change of data, we can make decisions about our objectives for a test. We might see that client X generates 1000 online assessments over the course of 3 hours. We might see client Y performs 20,000 logins ever day from 8am to 8:30am. We might see that client Z adopts a particular feature based on data growth, but no other client adopts the same feature nearly in any manner like client Z.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s