Before I write down some thoughts about the conference, I want to reiterate to myself that it was a bad idea waiting to blog about the conference nearly 10 days after the fact. I definitely have forgotten some of my notes and quite frankly I’ve been finding other things to do (sadly not blogging) in its place. I am going to vow to myself to not let such a time lapse happen the next go around.
The day started off with a really great presentation from Karen Morton. She’s another former Hotsos employee who left to go to Method-R with Cary Millsap. The presentation was about managing statistics for optimal query performance. She started off with a bit of math vs. magic. It was really creative…Imagine a deck of cards in which you study different characteristics about the cards. There are 52 cards in a deck. There are two colors: Red and Black. There are numbers 2 through 10, Jack, Queen, King and Ace. Each number is represented 4 times for the different suits.
Her main point with presenting the example of the deck of cards was to show the importance of predicate selectivity. The selectivity is the fraction of rows estimated to match the predicate The predicate is simply the filter value used to minimize the result set. Selectivity can be computed as 1/NDV (# of Distinct Values).
So in our example of 52 cards, if we were trying to capture the cardinality (selectivity x # rows in a table) of let’s say Kings in a deck, it would be ((1/13) * 52) which equals 4. Yes there are 4 out of 52 cards which represent the King card. More of this example is broken down in the image below.
The key to this paper and presentation is that statistics need to match the data, otherwise it will lead to poor cardinality estimates, poor access paths, poor join methods, poor join order and then poor execution times. Statistics should be collected when objects change by at least 10%.
I was looking over my notes and I recall writing that this topic of statistics gathering and histograms in general made a great theme for a design of experiment. There are many different ways to gather statistics and we take a very one size fits all approach. I think our approach is flawed for a number of reasons. First, we are not taking into account the affects of statistics at the varying degrees (schema, table, column, etc…). I believe we do schema level statistics. We also tend to capture statistics after data creation. That means we most likely do not have the correct query PLANS stored. We only have a small-subset.
I propose we set one of the benchmark teams off on a short design of experiment in which they study the best approach for us handling statistics. This might involve asking questions about when it’s appropriate to gather statistics? Should it be done after a few primed runs in which we run a lot of SELECT queries against a scaled system? Should we study different approaches to creating statistics? What initialization settings affect the optimizer? Should be selective from an object perspective for key entities? How would we make use of 10053 tracing as a validation mechanism?
There’s definitely lots of questions out there that we need to get to the bottom of. Question is who will take this on?
Morton ended the paper and the presentation discussing 3 common performance problems related to statistics that we needed to be aware of: Data Skewing, Bind Peeking and Incorrect High/Low Statistics.
Data skew is about how the optimizer assumes uniform distribution of column values. We all know that data is not always uniform. Data skew is best identified with a histogram.
Bind peeking is with regards to how the optimizer peeks at the bind value and uses it to determine the execution plan. What I mean by peeking is looking at the literal value used behind the BIND. The optimizer checks the value of the bind variable and is able to compute the selectivity as if the BIND was coded as a literal. Morton says this is a good and a bad thing. It’s good because the optimizer will use actual column statistics to determine the best plan. This can be a very bad thing when statistics are stale. This can be a problem when the selectivity of the literal value is sub-optimal. In our card example above we would want an INDEX access to find all of the ACES. If we had changed our predicate to look for BLACK cards, we might rather do a full table scan or an index scan to get the data.
To derive the cardinality estimate for range predicates, the optimizer uses the high and low value statistics. If the column is not indexed, no statistics are collected.