I might be foolish thinking I can capture my notes as eloquently as I would like given I only have 30 minutes or so to post this entry. So I’m going to try my best to get my thoughts down on the topic of Oracle statistics and see what comes out in this blog.
A few weeks back I went to the 2009 Hotsos Symposium. While I openly admit this years was not very good compared to last years, I have to say Karen Morton’s presentation on Managing Statistics for Optimal Query Performance was the best paper and presentation overall. In fact it was so good, I would say it was in my top 5 overall of all papers/presentations I’ve partaken in with regards to Oracle.
Morton’s presentation really got me thinking about how poorly we deal with statistics. First off, we gather statistics only at the schema level. So we take a one size fits all approach. Second and probably the most concerning is that we have not performed any design of experiment to validate that what we recommend has a positive affect. Surprisingly, our test results often reveal that query performance blows after our statistics. Third, I don’t think anyone (including myself) has really spent the necessary time studying everything there is to know about manual gathering of statistics. We have taken a very ignorant approach to Automatic Statistics Gathering which in my opinion is foolish considering we truly don’t understand Oracle’s approach to automatic statistics gathering.
That leads me to where we are now. Statistics help the Oracle CBO (Cost Based Optimizer) make better decisions about access paths, join methods and join order. As Morton says, “Statistics matter!” Our approach contradicts that statistics matter. We sort of put our middle finger up at the CBO and say “let’s play a little roulette”. That’s no way for us to run a performance engineering team.
So I would like us to address statistics, as well as histograms (considering they are incredibly relevant to the topic of statistics) in the form of a formalized design of experiment. Here’s what I would like to do…
First, I would like for someone to take this project on for the next few weeks and champion the design of experiment. This person would be responsible for reading Morton’s paper and summarizing her main arguments without copying word for word what she wrote. I’m seriously looking for someone to interpret and understand her core arguments as though we are trying to argue them in a court of law or a public debate.
Once you are finished reading and reviewing the article (in a blog of course), I would then like you to argue what we are doing from either a positive or negative vantage point. Specifically, we approach statistics gathering from a fairly universal (one-size-fits-all) approach. In the same regard, we do very little from an initialization perspective to manage CBO parameters. Thirdly, we have weak understanding of the use of histograms. I think we understand our data (from a scaled data model perspective), but we ignore our own knowledge of the data in order to have minimized scaling and set-up times. Ask yourself this…If statistics should be updated after 10% changes in data, shouldn’t we reconsider the role statistics as part of clp-datagen? Could lack of statistics be altering our scaling times?
I actually thought about the last question a lot while at the conference. Suppose we scaled a system using the PVT model. One thing we see is that as we permeate from XS to M3 dimension, scaling gets slower and slower. We hypothesize that the cause of the slowdown has to do with the size of the entities we are creating. What if we challenged that notion by doing a few scaling exercises:
- Scenario A: Scale XS, ST and XL in a sequential manner
- Scenario B: Scale XL, ST and XS in a sequential manner
- Scenario C: Scale XS individually followed by ST individually followed by XL individually
- Scenario D: Scale XL individually followed by ST individually followed by XS individually
The idea behind these four scenarios is to measure whether scaling any of these dimensions in any particular order with the absence of updated statistics has any affect on scaling times. I should be able to compare Scenario A against B, as well as Scenario C against D. If possible, I could compare any of the 4 scenarios against each other. Depending on what we find, we might choose to introduce two additional scenarios:
- Scenario E: Scale XS individually and run statistics, scale ST individually and run statistics, followed by scaling XL individually
- Scenario F: Scale XL individually and run statistics, scale ST individually and run statistics, followed by scaling XS individually
Of course we would have to make some decisions about our statistics approach. We could assume ignorance and use our current methods. Or we could use this time to really understand the best approach to statistics gathering.
Upon completion of the article and the experiment, we need to tackle the problem of lack of knowledge with regards to statistics gathering in a more scientific manner. At this point I would suggest reviewing the following presentation from Oracle Open World in 2005.
Our goal is to address two fundamental questions:
- What’s the best way for us to gather statistics as part of the datagen process?
- What’s the best approach for us to use for gathering statistics with the PVT data model?
We might need to address more questions. What I mean by that is we can’t necessarily take a one size fits all approach to this discovery project as well. We are going to have to dig a little deeper into our application and identify a series of use cases (our most critical and complex) to get a better understanding on a few key points:
- How affected are these queries from a CBO perspective when we take a universal approach to gathering statistics?
- How uniform are the data sets for the key entities for these queries? Specifically, what’s the cardinality of the data set?
- Would these entities benefit from the use of histograms due variance and skew of data?
- Would these entities and their schema objects benefit from a more focused statistics gathering approach specific to these entities?
I’m going to propose a few use cases in the JIRA ticket I create for whomever works on this assignment. Most likely the queries will be Grade Center, Discussion Board, Assessment, Content and Layout (My Institution or Course) use cases.
Whoever takes on this project will be responsible for designing multiple experiments to include the scenario above for datagen, some form of PVT execution scenario measuring more throughput and resource oriented metrics and then some focus test metrics for the key use cases. I would expect 10053 tracing would be used as a measurement tool. Execution plans and Estimated plans should be presented within the artifacts. Finally, an argument about what we should do next for this project should be made. I am not sure if we will be able to complete all of our goals with such a short window.