Monthly Archives: March 2008

Microsoft Day One Review

Executive Summary
Bob Taylor came on-site for day 1 of a 5-day consulting engagement between Microsoft and Blackboard. The focus of this engagement is with regards to SQL Server 2005 Performance and Optimization. Bob arrived in the early afternoon. I started the session by reviewing the Blackboard Reference Architecture in order to provide Bob with a high-level summary of the Blackboard Academic Suite. I then discussed our most recent EDB benchmark emphasizing the changes we made in order to demonstrate our tuning and optimization process. We then spent the rest of the afternoon reviewing the system currently housing our Windows EDB benchmark. A breakdown of our discussions can be found below.

General Discussion
As part of the review of EDB, Bob and I jumped around the system (perf-2950-db2) reviewing the changes we had implemented as part of the EDB benchmark thus far. The first thing that came-up was the location of the transaction logs for each of the Blackboard schemas. Bob discussed a point that many others have mentioned before, that being we need to move the transaction log to another LUN. In our current deployment, data and transaction data files reside on the same physical LUN.

The second item we discussed was our recovery model. There are three different recovery modes (simple, bulk and full). All of the schemas are configured to Full, which means everything is being logged. Below are some points from a somewhat relevant, but out-dated article. I am going to discuss this in more detail with Bob tomorrow.

Simple
The simple recovery model allows you to recover data only to the most recent full database or differential backup. Transaction log backups are not available because the contents of the transaction log are truncated each time a checkpoint is issued for the database.

Full
The full recovery model uses database backups and transaction log backups to provide complete protection against failure. Along with being able to restore a full or differential backup, you can recover the database to the point of failure or to a specific point in time. All operations, including bulk operations such as SELECT INTO, CREATE INDEX and bulk-loading data, are fully logged and recoverable.

Bulk-Logged
The bulk-logged recovery model provides protection against failure combined with the best performance. In order to get better performance, the following operations are minimally logged and not fully recoverable: SELECT INTO, bulk-load operations, CREATE INDEX as well as text and image operations. Under the bulk-logged recovery model, a damaged data file can result in having to redo work manually based on the operations that are not fully logged. In addition, the bulk-logged recovery model only allows the database to be recovered to the end of a transaction log backup when the log backup contains bulk changes.

Review of TempDB
TempDB is one of the most critical areas for optimizing performance of SQL Server. Most complex query operations take place in this region of SQL Server. When configured incorrectly, performance can be dramatically affected. By default the space is not meant to shrink, nor is it recommended to auto-shrink the space. There is a penalty for this space growing. According to Bob, we need to do a more effective job at sizing this space from the start based on the needs of our application.

For example, in our environment TempDB is configured to start at 1GB and grow by a small amount based on an auto-growth policy. After a 5-hour 4 server cluster test, the TempDB grew to 2.1GB. The TempLog grew to 76MB in size. A quick test shows that a restart of services resizes the TempDB back to 1GB in size. Bob mentioned that sometimes clients might complain that after maintenance in which the database has been restarted that the system is slow or sluggish. Part of the problem could be the fact that the TempDB needs to grow. The second factor could be the flushing of the procedure cache during the restart operation.

A great resource that Bob pointed me onto is this blog from Craig Freedman. He’s apparently really big on describing how query processing works in SQL Server.

Bob also discussed GAM and SGAM (Global Allocation Map and Shared Global Allocation Map) which are important components of the TempDB. One nugget of information Bob gave us is that it is recommended that we create 1 TempDB data file per core (just data…no need to do log). So on our 8-core system that has a need for about 2+GB of space, we would need to create 8 files roughly 256MB in size each. Below are a couple other links worth reading:

  • http://support.microsoft.com/kb/328551
  • http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx
  • http://msdn2.microsoft.com/en-us/library/ms175195.aspx
  • http://sqlblogcasts.com/blogs/tonyrogerson/archive/2006/08/24/958.aspx
  • Installation and Physical Layout Best Practices
    The next area we covered was a great overview. Bob first discussed where to or better yet (where not to install) SQL Server. He mentioned not installing SQL Server on the same file system as the Operating System. He also mentioned that it would be in our best interest to configure TempDB on a third LUN. He mentioned that this could be the cause of unnecessary paging issues.

    While were talking about paging problems, Bob discussed best practices for virtual memory. Rather then setting virtual memory to 1.5 and 3 times the size of RAM, he mentioned it was in our best interest to set both min/max to 3 times the amount of RAM on the box. He also mentioned defragging the disk before persisting the virtual memory change.

    We jumped into physical file layout of the schemas. Bob wanted to understand why we had 10x4GB files under the PRIMARY File Group rather then 1x40GB file. I was surprised by this question, but essentially splitting the data across multiple files has no impact on performance unless they are on different File Groups and on different LUNs. The conversation jumped into clustered indexes. We looked at a few functions worth spending a little more time reviewing:

  • sys.dm_db_index_physical_stats
  • sys.dm_db_index_operational_stats
  • sys.dm_db_index_usage_stats
  • Other Links to Review

  • http://blogs.msdn.com/psssql/archive/2007/12/18/rml-utilities-for-microsoft-sql-server-released.aspx
  • http://support.microsoft.com/kb/944837
  • http://support.microsoft.com/kb/887057
  • Advertisements

    Hotsos Symposium 2008 Day Three

    Today was my final day at the conference. Well, actually not the day of this posting, but March 5th. Since I was traveling last night I was not able to get a posting into my blog. I had a great time at the conference. It was quite possibly the best conference that I’ve ever attended or participated in. The Hotsos Community is simply the best. I’m going to do a quick recap of day 3 below. I will only cover the sessions that I attended.

    Measure Once…Cut Twice
    Cary Millsap gave a really interesting presentation on Software Development practices based on an analogy of carpentry. Apparently Cary is an accomplished furniture builder. He sees a lot of parallels in building furniture as building software. What’s interesting is that analogy is usually the opposite “Measure Twice…Cut Once”. The idea behind this phrase it’s better to be cautious upfront in order to reduce risk later on. As Cary discusses in his paper, things change over time and all of that upfront planning doesn’t necessarily save you anything in the end. You might find that all of that upfront planning ends up hurting you in in the end.

    For any of our agile programmers…this was a great read…

    Better Visualization Tools
    When Neil Gunther walks into a room, more often then not he’s the smartest person in the room. The problem with Neil Gunther is that he knows he is smarter then everyone else and wants you to know he’s smarter then you. The premise of this presentation was about better visualization of performance data. All of his examples can be seen here. In his presentation, he did call out Apdex, which I am a participating member of. He apparently gave a presentation at the CMG Group last year on Apdex and visualization. Gunther is clearly more accomplished then me. So who am I in the scheme of things? A possible buyer of his books…well never again will I buy any of his books.

    Singing SQL: Natural Data Clustering
    Dan Tow presented my absolute favorite presentation on SQL Performance. This guy is the author of SQL Tuning from O’Rielly. He spent a fair amount of time talking about why Oracle prefers Nested Loops over Hash Joins and why it can be really important to identify when it’s appropriate to use a Hash versus a Nested Loop. Take a look at the article Dan presented.

    One thing about Dan…we need to bring this guy here for a seminar. I did touch base with him and he seemed quite interested.

    Average Active Sessions
    The last presentation I went to was by Kyle Hailey. I really enjoyed hearing from Kyle. He’s pretty straightforward and just makes a lot of sense. His presentation was on the Active Session History metric available in Oracle. He’s written a number of tools that can help pull the metric around Active Session History out of Oracle. Feel free to look at his presentation on wait events as well.

    Blackboard’s Scholar.com

    If you are reading this blog, then you probably know me or you searched some combination of the words performance + blackboard. One of the coolest development efforts we did at Blackboard was the launching of our Social Bookmarking site Scholar. I’m an avid contributor to the site.

    Why do I like the site so much? Well, a couple of reasons. First, we developed the site at Blackboard. Second, it’s incredibly fast and easy to use. Third, it quickly integrates with my browser, as well as our application. Finally, lots of our clients can see what I am posting about.

    If you get a chance, take a look at the site. Sign-up…it’s free!

    Quest Software Webinar on SQL Server Under the Hood

    Every now and then a webinar invitation comes my way. I typically glance at the invite and either forward it on to the team or throw it away. A few weeks back I received an enticing invitation from Quest to figuratively Go Under the Hood of SQL Server 2008. I thought it was going to be a discussion of SQL Server 2008. It turns out it was just another marketing presentation showing off Spotlight for SQL Server.

    The session was pretty well attended. There were probably 200 other callers. A DBA from My Space was the primary presenter. I totally forgot that myspace.com was one of the world’s largest Windows environments. The SQL 2008 learning series I took a few months back actually mentioned that so it wasn’t news to me. There were a couple of key things I got from the session.

    * We have now had this license in place for a year and we are not taking advantage of this tool.
    * Spotlight has the ability to recall information from historical events within the database.

    The fact that we haven’t been using the Spotlight tools is really troubling me. It will be on the team’s 2008 objectives ASAP. In fact, I think I’m going to spend some time coaching our team on how to use the tool as well to help augment their DBA skills. In regards to collecting historical information, this can be pretty enticing. With the EDB model of test at night, analyze during the day, we need to find a way to be able to look into the system after a night of testing. The question that we will need to figure out is whether the data comes from the Spotlight repository or direct from dynamic views within SQL Server. If it goes in Spotlight, then no worries as we can benchmark and restore as much as we want. If it’s in the SQL Server views, then we will need the team to not restore at the end of the night. That would leave us with only one test to review.

    There were a few links worth looking at:

  • SQLServerPedia
  • SQLCrunch
  • Quest SQL Server Community Site
  • How to Clean Up a Solaris Zone Configuration Issue

    I’ve been playing around with CPU Groups within a Solaris Container so that I could promote a bound CPU set to a given zone. Well I ran into an issue “_svcadm: Instance “svc:/system/pools:default” is in maintenance state_” after creating two processor sets and associating the sets to a pool. Because of this issue, I couldn’t get my pooladm command to work. Well, if this ever happens, simply make a copy of /etc/pooladm.conf and run the command below. This will allow you to restore to the original default configuration.

    svcadm disable pools
    delete the /etc/pooladm.conf file
    pooladm -e
    pooladm -s

    Hotsos Symposium 2008 Day Two

    Day two is in the books…well we still have a dinner/reception tonight, but essentially all of the lectures are complete. Today was far better then yesterday. I’ll give my session by session synopsis below.

    Why You Can’t See Real Performance Problems
    Cary Millsap gave an awesome presentation on “Why You Can’t See Your Real Performance Problems.” He started the presentation with a visualization of the execution of a critical business process in the form of a sequence diagram. The diagram was awesome because it helped simplify the meaning behind what is a bottleneck.

    Traditionally, bottlenecks have been defined as the resource with the highest utilization. So in his visual, image a simple sequence diagram with the following components left to right (Application Server — Local Area Network — Database — Operating System). In his visual (I’ll post it once I get it), the database was the most utilized resource. Millsap defined utilization as the (Amount of a Resource Used / Amount of Available Resource). Well anyways, the point of the visualization was that the CPU was the most utilized resource, but it was not necessarily the bottleneck. The bottleneck was where the greatest amount of latency was occurring. In Millsap’s fictitious example, the OS was where the greatest amount of latency existed.

    To figure out the resource demands of each component, Millsap re-emphasized the importance of profiling. A profile is defined as a spanning, non-overlapping account of response time for a given task. So after he went through the example, Millsap redefined bottlenecks. He quoted one of my favorite capacity planners, Daniel Menasce in defining a bottleneck as the component where a task spends most of its time.

    So what does this all mean? Well, basically we need to re-prioritize performance improvements based on business needs. Essentially, make sure your system works for your business and not your business working for the system.

    I recommend that you read his paper. The powerpoint he presented only touched a small part of the paper. The paper talks mostly about skew. Take a look at his example below:
    Example: All of the following lists have a sum of 10, a count of 5, and thus a mean of 2:

    A = (2, 2, 2, 2, 2) Has No Skew
    B = (2, 2, 3, 1, 2) Has a Little Skew
    C = (0, 0, 10, 0, 0) Has a High Skew

    Essentially, if we don’t understand our skew factor, whether it be for response times or resource instrumentation metrics, then we are not effectively looking at our data.

    Trending and Charting Using Sesspack and Excel
    One thing I really like about the Hotsos Community is their sharing and openness. One guy in particular is named Tanel Poder. This guy has written a number of awesome tools that we could easily implement and make use of in our lab. The first tool is called Sesspack. It’s a lot like statspack, but does session level analysis. The tool can be downloaded here. What I like about this tool are the variety of options that can be specified. Below is a snippet from the readme he included about how to use sesspack and the options you can pass to capture data:


    Reference:
    list.sql shows you snapshots with brief descriptions
    srs.sql shows you the session event deltas between selected snapshots
    sesspack.snap_me - snaps current session stats
    sesspack.snap_sid - snaps given SIDs stats
    sesspack.snap_orauser('') - snaps all given ORACLE user's sessions's stats
    sesspack.snap_osuser('') - snaps all given OS user's sessions's stats
    sesspack.snap_program('') - snaps all given programs session stats
    sesspack.snap_machine('') - snaps all given machines session stats
    sesspack.snap_spid('spid') - snaps all given SPID session stats
    sesspack.snap_cpid('cpid') - snaps all given client PID session stats
    sesspack.snap_all - snaps all session stats
    sesspack.snap_bg - snaps background processes sessions stats
    sesspack.snap_fg - snaps foreground processes sessions stats
    sesspack.snap_sidlist_internal('select sid from v$sesstat where ')
    - snaps all SIDs returned by custom SQL

    If you don’t want to install the sesstat package, you can run a lightweight version called the session-snapper. He wrote the tool because of an interesting blog about the lack of session level information. You can [download the script here|http://www.tanelpoder.com/files/scripts/snapper.sql]. He’s written a few blogs about snapper.

    He also introduced a tool called Perfsheet. It’s not quite finished for use by any of us. I can’t find it on his web site yet. He said he would post the tool by tomorrow. This tool is amazing. Basically, he’s built a very simple program within Excel that can connect to a database. Within the program you can run a query and then see exactly what wait events the sql is spending time on in a visual manner. I’ll post an individual blog on this one when it is posted.

    One final thing he showed us was really simple things you can do with formatting in SQL*PLUS. I had no idea you could generate HTML formatted screens with little effort. He provided some scripts for demonstration purposes. You can write this to excel as well.

    RAC Be Nimble…RAC Be Quick
    I finally got to meet Bert Scalzo from Quest Software. Bert was the author of a pretty awesome Dell/Quest paper on Oracle RAC from a few years back that we used during [our benchmark at the Dell lab|http://www.dell.com/downloads/global/solutions/public/White_Papers/hied_blackboard_dell_whitepaper.pdf] back in 2006. Bert did an OK job presenting his paper, but it could have been a little better. It was a marketecture presentation on the three core tools of Quest: Toad, Benchmark Factory and Spotlight. There was one link I would like to investigate from Quest. Oh and he mentioned a paper on Oracle Waits Defined by Kyle Hailey.

    RAC SQL Performance Tax
    I finally sat in a presentation by Michael Erwin from Hotsos. He’s considered by many in the Oracle World as the Father of RAC. We brought him to Blackboard last year to work with ASP. Much of his presentation was about parallelism (degree of parallelism) with queries and how it can have a huge affect on RAC performance.


    ----------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | TQ |IN-OUT| PQ Distrib |
    ----------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 107 | 2782 | 3 (34) | | | |
    | 1 | PX COORDINATOR | | | | | | | |
    | 2 | PX SEND QC (RANDOM) | :TQ10001 | 107 | 2782 | 3 (34) | Q1,01 | P->S | QC (RAND) |
    | 3 | HASH GROUP BY | | 107 | 2782 | 3 (34) | Q1,01 | PCWP | |
    | 4 | PX RECEIVE | | 107 | 2782 | 3 (34) | Q1,01 | PCWP | |
    | 5 | PX SEND HASH | :TQ10000 | 107 | 2782 | 3 (34) | Q1,00 | P->P | HASH |
    | 6 | HASH GROUP BY | | 107 | 2782 | 3 (34) | Q1,00 | PCWP | |
    | 7 | PX BLOCK ITERATOR | | 107 | 2782 | 2 (0) | Q1,00 | PCWP | |
    | 8 | TABLE ACCESS FULL| EMP | 107 | 2782 | 2 (0) | Q1,00 | PCWP | |
    ----------------------------------------------------------------------------------------------------
    • Table is set to Parallel 2
    • PCWP = Parallel Child With Parent
    • P->P = Parallel to Parallel
    • P->S = Parallel to Serial

    The example cost plan above is what you would expect for a simple query with 2 degrees of parallelism. In the example below, we have a situation in which we have a serial to parallel query. Refer to metalink article (263414.1) as to why this is bad. It’s fairly obvious is that we unfairly are taking a serial query and attempting to parallelize. There’s no gain in doing this. It only makes performance worst.


    ------------------------------------------------------------------------------
    |Id| Operation |Name |Rows|Bytes|Cost| TQ |IN-OUT|PQ Distrib|
    ------------------------------------------------------------------------------
    | 0| SELECT STATEMENT | | 106| 2862| 3 | | | |
    |*1| HASH JOIN | | 106| 2862| 3 |23,01| P->S | QC (RAND)|
    | 2| TABLE ACCESS FULL|DEPARTMENTS | 27| 432| 2 |23,00| S->P | BROADCAST|
    | 3| TABLE ACCESS FULL|EMPLOYEES | 107| 1177| 1 |23,01| PCWP | |
    ------------------------------------------------------------------------------

    Hotsos Symposium 2008 Day One

    Day One is in the books for the 2008 Hotsos Symposium. It was pretty crazy getting here. I arrived around mid-night last night after 9 hours of airplane madness. Who would think that Dallas would have massive thunderstorms the first week of March? Certainly not me…

    The conference started off somewhat uneventful. There was no crazy welcome with smoke and flashing lights…just Gary Goodman, the CEO of Hotsos kicked-off the presentation covering logistics and welcomed the keynote speaker, [Cary Millsap|http://carymillsap.blogspot.com/]. For those of you who do not know Cary, well he’s one of the best performance engineers in the world. He wrote Optimizing Oracle Performance back in 2003. This book continues to be my most read performance book in my library. It’s quite frankly my favorite performance engineering book ever.

    Cary’s Key Note was a retrospective of his keynote from 2003. Back in the late 90’s and early 2000’s, “Oracle Tuning” was based on trial and error. The old way of thinking was to study the system as an aggregate, make a change (single change), observe the new performance characteristics and start the process again. Much of this process was based on improving percentages and ratios that had nothing to do with how users were actually perceiving performance. That’s where Method-R comes into play. Method-R is about focusing on the operations/transactions that are most important to the business. With Method-R, the idea of tuning is replaced by the process of optimization.

    Method-R forces the performance engineer to ask better questions:

    * Are the tasks fast? (Quantify/Measure)
    * Are the tasks efficient?
    * Is the system efficient?
    * What would happen if?

    Semantic Query Optimization
    The first presentation I attended was about Semantic Query Optimization by Toon Koppelaars. SQO is when you have two queries that are semantically equivalent if they return the same answer for any database state satisfying a given set of integrity constraints. The overall point about the presentation is to identify weaker predicates and replace them with more sound/stronger predicates. Definitely take a look at the presentation. It includes tons of example scripts in the zip archive.

    Leveraging Oracle’s Extended SQL Trace
    The next presentation I attended was [Leveraging Oracle’s Extended SQL Trace Data to Expedite Software Development. The idea behind this presentation is to use the 10046 Oracle trace as an instrumentation tool for software developers. Included in the archive are a bunch of perl scripts to parse the 10046 trace data. The premise of the presentation is to use 10046 to identify software anti-patterns that are not completely obvious, or are incredibly obvious, but don’t make sense for performance.

    Dealing with Software Agoraphobia
    The third presentation I attended was a Solaris focused system performance engineering presentation on software agoraphobia. This was a heavily sun-focused presentation with an emphasis on prstat, lockstat and d-trace. The emphasis of the presentation was to focus on CPU latency rather then aggregate measures. Take a look at the speaker’s blog.