In the past I have talked about the two easiest areas of the application stack to study are the front-end client experience and the database. I wanted to cover a few thoughts about E2E from a query perspective in this blog. I’m hoping that others on the team will add to my thoughts.
There’s a reason that I call-out queries rather then the database. The database is a system component whereas a query is transactional in nature. That’s an important differentiator as our E2E’s are supposed to be about the transaction and not the system. So when we look at queries from an E2E perspective, we are truly trying to isolate latency exhibited at the persistence tier. I like to classify culprits of latency in the following four ways:
- Data Access or Persistence Layer Design Anti-Patterns
- Inefficient SQL Design
- Poor database structural design to support querying of one or more entities
- Lack of understanding of data set the query interacts with
To the naked eye a query like SELECT * from TABLE XXX where COLUMN YYY = :1 looks harmless. Basically, it translates to select all columns from table XXX where the predicate condition on column Y is equal to some value. A query like this is so wrong in so many ways. The first and most obvious is that by requesting all columns, we are immediately introducing a Wide Load data access pattern. We have a very simple mechanism inside of our persistence framework to filter a SimpleSelect in order to retrieve only columns we need. I can recall a few cases where we requested all columns of a 30 column table only to make use of 2 values. It turns out one of the columns not used, but retrieved was storing binary data. We essentially return mounds of unnecessary data over the wire only to throw it away at the presentation layer. What a waste!
What else is deceiving about this query (and I didn’t mention it) is that the persistence code calling this query could be calling it N times for N-records. This so-called Round-Tripping pattern has been seen throughout the years in our application. It more often then not could have been avoided by working through the query needs with greater precision. I think it’s fairly obvious that making unnecessary round trips to the database is simply a waste. It’s also a waste from a query perspective. I will go into this in a little more detail below.
The best developers understand how to write good SQL. I say that with all serious of intentions because I am a firm believer that OO developers aren’t just coders in their given language. The best OO developers understand data access and entity modeling. SQL to these developers isn’t a foreign language that only DBAs understand. Rather, it’s a language that they are fluent.
Sometimes people make mistakes…we all make mistakes right? Inefficient SQL design is no different then making a mistake. It’s not like the developer responsible for writing a particular query was trying to be malicious. He/she simply made a mistake. Being able to identify inefficient SQL is a priority of a performance engineer.
Inefficient SQL can come in many forms. The most common form would be any of the following:
- Incorrect driving table
- Poor join order
- Incorrect join type
There are others as well. I think it’s important for us to simply the problem a few ways. SQL is inefficient not necessarily because it runs slow. We could have inefficient SQL that runs fast. The inefficiency comes from poor use of interfaces and resources. Queries that make unnecessary logical I/Os are example of inefficient SQL. They most likely have chosen a poor access path, join order and/or join type. Queries that have to be continuously parsed is another example of inefficient SQL. A third is failing to identify predicates that are necessary for minimizing our available result set.
I think this goes without saying that it’s absolutely essential to make sure that the structural aspects of the database, specifically the DDL is appropriately defined to support queries. Developers have to be aware of their entity structures. They also have to be aware of data access patterns, data orientation (cardinality or makeup of the data) and frequency of data access in order to make design decisions about DDL.
As part E2E, it’s imperative that we review the supporting DDL structures for our queries to ensure efficiency of query execution. It’s important to exercise some executions of different use cases that make use of a given entity or set of entities. The reason for that is to understand why certain index structures have been defined. They may have been defined specifically for a given use case and not considered for another use case that you just happen to be working on.
This last point is what I would call “King of the Obvious” from an SPE perspective. If you get a chance to read my old blog post, plus the attached article from Karen Morton, you will really get a sense of how important it is to understanding the data you are interacting with. As I mentioned above, how do you detect wide load antipatterns if you don’t know what data you need? What about the value of an index. If the data doesn’t have much variation, does it really make sense to index a column in which the optimizer most likely is going to recommend a Full Table Scan?
At the heart of SPE, we preach that we understand the functional and the technical. This last point is really an amalgam of the two. You really need to know your data needs (the orientation of the data you are querying) in order to move forward with a SQL optimization or SQL acceptance.