Monthly Archives: June 2011

Top 10 Users…Why This is Important

Added by Stephen Feldman, last edited by Stephen Feldman on Mar 23, 2011

I’ve been meaning to write this blog for quite a while. Procrastination has definitely held me back…that and the fact that what I want to do is no simple task. When you want a complex task to be done, you better have your thoughts organized.

Over the years, I’ve considered so many different ways to forensically study what people do in the system. I’ve looked at logs. I’ve looked our ACTIVITY_ACCUMULATOR table. I’ve looked at aggregates of data as well. I’ve brought in tools like Coradiant Dynatrace and Quest User Performance Management. None of these tools has ever met my real needs. The reason is that I haven’t been able to articulate what I am really in search of.

I think I’ve had a few eureka moments as of late with what I’m interested in seeing. I know that I want to see what is being done when in our product. I know that I want to understand the sequence of events and the orientation of where events happen in the system. I want to understand the probability of something happening. I want to see the frequency of something happening. In the case of frequency, I want to understand the volume related to frequency. I think all of this data is relavent because it will give us more insight into predicting patterns of usage of a system.

Where a lot of this has come from centers around coversations I’ve had recently about Assessment performance. A lot of customers have been complaining about high-stakes assessments in which they have hundreds of students taking tests all within a lab. They have been complaining about both memory issues (makes sense) and I/O issues (inserts/updates on QTI_RESULT_DATA) which also makes sense. In the case of I/O they didn’t really call them out. Rather, after discussing, I called-out that there likely were some I/O issues based on the behavior an assessment. One of the things I’ve been suggesting to customers was to query the QTI_RESULT_DATA table to get a resultset of row’s inserted versus modified. Then put it in a scatter plot (from an isolated period of time) to see the volumes of inserts versus updates to see when the timeslices of these events were occuring. From that data, then go into their I/O sub-system and graph their IOps for those same periods of time and overlay the two charts…

 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 QTI_ASI_DATA_PK1                          NOT NULL NUMBER(38)
 PK1                                       NOT NULL NUMBER(38)
 POSITION                                           NUMBER(38)
 ASI_PLIRID                                         VARCHAR2(255)
 ASI_TITLE                                          NVARCHAR2(255)
 DATA                                               BLOB
 BBMD_RESULTTYPE                                    NUMBER(38)
 PARENT_PK1                                         NUMBER(38)
 BBMD_DATE_ADDED                                    DATE
 BBMD_DATE_MODIFIED                                 DATE
 BBMD_GRADE                                         NVARCHAR2(32)


Back to My Point

So all of this talk about using scatter plots to isolate time of when certain events happened in mass, got me thinking about why I wasn’t getting what I really wanted (aka…my rambling above). What I really wanted to create an identity of a user. I didn’t care about their name, just their role. I would call them “Insanely Ambitious Student” or “Constantly Connected Teacher”. It really doesn’t matter. What matters is that you can start building profiles about these users. Before you can build the profile, you have to have a starting point.

My starting point is to look at every entity in the system. I would like to be able to directly or indirectly trace back a row of data to a user. It’s not as simple as you think. First off, not every table has a foreign key relationship to USERS. Some tables have a tie back to COURSE_USERS, which is not a problem per se, but it’s not a straight-up look at each table with USER_PK1 foreign keys.

As a starting point, I would like to do a gap analysis to determine what entities can be directly tied back to the user. From that, we need to know whether the row entry can be presented as time/date value. In some cases, the entity can even show the initial INSERT versus an UPDATE. We really need to understand this system-wide, which means yes we could/would touch the monster ACTIVITY_ACCUMULATOR table.

We could even start with a single entity as a starting point. I would even compromise for an entity that stores USER_PK1 in it. It has to be a table that can present a many to one reference of rows to a user. A good example might be MSG_MAIN as a starting point since it covers all of the criteria.

We could easily look at time series data by user, as well as aggregate statistics. Both are relevant, but obviously time series is a little more visual. I think you need aggregate statistics or at a minimum binned data (binned by time series per user) like aggregate counts by user over each week as a key data point.

 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 PK1                                       NOT NULL NUMBER(38)
 DTCREATED                                 NOT NULL DATE
 DTMODIFIED                                         DATE
 POSTED_DATE                                        DATE
 LAST_EDIT_DATE                                     DATE
 LIFECYCLE                                 NOT NULL VARCHAR2(64)
 TEXT_FORMAT_TYPE                                   CHAR(1)
 POST_AS_ANNON_IND                         NOT NULL CHAR(1)
 CARTRG_FLAG                               NOT NULL CHAR(1)
 THREAD_LOCKED                             NOT NULL CHAR(1)
 HIT_COUNT                                          NUMBER(38)
 SUBJECT                                            NVARCHAR2(300)
 POSTED_NAME                                        NVARCHAR2(255)
 LINKREFID                                          VARCHAR2(255)
 MSG_TEXT                                           NCLOB
 BODY_LENGTH                                        NUMBER(38)
 USERS_PK1                                          NUMBER(38)
 FORUMMAIN_PK1                             NOT NULL NUMBER(38)
 MSGMAIN_PK1                                        NUMBER(38)

Journeys Into Forensic Fishing

As performance engineers we often spend our time studying forensics after an incident occurred and most importantly when the incident was recognized by one or more people. What happens when an issue happens and nobody reports it? Lots of issues go unreported within our customer basis. It’s amazing when I sit down with customers and they walk me through a scenario. I guess the amazing part is how nonchalant they are when they encounter an issue. Often they ignore the issue or work around them. They recognize the defect is a nuisance, but reporting the issue is often more trouble than it’s worth. We all know that most system administrators don’t have hours and hours of free time scanning log files looking for clues from issues that go unreported. Imagine a scene where cops are wandering around a busy street looking for a crime that may have or may not have actually happened…It simply doesn’t happen.

Who’s to say it doesn’t have to happen. In fact, lots of great software companies and organizations actively seek out issues that go unreported. They build frameworks into their products to notify development and/or support teams that an issue occurred. While it might not be reported by the user, it’s actually being reported by the system. Now that would be a dream for me. Imagine an administrative feature set that sends some form of message when an exception occurs? Granted, we might get bombarded with hundreds or thousands of these messages, but we could easily build the capability to aggregate and analyze the data so that issues could be evaluated both individually and as a holistic trend.

Building an administrative feature wasn’t my intention of this blog. Rather, I wanted to talk about the topic of proactively evaluating issues that may or may not be reported by issues. I call this forensic fishing. The idea is to strategically sample a software system with the intent of identifying a performance, scalability or even a functional issue as part of the sampling effort. The goal is to seek out issues or catch some fish. It’s not like you go fishing with the intent of going home empty handed right?

So why do I use the phrase “strategically sample”? Well, my first strategic purpose would be to sample customers who are early adopters of a release. Let’s reward early adopters by responding to issues before customers report them. The second would be to sample a customer who’s taken a patch or SP based on a fix we did. Our team is working dozens of issues per release. We verify in our lab, but ideally being able to verify with an even greater degree of confidence. Third, the whole goal of fishing is to find issues. Issues creep from release to release and go unnoticed. We may find an old issue that gets exposed because our magnify glass was fortunate to pass over it during a sampling period.

Of course our plan is to fish using a pretty awesome rod, aka…Dynatrace. The idea is to build a program with our colleagues in Managed Hosting to selectively pick customers for sampling. We have to identify when we sample and most importantly how long we sample. I’m a firm believer that we do not need to sample every aspect of the customer. We could run with a lightweight profile. We could selectively choose 1 node or maybe 2. If a customer had 10 nodes, it’s not worth our time to sample them all. We obviously need some low hanging fruit to initially focus on and react to. A more sophisticated inspection could and should be performed as well. What would we be looking for? We would look for everything from exceptions to unreleased connections to other forms of software anti-patterns. We would sample over short intervals like 30 minutes and use the compare features across profiles.

Broken Window Theory…Exceptions and Log Messages

When I first read the Tipping Point years ago, I remember Gladwell’s fascinating story about the NYC police and how they cracked down on crime by replacing broken windows and painted over graffiti every single day/week. So if a window was broken by a stone, a brick, a bullet, whatever…it was replaced as quickly as possible. If graffitti was on a building wall or a city vehicle, it would be quickly painted. The core idea was that if vandalism was left in place, disorder would invite even more disorder-that a small deviation from the norm would set into motion a cascade of more vandalism. When addressed in a timely manner, it established a quality standard for vandalism to not be tolerated.

So what does that have to do with Blackboard Learn?

Good question Steve. Let’s start by talking about Exceptions and Log Messages. I’ve written two blogs, one about the cost of an exception and the second about using Dynatrace to evaluate exceptions and logs. Both of which discuss the great insight we can see quite easily with Dynatrace.

In the visual below you have a case of neglect. Yuck! We have broken windows everywhere. The visual is a snapshot of almost 1 million exceptions that our PVT test, which runs for about an hour. This test had about 4100 unique user sessions who performed about 56,000+ page requests. You saw what I wrote right? I said 1 million exceptions from an hour worth of user activity only by 4100 unique user attempts. That’s over 230+ exceptions a second. Holy smokes batman!


I call out the broken window theory because I see this problem of throwing exceptions as epidemic in the product. Fortunately, I’ve got the Flex team starting to begin looking into why are we throwing so many exceptions. We are wasting a ton of memory and CPU time by throwing these exceptions. Our goal is to get to no exceptions. This could, should and will become the standard going forward.

The visual below is also from the same PVT test. What we are showing here are all WARN, ERROR and SEVERE log messages (stack traces) that were raised during the test. It’s about 1100 log messages, which is about 1 log message every 4s. Not good…and yes another standard that we need to address.


Random Java Application Performance Blog Posting…Reader Beware

So I follow this one user on Twitter named @JavaTune. I am not sure who @JavaTune is, but I was under the impression that it was a credible source given some of the user’s tweets. This morning I saw the following tweet which pointed me to this blog post. It was kind of a random blog post with no substantial data backing any of the statements. Some of the tunables are worth investigating. I really want to see substantial evidence supporting these tunables.


Using Galileo and Dynatrace for Forensic Purposes

My intention with this blog is to share the process that I’ve become custom to using when analyzing results in Galileo and Dynatrace as part of a performance optimization exercise. It’s by no means the holy grail for forensic investigations. It’s a solid approach from my perspective to narrowing down the problem.

There’s really no reason to create a “Chicken or Egg” kind of problem with this investigation. For simplicity, I’m going to say that I have a transaction in question. I’ve done an analysis of “Transaction X” looking at my baseline results and my regression test results. I see that “Transaction X” took a 50% hit. The transaction went from 3s to 6s. Not a big deal, but big enough for it to be on my top regression issues list.

First thing I want to do is map the data in Dynatrace with the data in Galileo. What’s interesting is that Galileo presents the Timestamp and Response Time. The Timestamp looks to be the end time. Dynatrace shows the Duration in the container (which should always be equal or less than Galileo) plus the start time. You have to do a little bit of math to figure whether the PurePath in question matches the individual sample. Why am I doing this? A couple of reasons. First, it gives me confidence that the data is correct in both systems. Second, it gives me both the CPID and the user in question so that I can use that information for profiling. I would always profile with the system at rest. Even that’s not exactly necessary as we can see “Suspension Duration” in the Dynatrace profile to see whether we were waiting on Garbage Collection. We can also look at the Method time to see if we are ever waiting to get a Database connection.

Below is my Galileo details view as well. I match the Start/End times plus the duration. Now I have the user’s information and the CPID.

Before I go into any kind of profiling exercise, I want to review as much of the data that I have available in the Dynatrace recording. I start with the PurePath of course. This gives me an idea of my hotspots of where to investigate. In this case I can see the class ViewQuestionsActions is where I’m spending my time. Interesting enough it’s a Struts call. So most likely I know that I don’t have enough information to make a call as to where my problem is or is not.

Next I dig into the tier breakdown so I can be confident of where to spend my time. If I see a lot of time is in Struts, but a ton of time also in JDBC, it could be that my Struts action is being masked by database latency. In the example below it looks like I’m spending about 5s in DB time, but close to 79s in Struts time. I wonder why.

I want to know where to look next. I obviously could go do a deeper YourKit profile with an emphasis on the ViewQuestionsAction. I want to be absolutely sure of the data. I can dig into the Methods view next for confirmation. Interesting enough I see 1 execution which accounted for 84 out of 89s of runtime. The key is understanding what this code is doing and why.

I still want to see where my database time is being spent.


Unfortunately I have an issue where I don’t see where the time is being executed. I do have the knowledge that we always reference Proxy9 when we prepare and Proxy10 when we execute. We can do a quick filter on the word Proxy and viola we see the statements in question. Note that we always prepare before execute. So it’s a guarantee that the Proxy10 that falls in question with no text is the statement from Proxy9.

So now I have the 4s+ of JDBC time narrowed down to this query referenced above. I have my class in question above. I have 2 optimization exercises to focus on. The 84s is obviously my priority. The 4s query is just as alarming.

What Should I Do Next

As I mentioned above, there really are 2 optimization exercises. We really need to get to the bottom of the latency exhibited in the ViewQuestionsAction. Obviously part of my problem is that I’m using an Extreme data set. So I have some kind of workload variation anti-pattern that gets worst and worst with the bigger the data set. If the problem was more uniform or static in nature, I wouldn’t see the transaction in question perform better with smaller data sets, which is exactly what we see.

I’m definitely going to run a YourKit profile with as much emphasis placed on that class and related classes. I could do the same with Dynatrace, but it’s not necessarily the intended purpose of Dynatrace. The Dynatrace tool helps us narrow down where to investigate forensically. Not only would I do a deeper profile with YourKit, I would also do a deep code review to better understand what are we doing functionally.

At this point I’m focused on decomposing which anti-pattern is introduced plus what functional problem we are trying to solve. If I truly understand what we are doing from a code perspective, I would then begin a refactoring exercise. If I didn’t understand both functionally and technically what we were doing, I would bring the problem to my architect or development colleagues for their input.

With regards to the second optimization exercise, I would start by investigation more about the query in question. We kind of luck out because it’s a single query executing that consumes 4.2s of time. So it’s not like I have to look at DAO anti-patterns. I basically can focus my attention on query tuning exercise.

To HTTP or Not To HTTP…That is the Question

Well maybe it’s Jeopardy and rather than the question…it’s the answer. I’ve been thinking about this question for quite some time now. It comes in the context of how we should simulate/execute or EoE during a sprint. There really are three core ways we simulate: JUnitPerf, Selenium or HTTP via LoadRunner. For quite some time I pushed JUnitPerf as the primary way to test server side code. I’ve done this because JUnitPerf is one of the best testing tools used in an agile development environment.

JUnitPerf doesn’t complete the story. For one thing, JUnitPerf does not provide browser characteristics. Second, JUnitPerf only gives you part of the picture from an API execution. It doesn’t provide the layers of statistics that are inate to our MVC (Struts/Servlets). We really only see the lower level method execution. Third, in our own laboratory we don’t have the same maturity with JUnitPerf that we do per se with HTTP (LoadRunner) or even Selenium.

HTTP simulation isn’t always our best course of action. For similar reasons as JUnitPerf we do not see the browser characteristics as it’s one layer removed from the browser. Second, our agile development process doesn’t always leave us enough runway timewise to test over HTTP. Third, not all scenarios are run over HTTP. They could be command line oriented or service-oriented.

There is a key point about HTTP simulation that we can’t deny. That key point is it’s by far our greatest, most reliable and stable testing capability in our performance lab today.

JUnitPerf: When to Use It

Years ago a childhood friend of mine by the name of Andy Glover wrote this article about JUnitPerf. In the article he calls out that JUnitPerf is a valid testing tool, but it’s not necessarily the most accurate tool. I agree with Andy whole heartedly. I’ve always seen JUnitPerf as an Agile SPE tool used to faciliate feedback, but not as the quintessential tool for performance feedback. SPEs should leverage JUnitPerf during the development lifecycle for cases when a critical scenario cannot be simulated in its final shape or form. If a scenario is HTTP driven, or even command-line or services-oriented, but is not entirely ready for showtime, JUnitPerf should be leveraged as a stop gap for giving feedback.

“While JUnitPerf is a performance testing framework, the numbers you place around it should be considered rough estimates…”

Selenium: When To Use It

We might have become victims to our own performance goals. I’ve seen a lot of Selenium tests over the past 18 months. In many cases I’ve been promoting heavier use of Selenium to accomplish our testing goals. But why? Selenium is not the be-all…end-all of simulation tools. It has a purpose and we can’t lose sight of what that purpose is. Selenium executes the browser in the same intended manner a user would execute a browser. Therefore, you use Selenium as your test driver when you are predominately concerned with what is happening or executing inside the browser such as rich client-side interactions.

Justifying the use of Selenium because the scenario is executed via the front-end isn’t enough. We really have to breakdown the implementation from the perspective of which layer is executing code. If all that’s happening in the browser is rendering, then it doesn’t necessarily warrant a Selenium script. In a case like this, the code is predominately executing server side, whether that be in the container or in the database.

What happens when the browser is responsible for handling a voluminous amount of data to present from a DOM perspective? For example, maybe we aren’t executing code in the browser, but we return a 4mb data set to render. Isn’t that a problem? Absolutely and might require extra work to accomplish. It may make sense to simulate using two tools, with the emphasis being placed first on the server-side code and secondary emphasis placed on browser rendering.

HTTP Simulation: When to Use It

In our world, we use LoadRunner for HTTP simulation. So for the context of this section, it makes sense to use LoadRunner in the first place if the ultimate method for the scenario is over HTTP. Second, when the state of the code is ready to be executed over HTTP. Third, when there are attributes of the simulation that are best tested using our HTTP testing infrastructure such as timed tests, conditional abandonment, concurrency, workload throttling, etc…

We might find that HTTP simulation ends up being the home for 99 out of 100 tests we do. That’s ok in the long run when http simulation is capable of providing server-side feedback in a timely manner that does not jeopardize the browser experience. The key is timely feedback with our application. As an agile SPE shop, we have to be able to provide instant feedback to our teams so that the transition or even the next sprint can be planned appropriately.

The Most Important Database Statistic

Can’t believe in 2011 I’m asking myself to think back over the years as to what I deem the most important database statistic. I’m obviously talking about performance, scalability, capacity, etc…statistic. I guess not even talking about a statistic as I’m talking about a measurement. I’m a firm believer it’s Transactions/Second. It’s such a meaningful metric that can be converted to minute, hour, day, etc…It can also be studied over time-series data. Oracle has long since made this metric easy to capture and to visualize.


SQL Server has the Transaction/Sec counter on the sys.dm_os_performance_counters DMV. It looks like you have to select the metric at Time X, then select it again at Time Y. Determine the number of seconds that transpired between X and Y. Then use the following formula (Metric X – Metric Y)/(Time X – Time Y) = Transactions/Sec.

Thoughts people?

Digging Into Dynatrace Exceptions and Logging

We have started to make use of Dynatrace in our day to day lives now as performance engineers. I’m trying to do my part as well so that I can contribute to the test analysis lifecycle. Today I’m going to start by providing an overview of two types of Run-Time Events: Exceptions and Logging. Both are critical features of Dynatrace as they provide us aggregate statistics of problems that are so-called “under the covers”. Exceptions are used to provide error-handling capabilities for applications. DynaTrace makes use of run-time bytecode instrumentation to capture throwable classes, messages, and stack traces of java classes that extend java.lang.Exception. Logging APIs and Frameworks allow the developer to control which logging information is processed with arbitrary granularity. The Logging dashlet shows an aggregated view of the logging events, grouped by the class and method that issued the logging event, the DynaTrace Agent that recorded the logging event, the message string, severity, and name of the logger.

Quick View Into Exceptions

The visual you are looking at below comes from a Windows PVT run. It appears that we have nearly 800k exceptions that were generated during the course of a test. The first two exceptions were responsible for 550k exceptions as well. We don’t quite know what the cost of an exception is to date. Dynatrace is still working that problem. I did some informal correlations. I’m still working the problem, but there are a couple of data points that I’m willing to share.


Let’s say you take 2 identical PurePaths and perform a PurePath comparison. The first PurePath is riddled with exceptions and the second one is completely clean, meaning no exceptions. In my example I was looking at a recorded session called “Win20minscalib_run”, specifically looking at (1) transaction: “LS Grade Center: Grade Detail Attempt Save And Continue Action”. We had 171 samples of this transaction according to Dynatrace. The worst transaction took 5.7s. It had a total of 17 exceptions (6 key not found and 11 constraint violations for NULL values.


What’s amazing in my mind is that when you review the exceptions in question, none of the columns throwing exceptions are actually “NOT NULL” columns. It’s a completely wasted exception.

SQL> desc attempt;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 GRADEBOOK_GRADE_PK1                       NOT NULL NUMBER(38)
 PK1                                       NOT NULL NUMBER(38)
 QTI_RESULT_DATA_PK1                                NUMBER(38)
 LINKREFID                                          VARCHAR2(255)
 SCORE                                              NUMBER(15,5)
 GRADE                                              NVARCHAR2(32)
 STUDENT_COMMENTS                                   NCLOB
 INSTRUCTOR_COMMENTS                                NCLOB
 FEEDBACK_FORMAT_TYPE                               CHAR(1)
 COMMENT_PUBLIC_IND                        NOT NULL CHAR(1)
 EXEMPT_IND                                NOT NULL CHAR(1)
 OVERRIDE_IND                              NOT NULL CHAR(1)
 INSTRUCTOR_NOTES                                   NCLOB
 NOTES_FORMAT_TYPE                                  CHAR(1)
 STATUS                                             NUMBER(38)
 ATTEMPT_DATE                              NOT NULL DATE
 DATE_ADDED                                         DATE
 DATE_MODIFIED                                      DATE
 LATEST_IND                                         CHAR(1)
 GROUP_ATTEMPT_PK1                                  NUMBER(38)
 STUDENT_SUBMISSION                                 NCLOB
 TEXT_FORMAT_TYPE                                   CHAR(1)

SQL> desc gradebook_grade;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 GRADEBOOK_MAIN_PK1                        NOT NULL NUMBER(38)
 PK1                                       NOT NULL NUMBER(38)
 COURSE_USERS_PK1                          NOT NULL NUMBER(38)
 COMMENTS                                           NCLOB
 NOTES_FORMAT_TYPE                                  CHAR(1)
 FOR_STUDENT_COMMENTS                               NCLOB
 FEEDBACK_FORMAT_TYPE                               CHAR(1)
 DATE_ADDED                                         DATE
 DATE_MODIFIED                                      DATE
 STATUS                                    NOT NULL NUMBER(38)
 HIGHEST_ATTEMPT_PK1                                NUMBER(38)
 LOWEST_ATTEMPT_PK1                                 NUMBER(38)
 LAST_ATTEMPT_PK1                                   NUMBER(38)
 FIRST_ATTEMPT_PK1                                  NUMBER(38)
 EXEMPT_ATTEMPT_IND                        NOT NULL CHAR(1)
 AVERAGE_SCORE                                      NUMBER(15,5)
 MANUAL_GRADE                                       NVARCHAR2(32)
 MANUAL_SCORE                                       NUMBER(15,5)
 VERSION                                   NOT NULL NUMBER(38)
 EXEMPT_IND                                NOT NULL CHAR(1)
 EXCLUDED_IND                              NOT NULL CHAR(1)

The second transaction, which is identical from a use case perspective. It was smaller in object size 13 versus 163 for the previouse (another data point to consider btw). It was faster (only 104.52ms) and less CPU time (5.8ms versus 264.27ms). This one only had a single exception, an exception mind you that did not show up in the first PurePath. This exception was “Entity type not found for key: caliper”.

Below is a quick PurePath Comparison of these two PurePaths in question. You can see a completely different structural difference between the two PurePaths. I’m sure some of it was data related. When you see so many more method calls you have to call into question why? There could be a lot of reasons. It could be an application error. It could be a LoadRunner error. It could be an instrumentation error. The conclusion that I’m inferring is that there is a penalty for all of the exceptions resulting in more method calls to account for the exceptions.


Quick View Into Logging

During the course of the run we issued close to 800 WARN, WARNING and ERROR messages via log4j and java.util.logging. These are obviously a lot more managable then the exceptions. You can literally sort through ever single exception if necessary and tie the message to a PurePath, which can then help you tie to a LoadRunner transaction (CPID and VUser). From a forensic perspective this is awesome as we can open up a ticket and have someone from Chris’ team dig into each of these.


I say Chris and AppLabs as a starting point (Tier 1) to review the issues because most of these appear as though they are scripting inefficiencies. You dig into the exceptions and they look as though we are trying to do something we aren’t allowed to. A fair number of them also look like simple defects scu as the “Not Found” resources listed at the bottom.

Digging Into Logging

My approach was to isolate PurePaths responsible for issue a logging message. The best way to do that is the following:

  • Open Up the Diagnose Events –> Logging for your recorded session
  • Sort Descending by Count
  • Right Mouse Click on the Logging Event and Drill Into PurePaths

For the vast majority you will see that it’s very specific transactions issuing the same logging messages. I would start by evaluating the transaction in question first via LoadRunner to see if the exception can be generated on Demand. I would then evaluate the same use case by hand with careful inspection to ensure I am functionally completing the scenario. I would evaluate the LR code to see that it’s in fact posting/getting everything correct. If I’m unable to determine that it’s a LR issue, meaning I could create the same problem by hand, I would open up a bug.

What is the Cost of an Exception

This Post Should Have been made in March. It’s old…

In 2 weeks we have Dynatrace on-site to do some product training. I’ve been fortunate enough to have been introduced over the past year to the training approaches that the Dynatrace consulting team takes with their customers. So I kind of know what’s going to happen when Asad comes back. He’s definitely going to grasp at low-hanging fruit. The biggest one that I know he’s going to push hard are unnecessary Java Exceptions.

Below is the PurePath details of a teacher Grade Center request. The path is “/webapps/gradebook/do/instructor/getJSONData” which was called 2x by the same thread for the same request. I have a suspicion that it has to do with the Page Processing framework. Either way if you dig into the request, you see littered throughout the calls are a series of Exceptions.


The first chart shows the details of (3) Constraint Violations. Looking into the exception, you see the chart below which is calling out that certain columns from the COURSE_USERS (intro, note and pinfo) are NULL. That could be a result of our data model. It could also be something all customers receive because these are table values that might seem irrelevant. I know for the purpose of this query, this data is completely unnecessary. What’s even more frustrating is that we are calling a NULL value exception, but the table does not restrict NULLs on those 3 columns.

So it’s a little more complicated than quantifying the cost. It comes back to poor programming decisions or poor DDL design. I would argue it’s the first. We really only want the PK1 and possibly a couple other foreign keys from COURSE_USERS for what we are doing. We certainly do not need to unmarshall a NCLOB, let alone 3 of them for no purpose. Imagine if we actually were populating those fields and their impact on memory.

SELECT pk1, dtmodified, users_pk1, crsmain_pk1, data_src_pk1, role, link_name_1, link_url_1, link_desc_1, link_name_2, link_url_2, link_desc_2, link_name_3, link_url_3, link_desc_3, photo_link, intro, note, pinfo, cartridge_ind, available_ind, receive_email_ind, roster_ind, row_status, enrollment_date, last_access_date, child_crsmain_pk1 FROM course_users WHERE  ( crsmain_pk1 = ? )  AND  ( users_pk1 = ? )  AND  ( row_status = ? ) 

\\ Below is the Table

SQL> desc course_users;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 PK1                                       NOT NULL NUMBER(38)
 LINK_NAME_3                                        NVARCHAR2(100)
 CRSMAIN_PK1                               NOT NULL NUMBER(38)
 DATA_SRC_PK1                                       NUMBER(38)
 ROLE                                               NVARCHAR2(50)
 LINK_NAME_1                                        NVARCHAR2(100)
 LINK_URL_1                                         VARCHAR2(100)
 LINK_DESC_1                                        NVARCHAR2(255)
 LINK_NAME_2                                        NVARCHAR2(100)
 LINK_URL_2                                         VARCHAR2(100)
 LINK_DESC_2                                        NVARCHAR2(255)
 USERS_PK1                                 NOT NULL NUMBER(38)
 LINK_URL_3                                         VARCHAR2(100)
 LINK_DESC_3                                        NVARCHAR2(255)
 PHOTO_LINK                                         NVARCHAR2(500)
 INTRO                                              NCLOB
 NOTE                                               NCLOB
 PINFO                                              NCLOB
 CARTRIDGE_IND                             NOT NULL CHAR(1)
 AVAILABLE_IND                             NOT NULL CHAR(1)
 RECEIVE_EMAIL_IND                         NOT NULL CHAR(1)
 ROSTER_IND                                NOT NULL CHAR(1)
 SOS_ID_PK2                                NOT NULL NUMBER(38)
 ROW_STATUS                                         NUMBER(1)
 ENROLLMENT_DATE                                    DATE
 LAST_ACCESS_DATE                                   DATE
 CRSMAIN_SOS_ID_PK2                                 NUMBER(38)
 USERS_SOS_ID_PK2                                   NUMBER(38)
 DTMODIFIED                                         DATE
 CHILD_CRSMAIN_PK1                                  NUMBER(38)



What’s My Point About Dynatrace

So you are probably asking what’s my point about Dynatrace…Well I know for certain they are going to focus on exceptions. They are right to do so. Dirty code is bad code, nobody can argue that. Yes, we should really fix these problems. This could have been a filtered load. Second, we could not have issued an exception for a NULL value on columns that do not restrict NULLs. Third, we could make the columns NOT NULL, but I wouldn’t do it.

The key point is that at somepoint, we are going to need to measure the true cost…

AppDynamics…A Freeware Competitor to Dynatrace, Foglight and Introscope

I had an interesting beginning discussion with AppDynamics. They are a competitor to Dynatrace, Quest Foglight and CA Introscope in the area of J2EE application visibility. They have a product called AppDynamics in both a freeware Lite Version and Enterprise Professional Edition that dig into the JVM and provide analysis and measurements of run-time performance.

The product is agent based. It’s a simple configuration in which you can modify the Tanuki Wrapper (wrapper.conf) to add a small set of parameters for start-up. It includes a web-based console interface which you can evaluate how much time is being spent where in the byte code. It’s pretty standard fair in terms of showing call trees and SQL statements. What I like is that it provides a quick and narrow window to HotSpots in the code so you don’t have to muck around with the interface. It also provides you the control to do deeper profiling (sampling) like Foglight does so that you can get a complete profile (with overhead of course). You can define User Experience thresholds (which they all provide as well), but it’s right there in plain site.


Beware that the freeware version is limited to 30 “Business Transactions” which are identical to starting points of a PurePath. It’s not just Servlet requests. I saw JSP and JS references as well. You can prune them down to account for your 30. You can also rename them. For example, the request (/webapps/blackboard/execute/courseMain) could be renamed Course Home Page. This way you could target your 30 most important pages and limit your profiling to just these 30. Of course this is only a per JVM basis. The enterprise version connects all of the JVMs together so that all tiers and and Continuous monitoring can take place.


There are other features in the Enterprise version which I haven’t been able to evaluate just yet. I will get an Enterprise license shortly. This includes the following features:

  • Distributed/SOA related bottlenecks
  • Deadlocks
  • Payload/Input Data related errors
  • Memory / GC Behavior Monitoring
  • Memory Leak Detection
  • Memory Content Analysis & Accessor Tracking
  • Object Instance Tracking & Thrash Detection
  • 24/7 Monitoring of Custom Caches
  • Correlated Event Data (Payload, User Data, Hardware)
  • Alerts & Notifications
  • Incident Queue
  • Incident Workbench
  • Other workflow related features


How to Set-It Up

First thing you need to do is download the freeware version. Send the zip file to your server that you plan to instrument. Unzip the packages (both the viewer and the agent are in one package). Then you will need to unzip both packages. I took the simple route and set up the agent under (/usr/local/appdynamics) and the viewer under the (/usr/local/appdynamics/viewer) directories.

Modify the Wrapper.Conf

It’s easiest to modify the Tanuki wrapper by adding the following parameters. Make sure to add the host IP for the viewer. In my case, I used the same server which I instrumented, but you do not have to.

Modify the Catalina.Policy

You will get a Security exception if you don’t add the following into the Catalina.Policy. It took me a solid 20 minutes to figure this one out. Luckily I’ve seen this with Foglight and Dynatrace in the past, so it was easy to overcome.

// AppDynamics Grant
grant codeBase "file:/usr/local/appdynamics/-" {
permission java.awt.AWTPermission "accessClipboard";
permission java.awt.AWTPermission "accessEventQueue";
permission java.awt.AWTPermission "showWindowWithoutWarningBanner";
permission java.lang.RuntimePermission "exitVM";
permission java.lang.RuntimePermission "loadLibrary";
permission java.lang.RuntimePermission "queuePrintJob";
permission "*", "connect";
permission "localhost:1024-", "accept,listen";
permission "*", "read,write,execute";
permission java.util.PropertyPermission "*", "read";
permission java.lang.RuntimePermission "getenv.*";