Debugging Java Thread Problems using RTI

Featured

…This is the first in a series of in-depth answers to user questions about RTI features.

A user wrote, “Once I figured out how to turn on the Thread Dump Snapshot, I was really impressed with all the info that’s presented there.   I’d like to use it to find a deadlock/hang in my Eclipse/RCP Java application.   How would I do that?”

Here’s our answer…

The RTI thread dumps collect information about the thread states, CPU usage, wait time, and blocking time, as well as monitor/lock information.  The first thing to do is to make sure thread dumps are enabled in the RTI Collector.  To do this you need to edit the collector configuration and check the BTO Options.  The THREADDUMP option controls the collection interval.

1st pic

The default thread dump interval of 1 minute is designed to give a good overview, with low overhead.  For finer-grained thread issues you may need to set the thread dump collection interval to a short value, like 5 seconds.  This can cause a fair bit of overhead, so use it judiciously, and be certain to set the interval back after collecting some information.

Once you have thread dump snapshots you should start by viewing the Thread Dump Timeline to familiarize yourself with the (kind of) threads in your application or application server.  If you are lucky, your threads will have useful names, otherwise you have to figure that out for yourself.  You can examine thread stacks by clicking on a snapshot cell.   Examining stack traces can help you figure out what “unnamed” threads are doing.

2nd pic

If you have enabled the option for the JVM to detect deadlocks, the threads that are deadlocked will be marked in red.  Using the JVM option incurs extra runtime overhead so you might leave that off in a production environment. Even if the JVM option is off, RTI will always calculate thread deadlocks and mark those in red, if lock information is available.  If the JVM option is off and no lock information is available, you will have to identify deadlocked threads manually.

3rd pic

4th pic

To identify potential deadlock manually, look for threads that have locked resources and that are also waiting for resources.  You can find these threads by examining each thread stack to look for those that are both holding a lock and waiting for a lock. The next two images show the thread details for the deadlocked threads identified above.

5th pic

6th pic

One way to diagnose potential thread deadlocks is to look for a set of threads that are blocked for a long period of time.  Even more suspicion is warranted if the threads don’t make any progress and have the same stack trace repeatedly.  The Thread Dump Timeline will generally mark these threads in orange, and the less than (“<“) character to indicate that the thread state is the same as the previous snapshot.  Checking the Thread Blocked Timeline can confirm that threads are blocking—they will be marked in orange. (See the resource starvation example below for screenshots of what this might look like.)

Another thread issue is resource starvation.  In this scenario, a number of threads are competing for a “scarce” resource.  Generally this will lead to a thread(s) that gets the resource and a number of other threads that are waiting for the resource.  A single higher-priority thread may always control the resource, or the set of threads may trade control of the resource.  Again, the Thread Timeline View and the Thread Blocked and Waited Views will help.

In this example (the Dining Philosophers) three threads are contending for the same set of resources.  The higher priority thread (Plato) can make progress, but the other two (Aristotle and Socrates) are blocked waiting for the resources Plato has.  (Plato is shown as waiting because it uses Thread.sleep() to simulate processing.)

7th pic

Checking the thread state of Aristotle or Socrates shows the resources in contention:

8th pic

And checking the monitors directly show the Chopsticks are contended (are locked and have waiting threads):

9th pic

A final thread issue is a CPU hog.  In this scenario a single thread or threads consumes all the CPU for a time slice, starving other threads.  The Thread CPU Timeline can help spot these.  Here threads which use a large percentage of the available CPU are highlighted.  Threads that continually consume lots of CPU time are good candidates for review.

The following example shows “request” threads that block waiting for work, then process the request and terminate.  They show high CPU for the duration of the processing.

10th pic

11th pic

Checking their thread state shows that they were only using a small percentage of CPU on an unloaded system.

The details say that since the previous snapshot, this thread used 150 milliseconds CPU which was only 2.26% of the 6.64 seconds available, and that thread has used a total of 700 milliseconds since its creation.

In summary:

  • Run your application under RTI and add its collector to your RTI Console.
  • Enable and configure thread dumps in the BTO tab of the Configuration Editor.
  • Run your application such that it shows the problem(s) you’re looking for.
  • Examine the Thread Timeline for red, indicating deadlock, select those and examine the details, then find the referenced methods and resources in your code.
  • If there’s no red, look for orange in the Thread Timeline or Thread Blocked tabs, indicating long blocked periods; again the Thread/Monitor Details will point to where.
  • Also, look at the Thread CPU tab to see where other threads may be stealing CPU from the apparently stalled thread(s).

 …written by Steve North (RTI Product Manager, OC Systems)

Performance checkpoint…

Featured

APM (Application Performance Management).  Small word, big task.

Some organizations have it and use it.  Some say “sounds like something my organization needs”.  Some can’t identify problems in their applications, but they don’t have the resources or the manpower to provide the insight.  Plus, they have their own job to do.  Then something happens, and all the sudden, they need a performance management solution in place – yesterday.

Time…money…manpower.  Does this ring a bell?

Like when you are trying to modernize your system, and you are given an SLA to have a 2 second response time for many thousands of internal users.  You need the ability to identify where the slowness is occurring on an ongoing basis to maintain response time requirements; you have to measure the actual end-user response time and decompose where slow transactions are spending their time.  You need a complete understanding of the user experience and to be able to quickly correct slowness – check.

What if your organization experiences intermittent application crashes.  You suspect database problems.  Until you identify the source, you can’t deploy the update into production.  The crashes and delayed releases cause a loss of revenue.  Ugh – what a nightmare.  I can hear management knocking on my door.  You need to find and focus on the actual problem and resolve the issue – check.

Customer support is major.  Not only is it a big cost, but the more you can drive users to online support, the less money you spend on call centers.  Ongoing good response time is key and with low-overhead transaction monitoring and lightweight production profiling, you can define monitoring and alerts on any code in your application, Legacy, COTS, POJO, JEE framework.  You can also respond to alerts by enabling diagnostic snapshots giving you diagnostic level traces of specific transactions.  Happy customers = returning customers – check.

If you have adopted and embraced the methodology and practice of Application Performance Management, you really can save money and avoid correctness problems early in the software lifecycle.  Developers directly connect to local “sandbox” servers to collect performance metrics and profiles for application code under test.  The earlier a problem is detected in the software lifecycle, the less expensive it is to resolve.  Let your developer make sure that his code performs well before it is delivered.  Have them analyze transaction profiles and export and share data, import diagnostic snapshots, correlate metrics to business transaction performance and connect directly to the server.  Yes, application performance does impact the bottom line.  Be proactive, not reactive – check.

Migrations.  Large, tangled webs – and lots of work.  Public-facing, supplier, and internal applications to various platforms.  You know you will have application performance problems and wants a tool in place to deal with these issues as early in the migration as possible.  Monitor and profile migrated applications and deep dive to track specific performance problems.  Be prepared – check.

Too many cooks in the kitchen?  With software, there are always multiple people working in an application.  Analyze the performance of third-party, open-source code.  Profile and trace the execution of web services in a third-party, open-source system and then evaluate the performance of back-end storage strategies. Leave the open-source code unmodified and still collect valuable performance information from lightweight profiles and deep dive traces.   Don’t avoid Pandora’s box, open it up – check.

Georgia Ferretti handles all RTI sales, marketing and partnerships – contact gbf@ocsystems.com.