The changes to BTM-2.2 are the result of extensive profiling using JProfiler 7.0, including CPU-time analysis as well as lock-contention analysis.  For these analyses, the profiler was restricted as much as possible to just the Bitronix components.

This page provides a comparison of the BTM-2.1.2 release vs. the BTM-2.2 branch in a real world load test.  These profiling runs include only one of the workloads in my company's product, but is the most important one for customers.  Having said that, I would expect the results of this analysis to be representative of other high-concurrency workloads.

CPU

First, let's look at CPU usage.  BTM-2.2 contains a fair number of concurrency changes, but also a move away from the BTM-2.1.2 "proxy" wrapper around JDBC objects.  The proxy wrapper was designed to allow bitronix to support JDBC3 and JDBC4 without a compile-time dependency on JDBC4.  Unfortunately, this has sacrificed runtime performance in favor of build-time simplicity.

BTM 2.1.2

As can be seen here, looking at one of the processing threads in the workload, 594ms were spent by this thread in the reflection-based BaseProxyHandlerClass.invoke method.  Total processing time of this thread was 1149ms.

BTM-2.2

BTM-2.2 replaces the reflection-based approach with concrete JDBC3 and JDBC4 classes that wrap and delegate.  The cost is increased build complexity, but improved runtime performance.  As can be seen here, looking at the same processing thread as above in the workload, the reflection invocation overhead is gone, and the total CPU time of the thread dropped from 1149ms to 839ms (27% improvement).

You might notice in this call tree, when compared to the BTM-2.1.2 call tree, a lot of new methods appear: JdbcConnectionHandle.close(), various PreparedStatement methods, etc.  These were previously contained "within" the proxy invocation, but now appear as direct calls.  Subtracting the two, we find that the proxy overhead was approximately 300ms for this thread.  300ms out of the original 1149ms is a rather large percentage.

Monitors (lock contention)

While CPU time is important, and directly impacts the overall "load" on a server, another performance killer is lock contention.  In lock contention, very little CPU time is expended, but threads are blocked from performing useful work.  This extends the overall runtime of a given workload.  One area that immediately showed up in profiling bitronix was the XAPool resource pool.  One or two threads contending for a resource may experience very little lock contention, but as the number of processing threads increases into the 10s or 100s, lock contention can become the primary bottleneck in a system.

BTM-2.1.2

In BTM-2.1.2, the XAPool is largely synchronized by a high-level lock (synchronized on the XAPool).  In addition, while holding this lock various actions are performed.  These include:

Performing these actions while holding the lock increase lock contention substantially in a high-concurrency environment.

Here is a screenshot from JProfiler's Monitor Views showing the number of times the XAPool lock was contended for, and total time threads spent blocked during the load test:

Looking at the total time, at the bottom of the Duration column, you can see that a total of ~76 seconds was spent blocked on XAPool during the workload.  While each thread blocked for on average only 8ms, some blocked for as long as 390ms.

As you can see from the size of the scrollbar on the right-side of the view, there were a lot of lock contentions.  In fact, there were 26269 times that threads were blocked interacting with XAPool.

I encourage you to download the attached zip file, and open the BTM-2.1.2-monitor-history.html file.  It's size (13.4MB) gives an indication of how many lock contentions occurred.

BTM-2.2

In BTM-2.2, the XAPool was refactored, along with small changes to related classes.  This is a summary of the changes:

By splitting the pools, and using new collection-types the following are advantages are gained (compare one-by-one to the BTM-2.1.2 bullets above):

I include these probably obvious observations about the BTM-2.1.2 XAPool:

While all of these changes sound extensive, the XAPool code is now 565 lines (ignoring added JavaDoc) compared to 524 lines in BTM-2.1.2.  Additionally, while the improvements in concurrency brought by these changes might seem abstract or academic, they are anything but that.  In fact, i don't hesitate to say the results are astounding.

For the exact same workload, here is a screenshot from JProfiler's Monitor Views showing the number of times a lock within XAPool was contended for, and total time threads spent blocked during the load test:

Looking again at the total time, at the bottom of the Duration column, you can see that a total of 372 milliseconds were spent blocked on XAPool during the workload.  Compared to ~76 seconds in BTM-2.1.2.

This is not because of "lock duration" but because of less lock contention.  Compared to BTM-2.1.2 where there were 26269 lock contentions in XAPool, in BTM-2.2 there were only 21.

Conclusion

To be clear, the difference between BTM-2.1.2 and BTM-2.2 is dramatic for my company's product.  Bitronix went from a component that often showed up high on the radar when profiling our overall application (both in CPU and lock-contention), to a component that is largely transparent to us.  Before, we had to "filter out" Bitronix when searching for performance problems in our application.  With BTM-2.2 that is rarely necessary.

Having seen the gains in BTM-2.2, it would be hard to "go back" to the current release version.  As it stands we intend to use BTM-2.2 in our next product release.