This section provides some tips on collecting performance numbers with Jikes RVM.
Which boot image should I use?
To make a long story short the best performing configuration of Jikes RVM will almost always be
production. Unless you really know what you are doing, don't use any other configuration to do a performance evaluation of Jikes RVM.
Any boot image you use for performance evaluation must have the following characteristics for the results to be meaningful:
- config.assertions=none. Unless this is set, the runtime system and optimizing compiler will perform fairly extensive assertion checking. This introduces significant runtime overhead. By convention, a configuration with the Fast prefix disables assertion checking.
- config.bootimage.compiler=opt. Unless this is set, the boot image will be compiled with the baseline compiler and virtual machine performance will be abysmal. Jikes RVM has been designed under the assumption that aggressive inlining and optimization will be applied to the VM source code.
- Any configuration that performs opt compilation at runtime (config.include.aos=1 should be built with config.include.all-classes=1). This includes the optimizing compiler and associated support classes in the boot image where they can be optimized by the boot image compiler. By convention, configurations that include the opt compiler in the boot image have the Full or Fast prefix. Configurations where config.include.all-classes is not set to 1 that use the optimizing compiler will dynamically load it (which will force it to be baseline compiled).
What command-line arguments should I use?
For best performance we recommend the following:
-X:processors=all: By default, Jikes™ RVM uses only one processor. Setting this option tells the runtime system to utilize all available processors.
- Set the heap size generously. We typically set the heap size to at least half the physical memory on a machine.
- Use a dedicated machine with no other users. The Jikes RVM thread and synchronization implementation do not play well with others.
The compiler-replay methodology is deterministic and eliminates memory allocation and mutator variations due to non-deterministic application of the adaptive compiler. We need this latter methodology because the non-determinism of the adaptive compilation system makes it a difficult platform for detailed performance studies. For example, we cannot determine if a variation is due to the system change being studied or just a different application of the adaptive compiler. The information we record and use are hot methods and blocks information. We also record dynamic call graph with calling frequency on each edge for inlining decisions.
Here is how to use it:
- Generate the profile information, using the following command line arguments:
For edge profile
For adaptive compilation profile
For dynamic call graph profile (used by adaptive inlining)Typically you might run a benchmark several times and choose the set of replay data that produced the best performance.
- Use the profile you generated for compiler replay, using the following command line arguments:
Measuring GC performance
MMTk includes a statistics subsystem and a harness mechanism for measuring its performance. If you are using the DaCapo benchmarks, the MMTk harness can be invoked using the '-c MMTkCallback' command line option, but for other benchmarks you will need to invoke the harness by calling the static methods
at the appropriate places. Other command line switches that affect the collection of statistics are
Print statistics for each mutator/gc phase during the run
Print statistics in an XML format (as opposed to human-readable format)
This is incompatible with MMTk's statistics system.
Disable dynamic resizing of the heap
Unless you are specifically researching flexible heap sizes, it is best to run benchmarks in a fixed size heap, using a range of heap sizes to produce a curve that reflects the space-time tradeoff. Using replay compilation and measuring the second iteration of a benchmark is a good way to produce results with low noise.
There is an active debate among memory management and VM researchers about how best to measure performance, and this section is not meant to dictate or advocate any particular position, simply to describe one particular methodology.
Jikes RVM is really slow! What am I doing wrong?
Perhaps you are not seeing stellar Jikes™ RVM performance. If Jikes RVM as described above is not competitive with the IBM AIX™ or Linux®/x86-32 product DK, we recommend you test your installation with the SPECjvm™98 benchmarks. We expect Jikes RVM performance to be competitive with the IBM® DK 1.3.0 on the SPECjvm98 benchmarks.
Of course, SPECjvm98 does not guarantee that Jikes RVM runs all codes well. We have also tested various flavors of pBOB and the Volano benchmarks, and usually see superior or competitive performance.
The x86-32/IA-32 port is somewhat less mature than the PPC port, and does not deliver competitive performance on some codes. In particular, x86 floating-point performance is mediocre.
Some kinds of code will not run fast on Jikes RVM. Known issues include:
- Jikes RVM start-up is slow compared to the IBM product JVM.
- Remember that the non-adaptive configurations (eg. Fast) opt-compile every method the first time it executes. With aggressive optimization levels, opt-compiling will severely slow down the first execution of each method. For many benchmarks, it is possible to test the quality of generated code by either running for several iterations and ignoring the first, or by building a warm-up period into the code. The SPEC benchmarks already use these strategies. The adaptive configuration does not have this problem; however, we cannot stipulate that the adaptive system will compete with the product on short-running codes of a few seconds.
- We expect Jikes RVM to perform well on codes with many threads, such as VolanoMark. However, if you have a code with many threads, each using JNI, Jikes RVM performance will suffer due to factors in the design of the current thread system.
- Performance on tight loops may suffer. The Jikes RVM thread system relies on quasi-preemption; the optimizing compiler inserts a thread-switch test on every back edge. This will hurt tight loops, including many simple microbenchmarks. We should someday alleviate this problem by strip-mining and hoisting the yield point out of hot loops.
- The thread system currently uses a spinning idle thread. If a Jikes RVM virtual processor (ie., pthread) has no work to do, it spins chewing up cpu cycles. Thus, Jikes RVM will only perform well if there is no other activity on the machine.
- The load balancing in the system is naive and unfair. This can hurt some styles of codes, including bulk-synchronous parallel programs.
- The adaptive system may not perform well on SMPs; this may be due to bad interaction with the thread load balancer.
The Jikes RVM developers wish to ensure that Jikes RVM delivers competitive performance. If you can isolate reproducible performance problems, please let us know.