Multi-core may be good for Java!
Henrik Stahl's Blog |
December 15, 2006 1:32 AM
|
Comments (3)
I was rather surprised when I read Billy Newport's blog Multi-core may be bad for Java
a couple of weeks back. We do a significant amount of Java benchmarking in the JRockit team, and we have not seen any indication multi-core chips causing performance issues. One of the comments he makes is "Garbage collection remains heavily dependant on clock speed for small pauses". I take this to mean that
he believes GC does not scale well on multi-core chips. Again, this goes against our experience so I decided to put it to test.
Test Setup
GC pause times are normally a function of heap size and live data (i.e. non-garbage), but can vary quite a bit depending on the exact heap layout. Looking at single pauses would not produce high confidence data, so I decided to find a benchmark with a fixed amount of live data and a fixed heap size and look at pause times for a large number of GC pauses over a long run.
I decided to use a very well defined workload, namely the SPECjbb2005 benchmark. Normally, this benchmark ramps up the load linearly throughout the run by increasing the number of warehouses, but I configured it to use a fixed set of warehouses (8) instead. Each warehouse contains a certain amount of live data - around 35 MB - plus some base data. I then ran the benchmark multiple times while ramping up the number of GC threads used
by JRockit. For every such run I recorded the median value of pause times, to decrease the impact on the odd long or short GC pause.
The test setup included single-core, dual-core and quad-core Intel systems, plus a 6-core Sun T1 server. Here are the details:
Intel Xeon 2.2 GHz, 2 CPUs with hyperthreading (4 CPU threads). This is a fairly old single-core system.
Intel Xeon "Woodcrest" 2.67 GHz, 2 CPUs, 4 cores
Intel Xeon "Clovertown" 2.33 GHz, 2 CPUs, 8 cores
Sun T2000 1000 MHz, 1 CPU, 6 cores, 24 threads
The JVM command line was:
-Xms2g -Xmx2g -Xgc:parallel -XXgcthreads=N
i.e., a fixed 2 GB heap and a single-spaced stop-the-world parallel compacting mark-and-sweep GC (whew!), varying the number of GC threads.
Before going on to the results, I'd like to point out that the frequency of GCs also varies with CPU performance. A fast chip generates more SPECjbb2005 bops, and allocates more memory in a given time frame, the heap fills up faster and GCs will therefore be more frequent. However, this does not affect GC pause times and is thus not relevant for
this experiment.
Expected Result
Remember that GC essentially means traversing memory and moving objects around. This means that the performance bottleneck will be either the CPU or the memory subsystem. The result should look something like this:


The first of these two graphs shows the GC pause time as a function of the number of GC threads, and the second plots the GC scalability. If the CPU is the bottleneck, we expect the second GC thread to cut pause time in half, the third GC thread to a third and so on. And the
scalability graph should scale linerarly up to the number of cores. If we see this behavior, we can assume that the GC implementation scales well.
If instead memory is the bottleneck we can still have perfect scaling up to the point where the memory bus is saturated, after which the curve will level out. In this case, we can't be sure if it's the GC implementation in the JVM that is at fault or if it's a hardware limitation without further analysis.
Of course, we can't expect to see a perfect result. Memory access speed is not constant but affected by multi-level CPU caches, and a GC thread could be stalled waiting for memory if it is not in the CPU cache. We could have serialization issues due to the fact that JRockit must ensure safe concurrent access to the Java heap by the GC threads. And CPU threading can impact performance in various ways, though hopefully to the better.
Intel Xeon
Here are the results from our three Intel machines:


In a perfect case we would see 8x and 4x on Clovvertown and Woodcrest, but the real numbers are around 5x and 3x respectively. My guess is that we could probably come closer to perfect scalability through further JVM optimizations, but it's clear that JRockit's GC implementation scales reasonably well with more cores on Intel x86 chips.
Sun T1
Sun says that their hardware is designed for throughput, which means we should expect fairly good scalability but possibly poor single-thread performance. I.e., if the GC implementation in JRockit scales poorly we will get very long GC pause times.
Also, while our server is a 6-core T2000, we must not forget that it has a CPU threading implementation with four threads per core. The threading implementation is Switch on Event Multi-threading (SoEMT), but Sun uses the marketing name "Coolthreads". I'll leave it to someone else to explain how this works, but the point is that we can expect scalability beyond the number of physical cores.


Scalability up to the number of cores (6) is almost linear (5.7x), and we get an additional boost from the CPU threading implementation. The optimal result is achieved using 21 threads - around 3 times as many as physical cores, which also seems to be Sun's choice when they do benchmark submissions, so it is probably a good choice for the hardware. One minor concern about this result is that we actually lose performance a we increase the number of threads beyond the optimal number which is not something we see on other hardware.
Single vs multi-thread GC performance
Looking at the extreme cases, here is the GC pause times using a single GC thread on the various architectures:

And the pause times using an optimal number of GC threads (on Xeon equal to the number of cores or CPU threads when HT, and on T1 about
three times the amount of cores):
Single-thread performance is very good on Woodcrest and Clovertown CPUs - about twice as fast as the older Xeon chip. On T1 single-thread performance is quite bad, but since the GC scales well that is not an issue in practice.
To summarize, I think these results demonstrate that JRockit's GC implementation scales very well on multi-core systems, and I believe that GC scalability could improved further, so there should be no issue scaling to even larger numbers of cores in the future. As for the other issues mentioned in Mr. Newport's blog, here are some thoughts:
1) Threading is relatively simple in Java compared to many other languages, and many Java applications are already multi-threaded.
2) Java can use runtime analysis to optimize locks, memory allocation and other low-level operations for multi-threaded usage in ways that a statically compiled language (i.e. C) can not easily do, which helps scalability on multi-core hardware.
3) Even if the user code does not scale, JVMs can use free CPU cycles for background housekeeping such as concurrent GC or recompiling code with more aggressive optimizations.
Final Words
The trend towards more cores will clearly force Java programmers to adapt. However, JRockit benefits from current multi-core chips for critical tasks such as GC, and includes other optimizations that lessen the scalability impact of running on multi-core systems. I have no reason to believe that other JVMs are worse. With JVMs providing a good base, and the ease-of-use that the language provides for multi-threading, my conclusion is that multi-core is good for Java!
Comments
Comments are listed in date ascending order (oldest first) | Post Comment
-
Beautifully clear - thank you.
Posted by: tallsandwich on December 18, 2006 at 8:05 AM
-
very clear, very informative! thanks
Posted by: vedran on December 20, 2006 at 3:57 AM
-
Wow this is great stuff. The DataRush team will definitely use this for our benchmarks.
I must concur. I've run DataRush on other JVM's in a 32-core, 128 GB RAM platform and GC hummed along while the app completed processing 10 million rows of data in seconds. Not many IT managers would look at that and say "Gee, I better fix GC".
Posted by: EmilioB on December 22, 2006 at 1:16 PM
|