Arch2Arch Tab BEA.com
Syndicate this blog (XML)

A second look at Java 6 performance

Bookmark Blog Post

del.icio.us del.icio.us
Digg Digg
DZone DZone
Furl Furl
Reddit Reddit

Henrik Stahl's Blog | August 7, 2007   7:33 AM | Comments (5)


When we released JRockit for Java SE6, I promised to present some benchmark numbers. Here they are - only 4 months late...

The results shown here are based on the SPECjbb2005 and XMLMark workloads. We have run the benchmarks in a realistic "base" configuration as well as in a fully tuned configuration and have tried to make as fair apples-to-apples comparisons as possible. See the Q&A section below for more details.

SPECjbb2005

This is the only modern industry-standard JVM benchmark available and the main arena for Java performance comparisons by JVM and hardware vendors.

intel-jbb.PNG

amd-jbb.PNG

Or in table format:

jbb-table.PNG

As you can see, JRockit is much faster than the Sun JVM on both Intel and AMD hardware, and in both the base and tuned configurations. The JRockit performance advantage is between 5 and 65%. The best known config with JRockit is 27% faster on Intel and 19% faster on AMD in our setup.

Note: While we have run the benchmarks on both Intel and AMD hardware, using these results to compare the performance of the hardware is misleading since we used a multi-JVM configuration for the Intel result and a single-JVM configuration for the AMD result. The reason for this is that we had too little memory in the AMD machine to run a multi-JVM config.

XMLMark 1.1

This benchmark was originally used by Sun and Microsoft to compare Java to .Net performance. We have not used it to drive performance enhancements in JRockit, so it can be used as one data point to help validate that our JVM optimizations are generic. I used the XMLMark code provided by Microsoft, download here.

The XMLMark benchmark consists of 3 different components for SAX and 6 components for DOM. The results below were calculated as the geometric mean of these components.

intel-xmlmark.PNG

amd-xmlmark.PNG

The result is clear: JRockit is faster in every single combination. On SAX, JRockit's lead is between 32 and 66%, on DOM between 3 and 17%.

Configuration

Hardware and operating system

2-way AMD Opteron 2220 SE, 8 GB RAM, Windows X64 Edition

2-way Intel X5355, 12 GB RAM, Windows x64 Edition

Both systems had large pages enabled in the OS to enable the use of large pages for both JVMs.

Intel system had hardware prefetching disabled in the BIOS. This benefits the software
prefetchers in JRockit and the Sun JVM (for Sun, this is AFAIK new in 1.6.0_02).

Sun JVM

Sun JDK 1.6.0_02, 32 and 64-bit

For SPECjbb2005:
-Xms1500m -Xmx1500m -server -Xss128k (32-bit JVM, base)
-Xms3500m -Xmx3500m -server -Xss128k (64-bit JVM, base)
-Xms1500m -Xmx1500m -Xmn800m -server -XX:+UseBiasedLocking -XX:+AggressiveOpts -XX:+UseParallelOldGC -Xss128k -XX:+UseLargePages -Xbatch (32-bit JVM, tuned)
-Xms3650m -Xmx3650m -Xmn2000m -server -XX:+UseBiasedLocking -XX:+AggressiveOpts -XX:+UseParallelOldGC -Xss128k -XX:+UseLargePages -Xbatch (64-bit JVM, tuned)

For XMLMark (32-bit JVM only):
-Xms1500m -Xmx1500m -server (base)
-Xms1500m -Xmx1500m -server -xx:+AggressiveOpts -XX:+UseLargePages (tuned)

JRockit

JRockit 1.6.0_01 R27.3.0, 32 and 64-bit

For SPECjbb2005:
-Xms3500m -Xmx3500m (32-bit JVM, base)
-Xms3500m -Xmx3500m (64-bit JVM, base)
-Xms3650m -Xmx3650m -Xns3000m -XXaggressive -XXlazyunlocking -Xlargepages -Xgc:genpar -XXtlasize:min=4k,preferred=1024k -XXcallprofiling (32-bit JVM, tuned)
-Xms3650m -Xmx3650m -Xns3000m -XXaggressive -XXlazyunlocking -Xlargepages -Xgc:genpar -XXtlasize:min=4k,preferred=1024k -XXcallprofiling (64-bit JVM, tuned)

For XMLMark (32-bit JVM only):
-Xms1500m -Xmx1500m -Xgc:parallel (base)
-Xms1500m -Xmx1500m -Xgc:parallel -XXlazyunlocking -XXcallprofiling -Xlargepages(tuned)

Q & A

How did you decide on the tuning parameters? For our base tuning we only allowed the most commonly used JVM parameters, and those needed to work around inherent weaknesses in each JVM.

For Sun, this varies with platform but on Windows you typically have to specify -server, max heap size (-Xmx) and often the stack size (Xss). The Sun JVM is less sensitive to the initial heap size (-Xms). You also often have to configure a larger perm space, though that didn't affect the benchmarks used for these results. To increase performance, we used the same parameters used by Sun in benchmark publications.

For JRockit, the most important thing is to set the minimum heap size (-Xms), and sometimes select the static throughput-optimized GC (-Xgc:parallel). There is typically less need to set the max heap size (-Xmx). To increase performance, publication options were used.

For both JVMs, we only used the most common tuning parameters for XMLMark even for the "tuned" scenario.

Wouldn't it be more fair/realistic to run with an out-of-the-box configuration?
Not really. To start with, most server-side Java installations do use basic tuning options and products like WebLogic Server come with precreated start scripts configured that way. Also, out-of-the-box configuration of the JVM varies with platform so it could yield really poor performance in some configurations. For instance, our tests were run on Windows, and Sun uses the Client VM by default on that platform which performs much worse for typical server-side benchmarks. Also, not configuring things like stack size would mean that the benchmark
would terminate during the run; hardly fair to Sun!

Aren't tuned configurations pointless? I mean; people don't take the time to tune anyway!
The configurations used for benchmark publications do tend to be a bit extreme, yes. It is unlikely that many users have the skill or inclination to tune most products to that extent. However, that is less true for Java benchmarks like SPECjbb2005 than for more complex benchmarks since there are so few tuning parameters involved.

Also, while most users may not engage in heavy tuning, it is probably the case that the users that really care about performance do. So peak performance is still interesting.

How did you configure the XMLMark benchmark?
We ran every component benchmark for 10 minutes, of which the last five were used to calculate the score. This seemed to be a reasonable compromise giving us a run long enough to allow the GC to impact the score while keeping the total runtime reasonably low. The benchmark was configured to use one thread per CPU core - 8 on the Intel Xeon server, 4 on the AMD server - in order to fully load the system.

Can you generalize these results to other platforms? What about other workloads?
Running similar benchmarks on other x86 hardware will probably produce similar results. You can find some comparisons on http://www.spec.org/jbb2005/results/jbb2005.html to back this up if you look
hard, or search the web for independent (non-Sun, non-BEA) results.

You can definitely not generalize to non-x86 platforms like SPARC or Itanium.

Generalizing across workloads is very difficult. However, in our own testing we often see that
other memory-intensive applications show a similar advantage to JRockit. Examples of such applications include XML processing including web services and JSP/Servlet heavy apps like WLP.

You get really impressive performance with JRockit... What type of optimizations have you made to get these results?
Things that affect SPECjbb2005 performance include: compressed references, lock optimizations, lazy unlocking, software prefetching, memory locality optimizations in the GC, generic compiler enhancements and optimizations to a few core Java APIs like HashMap and BigDecimal. Most of these have been done by the JRockit team, some are inherited from Sun (the Java API changes). Interestingly, almost all of the optimizations that affect SPECjbb2005 have very broad applicability.

Why do you use a larger heap with JRockit than with Sun?
The 32-bit Sun JVM cannot allocate more than approximately 1.5 GB on Windows since it requires a contiguous memory space for the heap. JRockit does not have this limitation, nor does it affect 64-bit JVMs.

We did not use a larger heap for the XMLMark results shown here. If we had increased the heap size from 1.5 GB to 3.5 GB it would have increased the performance another 1-2%.

Why is the 64-bit JRockit so much faster than Sun?
When you run the 64-bit JRockit with a small heap, it compresses the references to 32-bit which decreases the pressure on the memory bus. This yields a 10-20% performance advantage, but does not apply if you run with a larger heap. The current implementation is limited to 4 GB, though this can in theory be extended to a 32 GB heap with a slight loss in performance.

This technique can obviously not be used if you need a larger heap than the limit. However, it is quite common to use a 64-bit JVM even with a small heap; for instance if you want to be able to expand the heap later or if you are integrating with any 64-bit native libraries such as a floating point library.

Sun used a partially different set of benchmarks and a different configuration for their Java 6 launch. What's your response to that?
I have already provided our view on the tricky "out of the box" question above. The other benchmarks Sun used were scimark2 and volanomark. Both of these are very old and we feel they do not reflect the way our customers use JRockit so we don't work on them. Sun obviously comes to a different conclusion since they keep referring to them every so often.

Also, Sun used a 6-month old version of the 32-bit JRockit 5.0 (32-bit) and a slightly newer 64-bit version in their comparison - disregarding the fact that the class libraries differ which doesn't make for an apples-to-apples JVM comparison. The comparison above is intentionally made with brand new releases from both Sun and BEA. JRockit is even at a slight disadvantage since the Sun JVM is based on the 1.6.0_02 class libraries but the JRockit version is based on 1.6.0_01.

These results are worthless since JRockit is optimized only for SPECjbb2005!
That's a really, really funny statement made by one of our competitors, though we forgive him
since we like our competition and the statement was obviously made in frustration. Make love, not war!

Is the Sun JVM better since it has been used for more SPECjAppServer2004 publications?
Another really funny statement, which was implied in a Sun JavaOne presentation this year. But it's quite obvious that quantity doesn't count when it comes to benchmarks, only results. And since there are no apples-to-apples comparisons available on this benchmark we cannot draw any conclusions about JVM impact on it.

So what JVM is the fastest?
JRockit of course :-)

Seriously, the answer is the much-hated "it depends". For SPECjbb2005 (and XMLMark) on x86 right now JRockit seems to be in the lead, so I would expect it to be used for the upcoming Intel and AMD product launches. If you're running a BEA product you should be using JRockit since that is where we focus our efforts. But at the end of the day, you have to do your own benchmarking if you really want to know.

Disclaimer: The SPECjbb2005 results quoted above were produced by BEA and have been submitted to SPEC for review. SPEC and the benchmark name SPECjbb2005 are trademarks of the Standard Performance Evaluation Corporation. The XMLMark version we used is the property of Microsoft.


Comments

Comments are listed in date ascending order (oldest first) | Post Comment

  • There's a story at javalobby.org pointing to this blog:
    http://www.javalobby.org/java/forums/t99916.html
    And here are some more questions and answers:

    > * Why did you choose UseParallelOldGC for Sun's GC.
    > Are you sure it's the fastest choice for this
    > benchmark?

    We started with what Sun had used for their benchmark
    publications, and then tried various options including the
    brand new -XX:+UseNUMA flag. We couldn't find anything
    substantially better than Sun's own choice.

    > * How much of the performance difference is due to
    > the smaller max heap size (1500 for Sun vs. 3500 for
    > JRockit)?

    For JRockit, I think the difference is 1-2%. Sun is typically
    less sensitive to heap size than we are, but I don't have
    any hard numbers.

    > * You say "To increase performance, publication
    > options were used." What does that mean?

    We used the same options as hardware vendors have
    used for SPECjbb2005 benchmark publications, i.e.
    the "best known method".

    > * I wouldn't be at all surprised if all the
    > performance difference was due to, say, a single
    > enhancement in BigDecimal (I know Classpath uses
    > libgmp for BigDecimal, using very fast, native
    > assembly and C). Can you give any more details on
    > where the performance difference comes from?

    Sorry, but there is no silver bullet. It's a lot of small things
    combined, read the Q&A section in my blog for some more
    detail.

    > * Are you sure you've followed all the spec.org rules
    > for posting benchmark results? :)

    Yes, 100% certain.

    > How come I don't see any entries with BEA for "Tester Name"

    The results have been submitted to SPEC, reviewed and
    accepted. For the time being, BEA has asked not to have
    them published since SPECjbb2005 is typically used to
    compare hardware and not software, and our intention
    with these numbers was not to take a stance on that issue.

    Note that we are not the first JVM vendor to use
    SPECjbb2005 in this fashion. Sun did it for their Java 6
    launch, for instance.

    (Edited for readability -- Henrik)

    Posted by: atripp54321 on August 10, 2007 at 8:29 AM

  • Thanks, Andy! I'd be happy to answer any other question you might have.

    Posted by: hstahl on August 10, 2007 at 9:21 AM

  • Henrik, I just wonder if it is possible to start a jvm with 3,6gb heap and with largepages on a real prod machine where uptime is > one week?

    Besides, don't really understand why the the -Xss for sun jvm has been decreased to 128k? jrockit 64bit uses 1m, doesn't it?

    Further, I see that jrockit runs with genpar, also -imo- sun should run with UseConcMarkSweepGC, what do you think about it? I guess it has been tested as well? Wasn't the sun's UseParallelOldGC a little bit buggy?

    Finally, a general conclusion: being a fan of jrockit (I love jrcmd), I think that the most important is context: could be that for some applications 32bit will perform better than 64bit. I think - due to risk of being frozen by gc - it really makes sense to keep the heap as low as possible, and therefore I prefer 32bit jvms. Sorry for that ;).

    regards, makiey

    ps. I guess, talking about "the other benchmarks Sun used" you were referencing to the "Java 6 Leads Out of the Box Server Performance" entry from David's blog...

    Posted by: makiey on January 30, 2008 at 3:53 AM

  • Henrik, I just wonder if it is possible to start a jvm with 3,6gb heap and with largepages on a real prod machine where uptime is one week?

    I don't see why not. Large pages is more static than standard memory allocation so it should give you more predictable performance. The back side is that it is "unfriendly" towards other apps running on the same box in that it typically preempts swapping of the process.

    Running with heap size close to the address space limits (varies with OS, JVM and depending on if compressed refs are used) can lead to out of memory conditions if you're not careful. But it's typically not a problem for a well-defined Java application that avoids things like constantly generating new classes etc.

    Besides, don't really understand why the the -Xss for sun jvm has been decreased to 128k? jrockit 64bit uses 1m, doesn't it?

    When we ran with Sun's 32-bit JVM it failed to complete a run if we didn't set this option. Can't remember why we kept it for the 64-bit JVM, that might have been a mistake. But this is a resource consumption option, not a performance tuning option. Changing it would have negligible impact in performance

    Further, I see that jrockit runs with genpar, also -imo- sun should run with UseConcMarkSweepGC, what do you think about it? I guess it has been tested as well? Wasn't the sun's UseParallelOldGC a little bit buggy?

    -Xgc:genpar is our multiple-generational stop-the-world GC. This is similar to Sun's out of the box configuration. UseConcMarkSweepGC would map to our -Xgc:gencon, which would give you shorter pause times but lower throughput performance on these benchmarks.

    Finally, a general conclusion: being a fan of jrockit (I love jrcmd), I think that the most important is context: could be that for some applications 32bit will perform better than 64bit. I think - due to risk of being frozen by gc - it really makes sense to keep the heap as low as possible, and therefore I prefer 32bit jvms.

    A valid opinion. But I guess my point is that - for small heap sizes - our 64-bit JVM performs on par with our 32-bit JVM. The same thing cannot be said for the Sun JVM.

    ps. I guess, talking about "the other benchmarks Sun used" you were referencing to the "Java 6 Leads Out of the Box Server Performance" entry from David's blog...

    Yep, my favorite sparring partner :P

    Posted by: hstahl on January 31, 2008 at 12:34 AM

  • thanks! I really appreciate your efforts of making jrockit the fastest jvm the world and promise to stay with it for a while ;)

    cheers, makiey

    Posted by: makiey on February 8, 2008 at 1:37 AM



Only logged in users may post comments. Login Here.

Powered by
Movable Type 3.31