Monday, March 12, 2012

Java Tuning in a Nutshell - Part 1

While delivering a training recently, I got a request to put together a JVM tuning cheat sheet. Given the 50+ parameters available on the Sun hotspot, this request is understandable. The diagram below is what I came up with. I’ve tried to narrow down the most important flags that will solve 80% of JVM performance needs with 20% of the tuning effort. This article assumes basic JVM tuning knowledge - the different generations used in the Sun hotspot JVM, different garbage collection algorithms available, etc. Although this is intended primarily for enterprise grade Oracle Fusion Middleware products, it applies to most server JVM’s with large heaps and hosted on server class, multi-core machines. This is not an exhaustive list, only low hanging fruit. In fact, many JDK1.6 users need no tuning at all - the JVM picks good defaults and ergonomics does a decent job. Follow this only if the default behavior is not good enough (for instance, frequent garbage collections, low throughput, long GC pauses, etc). In my experience, a non-trivial production topology with Oracle Fusion Middleware products often requires this level of tuning. This includes Oracle WebLogic Server (JavaEE apps), Oracle Coherence, Oracle Service Bus, Oracle SOA Suite, BPM, AIA and other enterprise FMW apps running on the Sun hotspot JVM. I’ve used a mind map below to help visualize the relationship and dependencies between various JVM tuning flags. In the diagram, the flags in black are the ones to try first; the ones in gray are optional; anything not covered here can be ignored! :)

I’ve categorized the flags into 4 groups:
  1. Garbage collection (GC): The garbage collection algorithm is one of the two mandatory tunables for java performance tuning. Start with UseParallelOldGC. If GC pauses are not acceptable, switch to UseConcMarkSweepGC (prioritizes low application pause times at the cost of raw application throughput). Specify parameter ParallelGCThreads to limit GC threads (yes limit, the default is usually too high for multiple Weblogic servers sharing a large, multi-core machine). Recommendations for values and other flags will be covered later.
  2. Heap tuning: This is the other mandatory tunable. I’m using ‘heap’ as an umbrella term for all Java memory spaces. Technically, Perm and Stack are not part of the java heap in Sun hotspot. Required flags in my tuning exercise are total heap size (Xmx, Xms), young generation size (Xmn) and permanent generation size (PermSize, MaxPermSize). Xss tuning is optional. I only use it when tuning on a 32-bit heap-constrained JVM; reducing Xss only to squeeze memory out from native space so more is available for Xmx. In any case, never set Xss below 128k for Fusion Middleware (default is usually 512k to 1m depending on OS).
  3. Logging: GC logging is mandatory only for the duration of the tuning exercise itself. However, due to its low overhead (typically only one line written per collection, which itself is relatively infrequent), it is highly recommended for production as well. Otherwise, you will not be able to make an educated tuning decision if/when things don't work as expected. 
  4. (Optional) Other Performance: These are only used for fine tuning when performance is the driver for the tuning exercise. Even then, try these only after GC and heap are well tuned to begin with.
The primary requirement that warrants JVM tuning in production Oracle Fusion Middleware is not performance, rather unacceptable GC pauses. The cultprit almost always is a Full GC that causes long application pause. Symptoms include temporarily unresponsive servers, client session timeouts, etc. If you’re capturing GC logs using the flags in the diagram, a search for “Full GC” will show how many, how frequent and how long Full GC’s took. Following the tunables in the diagram above, this is how you can solve the problem (I have highlighted the parameters to match those in the diagram):
  1. Heap not sized correctly, causing Full GC’s
    1. -Xmx should be equal to -Xms Growing from Xms to Xmx requires Full GC’s to resize the heap. Set these to the same value if Full GC’s are to be completely eliminated in production.
    2. –XX:PermSize should be equal to –XX:MaxPermSize
      Both params need to be specified and should have the same value. Otherwise, a full GC is required for each Perm Gen resize while it grows up to MaxPermSize
    3. –XX:NewSize is specified but not equal to –XX:MaxNewSize
      Like the other heap params, resize of new/young gen requires a Full GC. The preferred approach is to avoid these two parameters and use -Xmn instead. This eliminates the problem as setting, say "-Xmn1g", is the same as setting "-XX:NewSize=1g -XX:MaxNewSize=1g".
    4. –XX:SurvivorRatio is specified but –XX:-UseAdaptiveSizePolicy is not. The SurvivorRatio specified will not stick if AdaptiveSizePolicy is in effect. By default, the JVM adapts and overrides the value you specified based on runtime heuristics. Use this parameter to disable adaptive sizing of generations (notice the 'minus' sign preceding UseAdaptiveSizePolicy).
  2. –XX:+UseConcMarkSweepGC is almost always used when there is a strict latency requirement or Service Level Agreement (SLA) and long GC pauses are unacceptable. That is, avoid Full GC’s at all cost. However there are many reasons why Full GC’s could still occur:
    1. Although UseConcMarkSweepGC is specified, CMS can and often will kick in too late, causing a Full GC when it can’t catch up. In other words, although CMS is collecting garbage, the application threads that are executing concurrently run out of heap for allocation because CMS couldn't free garbage soon enough. At this point, the JVM stops all application threads and does a Full GC. This is also called a “concurrent mode failure” in GC logs. The reason for concurrent mode failure - the JVM dynamically finds a value for when CMS should be initiated and changes this value based on statistics. However, in production, load is often bursty which leads to misses/miscalculation for the last dynamically computed initiation value. To prevent this, provide a static value for CMSInitiation. Use –XX:CMSInitiatingOccupancyFraction (as percentage of total heap) to tell the JVM what point it should initiate CMS. A value between 40 to 70 usually works for most Fusion middleware products. Start with the higher value (70) and tune down only if you still see the string “concurrent mode failure” in GC logs.
    2. Secondly, always specify –XX:+UseCMSInitiatingOccupancyOnly when CMSInitiatingOccupancyFraction is used, otherwise the value you specify does not stick (JVM will dynamically change it on the fly again). This is very important and commonly missed.
  3. UseParallelGC is used instead of –XX:+UseParallelOldGC
    1. UseParallelOldGC does old gen collection in parallel unlike UseParallelGC. In both cases, young gen (minor) collections are still parallel. By having multiple threads do old gen collection, the overall Full GC pause can be reduced.
    2. If no GC params are specified, UseParallelGC is usually the default (this may have changed in later versions of JDK6), so it is safe to always specify this parameter when throughput is the goal.
Rarely, no matter how well you tune your JVM, the heap gets backed up eventually and results in back-to-back Full GC’s (again, use GC logs to guide you). If this is the case, there is a possibility that your code has introduced a memory/reference leak. To confirm, take a few heap dumps and compare them to see if any particular object count is growing with time, even after GC completes. Again, this is very rare so make sure you do your due diligence with JVM tuning first. 

I’d be interested in your comments or questions after you try this out. Happy tuning!

Enterprise-class SOA on Exalogic... what, why and how?

What is it?
At the Oracle Open World 2011 conference, I presented a session titled "Enterprise-class SOA on Exalogic" along with Vikas Anand and Manas Deb. Exalogic is part of Oracle's new breed of engineered systems, and is geared towards the middleware space. For those new to the term, an 'engineered system' is one where a single vendor provides everything from hardware, network, OS, middleware, applications, etc in one appliance. The hardware and software are fully integrated out of the box and designed for maximum performance, availability, scalability and reliability. The idea is to incorporate architecture best practices and provide a well tuned stack out-of-the-box. This removes room for error in a customer's data center for that particular application type (Oracle SOA Suite in our case). Hence drastically reducing time-to-production without compromising all the QoS (qualities of service) we just talked about. 

Why do I need it?
You simply roll in a rack into your data center, hook it up... and your middleware teams are ready to deploy their software and take advantage of this high end hardware. This is, in a way, moving towards commoditization of middleware in an appliance. Of course, this does not take you 100% of the way - picking the application deployment topology and architecture is still up to each end user. However, that is just the 'last mile' and lets the application architect (e.g. SOA architect) work on a commodity middleware appliance. Middleware architects, CIO's, CTO's all love the concept since they grasp the value immediately. Their teams can focus on building business applications and not the infrastructure that hosts it. Public cloud, on the other hand, is rarely an option in the SOA/business integration world where all the moving parts exist within the organization's data centers. 

As a former Sun CEO famously said, "the network is the computer". Growth of public clouds like Amazon EC2 and private clouds within organizations is testament to the fact. Engineered Systems (like Oracle's Exalogic, Exadata, Exalytics, Exa*) is the building block and enabler to this concept. This is why 'scalability' is the keyword in Oracle's Engineered Systems. If the network is the computer, surely it cannot be the bottleneck in the scalability equation. 

How does it work?
The single most important enabler of Exa* scalability and performance is Infiniband technology. Infiniband was designed as a new standard for both internal and external communication. What this means is that it provides ‘bus speed’ for external connectivity (networking layer). For comparison, the bandwidth/speed with Infiniband is higher than PCI Express technology used internally on the motherboard.  In the context of Exa*, Infiniband is used exclusively for external communication. Specifically, between Exa* machines co-located in the same data center. For scalability, bringing in an additional Exalogic rack and connecting them via Infiniband is equivalent to, say, taking a new motherboard and soldering it on to an existing one.You can begin to see how this becomes the building block for a private cloud.

In the context of Fusion Middleware applications, Infiniband makes the whole greater than the sum of its parts. Unlike engineered systems from other vendors, Oracle has the advantage that it owns everything from the hardware (Sun hardware), operating system (Solaris, Oracle Linux), JVM (Sun hotspot, JRockit) and application server (Weblogic Server). Oracle enhanced every layer of this stack (hardware, OS, JVM, app server) to take advantage of the blazing speed of Infiniband. There's a lot more than infiniband in Exalogic. There is redundancy built into every hardware component so there is no single point of failure... whether it is network, storage or power. Disaster Recovery enablers are also built-in to the rack.

Counter Arguments:
1. I can build this myself using best of breed components from multiple vendors. Why do I need to buy an engineering system from Oracle?
2. Isn't this what mainframes did before we moved away from them??
3. Is this just Oracle's way of selling us something we don't really need?

Well, there is a simple answer to all these questions - do some research on how you would build this on your own. You will quickly realize that Infiniband requires hardware and device drivers that are expensive and not easily available. There is a reason mainstream hardware vendors don't have this technology. Building it into your system is not as simple as plug-and-play. And even if you have the hardware, you most likely don't have the resources to optimize it for your software stack. With the Sun acquisition, Oracle has the expertise and R&D resources to design, implement and integrate this in-house, so that it works seamlessly with Oracle's industry leading middleware software stack.

To summarize, SOA Suite has stepped up to the next level with Exa*. Exalogic, used as the infrastructure for the application tier and Exadata, used as the infrastructure for the DB tier, connected via high speed Infiniband – takes enterprise grade SOA to a whole different level. The real world performance numbers from customers demonstrate this. Migrating application clusters spanning multiple machines, from like-hardware to Exalogic, has shown a miniumum of 2x performance improvement, in many cases 10x or more!