On heap vs off heap memory usage
Overview
I was recently asked about the benefits and wisdom of using off heap memory in Java. The answers may be of interest to others facing the same choices.
Off heap memory is nothing special. The thread stacks, application code, NIO buffers are all off heap. In fact in C and C++, you only have unmanaged memory as it does not have a managed heap by default. The use of managed memory or “heap” in Java is a special feature of the language. Note: Java is not the only language to do this.
new Object() vs Object pool vs Off Heap memory
new Object()
Before Java 5.0, using object pools was very popular. Creating objects was still very expensive. However, from Java 5.0, object allocation and garbage cleanup was made much cheaper, and developers found they got a performance speed up and a simplification of their code by removing object pools and just creating new objects whenever needed. Before Java 5.0, almost any object pool, even an object pool which used objects provided an improvement, from Java 5.0 pooling only expensive objects obviously made sense e.g. threads, socket and database connections.
Object pools
In the low latency space it was still apparent that recycling mutable objects improved performance by reduced pressure on your CPU caches. These objects have to have simple life cycles and have a simple structure, but you could see significant improvements in performance and jitter by using them.
Another area where it made sense to use object pools is when loading large amounts of data with many duplicate objects. With a significant reduction in memory usage and a reduction in the number of objects the GC had to manage, you saw a reduction in GC times and an increase in throughput.
These object pools were designed to be more light weight than say using a synchronized HashMap, and so they still helped.
Take this StringInterner class as an example. You pass it a recycled mutable StringBuilder of the text you want as a String and it will provide a String which matches. Passing a String would be inefficient as you would have already created the object. The StringBuilder can be recycled.
Note: this structure has an interesting property that requires no additional thread safety features, like volatile or synchronized, other than is provided by the minimum Java guarantees. i.e. you can see the final fields in a String correctly and only read consistent references.
public class StringInterner { private final String[] interner; private final int mask; public StringInterner(int capacity) { int n = Maths.nextPower2(capacity, 128); interner = new String[n]; mask = n - 1; } private static boolean isEqual(@Nullable CharSequence s, @NotNull CharSequence cs) { if (s == null) return false; if (s.length() != cs.length()) return false; for (int i = 0; i < cs.length(); i++) if (s.charAt(i) != cs.charAt(i)) return false; return true; } @NotNull public String intern(@NotNull CharSequence cs) { long hash = 0; for (int i = 0; i < cs.length(); i++) hash = 57 * hash + cs.charAt(i); int h = (int) Maths.hash(hash) & mask; String s = interner[h]; if (isEqual(s, cs)) return s; String s2 = cs.toString(); return interner[h] = s2; } }
Off heap memory usage
Using off heap memory and using object pools both help reduce GC pauses, this is their only similarity. Object pools are good for short lived mutable objects, expensive to create objects and long live immutable objects where there is a lot of duplication. Medium lived mutable objects, or complex objects are more likely to be better left to the GC to handle. However, medium to long lived mutable objects suffer in a number of ways which off heap memory solves.
Off heap memory provides;
- Scalability to large memory sizes e.g. over 1 TB and larger than main memory.
- Notional impact on GC pause times.
- Sharing between processes, reducing duplication between JVMs, and making it easier to split JVMs.
- Persistence for faster restarts or replying of production data in test.
The use of off heap memory gives you more options in terms of how you design your system. The most important improvement is not performance, but determinism.
Off heap and testing
One of the biggest challenges in high performance computing is reproducing obscure bugs and being able to prove you have fixed them. By storing all your input events and data off heap in a persisted way you can turn your critical systems into a series of complex state machines. (Or in simple cases, just one state machine) In this way you get reproducible behaviour and performance between test and production.
A number of investment banks use this technique to replay a system reliably to any event in the day and work out exactly why that event was processed the way it was. More importantly, once you have a fix you can show that you have fixed the issue which occurred in production, instead of finding an issue and hoping this was the issue.
Along with deterministic behaviour comes deterministic performance. In test environments, you can replay the events with realistic timings and show the latency distribution you expect to get in production. Some system jitter can’t be reproduce esp if the hardware is not the same, but you can get pretty close when you take a statistical view. To avoid taking a day to replay a day of data you can add a threshold. e.g. if the time between events is more than 10 ms you might only wait 10 ms. This can allow you to replay a day of events with realistic timing in under an hour and see whether your changes have improved your latency distribution or not.
By going more low level don’t you lose some of “compile once, run anywhere”?
To some degree this is true, but it is far less than you might think. When you are working closer the processor and so you are more dependant on how the processor, or OS behaves. Fortunately, most systems use AMD/Intel processors and even ARM processors are becoming more compatible in terms of the low level guarantees they provide. There is also differences in the OSes, and these techniques tend to work better on Linux than Windows. However, if you develop on MacOSX or Windows and use Linux for production, you shouldn’t have any issues. This is what we do at Higher Frequency Trading.
What new problems are we creating by using off heap?
Nothing comes for free, and this is the case with off heap. The biggest issue with off heap is your data structures become less natural. You either need a simple data structure which can be mapped directly to off heap, or you have a complex data structure which serializes and deserializes to put it off heap. Obvious using serialization has its own headaches and performance hit. Using serialization thus much slower than on heap objects.
In the financial world, most high ticking data structure are flat and simple, full of primitives which maps nicely off heap with little overhead. However, this doesn’t apply in all applications and you can get complex nested data structures e.g. graphs, which you can end up having to cache some objects on-heap as well.
Another problem is that the JVM limits how much of the system you can use. You don’t have to worry about the JVM overloading the system so much. With off heap, some limitations are lifted and you can use data structures much larger than main memory, and you start having to worry about what kind of disk sub-system you have if you do this. For example, you don’t want to be paging to a HDD which has 80 IOPS, instead you are likely to want an SSD with 80,000 IOPS (Input/Ouput Operations per Second) or better i.e. 1000x faster.
How does OpenHFT help?
OpenHFT has a number of libraries to hide the fact you are really using native memory to store your data. These data structures are persisted and can be used with little or no garbage. These are used in applications which run all day without a minor collection
Chronicle Queue – Persisted queue of events. Supports concurrent writers across JVMs on the same machine and concurrent readers across machines. Micro-second latencies and sustained throughputs in the millions of messages per second.
Chronicle Map – Native or Persisted storage of a key-value Map. Can be shared between JVMs on the same machine, replicated via UDP or TCP and/or accessed remotely via TCP. Micro-second latencies and sustained read/write rates in the millions of operations per second per machine.
Thread Affinity – Binding of critical threads to isolated cores or logical cpus to minimise jitter. Can reduce jitter by a factor of 1000.
Which API to use?
If you need to record every event -> Chronicle Queue
If you only need the latest result for a unique key -> Chronicle Map
If you care about 20 micro-second jitter -> Thread Affinity
Conclusion
Off heap memory can have challenges but also come with a lot of benefits. Where you see the biggest gain and compares with other solutions introduced to achieve scalability. Off heap is likely to be simpler and much faster than using partitioned/sharded on heap caches, messaging solutions, or out of process databases. By being faster, you may find that some of the tricks you need to do to give you the performance you need are no longer required. e.g. off heap solutions can support synchronous writes to the OS, instead of having to perform them asynchronously with the risk of data loss.
The biggest gain however, can be your startup time, giving you a production system which restarts much faster. e.g. mapping in a 1 TB data set can take 10 milli-seconds, and ease of reproducibility in test by replaying every event in order you get the same behaviour every time. This allows you to produce quality systems you can rely on.
Reference: | On heap vs off heap memory usage from our JCG partner Peter Lawrey at the Vanilla Java blog. |
In a “Chronicle Map”, are these microsecond latencies, within same JVM (normal times if using local heap) or in a distributed/replicated cache? In reality, events are across machines, so how do you read across JVMs in the same order? what about replication times? Are there specific numbers for real use cases? what are the read vs write latencies, what about serialized objects instead of primitives?
What about staleness and clean up of off heap data? “Mapping in a 1 TB data set can take 10 milli-seconds”: how do we reproduce these?