Lies, statistics and vendors
Overview
Reading performance results supplied by vendors is a skill in itself. It can be difficult to compare numbers from different vendors on a fair basis, and even more difficult to estimate how a product will behave in your system.
Lies and statistics
Peak Performance – A manufacture’s guarantee not to exceed a given rating
— Computer Architecture, A Quantitative Approach. (1st edition)
Why is it so hard to give a trustworthy performance number?
- Latencies and throughputs don’t follow a normal distribution which is the basis of mathematically rigorous statistics. This means you are modelling something for which is isn’t a generally accepted mathematical model.
- There are many different assumptions you can make, ways to test your solution and ways to represent the results.
- You need to use benchmarks to measure something, but those benchmarks are either a) not standard, b) not representative of your use case, or c) can be optimised for in ways which don’t help you.
- Vendors understand their products and sensibly select the best hardware for their product. This works best if you only have one product to consider. Multi-product systems many not have an optimal hardware solution for all the products, even if your organisation allowed you to buy the optional hardware.
- It is easy to report the best results tested and not include results which were not so good.
Any decent vendor will use their benchmarks to optimise their solution. The downside of this is that the solution will have been optimised more for the benchmarks they report than use cases the vendor hasn’t tested e.g. your use case.
BTW: I often find it interesting to see what use cases the vendor had in mind when they benchmark their solutions. This can be a good indication of a) what it is good for, b) the assumptions made in designing the solution, and c) how it is generally used already.
Should we ignore all benchmarks?
Percentiles for latency
Percentile | One in N | Scale | Notes |
---|---|---|---|
50% | “typical” | 1x | This is a good indication of what is possible. It is the most optimistic figure you could use |
90% | one in ten | 2x-3x | This is a better indication of performance if tested on a real, complex system. |
99% | one in 100 | 4x-10x | For benchmarks of simplified systems, this is a better indication of what you can realistically expect to achieve |
99.9% | one in 1,000 | 10x-30x | For benchmarks of simplified systems, this is a conservative indication of what you can expect. |
99.99% | one in 10,000 | 20x-100x | This number is nice to have but difficult to reproduce, even for the same benchmark, let alone for a different use case. See below |
99.999% | one in 100,000 | varies | This number is almost impossible reproduce between systems. See below |
A guide to the number of samples you need for reproducible numbers
Java has a additional feature that it gets faster as it warms up. In the past I have advocated removing these warm-up figures, but given micro-benchmarks give overly optimistic figures, I am more inclined to include them if for no other reason than it is simpler. My rule of thumb for reproducible percentile figures is that for 1 in N, you need N^1.5 samples for simple micro-benchmarks and N^2 samples for complex systems.
Percentile | One in N | Simple test samples | Complex test samples |
---|---|---|---|
90% | one in ten | ~ 30 | ~ 100 |
99% | one in 100 | ~ 300 | ~ 10,000 |
99.9% | one in 1,000 | ~ 30,000 | ~ 1 million |
99.99% | one in 10,000 | ~1 million | ~ 100 million |
99.999% | one in 100,000 | ~ 30 million | ~ 10 billion |
99.9999% | one in 1,000,000 | ~ one billion | ~ one trillion |
Maximum or 100% | never | Infinite | Infinite |
Based on this rule of thumb I don’t believe a real maximum can be measured empirically. Never the less, not reporting it all isn’t satisfactory either. Some benchmarks report what is the “worst in sample” which is better than nothing, but very hard to reproduce. To mitigate the cost of warm up in real systems, I suggest latency critical classes should be pre-loaded, if not warmed up on start up of your application.
In summary
If you are looking for a performance figure you can use, I suggest using the 99 percentile as a good indication of what you can expect in a real system. If you want to be cautious, use the 99.9 percentile. If this number is not given, I would assume you might get about 10x the average or typical latency and 1/10th of the throughput the vendor can get under ideal conditions. Usually this is still more than enough. If the vendor quotes performance figures close to what you need, or worse doesn’t quote figures at all, beware !! I am amazed how many vendors will say they are fast, quick, fastest, efficient, high performance but don’t quote any figures at all.
Reference: Lies, statistics and vendors from our JCG partner Peter Lawrey at the Vanilla Java blog.