Building extremely large in-memory InputStream for testing purposes
For some reason I needed extremely large, possibly even infinite InputStream
that would simply return the samebyte[]
over and over. This way I could produce insanely big stream of data by repeating small sample. Sort of similar functionality can be found in Guava: Iterable<T> Iterables.cycle(Iterable<T>)
and Iterator<T> Iterators.cycle(Iterator<T>)
. For example if you need an infinite source of 0
and 1
, simply sayIterables.cycle(0, 1)
and get 0, 1, 0, 1, 0, 1...
infinitely. Unfortunately I haven’t found such utility forInputStream
, so I jumped into writing my own. This article documents many mistakes I made during that process, mostly due to overcomplicating and overengineering straightforward solution.
We don’t really need an infinite InputStream
, being able to create very large one (say, 32 GiB) is enough. So we are after the following method:
public static InputStream repeat(byte[] sample, int times)
It basically takes sample
array of bytes and returns an InputStream
returning these bytes. However when sample
runs out, it rolls over, returning the same bytes again – this process is repeated given number of times, untilInputStream
signals end. One solution that I haven’t really tried but which seems most obvious:
public static InputStream repeat(byte[] sample, int times) { final byte[] allBytes = new byte[sample.length * times]; for (int i = 0; i < times; i++) { System.arraycopy(sample, 0, allBytes, i * sample.length, sample.length); } return new ByteArrayInputStream(allBytes); }
I see you laughing there! If sample
is 100 bytes and we need 32 GiB of input repeating these 100 bytes, generated InputStream
shouldn’t really allocate 32 GiB of memory, we must be more clever here. As a matter of fact repeat()
above has another subtle bug. Arrays in Java are limited to 231-1 entries (int
), 32 GiB is way above that. The reason this program compiles is a silent integer overflow here: sample.length * times
. This multiplication doesn’t fit in int
.
OK, let’s try something that at least theoretically can work. My first idea was as follows: what if I create manyByteArrayInputStream
s sharing the same byte[] sample
(they don’t do an eager copy) and somehow join them together? Thus I needed some InputStream
adapter that could take arbitrary number of underlying InputStream
s and chain them together – when first stream is exhausted, switch to next one. This awkward moment when you look for something in Apache Commons or Guava and apparently it was in the JDK forever… java.io.SequenceInputStream
is almost ideal. However it can only chain precisely two underlying InputStream
s. Of course sinceSequenceInputStream
is an InputStream
itself, we can use it recursively as an argument to outerSequenceInputStream
. Repeating this process we can chain arbitrary number of ByteArrayInputStream
s together:
public static InputStream repeat(byte[] sample, int times) { if (times <= 1) { return new ByteArrayInputStream(sample); } else { return new SequenceInputStream( new ByteArrayInputStream(sample), repeat(sample, times - 1) ); } }
If times
is 1, just wrap sample
in ByteArrayInputStream
. Otherwise use SequenceInputStream
recursively. I think you can immediately spot what’s wrong with this code: too deep recursion. Nesting level is the same as times
argument, which will reach millions or even billions. There must be a better way. Luckily minor improvement changes recursion depth from O(n) to O(logn):
public static InputStream repeat(byte[] sample, int times) { if (times <= 1) { return new ByteArrayInputStream(sample); } else { return new SequenceInputStream( repeat(sample, times / 2), repeat(sample, times - times / 2) ); } }
Honestly this was the first implementation I tried. It’s a simple application of divide and conquer principle, where we produce result by evenly splitting it into two smaller sub-problems. Looks clever, but there is one issue: it’s easy to prove we create t (t =times
) ByteArrayInputStreams
and O(t) SequenceInputStream
s. While sample
byte array is shared, millions of various InputStream
instances are wasting memory. This leads us to alternative implementation, creating just one InputStream
, regardless value of times
:
import com.google.common.collect.Iterators; import org.apache.commons.lang3.ArrayUtils; public static InputStream repeat(byte[] sample, int times) { final Byte[] objArray = ArrayUtils.toObject(sample); final Iterator<Byte> infinite = Iterators.cycle(objArray); final Iterator<Byte> limited = Iterators.limit(infinite, sample.length * times); return new InputStream() { @Override public int read() throws IOException { return limited.hasNext() ? limited.next() & 0xFF : -1; } }; }
We will use Iterators.cycle()
after all. But before we have to translate byte[]
into Byte[]
since iterators can only work with objets, not primitives. There is no idiomatic way to turn array of primitives to array of boxed types, so I use ArrayUtils.toObject(byte[])
from Apache Commons Lang. Having an array of objects we can create aninfinite
iterator that cycles through values of sample
. Since we don’t want an infinite stream, we cut off infinite iterator using Iterators.limit(Iterator<T>, int)
, again from Guava. Now we just have to bridge fromIterator<Byte>
to InputStream
– after all semantically they represent the same thing.
This solution suffers two problems. First of all it produces tons of garbage due to unboxing. Garbage collection is not that much concerned about dead, short-living objects, but still seems wasteful. Second issue we already faced previously: sample.length * times
multiplication can cause integer overflow. It can’t be fixed becauseIterators.limit()
takes int
, not long
– for no good reason. BTW we avoided third problem by doing bitwise andwith 0xFF
– otherwise byte
with value -1
would signal end of stream, which is not the case. x & 0xFF
is correctly translated to unsigned 255
(int
).
So even though implementation above is short and sweet, declarative rather than imperative, it’s too slow and limited. If you have a C background, I can imagine how uncomfortable you were seeing me struggle. After all the most straightforward, painfully simple and low-level implementation was the one I came up with last:
public static InputStream repeat(byte[] sample, int times) { return new InputStream() { private long pos = 0; private final long total = (long)sample.length * times; public int read() throws IOException { return pos < total ? sample[(int)(pos++ % sample.length)] : -1; } }; }
GC free, pure JDK, fast and simple to understand. Let this be a lesson for you: start with the simplest solution that jumps to your mind, don’t overengineer and don’t be too smart. My previous solutions, declarative, functional, immutable, etc. – maybe they looked clever, but they were neither fast nor easy to understand.
The utility we just developed was not just a toy project, it will be used later in subsequent article.
Reference: | Building extremely large in-memory InputStream for testing purposes from our JCG partner Tomasz Nurkiewicz at the Java and neighbourhood blog. |