Adds a binary IonReader implementation capable of incremental reads.#355
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Changes
This PR adds a new binary
IonReaderimplementation that is enabled and configured via optionalIonReaderBuilderoptions.Motivation
The motivation is twofold:
implementation in most situations (see the
Performancesection below).always expects a complete value to be available when the user calls
next(). If only part ofa value is available, the existing reader will raise an
UnexpectedEOFExceptionand terminateprocessing, even if data completing the value would have become available later. The new
implementation supports incremental (streaming) use cases by allowing the user to wait for
additional data to become available and call
next()again to proceed.Code reuse was not a goal of this project. I wanted to build a new implementation from scratch
to avoid being influenced or constrained by any potentially performance-limiting code used
by the existing implementation. As such, there is intentionally some logic that probably looks
similar between the implementations.
Testing
The classes added as part of this change are covered by new unit tests. Additionally, an enum
value for the incremental reader implementation was added so that any existing reader tests
(including the tests generated from the
ion-testsfiles) exercise both the old and newimplementations.
Performance
The ion-java-benchmark-cli
was updated to support the new incremental implementation. It was then used to compare
latency, heap usage, and garbage collection stats for the existing and new implementation for
various data. Below are the results, which are for fully-materialized traversals using
default settings unless otherwise stated.
Medium-size log stream
A 22MB stream of struct values descending to depth ~5, containing timestamps, strings, symbol
values, annotations, nested lists and structs, ints, 32-bit floats, and decimals. Each top-level
value is up to a couple kilobytes in size. Local symbol table appends are interspersed with user
values.
Full traversal
The new implementation is 52% faster (515ms -> 246ms) and uses 33% less heap space (30MB -> 20MB).
Full results here.
Sparse read
This run navigated to and materialized a single depth 1 string value from each top level struct.
The new implementation is 31% faster (96ms -> 66ms) and uses 56% less heap space (54MB -> 24MB).
Full results here.
Large log stream
A 7.7GB stream of values with similar shape to the values in the medium-size log stream (see
above), but with significantly different values, field names, and annotations, and with
significantly more symbols and larger top-level values.
The new implementation is 35% faster (110s -> 71s), uses similar heap space (123MB -> 135MB),
and spends 49% less time performing garbage collections (506ms -> 260ms).
Full results here.
Single small value
A single 5KB value from the medium-size log stream (see above).
The new implementation is 24% faster (267us -> 202us).
Full results here.
Single large list
A single 1MB list of random float values. The code used to generate the list can be found
here.
Default initial buffer size (32K)
With the default buffer size of 32K, the incremental reader's internal buffer must grow several
times in order to fit the 1MB value, consuming a lot of heap space and leading to GC churn.
Nevertheless, the new implementation is 14% faster (9342us -> 8019us) despite the fact that
it uses much more heap space (3MB -> 107MB) and spends much more time in GC (67ms -> 603ms).
Full results here.
Sufficient buffer size to fit the value (1.01 MB)
When it is known that the data to parse contains very few values of a certain maximum size that
exceeds the default buffer size, users may configure the incremental reader with a given
initial buffer size to avoid the need for growth.
Doing this made the new implementation 56% faster (9277us -> 4068us), even though heap usage
(3MB -> 34MB) and GC time (71ms -> 159ms) still exceeded the existing implementation.
Full results here.
Many large lists
100MB stream of 1MB lists of random float values. The code used to generate the lists can be found
here.
Default initial buffer size (32K)
The new implementation is 54% faster (818ms -> 372ms), but uses more heap space (3MB -> 13MB).
Full results here.
Sufficient buffer size to fit a value (1.01 MB)
Because the stream contains 100 values, setting an appropriate initial buffer size is less
important because buffer growth occurs during processing of the first value only and is amortized
over all the values.
The new implementation is 57% faster (853ms -> 367ms) and uses similar heap space (2.7MB vs 3.8MB).
Full results here.
Random blobs/clobs
100MB stream of random blobs or clobs of random size between 0 and 512 bytes. The code used to
generate the lobs can be found
here.
Both implementations take almost identical time (69ms -> 71ms), heap usage (50MB -> 50MB), and GC
time (2ms -> 5ms).
Full results here
and here.
Random decimals
108MB stream of random decimals with varying precision. The code used to generate the decimals
can be found
here.
Read via
decimalValue()The new implementation is 24% faster (1669ms -> 1260ms) and uses similar heap space (27MB -> 32MB).
Full results here.
Read via
bigDecimalValue()The new implementation is 36% faster (1748ms -> 1119ms) and uses similar heap space (28 -> 26MB).
Full results here.
Random floats
100MB stream of random 32- and 64-bit floats. The code used to generate the floats can be found
here.
The new implementation is 4% faster (934ms -> 899ms) and uses almost identical heap space (3.8MB
-> 3.8MB), with no time spent in GC in either case.
Full results here.
Random annotations on random floats
100MB stream of random floats annotated with up to 3 annotations each from a pool of 500
symbols of various lengths between 0 and 20.
The code used to generate the annotated floats can be found
here.
Annotations read as strings
The new implementation is 31% faster (1487ms -> 1021ms) and uses similar heap space (30MB -> 31MB).
Full results here.
Annotations read as SymbolTokens
The new implementation is 20% faster (1380ms -> 1106ms) and uses less heap space (26MB -> 18MB).
Full results here.
Random ints
110MB stream of random ints of various sizes. The code used to generate the ints can be found
here.
The new implementation is 24% faster (1474ms -> 1115ms) and uses similar heap space (20MB -> 17MB).
Full results here.
Random strings
174MB stream of random strings of various sizes between 0 and 20. The code used to generate the
strings can be found
here.
The new implementation is 12% faster (1262ms -> 1113ms) and uses identical heap space (16MB) and
GC time (28ms).
Full results here.
Random symbol values
123MB stream of symbol values pulled from a pool of 500 random symbols of various sizes between
0 and 20. The code used to generate the symbol values can be found
here.
Read to strings using
stringValue()The new implementation is 9% faster (3397ms -> 3095ms) and uses identical heap space (4MB) and
zero GC time.
Full results here.
Read to SymbolTokens using
symbolValue()The new implementation is 12% faster (3582ms -> 3130ms), uses much less heap space (28MB -> 4MB),
and less GC time (96ms -> 0ms).
Full results here.
Random timestamps
113MB stream of random timestamps with various precisions. The code used to generate the timestamps
can be found
here.
The new implementation is 20% faster (1818ms -> 1449ms) and uses slightly more heap space (11MB ->
18MB).
Full results here.
Caveats
The new implementation only supports incremental reading at the top level, meaning that it must
buffer an entire top-level value (and any system values that precede it) before allowing the
user to start consuming the value. This means the internal buffer must grow to the size of the
largest top-level value (plus preceding system values) in the stream. The old implementation
buffers data using fixed-size pages and only allocates additional pages if a single leaf value
exceeds the size of a page, which is rare.
As a result, the new implementation is not suitable for use with data that contains values
that approach or exceed the amount of memory allocated to the JVM, and currently cannot support
values larger than 2 GB in any case because the underlying buffer is a byte array indexed by an
int. This should not be a common use case.In order to avoid inadvertently attempting to process a humongous top-level value that would
result in an
OutOfMemoryErrorwith the new implementation, an option is providedto allow the user to configure the maximum size of the internal buffer. Under no circumstances
will the buffer grow beyond this size, even if that means a large value must be dropped. A
callback must be implemented by users to handle this case if it occurs.
As described previously, the new implementation is a normal
IonReaderimplementation except thatit supports incremental reads, which necessitates slightly different behavior when calling
next()at the top level. With the new implementation, receivingnullfromnext()at thetop level means that there is not enough data to complete a value. The user may decide either to
wait and call
next()again in the future, at which pointnext()will return a non-nullIonTypeif a complete value has become available; or, the user may decide that the stream iscomplete and
close()the reader. Ifclose()is called when the reader holds apartially-buffered value,
UnexpectedEOFExceptionwill be raised, mimicking the behavior of theold implementation. If any existing users rely on being able to partially read an incomplete
top-level value before handling an
UnexpectedEOFException, the new implementation will not besuitable. This is also unlikely to be common.
Example
The following example creates an incremental reader over a growing binary Ion file and limits
the size of the internal buffer to 128K.
Suggested Review Order
IonReaderLookaheadBuffer and ResizingPipedInputStream.
IonReaderLookaheadBuffer to ensure a complete top-level value is buffered. Navigates through
the stream and parses values at the direction of the user.
and configure its buffers.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.