Skip to content

Adds a binary IonReader implementation capable of incremental reads.#355

Merged
tgregg merged 21 commits into
masterfrom
incremental-reader-merge
Sep 8, 2021
Merged

Adds a binary IonReader implementation capable of incremental reads.#355
tgregg merged 21 commits into
masterfrom
incremental-reader-merge

Conversation

@tgregg
Copy link
Copy Markdown
Contributor

@tgregg tgregg commented Apr 10, 2021

Description of Changes

This PR adds a new binary IonReader implementation that is enabled and configured via optional
IonReaderBuilder options.

Motivation

The motivation is twofold:

  1. Improve performance. The new implementation provides improved performance over the existing
    implementation in most situations (see the Performance section below).
  2. Support incremental reading. The existing implementation is purely blocking, meaning that it
    always expects a complete value to be available when the user calls next(). If only part of
    a value is available, the existing reader will raise an UnexpectedEOFException and terminate
    processing, even if data completing the value would have become available later. The new
    implementation supports incremental (streaming) use cases by allowing the user to wait for
    additional data to become available and call next() again to proceed.

Code reuse was not a goal of this project. I wanted to build a new implementation from scratch
to avoid being influenced or constrained by any potentially performance-limiting code used
by the existing implementation. As such, there is intentionally some logic that probably looks
similar between the implementations.

Testing

The classes added as part of this change are covered by new unit tests. Additionally, an enum
value for the incremental reader implementation was added so that any existing reader tests
(including the tests generated from the ion-tests files) exercise both the old and new
implementations.

Performance

The ion-java-benchmark-cli
was updated to support the new incremental implementation. It was then used to compare
latency, heap usage, and garbage collection stats for the existing and new implementation for
various data. Below are the results, which are for fully-materialized traversals using
default settings unless otherwise stated.

Medium-size log stream

A 22MB stream of struct values descending to depth ~5, containing timestamps, strings, symbol
values, annotations, nested lists and structs, ints, 32-bit floats, and decimals. Each top-level
value is up to a couple kilobytes in size. Local symbol table appends are interspersed with user
values.

Full traversal

The new implementation is 52% faster (515ms -> 246ms) and uses 33% less heap space (30MB -> 20MB).

Full results here.

Sparse read

This run navigated to and materialized a single depth 1 string value from each top level struct.

The new implementation is 31% faster (96ms -> 66ms) and uses 56% less heap space (54MB -> 24MB).

Full results here.

Large log stream

A 7.7GB stream of values with similar shape to the values in the medium-size log stream (see
above), but with significantly different values, field names, and annotations, and with
significantly more symbols and larger top-level values.

The new implementation is 35% faster (110s -> 71s), uses similar heap space (123MB -> 135MB),
and spends 49% less time performing garbage collections (506ms -> 260ms).

Full results here.

Single small value

A single 5KB value from the medium-size log stream (see above).

The new implementation is 24% faster (267us -> 202us).

Full results here.

Single large list

A single 1MB list of random float values. The code used to generate the list can be found
here.

Default initial buffer size (32K)

With the default buffer size of 32K, the incremental reader's internal buffer must grow several
times in order to fit the 1MB value, consuming a lot of heap space and leading to GC churn.

Nevertheless, the new implementation is 14% faster (9342us -> 8019us) despite the fact that
it uses much more heap space (3MB -> 107MB) and spends much more time in GC (67ms -> 603ms).

Full results here.

Sufficient buffer size to fit the value (1.01 MB)

When it is known that the data to parse contains very few values of a certain maximum size that
exceeds the default buffer size, users may configure the incremental reader with a given
initial buffer size to avoid the need for growth.

Doing this made the new implementation 56% faster (9277us -> 4068us), even though heap usage
(3MB -> 34MB) and GC time (71ms -> 159ms) still exceeded the existing implementation.

Full results here.

Many large lists

100MB stream of 1MB lists of random float values. The code used to generate the lists can be found
here.

Default initial buffer size (32K)

The new implementation is 54% faster (818ms -> 372ms), but uses more heap space (3MB -> 13MB).

Full results here.

Sufficient buffer size to fit a value (1.01 MB)

Because the stream contains 100 values, setting an appropriate initial buffer size is less
important because buffer growth occurs during processing of the first value only and is amortized
over all the values.

The new implementation is 57% faster (853ms -> 367ms) and uses similar heap space (2.7MB vs 3.8MB).

Full results here.

Random blobs/clobs

100MB stream of random blobs or clobs of random size between 0 and 512 bytes. The code used to
generate the lobs can be found
here.

Both implementations take almost identical time (69ms -> 71ms), heap usage (50MB -> 50MB), and GC
time (2ms -> 5ms).

Full results here
and here.

Random decimals

108MB stream of random decimals with varying precision. The code used to generate the decimals
can be found
here.

Read via decimalValue()

The new implementation is 24% faster (1669ms -> 1260ms) and uses similar heap space (27MB -> 32MB).

Full results here.

Read via bigDecimalValue()

The new implementation is 36% faster (1748ms -> 1119ms) and uses similar heap space (28 -> 26MB).

Full results here.

Random floats

100MB stream of random 32- and 64-bit floats. The code used to generate the floats can be found
here.

The new implementation is 4% faster (934ms -> 899ms) and uses almost identical heap space (3.8MB
-> 3.8MB), with no time spent in GC in either case.

Full results here.

Random annotations on random floats

100MB stream of random floats annotated with up to 3 annotations each from a pool of 500
symbols of various lengths between 0 and 20.

The code used to generate the annotated floats can be found
here.

Annotations read as strings

The new implementation is 31% faster (1487ms -> 1021ms) and uses similar heap space (30MB -> 31MB).

Full results here.

Annotations read as SymbolTokens

The new implementation is 20% faster (1380ms -> 1106ms) and uses less heap space (26MB -> 18MB).

Full results here.

Random ints

110MB stream of random ints of various sizes. The code used to generate the ints can be found
here.

The new implementation is 24% faster (1474ms -> 1115ms) and uses similar heap space (20MB -> 17MB).

Full results here.

Random strings

174MB stream of random strings of various sizes between 0 and 20. The code used to generate the
strings can be found
here.

The new implementation is 12% faster (1262ms -> 1113ms) and uses identical heap space (16MB) and
GC time (28ms).

Full results here.

Random symbol values

123MB stream of symbol values pulled from a pool of 500 random symbols of various sizes between
0 and 20. The code used to generate the symbol values can be found
here.

Read to strings using stringValue()

The new implementation is 9% faster (3397ms -> 3095ms) and uses identical heap space (4MB) and
zero GC time.

Full results here.

Read to SymbolTokens using symbolValue()

The new implementation is 12% faster (3582ms -> 3130ms), uses much less heap space (28MB -> 4MB),
and less GC time (96ms -> 0ms).

Full results here.

Random timestamps

113MB stream of random timestamps with various precisions. The code used to generate the timestamps
can be found
here.

The new implementation is 20% faster (1818ms -> 1449ms) and uses slightly more heap space (11MB ->
18MB).

Full results here.

Caveats

The new implementation only supports incremental reading at the top level, meaning that it must
buffer an entire top-level value (and any system values that precede it) before allowing the
user to start consuming the value. This means the internal buffer must grow to the size of the
largest top-level value (plus preceding system values) in the stream. The old implementation
buffers data using fixed-size pages and only allocates additional pages if a single leaf value
exceeds the size of a page, which is rare.

As a result, the new implementation is not suitable for use with data that contains values
that approach or exceed the amount of memory allocated to the JVM, and currently cannot support
values larger than 2 GB in any case because the underlying buffer is a byte array indexed by an
int. This should not be a common use case.

In order to avoid inadvertently attempting to process a humongous top-level value that would
result in an OutOfMemoryError with the new implementation, an option is provided
to allow the user to configure the maximum size of the internal buffer. Under no circumstances
will the buffer grow beyond this size, even if that means a large value must be dropped. A
callback must be implemented by users to handle this case if it occurs.

As described previously, the new implementation is a normal IonReader implementation except that
it supports incremental reads, which necessitates slightly different behavior when calling
next() at the top level. With the new implementation, receiving null from next() at the
top level means that there is not enough data to complete a value. The user may decide either to
wait and call next() again in the future, at which point next() will return a non-null
IonType if a complete value has become available; or, the user may decide that the stream is
complete and close() the reader. If close() is called when the reader holds a
partially-buffered value, UnexpectedEOFException will be raised, mimicking the behavior of the
old implementation. If any existing users rely on being able to partially read an incomplete
top-level value before handling an UnexpectedEOFException, the new implementation will not be
suitable. This is also unlikely to be common.

Example

The following example creates an incremental reader over a growing binary Ion file and limits
the size of the internal buffer to 128K.

InputStream in = new FileInputStream("growingFile.10n");
IonReader reader = IonReaderBuilder.standard()
    .withIncrementalReadingEnabled(true)
    .withBufferConfiguration(
        IonBufferConfiguration.Builder.standard()
            .withMaximumBufferSize(128 * 1024)
            .onOversizedSymbolTable(() -> {
                // A symbol table alone exceeded the maximum buffer size. This is not
                // recoverable because any values following this symbol table may rely on
                // it. Returning normally causes the reader to cleanly exit, but some
                // implementations may wish to throw an exception here.
            })
            .onOversizedValue(() -> {
                // Return normally, causing the oversize value to be skipped and processing
                // to continue. Some implementations may wish to throw an exception here,
                // causing processing to abort.
            })
            .onData(numberOfBytes -> {
                // This implementation does not care about byte count; do nothing. Some
                // implementations may wish to use metrics to track the number of bytes
                // processed.
            })
            .build()
    )
    .build(in);
// Data in the file (text-equivalent): "foo" [123,
reader.next(); // Returns IonType.STRING
reader.next(); // Returns null
// Data is appended to the file: 4.56]
reader.next(); // Returns IonType.LIST
reader.stepIn();
reader.next(); // Returns IonType.INT
reader.next(); // Returns IonType.DECIMAL
reader.next(); // Returns null (end of container)
reader.stepOut();
// The file has stopped growing.
reader.next(); // Returns null
reader.close(); // Succeeds.

Suggested Review Order

  1. ResizingPipedInputStream - manages the growable buffer that holds top-level values.
  2. IonReaderLookaheadBuffer - fills the ResizingPipedInputStream with complete top-level values.
  3. *BufferConfiguration, *BufferEventHandler - allows the user to configure the
    IonReaderLookaheadBuffer and ResizingPipedInputStream.
  4. IonReaderBinaryIncremental - the core incremental IonReader implementation. Uses an
    IonReaderLookaheadBuffer to ensure a complete top-level value is buffered. Navigates through
    the stream and parses values at the direction of the user.
  5. IonReaderBuilder - provides the options that allow users to enable the incremental reader
    and configure its buffers.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants