Benchmark comparing DOM load and path extractor#8
Conversation
| more details. | ||
|
|
||
| To execute the benchmarks run: `gradle --no-daemon jmh`, requires an internet connection as it downloads the data set. | ||
| Results bellow, higher is better. |
| final IonReader reader = newReader(inputStream); | ||
| final IonWriter writer = newBinaryWriter(binaryOut) | ||
| ) { | ||
| // all data is in the `dataset` key as a list, only write out that field to keep the extractor smaller |
There was a problem hiding this comment.
Can you elaborate on this? It looks like the binary and text data has different structure?
There was a problem hiding this comment.
The text data is created by writing the binary data as Ion text so they are the same, see line 89.
Both are different than the original dataset. The original dataset is something like:
{
<some meta field>: <value>,
<other meta field>: <value>,
"dataset": [ <item1>, <item2> ]
}
the bulk of the data is in the dataset struct field, that code is only picking the contents of the dataset and writing them out, e.g. <item1><item2>.
I'm doing this to make the search paths less verbose but not really work it if it caused confusion. Will change the benchmark to work with the data as is.
There was a problem hiding this comment.
The binary and text are the same as bytesText is created by writing out bytesBinary as IonText.
I'm extracting the dataset field from the original dataset for the benchmark test data as the bulk of the data is there, the rest is just some metadata. But thinking more on it it's probably not worth it as can cause confusion, better to have the benchmark to work on the original dataset, will change that
There was a problem hiding this comment.
Ah, I see that now, thanks. I'm fine with it as-is.
| /** | ||
| * Benchmarks comparing the PathExtractor with fully materializing the DOM. | ||
| */ | ||
| public class PathExtractorBenchmark { |
There was a problem hiding this comment.
It would be nice to make this pluggable for different data sets as well, since the performance of the path extractor is highly dependent on the characteristics of the data. It would be good to eventually provide results for several data sets along with a description of the data and what was skipped. This doesn't block the initial release.
| ``` | ||
|
|
||
| Using the path extractor has equivalent performance for both text and binary when fully materializing the document and | ||
| can give significant performance improvements when partially materializing binary documents. This happens due to Ion's |
See README.md changes for description of benchmarks
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.