Skip to content

Reduce code duplication in Arrow code#2746

Merged
rymurr merged 6 commits into
apache:masterfrom
nastra:arrow-refactoring
Jul 12, 2021
Merged

Reduce code duplication in Arrow code#2746
rymurr merged 6 commits into
apache:masterfrom
nastra:arrow-refactoring

Conversation

@nastra

@nastra nastra commented Jun 28, 2021

Copy link
Copy Markdown
Contributor

No description provided.

@nastra nastra marked this pull request as draft June 28, 2021 09:45
@github-actions github-actions Bot added the arrow label Jun 28, 2021
@nastra nastra force-pushed the arrow-refactoring branch 2 times, most recently from ed4ca7e to 0addb2c Compare June 28, 2021 16:16
@github-actions github-actions Bot added the spark label Jun 28, 2021
@nastra nastra force-pushed the arrow-refactoring branch from 0addb2c to 0652940 Compare June 28, 2021 17:05
@nastra nastra marked this pull request as ready for review June 29, 2021 05:02
@nastra

nastra commented Jun 29, 2021

Copy link
Copy Markdown
Contributor Author

@rymurr it is probably easier looking at the modified files directly instead of looking at the diff when reviewing.

@nastra nastra force-pushed the arrow-refactoring branch from 0652940 to 4ddd3fa Compare June 29, 2021 08:03
@nastra

nastra commented Jun 29, 2021

Copy link
Copy Markdown
Contributor Author

Results on branch master

Benchmark                                                                 Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  1.933 ± 0.171   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  1.548 ± 0.040   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.692 ± 0.245   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  7.726 ± 0.163   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.564 ± 0.074   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.419 ± 0.094   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.466 ± 0.128   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.274 ± 0.033   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.385 ± 0.171   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.437 ± 0.125   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.490 ± 0.115   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.654 ± 0.182   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  4.884 ± 0.381   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  4.366 ± 0.439   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  1.729 ± 0.150   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  1.896 ± 0.174   s/op

Results on branch arrow-refactoring

Benchmark                                                                 Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readDatesIcebergVectorized5k         ss    5  1.611 ± 0.035   s/op
VectorizedReadFlatParquetDataBenchmark.readDatesSparkVectorized5k           ss    5  1.496 ± 0.072   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsIcebergVectorized5k      ss    5  8.461 ± 0.251   s/op
VectorizedReadFlatParquetDataBenchmark.readDecimalsSparkVectorized5k        ss    5  8.250 ± 0.083   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesIcebergVectorized5k       ss    5  2.872 ± 0.050   s/op
VectorizedReadFlatParquetDataBenchmark.readDoublesSparkVectorized5k         ss    5  2.781 ± 0.129   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsIcebergVectorized5k        ss    5  2.752 ± 0.216   s/op
VectorizedReadFlatParquetDataBenchmark.readFloatsSparkVectorized5k          ss    5  2.261 ± 0.101   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersIcebergVectorized5k      ss    5  2.743 ± 0.123   s/op
VectorizedReadFlatParquetDataBenchmark.readIntegersSparkVectorized5k        ss    5  2.652 ± 0.221   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k         ss    5  2.754 ± 0.818   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k           ss    5  2.682 ± 0.136   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k       ss    5  4.611 ± 0.066   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k         ss    5  3.901 ± 0.143   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsIcebergVectorized5k    ss    5  1.661 ± 0.062   s/op
VectorizedReadFlatParquetDataBenchmark.readTimestampsSparkVectorized5k      ss    5  1.572 ± 0.056   s/op

Detailed results are attached below:
master_detailed_results_new.txt
refactoring_detailed_results_new.txt

@nastra nastra force-pushed the arrow-refactoring branch from 4ddd3fa to 5ebdd4c Compare July 2, 2021 07:45

@rymurr rymurr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment @nastra . Looks great though, the vectorised code is a lot easier to read! I love deleting code!

class DictionaryIdReader extends BaseDictEncodedReader {
@Override
protected void nextVal(FieldVector vector, Dictionary dict, int idx, int currentVal, int typeWidth) {
((IntVector) vector).set(idx, currentVal);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is cast to IntVector and the others are directly manipulating the data buffer?

@nastra nastra Jul 8, 2021

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dictionary encoded vectors are always represented as IntVector, but since BaseDictEncodedReader uses the more generic FieldVector we need to do a cast to IntVector here. Fwiw, here's how it was done in the original code:

and
vectorizedColumnIterator.nextBatchDictionaryIds((IntVector) vec, nullabilityHolder);

@rymurr rymurr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome @nastra

@rymurr

rymurr commented Jul 9, 2021

Copy link
Copy Markdown
Contributor

@rdblue: @nastra suggested merging this w/o squashing as each commit atomically refactors a single class. WDYT?

@rdblue

rdblue commented Jul 9, 2021

Copy link
Copy Markdown
Contributor

I'm fine either way. Whatever you'd like to do.

@nastra nastra force-pushed the arrow-refactoring branch from 5ebdd4c to 6c12d43 Compare July 12, 2021 08:39
@rymurr

rymurr commented Jul 12, 2021

Copy link
Copy Markdown
Contributor

note - rebasing to keep each atomic class refactor as a separate (revertible) commit

@rymurr rymurr merged commit 8058ec1 into apache:master Jul 12, 2021
@nastra nastra deleted the arrow-refactoring branch July 12, 2021 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants