HIVE-26809: Upgrade ORC to 1.8.1. by difin · Pull Request #3833 · apache/hive

difin · 2022-12-05T22:11:08Z

What changes were proposed in this pull request?

Upgrading ORC version to currently latest version 1.8.1.
This PR is based on the changes proposed in uncompleted PR #2853 (ticket https://issues.apache.org/jira/browse/HIVE-25497 - Bump ORC to 1.7.2) with changes on top of it which enabled CI to pass.
Changes done in HIVE-25497:
"LLAP EncodedTreeReaderFactory is implementing its own TreeReaderFactory -- initially we are going to avoid LAZY IO here as everything is kept in memory."

Additional changes: Hive implements its own TreeReaderFactory. In ORC project, ORC-1060 - "Reduce memory usage when vectorized reading dictionary string encoding columns" introduced changes to StringDictionaryTreeReader which were causing exceptions in Hive EncodedTreeReaderFactory when attempting to upgrade to ORC 1.8.1. To handle that I added changes to Hive's EncodedTreeReaderFactory to use StringDictionaryTreeReader version as without ORC-1060.

Why are the changes needed?

To use latest ORC release in HIVE.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI tests passing.

aturoczy · 2022-12-06T00:21:51Z

like it :)

cnauroth

Hello @difin . Thank you for the patch. This looks like a good idea to try to complete before GA of Hive 4.0.

I see Apache ORC has just released version 1.8.1. Can we use that, so Hive gets on the latest release?

There are currently numerous test failures in CI, like this one:

http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-3833/1/tests

I noticed a lot of ArrayIndexOutOfBoundsException, like this:

Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:2242)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2283)
	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
	at org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.nextVector(EncodedTreeReaderFactory.java:313)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:196)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:66)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer.consumeData(EncodedDataConsumer.java:122)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.sendEcbToConsumer(SerDeEncodedDataReader.java:1687)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneSlice(SerDeEncodedDataReader.java:1059)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneFileSplit(SerDeEncodedDataReader.java:908)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:859)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:731)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:278)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:275)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:275)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:115)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:88)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:73)
	... 5 more

Can you please investigate?

difin · 2022-12-06T22:28:28Z

Hello @difin . Thank you for the patch. This looks like a good idea to try to complete before GA of Hive 4.0.

I see Apache ORC has just released version 1.8.1. Can we use that, so Hive gets on the latest release?

There are currently numerous test failures in CI, like this one:

http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-3833/1/tests

I noticed a lot of ArrayIndexOutOfBoundsException, like this:

Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:2242)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2283)
	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
	at org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.nextVector(EncodedTreeReaderFactory.java:313)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:196)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:66)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer.consumeData(EncodedDataConsumer.java:122)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.sendEcbToConsumer(SerDeEncodedDataReader.java:1687)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneSlice(SerDeEncodedDataReader.java:1059)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneFileSplit(SerDeEncodedDataReader.java:908)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:859)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:731)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:278)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:275)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:275)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:115)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:88)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:73)
	... 5 more

Can you please investigate?

Hi @cnauroth, thank you for your comments. I am investigating the CI errors and will upgrade to ORC 1.8.1 too.

difin · 2023-01-03T15:05:47Z

Hi @abstractdog, can you please review?

ayushtkn

minor comments and a question why are these new classes added?

ayushtkn · 2023-01-13T01:32:17Z

  }

-  protected static class StringStreamReader extends StringTreeReader
+  public static class StringDictionaryTreeReaderHive extends TreeReader {


Why is this added in scope of upgrade?

This is added as a fix to many failed CI tests that happened without this fix.
These new classes are classes from Orc project prior to changes to StringDictionaryTreeReader that were done as part of ORC-1060.
In more detail: Hive implements its own TreeReaderFactory. In ORC project, the ticket ORC-1060 - "Reduce memory usage when vectorized reading dictionary string encoding columns" introduced changes to StringDictionaryTreeReader which were causing exceptions in Hive EncodedTreeReaderFactory when attempting to upgrade to ORC 1.8.1. To handle that I added changes to Hive's EncodedTreeReaderFactory to use StringDictionaryTreeReader version from Orc project prior to changes from ORC-1060.

Ok, that seems to be an improvement or say a bug fix in the ORC project and we are just implementing our own varient because now the original class is causing test failures.
This isn't the ideal approach and will backfire in future when we try to upgrade and the changes in ORC depends on the ones which we ditched.

We should try to adapt to those changes and make sure we don't crash with those changes in Hive by making hive changes, rather than maintaining a old version of ORC class at Hive

Hi @ayushtkn, I agree with you. It is not ideal approach. Before implementing this approach I did try to adapt Hive, but I didn't succeed to find how Hive could be adapted to ORC-1060 changes because those changes are inside internal implementation of Orc StringDictionaryTreeReader class. The API of StringDictionaryTreeReader class remained the same.

I agree with you that this approach is not ideal and will backfire in future when we try to upgrade and the changes in ORC depends on the ones which we ditched, but Hive already heavily depends on internal ORC API by implementing its own column readers on top of ORC and when upgrading to different ORC version it is often required to make adaptations in Hive.

@abstractdog any pointers here? Does that sound fine thing to do for now

I was trying to understand the scenario here and the way I see this: the current PR code is not the proper one as we end up Hive on ORC 1.8.x but without an important optimization introduced in ORC-1060, so if we have to copy some ORC code anyway, let's have ORC-1060 at least here (sometimes I feel we need to port changes on separate jiras, but here we can merge them together)
but what's more important is that I see the basic confusion comes from the fact that in ORC we have a common StringTreeReader which encapsulates different kinds of string readers like StringDirectTreeReader, StringDictionaryTreeReader, but in hive's StringStreamReader we have dictionary-related properties like _dictionaryStream, _lengthStream, which is confusing...if we're already subclassing ORC tree readers, we should follow it like:

HIVE -> ORC StringStreamReader -> StringTreeReader (as it is now) StringDictionaryStreamReader -> StringDictionaryTreeReader StringDirectStreamReader -> StringDirectTreeReader

this is a change that should be done regardless of ORC 1.8 upgrade in my opinion, and prior to ORC 1.8 upgrade
once we follow ORC tree class hierarchy, we have a better chance to adapt changes like ORC-1060, where e.g. only the dictionary reader has been changed

guys, if you agree with this, let's address the above problem in a separate hive ticket first, it's worth spending the time on it, especially if turns out that the ORC 1.8 upgrade becomes a clearer thing

makes sense to me, thanx Laszlo!!!

sonarqubecloud · 2023-01-17T00:31:48Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
10 Code Smells

No Coverage information
No Duplication information

github-actions · 2023-03-26T00:21:46Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

kgyrtkirk added the tests pending label Dec 5, 2022

kgyrtkirk added tests failed and removed tests pending labels Dec 6, 2022

cnauroth suggested changes Dec 6, 2022

View reviewed changes

difin force-pushed the HIVE-26809 branch from c657559 to fa56da1 Compare December 6, 2022 21:40

kgyrtkirk added tests pending and removed tests failed labels Dec 6, 2022

kgyrtkirk added tests failed and removed tests pending labels Dec 7, 2022

difin force-pushed the HIVE-26809 branch from fa56da1 to 8719a4b Compare December 7, 2022 18:50

kgyrtkirk added tests pending tests failed and removed tests failed tests pending labels Dec 7, 2022

difin force-pushed the HIVE-26809 branch from 8719a4b to 58c5d78 Compare December 8, 2022 19:28

kgyrtkirk added tests pending tests failed and removed tests failed tests pending labels Dec 8, 2022

difin force-pushed the HIVE-26809 branch from 58c5d78 to fb6116f Compare December 9, 2022 02:07

kgyrtkirk added tests pending tests failed and removed tests failed tests pending labels Dec 9, 2022

difin force-pushed the HIVE-26809 branch from fb6116f to 276401d Compare December 9, 2022 19:13

github-actions Bot requested a review from abstractdog December 9, 2022 19:14

kgyrtkirk added tests pending and removed tests failed labels Dec 9, 2022

kgyrtkirk added tests failed and removed tests pending labels Dec 15, 2022

difin force-pushed the HIVE-26809 branch from 409b40d to 025793e Compare December 16, 2022 00:07

kgyrtkirk added tests pending tests failed and removed tests failed tests pending labels Dec 16, 2022

difin force-pushed the HIVE-26809 branch from 025793e to 3a5c02a Compare December 16, 2022 22:34

kgyrtkirk added tests pending tests unstable and removed tests failed tests pending labels Dec 16, 2022

difin force-pushed the HIVE-26809 branch from 3a5c02a to 7c5fca1 Compare December 17, 2022 02:40

kgyrtkirk added tests pending tests passed and removed tests unstable tests pending labels Dec 17, 2022

difin changed the title ~~HIVE-26809: Upgrade ORC to 1.8.0.~~ HIVE-26809: Upgrade ORC to 1.8.1. Dec 20, 2022

difin force-pushed the HIVE-26809 branch from 7c5fca1 to 17e78c6 Compare January 5, 2023 15:44

kgyrtkirk added tests pending tests unstable and removed tests passed tests pending labels Jan 5, 2023

ayushtkn reviewed Jan 13, 2023

View reviewed changes

HIVE-26809: Upgrade ORC to 1.8.1.

b0c2c4a

deshanxiao mentioned this pull request Mar 8, 2023

ORC-1384: Fix ArrayIndexOutOfBoundsException when reading dictionary stream bigger then dictionary apache/orc#1431

Closed

zhangbutao mentioned this pull request Mar 16, 2023

HIVE-26809: Upgrade ORC to 1.8.3 #4121

Merged

Uh oh!

Conversation

difin commented Dec 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

aturoczy commented Dec 6, 2022

Uh oh!

cnauroth left a comment

Choose a reason for hiding this comment

Uh oh!

difin commented Dec 6, 2022

Uh oh!

difin commented Jan 3, 2023

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ayushtkn Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

difin Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Jan 17, 2023

Choose a reason for hiding this comment

Uh oh!

difin Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

abstractdog Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushtkn Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jan 17, 2023

Uh oh!

github-actions Bot commented Mar 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

difin commented Dec 5, 2022 •

edited

Loading

difin Jan 16, 2023 •

edited

Loading

difin Jan 17, 2023 •

edited

Loading

abstractdog Jan 23, 2023 •

edited

Loading