Skip to content

HIVE-26809: Upgrade ORC to 1.8.1.#3833

Closed
difin wants to merge 1 commit into
apache:masterfrom
difin:HIVE-26809
Closed

HIVE-26809: Upgrade ORC to 1.8.1.#3833
difin wants to merge 1 commit into
apache:masterfrom
difin:HIVE-26809

Conversation

@difin

@difin difin commented Dec 5, 2022

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Upgrading ORC version to currently latest version 1.8.1.
This PR is based on the changes proposed in uncompleted PR #2853 (ticket https://issues.apache.org/jira/browse/HIVE-25497 - Bump ORC to 1.7.2) with changes on top of it which enabled CI to pass.
Changes done in HIVE-25497:
"LLAP EncodedTreeReaderFactory is implementing its own TreeReaderFactory -- initially we are going to avoid LAZY IO here as everything is kept in memory."

Additional changes: Hive implements its own TreeReaderFactory. In ORC project, ORC-1060 - "Reduce memory usage when vectorized reading dictionary string encoding columns" introduced changes to StringDictionaryTreeReader which were causing exceptions in Hive EncodedTreeReaderFactory when attempting to upgrade to ORC 1.8.1. To handle that I added changes to Hive's EncodedTreeReaderFactory to use StringDictionaryTreeReader version as without ORC-1060.

Why are the changes needed?

To use latest ORC release in HIVE.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI tests passing.

@aturoczy

aturoczy commented Dec 6, 2022

Copy link
Copy Markdown

like it :)

@cnauroth cnauroth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @difin . Thank you for the patch. This looks like a good idea to try to complete before GA of Hive 4.0.

I see Apache ORC has just released version 1.8.1. Can we use that, so Hive gets on the latest release?

There are currently numerous test failures in CI, like this one:

http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-3833/1/tests

I noticed a lot of ArrayIndexOutOfBoundsException, like this:

Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:2242)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2283)
	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
	at org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.nextVector(EncodedTreeReaderFactory.java:313)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:196)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:66)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer.consumeData(EncodedDataConsumer.java:122)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.sendEcbToConsumer(SerDeEncodedDataReader.java:1687)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneSlice(SerDeEncodedDataReader.java:1059)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneFileSplit(SerDeEncodedDataReader.java:908)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:859)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:731)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:278)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:275)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:275)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:115)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:88)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:73)
	... 5 more

Can you please investigate?

@difin

difin commented Dec 6, 2022

Copy link
Copy Markdown
Contributor Author

Hello @difin . Thank you for the patch. This looks like a good idea to try to complete before GA of Hive 4.0.

I see Apache ORC has just released version 1.8.1. Can we use that, so Hive gets on the latest release?

There are currently numerous test failures in CI, like this one:

http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-3833/1/tests

I noticed a lot of ArrayIndexOutOfBoundsException, like this:

Caused by: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryStream(TreeReaderFactory.java:2242)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2283)
	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1963)
	at org.apache.hadoop.hive.ql.io.orc.encoded.EncodedTreeReaderFactory$StringStreamReader.nextVector(EncodedTreeReaderFactory.java:313)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:196)
	at org.apache.hadoop.hive.llap.io.decode.OrcEncodedDataConsumer.decodeBatch(OrcEncodedDataConsumer.java:66)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer.consumeData(EncodedDataConsumer.java:122)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.sendEcbToConsumer(SerDeEncodedDataReader.java:1687)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneSlice(SerDeEncodedDataReader.java:1059)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.processOneFileSplit(SerDeEncodedDataReader.java:908)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.readFileWithCache(SerDeEncodedDataReader.java:859)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.performDataRead(SerDeEncodedDataReader.java:731)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:278)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader$5.run(SerDeEncodedDataReader.java:275)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:275)
	at org.apache.hadoop.hive.llap.io.encoded.SerDeEncodedDataReader.callInternal(SerDeEncodedDataReader.java:115)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:88)
	at org.apache.hadoop.hive.llap.io.decode.EncodedDataConsumer$CpuRecordingCallable.call(EncodedDataConsumer.java:73)
	... 5 more

Can you please investigate?

Hi @cnauroth, thank you for your comments. I am investigating the CI errors and will upgrade to ORC 1.8.1 too.

@difin

difin commented Jan 3, 2023

Copy link
Copy Markdown
Contributor Author

Hi @abstractdog, can you please review?

@ayushtkn ayushtkn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments and a question why are these new classes added?

}

protected static class StringStreamReader extends StringTreeReader
public static class StringDictionaryTreeReaderHive extends TreeReader {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this added in scope of upgrade?

@difin difin Jan 16, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added as a fix to many failed CI tests that happened without this fix.
These new classes are classes from Orc project prior to changes to StringDictionaryTreeReader that were done as part of ORC-1060.
In more detail: Hive implements its own TreeReaderFactory. In ORC project, the ticket ORC-1060 - "Reduce memory usage when vectorized reading dictionary string encoding columns" introduced changes to StringDictionaryTreeReader which were causing exceptions in Hive EncodedTreeReaderFactory when attempting to upgrade to ORC 1.8.1. To handle that I added changes to Hive's EncodedTreeReaderFactory to use StringDictionaryTreeReader version from Orc project prior to changes from ORC-1060.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that seems to be an improvement or say a bug fix in the ORC project and we are just implementing our own varient because now the original class is causing test failures.
This isn't the ideal approach and will backfire in future when we try to upgrade and the changes in ORC depends on the ones which we ditched.

We should try to adapt to those changes and make sure we don't crash with those changes in Hive by making hive changes, rather than maintaining a old version of ORC class at Hive

@difin difin Jan 17, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ayushtkn, I agree with you. It is not ideal approach. Before implementing this approach I did try to adapt Hive, but I didn't succeed to find how Hive could be adapted to ORC-1060 changes because those changes are inside internal implementation of Orc StringDictionaryTreeReader class. The API of StringDictionaryTreeReader class remained the same.

I agree with you that this approach is not ideal and will backfire in future when we try to upgrade and the changes in ORC depends on the ones which we ditched, but Hive already heavily depends on internal ORC API by implementing its own column readers on top of ORC and when upgrading to different ORC version it is often required to make adaptations in Hive.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abstractdog any pointers here? Does that sound fine thing to do for now

@abstractdog abstractdog Jan 23, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to understand the scenario here and the way I see this: the current PR code is not the proper one as we end up Hive on ORC 1.8.x but without an important optimization introduced in ORC-1060, so if we have to copy some ORC code anyway, let's have ORC-1060 at least here (sometimes I feel we need to port changes on separate jiras, but here we can merge them together)
but what's more important is that I see the basic confusion comes from the fact that in ORC we have a common StringTreeReader which encapsulates different kinds of string readers like StringDirectTreeReader, StringDictionaryTreeReader, but in hive's StringStreamReader we have dictionary-related properties like _dictionaryStream, _lengthStream, which is confusing...if we're already subclassing ORC tree readers, we should follow it like:

HIVE -> ORC
StringStreamReader -> StringTreeReader (as it is now)
StringDictionaryStreamReader -> StringDictionaryTreeReader
StringDirectStreamReader -> StringDirectTreeReader

this is a change that should be done regardless of ORC 1.8 upgrade in my opinion, and prior to ORC 1.8 upgrade
once we follow ORC tree class hierarchy, we have a better chance to adapt changes like ORC-1060, where e.g. only the dictionary reader has been changed

guys, if you agree with this, let's address the above problem in a separate hive ticket first, it's worth spending the time on it, especially if turns out that the ORC 1.8 upgrade becomes a clearer thing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me, thanx Laszlo!!!

@sonarqubecloud

Copy link
Copy Markdown

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 10 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants