Skip to content

Add Dataproc 3.0 and Hive 4 compatibility to Hive lineage script#1398

Open
pathakriya-cyber wants to merge 1 commit into
GoogleCloudDataproc:mainfrom
pathakriya-cyber:patch-1
Open

Add Dataproc 3.0 and Hive 4 compatibility to Hive lineage script#1398
pathakriya-cyber wants to merge 1 commit into
GoogleCloudDataproc:mainfrom
pathakriya-cyber:patch-1

Conversation

@pathakriya-cyber

Copy link
Copy Markdown

On Dataproc 3.0 (Hive 4), OpenLineage requires the GCP lineage transport library (transports-gcplineage.jar) to be explicitly copied into Hive's library folder (/usr/lib/hive/lib/) so Hive can send lineage events to GCP Dataplex without throwing ClassNotFoundException.

This change:

  1. Automatically copies Spark's transports-gcplineage.jar into Hive's library folder.
  2. Modernizes legacy "gsutil cp" calls to "gcloud storage cp".

Verified end-to-end against Dataproc 3.0 Debian 13:
http://sponge2/11f79ec0-b6d7-403e-8bc9-f7e923d70635

On Dataproc 3.0 (Hive 4), OpenLineage requires the GCP lineage transport library (transports-gcplineage.jar) to be explicitly copied into Hive's library folder (/usr/lib/hive/lib/) so Hive can send lineage events to GCP Dataplex without throwing ClassNotFoundException.
  
  This change:
  1. Automatically copies Spark's transports-gcplineage.jar into Hive's library folder.
  2. Modernizes legacy "gsutil cp" calls to "gcloud storage cp".
  
  Verified end-to-end against Dataproc 3.0 Debian 13:
  http://sponge2/11f79ec0-b6d7-403e-8bc9-f7e923d70635
@google-cla

google-cla Bot commented Jul 1, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates hive-lineage.sh to use gcloud storage cp instead of gsutil cp and adds logic to copy the GCP lineage transport JAR into the Hive library directory for Dataproc 3.0 compatibility. The feedback recommends restoring the -P flag in the gcloud storage cp command to preserve POSIX attributes and ensure the downloaded JAR remains readable by the hive user.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

echo "Installing openlineage-hive hook"
gsutil cp -P "$INSTALLATION_SOURCE/hive-openlineage-hook-$HIVE_OL_HOOK_VERSION.jar" "$HIVE_LIB_DIR/hive-openlineage-hook.jar"
}
gcloud storage cp "$INSTALLATION_SOURCE/hive-openlineage-hook-$HIVE_OL_HOOK_VERSION.jar" "$HIVE_LIB_DIR/hive-openlineage-hook.jar"

This comment was marked as duplicate.

echo "Installing openlineage-hive hook"
gsutil cp -P "$INSTALLATION_SOURCE/hive-openlineage-hook-$HIVE_OL_HOOK_VERSION.jar" "$HIVE_LIB_DIR/hive-openlineage-hook.jar"
}
gcloud storage cp "$INSTALLATION_SOURCE/hive-openlineage-hook-$HIVE_OL_HOOK_VERSION.jar" "$HIVE_LIB_DIR/hive-openlineage-hook.jar"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcloud storage cp -P ...

gcloud storage cp "$INSTALLATION_SOURCE/hive-openlineage-hook-$HIVE_OL_HOOK_VERSION.jar" "$HIVE_LIB_DIR/hive-openlineage-hook.jar"

echo "Copying GCP lineage transport jar into Hive lib folder for Dataproc 3.0 compatibility..."
if [[ -f "/usr/lib/spark/connector/transports-gcplineage.jar" ]]; then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transports-gcplineage:1.27 is bundled in hive-openlineage-hook.jar, and the jar you're copying from spark classpath is based on transports-gcplineage:1.34

This might cause some issues with class conflicts. I'm assuming that the reason v1.27 is not working well here is because some of the classes do not support Hive 4? It is probably not a good idea to have multiple versions of the same class in the Hive class path. (Even if our local tests are passing, customer workloads might break as customers might bring in their own classes to the classpath)

@codelixir codelixir left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could share the fully qualified names of the class(es) that result in ClassNotFoundException, it would help determine if this is the right fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants