Skip to content

[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile#25443

Closed
wangyum wants to merge 24 commits into
apache:masterfrom
wangyum:test-on-jenkins
Closed

[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile#25443
wangyum wants to merge 24 commits into
apache:masterfrom
wangyum:test-on-jenkins

Conversation

@wangyum

@wangyum wangyum commented Aug 14, 2019

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for hadoop-3.2.

Hive 2.3.6 release notes:

  • HIVE-22096: Backport HIVE-21584 (Java 11 preparation: system class loader is not URLClassLoader)
  • HIVE-21859: Backport HIVE-17466 (Metastore API to list unique partition-key-value combinations)
  • HIVE-21786: Update repo URLs in poms branch 2.3 version

Why are the changes needed?

Make Spark support JDK 11.

Does this PR introduce any user-facing change?

Yes. Please see SPARK-28684 and SPARK-24417 for more details.

How was this patch tested?

Existing unit test and manual test.

@wangyum wangyum changed the title [WIP][test-hadoop3.2] Test on JDK 11 with Hive 2.3.6 on jenkins [WIP][test-hadoop3.2] Test JDK 11 with Hive 2.3.6 on jenkins Aug 14, 2019
Comment thread dev/run-tests-jenkins.py Outdated
Comment thread pom.xml Outdated
@dongjoon-hyun

Copy link
Copy Markdown
Member

cc @dbtsai

@wangyum wangyum changed the title [WIP][test-hadoop3.2] Test JDK 11 with Hive 2.3.6 on jenkins [WIP][test-hadoop3.2][test-maven] Test JDK 11 with Hive 2.3.6 on jenkins Aug 14, 2019
@wangyum

wangyum commented Aug 14, 2019

Copy link
Copy Markdown
Member Author

retest this please

@dongjoon-hyun dongjoon-hyun changed the title [WIP][test-hadoop3.2][test-maven] Test JDK 11 with Hive 2.3.6 on jenkins [WIP][test-hadoop3.2][test-maven] Test JDK 11 with Hadoop-3.2/Hive 2.3.6 on jenkins Aug 14, 2019
@SparkQA

This comment has been minimized.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Since the test is parallel, could you add the following, too?

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala
-    case "2.3" | "2.3.0" | "2.3.1" | "2.3.2" | "2.3.3" | "2.3.4" | "2.3.5" => hive.v2_3
+    case "2.3" | "2.3.0" | "2.3.1" | "2.3.2" | "2.3.3" | "2.3.4" | "2.3.5" | "2.3.6" => hive.v2_3

@dongjoon-hyun dongjoon-hyun changed the title [WIP][test-hadoop3.2][test-maven] Test JDK 11 with Hadoop-3.2/Hive 2.3.6 on jenkins [WIP][SPARK-28723][test-hadoop3.2][test-maven] Test JDK 11 with Hadoop-3.2/Hive 2.3.6 on jenkins Aug 14, 2019
@wangyum

wangyum commented Aug 14, 2019

Copy link
Copy Markdown
Member Author

Will do it later.

Comment thread sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala Outdated
Comment thread sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala Outdated
@dongjoon-hyun

Copy link
Copy Markdown
Member

@wangyum . I believe we should do #25443 (comment) in this PR to be complete.

cc @gatorsmile

@SparkQA

This comment has been minimized.

@wangyum

wangyum commented Aug 14, 2019

Copy link
Copy Markdown
Member Author

Failed with these errors:

ExternalSorterSuite:
- empty data stream with kryo ser
- empty data stream with java ser
- few elements per partition with kryo ser
- few elements per partition with java ser
- empty partitions with spilling with kryo ser
- empty partitions with spilling with java ser
- spilling in local cluster with kryo ser *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
	at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:115)
	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:307)
	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:137)
	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:91)
	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:74)
	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1470)
	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1086)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5(DAGScheduler.scala:1089)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$submitStage$5$adapted(DAGScheduler.scala:1088)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1088)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1030)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2129)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2121)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2110)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

@dongjoon-hyun

dongjoon-hyun commented Aug 14, 2019

Copy link
Copy Markdown
Member

If we build and test with both JDK11, it will pass. The current Jenkins seems to build with JDK8 and running on JDK11 and hit this known issue.

$ build/sbt "core/testOnly *.ExternalSorterSuite"
[info] ExternalSorterSuite:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/dhyun/PRS/SPARK-HIVE-2.3.6/common/unsafe/target/scala-2.12/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[info] - empty data stream with kryo ser (1 second, 457 milliseconds)
[info] - empty data stream with java ser (89 milliseconds)
[info] - few elements per partition with kryo ser (89 milliseconds)
[info] - few elements per partition with java ser (74 milliseconds)
[info] - empty partitions with spilling with kryo ser (329 milliseconds)
[info] - empty partitions with spilling with java ser (156 milliseconds)
[info] - spilling in local cluster with kryo ser (4 seconds, 296 milliseconds)
[info] - spilling in local cluster with java ser (4 seconds, 372 milliseconds)
[info] - spilling in local cluster with many reduce tasks with kryo ser (5 seconds, 718 milliseconds)
[info] - spilling in local cluster with many reduce tasks with java ser (6 seconds, 51 milliseconds)
[info] - cleanup of intermediate files in sorter (113 milliseconds)
[info] - cleanup of intermediate files in sorter with failures (111 milliseconds)
[info] - cleanup of intermediate files in shuffle (297 milliseconds)
[info] - cleanup of intermediate files in shuffle with failures (121 milliseconds)
[info] - no sorting or partial aggregation with kryo ser (58 milliseconds)
[info] - no sorting or partial aggregation with java ser (53 milliseconds)
[info] - no sorting or partial aggregation with spilling with kryo ser (62 milliseconds)
[info] - no sorting or partial aggregation with spilling with java ser (68 milliseconds)
[info] - sorting, no partial aggregation with kryo ser (63 milliseconds)
[info] - sorting, no partial aggregation with java ser (54 milliseconds)
[info] - sorting, no partial aggregation with spilling with kryo ser (58 milliseconds)
[info] - sorting, no partial aggregation with spilling with java ser (61 milliseconds)
[info] - partial aggregation, no sorting with kryo ser (52 milliseconds)
[info] - partial aggregation, no sorting with java ser (51 milliseconds)
[info] - partial aggregation, no sorting with spilling with kryo ser (55 milliseconds)
[info] - partial aggregation, no sorting with spilling with java ser (49 milliseconds)
[info] - partial aggregation and sorting with kryo ser (44 milliseconds)
[info] - partial aggregation and sorting with java ser (44 milliseconds)
[info] - partial aggregation and sorting with spilling with kryo ser (48 milliseconds)
[info] - partial aggregation and sorting with spilling with java ser (49 milliseconds)
[info] - sort without breaking sorting contracts with kryo ser (1 second, 904 milliseconds)
[info] - sort without breaking sorting contracts with java ser (1 second, 860 milliseconds)
[info] - sort without breaking timsort contracts for large arrays !!! IGNORED !!!
[info] - spilling with hash collisions (208 milliseconds)
[info] - spilling with many hash collisions (589 milliseconds)
[info] - spilling with hash collisions using the Int.MaxValue key (168 milliseconds)
[info] - spilling with null keys and values (226 milliseconds)
[info] - sorting updates peak execution memory (1 second, 347 milliseconds)
[info] - force to spill for external sorter (800 milliseconds)
[info] ScalaTest
[info] Run completed in 34 seconds, 99 milliseconds.
[info] Total number of tests run: 38
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 38, failed 0, canceled 0, ignored 1, pending 0
[info] All tests passed.

cc @srowen

@dongjoon-hyun

dongjoon-hyun commented Aug 14, 2019

Copy link
Copy Markdown
Member

Hmm. It's a little confusing. The current Jenkins passes this module, too.

Comment thread dev/run-tests-jenkins Outdated
@SparkQA

This comment has been minimized.

@HyukjinKwon

Copy link
Copy Markdown
Member

Hm, seems not working. Let me check.

@dongjoon-hyun

Copy link
Copy Markdown
Member

6821aa5 is still running, isn't it?

@HyukjinKwon

HyukjinKwon commented Aug 14, 2019

Copy link
Copy Markdown
Member

Yeah but my approach e6508c0 and 2ebdcb9 doesn't seem working.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Oh, got it.

@dongjoon-hyun

Copy link
Copy Markdown
Member

@dongjoon-hyun

Copy link
Copy Markdown
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-28723][test-hadoop3.2][test-maven] Test JDK 11 with Hadoop-3.2/Hive 2.3.6 on jenkins [SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile Aug 23, 2019
@dongjoon-hyun

Copy link
Copy Markdown
Member

Retest this please.

@SparkQA

SparkQA commented Aug 24, 2019

Copy link
Copy Markdown

Test build #109662 has finished for PR 25443 at commit ff4783c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -17,6 +17,11 @@

package org.apache.spark.sql.hive.thriftserver

@dongjoon-hyun dongjoon-hyun Aug 24, 2019

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During JDK11 testing and review, we has been skipped renaming in order to focus JDK11 related stuff by minimizing PR diff. We may need to rename this src file directory v2.3.5 to v2.3.6 again for consistency later. If the test pass, I'd like to merge this AS-IS PR first.

cc @gatorsmile , @srowen

@SparkQA

SparkQA commented Aug 24, 2019

Copy link
Copy Markdown

Test build #109658 has finished for PR 25443 at commit ff4783c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member

+1, LGTM. Merged to master.
Thank you so much, @wangyum , @srowen , @HyukjinKwon , @shaneknapp !

@wangyum wangyum deleted the test-on-jenkins branch August 24, 2019 04:39
@HyukjinKwon

Copy link
Copy Markdown
Member

+1!

@HyukjinKwon HyukjinKwon left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM too!

@dongjoon-hyun

Copy link
Copy Markdown
Member

FYI, after this, we have one successful Jenkins result on JDK11.

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.603 s]
[INFO] Spark Project Tags ................................. SUCCESS [  8.820 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 23.616 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  6.317 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 58.109 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 12.534 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  9.939 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  9.372 s]
[INFO] Spark Project Core ................................. SUCCESS [23:13 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [  8.764 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:17 min]
[INFO] Spark Project Streaming ............................ SUCCESS [05:38 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [10:23 min]
[INFO] Spark Project SQL .................................. SUCCESS [  01:44 h]
[INFO] Spark Project ML Library ........................... SUCCESS [33:00 min]
[INFO] Spark Project Tools ................................ SUCCESS [  1.508 s]
[INFO] Spark Project Hive ................................. SUCCESS [  01:09 h]
[INFO] Spark Project Graph API ............................ SUCCESS [  3.619 s]
[INFO] Spark Project Cypher ............................... SUCCESS [  3.860 s]
[INFO] Spark Project Graph ................................ SUCCESS [  2.397 s]
[INFO] Spark Project REPL ................................. SUCCESS [01:26 min]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [  3.692 s]
[INFO] Spark Project YARN ................................. SUCCESS [07:36 min]
[INFO] Spark Project Mesos ................................ SUCCESS [ 37.176 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [09:03 min]
[INFO] Spark Project Assembly ............................. SUCCESS [  3.331 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [  7.260 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [01:16 min]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [07:36 min]
[INFO] Spark Kinesis Integration .......................... SUCCESS [ 26.717 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 27.544 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  2.694 s]
[INFO] Spark Avro ......................................... SUCCESS [01:54 min]
[INFO] Spark Project Kinesis Assembly ..................... SUCCESS [  2.481 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:40 h
[INFO] Finished at: 2019-08-24T03:35:36-07:00
[INFO] ------------------------------------------------------------------------

cc @gatorsmile , @dbtsai

rshkv pushed a commit to palantir/spark that referenced this pull request Jun 18, 2020
… for JDK 11

<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This PR proposes to increase the tolerance for the exact value comparison in `spark.mlp` test. I don't know the root cause but some tolerance is already expected. I suspect it is not a big deal considering all other tests pass.

The values are fairly close:

JDK 8:

```
-24.28415, 107.8701, 16.86376, 1.103736, 9.244488
```

JDK 11:

```
-24.33892, 108.0316, 16.89082, 1.090723, 9.260533
```

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

To fully support JDK 11. See, for instance, apache#25443 and apache#25423 for ongoing efforts.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->

No

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Manually tested on the top of apache#25472 with JDK 11

```bash
./build/mvn -DskipTests -Psparkr -Phadoop-3.2 package
./bin/sparkR
```

```R
absoluteSparkPath <- function(x) {
  sparkHome <- sparkR.conf("spark.home")
  file.path(sparkHome, x)
}
df <- read.df(absoluteSparkPath("data/mllib/sample_multiclass_classification_data.txt"),
              source = "libsvm")
model <- spark.mlp(df, label ~ features, blockSize = 128, layers = c(4, 5, 4, 3),
                   solver = "l-bfgs", maxIter = 100, tol = 0.00001, stepSize = 1, seed = 1)
summary <- summary(model)
head(summary$weights, 5)
```

Closes apache#25478 from HyukjinKwon/SPARK-28755.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
senthh pushed a commit to acceldata-io/spark that referenced this pull request Jun 24, 2026
…lient and Hadoop-3.2 profile

### What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`.

Hive 2.3.6 release notes:
- [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader)
- [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations)
- [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version

### Why are the changes needed?
Make Spark support JDK 11.

### Does this PR introduce any user-facing change?
Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details.

### How was this patch tested?
Existing unit test and manual test.

Closes apache#25443 from wangyum/test-on-jenkins.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 02a0cde)
senthh added a commit to acceldata-io/spark that referenced this pull request Jun 25, 2026
* ODP-7038|[SPARK-25946][BUILD] Upgrade ASM to 7.x to support JDK11

## What changes were proposed in this pull request?

Upgrade ASM to 7.x to support JDK11

## How was this patch tested?

Existing tests.

Closes apache#22953 from dbtsai/asm7.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>

(cherry picked from commit 3ed91c9)

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038: replace String.lines with split for JDK11 compile

JDK11 added java.lang.String#lines() returning java.util.stream.Stream<String>.
Scala 2.11's StringLike implicit also exposes .lines (Iterator[String]),
but the Java instance method takes resolution priority on JDK11+. The
resulting Stream<String>.toArray returns Object[], and the downstream
.size / .forall(_.size <= N) then fail to typecheck:

  value size is not a member of Object

MatricesSuite (both mllib and mllib-local copies) only needs a plain
newline split, so use .split("\\n") which returns Array[String]
unambiguously on every JDK.

* ODP-7038|[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation

Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it.

## What changes were proposed in this pull request?

Java 9+ changed how ClassLoaders work. The two most salient points:
- The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders
- The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath

## How was this patch tested?

We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though.

(NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.)

Closes apache#24057 from srowen/SPARK-26839.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

(cherry picked from commit c65f9b2)

* ODP-7038|[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile

### What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`.

Hive 2.3.6 release notes:
- [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader)
- [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations)
- [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version

### Why are the changes needed?
Make Spark support JDK 11.

### Does this PR introduce any user-facing change?
Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details.

### How was this patch tested?
Existing unit test and manual test.

Closes apache#25443 from wangyum/test-on-jenkins.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 02a0cde)

* ODP-7038 - Dev - Adding missing orc versions

* ODP-7038: harden Platform.<clinit> Cleaner reflection for JDK11 runtime

On JDK11, jdk.internal.ref is not exported to the unnamed module by
default, so Method.setAccessible() throws InaccessibleObjectException
inside Platform's static block, and spark-shell fails to start with:

  java.lang.ExceptionInInitializerError at ByteArrayMethods.<clinit>
  Caused by: InaccessibleObjectException: Unable to make ...
  jdk.internal.ref.Cleaner.create(Object, Runnable) accessible

Backport the SPARK-26839 graceful-degradation pattern from
upstream 2.4.x+/3.x:

  - Catch InaccessibleObjectException by name (avoids importing the
    JDK9+ class) when setAccessible() on DirectByteBuffer ctor/field
    fails; null both refs.
  - Probe createMethod by calling it with null args; if it throws
    IllegalAccessException, null the method ref.
  - allocateDirectBuffer() now checks for null CLEANER_CREATE_METHOD
    and falls back to ByteBuffer.allocateDirect(size), with a
    helpful OOM message pointing at -XX:MaxDirectMemorySize.

With this, spark-shell on JDK11 starts even without
`--add-opens java.base/jdk.internal.ref=ALL-UNNAMED`. Adding that
add-opens still gives you the bigger off-heap budget.

* ODP-7038: restore hive.version to ODP fork 1.2.1.spark24.0.14.1

The earlier SPARK-28723 cherry-pick (9bbdab0) blindly took upstream's
hive.version=1.2.1.spark2, which is the upstream spark-project.hive
1.2.1 line - NOT the ODP fork that lives in odp-hive-spark and ships
as 1.2.1.spark24.0.14.1.

ODP's deployed jar
  standalone-metastore-1.2.1.spark24.0.14.1-hive3.jar
is built from odp-hive-spark/standalone-metastore at 1.2.1.spark24.0.14.1.
Any JDK11 patches for the embedded HiveMetaStoreClient (e.g. HIVE-21508's
toArray fix) belong in odp-hive-spark, not here.

Keep the rest of SPARK-28723 (hive23.version, hadoop-3.2 profile
overrides, ThriftserverShimUtils) intact - those only kick in when
hadoop-3.2 profile selects the Apache Hive 2.3 path.

* ODP-7038: PySpark + bundled py4j source patches for Python 3.11

Stock Spark 2.4 PySpark targets Python 2.7-3.8. Python 3.10 and 3.11
broke several APIs PySpark and its bundled py4j-0.10.7 / cloudpickle
0.x still relied on. This commit applies source-level patches so a
fresh `pyspark` session runs cleanly under Python 3.11.

The big one: replace the 2017-era single-file pyspark/cloudpickle.py
with the vendored cloudpickle 2.2.1 package (exact backport from
upstream Apache Spark 3.x's python/pyspark/cloudpickle/). cloudpickle
2.2.1 (Aug 2022) is the first release with full Python 3.11 support -
bytecode opcode walker handles the new LOAD_GLOBAL flag encoding,
CodeType construction uses .replace() forward-compat, closure cell
serialization adapted to 3.11 frame layout, and many other 3.10/3.11
fixes that would have required dozens of manual patches to the old
copy.

Verified end-to-end on Python 3.11.15: pyspark imports cleanly, lambda
closure round-trips through cloudpickle.dumps()/loads() succeed for
the patterns that previously raised
  TypeError: code() argument 13 must be str, not int
  IndexError: tuple index out of range  (in extract_code_globals)
  RecursionError in save_function/_fill_function

Source changes
--------------
python/pyspark/cloudpickle.py  ->  python/pyspark/cloudpickle/
  Replace single-file 0.x copy with cloudpickle 2.2.1 vendored as a
  package (matching upstream Apache Spark 3.x layout). Only deltas
  vs upstream PyPI cloudpickle 2.2.1:
    * __init__.py:  `from cloudpickle.X` -> `from pyspark.cloudpickle.X`
      (relocates the package under pyspark)
    * cloudpickle_fast.py:634:  add `len(e.args) > 0` guard to the
      RecursionError fallback (same as Apache Spark 3.x's vendor diff)

python/pyspark/resultiterable.py
  Python 3.10 removed the lazy collections.* abc aliases. Class
  ResultIterable(collections.Iterable) raised AttributeError on import.
  Import from collections.abc with a Python 2 fallback.

python/pyspark/sql/types.py
python/pyspark/sql/session.py
  pandas 2.0 removed DataFrame.iteritems(). PySpark uses it in
  timestamp localization (types.py) and Arrow batch creation
  (session.py x2). Replace with .items() (present in pandas 1.x and
  2.x) guarded by a getattr() probe so older pandas keeps working.

python/pyspark/mllib/linalg/__init__.py
python/pyspark/ml/linalg/__init__.py
  Python 3.9 removed array.array.tostring(). Replace with .tobytes()
  in the DenseVector / SparseVector / DenseMatrix / SparseMatrix
  pickling paths (6+6 sites). Both methods are bytewise-identical so
  serialized payloads stay wire-compatible.

python/lib/py4j-0.10.7-src.zip
  Bundled py4j 0.10.7 (from 2018) imports MutableMapping, Sequence,
  MutableSequence, MutableSet, Set straight from `collections`. Python
  3.10 removed those aliases, causing
    ImportError: cannot import name 'MutableMapping' from 'collections'
  Patch the bundled zip: java_collections.py uses `from collections.abc`
  with a `from collections` fallback. Bytes-only change to the zip,
  no version bump (py4j Java jar stays at 0.10.7 so wire-protocol
  compat is preserved).

Verification
------------
  $ PYTHONPATH=python:python/lib/py4j-0.10.7-src.zip python3.11 \
      -W ignore -c "import pyspark; print(pyspark.__version__)"
  2.4.8
  $ python3.11 -W ignore -c "
      from pyspark import cloudpickle
      def make():
          x = 42
          return lambda r: (r, r * x)
      f = make()
      assert cloudpickle.loads(cloudpickle.dumps(f))(10) == (10, 420)
      print('closure round-trip OK')"
  closure round-trip OK

* ODP-7038: restore HiveUtils imports + isHive23, fix hadoop-3.2 profile

---------

Co-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: senthh <senthil.kumar@acceldata.io>
Co-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
shubhluck added a commit to acceldata-io/spark that referenced this pull request Jun 25, 2026
* ODP-7038|[SPARK-25946][BUILD] Upgrade ASM to 7.x to support JDK11

## What changes were proposed in this pull request?

Upgrade ASM to 7.x to support JDK11

## How was this patch tested?

Existing tests.

Closes apache#22953 from dbtsai/asm7.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>

(cherry picked from commit 3ed91c9)

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038: replace String.lines with split for JDK11 compile

JDK11 added java.lang.String#lines() returning java.util.stream.Stream<String>.
Scala 2.11's StringLike implicit also exposes .lines (Iterator[String]),
but the Java instance method takes resolution priority on JDK11+. The
resulting Stream<String>.toArray returns Object[], and the downstream
.size / .forall(_.size <= N) then fail to typecheck:

  value size is not a member of Object

MatricesSuite (both mllib and mllib-local copies) only needs a plain
newline split, so use .split("\\n") which returns Array[String]
unambiguously on every JDK.

* ODP-7038|[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation

Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it.

## What changes were proposed in this pull request?

Java 9+ changed how ClassLoaders work. The two most salient points:
- The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders
- The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath

## How was this patch tested?

We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though.

(NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.)

Closes apache#24057 from srowen/SPARK-26839.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

(cherry picked from commit c65f9b2)

* ODP-7038|[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile

### What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`.

Hive 2.3.6 release notes:
- [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader)
- [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations)
- [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version

### Why are the changes needed?
Make Spark support JDK 11.

### Does this PR introduce any user-facing change?
Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details.

### How was this patch tested?
Existing unit test and manual test.

Closes apache#25443 from wangyum/test-on-jenkins.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 02a0cde)

* ODP-7038 - Dev - Adding missing orc versions

* ODP-7038: harden Platform.<clinit> Cleaner reflection for JDK11 runtime

On JDK11, jdk.internal.ref is not exported to the unnamed module by
default, so Method.setAccessible() throws InaccessibleObjectException
inside Platform's static block, and spark-shell fails to start with:

  java.lang.ExceptionInInitializerError at ByteArrayMethods.<clinit>
  Caused by: InaccessibleObjectException: Unable to make ...
  jdk.internal.ref.Cleaner.create(Object, Runnable) accessible

Backport the SPARK-26839 graceful-degradation pattern from
upstream 2.4.x+/3.x:

  - Catch InaccessibleObjectException by name (avoids importing the
    JDK9+ class) when setAccessible() on DirectByteBuffer ctor/field
    fails; null both refs.
  - Probe createMethod by calling it with null args; if it throws
    IllegalAccessException, null the method ref.
  - allocateDirectBuffer() now checks for null CLEANER_CREATE_METHOD
    and falls back to ByteBuffer.allocateDirect(size), with a
    helpful OOM message pointing at -XX:MaxDirectMemorySize.

With this, spark-shell on JDK11 starts even without
`--add-opens java.base/jdk.internal.ref=ALL-UNNAMED`. Adding that
add-opens still gives you the bigger off-heap budget.

* ODP-7038: restore hive.version to ODP fork 1.2.1.spark24.0.14.1

The earlier SPARK-28723 cherry-pick (9bbdab0) blindly took upstream's
hive.version=1.2.1.spark2, which is the upstream spark-project.hive
1.2.1 line - NOT the ODP fork that lives in odp-hive-spark and ships
as 1.2.1.spark24.0.14.1.

ODP's deployed jar
  standalone-metastore-1.2.1.spark24.0.14.1-hive3.jar
is built from odp-hive-spark/standalone-metastore at 1.2.1.spark24.0.14.1.
Any JDK11 patches for the embedded HiveMetaStoreClient (e.g. HIVE-21508's
toArray fix) belong in odp-hive-spark, not here.

Keep the rest of SPARK-28723 (hive23.version, hadoop-3.2 profile
overrides, ThriftserverShimUtils) intact - those only kick in when
hadoop-3.2 profile selects the Apache Hive 2.3 path.

* ODP-7038: PySpark + bundled py4j source patches for Python 3.11

Stock Spark 2.4 PySpark targets Python 2.7-3.8. Python 3.10 and 3.11
broke several APIs PySpark and its bundled py4j-0.10.7 / cloudpickle
0.x still relied on. This commit applies source-level patches so a
fresh `pyspark` session runs cleanly under Python 3.11.

The big one: replace the 2017-era single-file pyspark/cloudpickle.py
with the vendored cloudpickle 2.2.1 package (exact backport from
upstream Apache Spark 3.x's python/pyspark/cloudpickle/). cloudpickle
2.2.1 (Aug 2022) is the first release with full Python 3.11 support -
bytecode opcode walker handles the new LOAD_GLOBAL flag encoding,
CodeType construction uses .replace() forward-compat, closure cell
serialization adapted to 3.11 frame layout, and many other 3.10/3.11
fixes that would have required dozens of manual patches to the old
copy.

Verified end-to-end on Python 3.11.15: pyspark imports cleanly, lambda
closure round-trips through cloudpickle.dumps()/loads() succeed for
the patterns that previously raised
  TypeError: code() argument 13 must be str, not int
  IndexError: tuple index out of range  (in extract_code_globals)
  RecursionError in save_function/_fill_function

Source changes
--------------
python/pyspark/cloudpickle.py  ->  python/pyspark/cloudpickle/
  Replace single-file 0.x copy with cloudpickle 2.2.1 vendored as a
  package (matching upstream Apache Spark 3.x layout). Only deltas
  vs upstream PyPI cloudpickle 2.2.1:
    * __init__.py:  `from cloudpickle.X` -> `from pyspark.cloudpickle.X`
      (relocates the package under pyspark)
    * cloudpickle_fast.py:634:  add `len(e.args) > 0` guard to the
      RecursionError fallback (same as Apache Spark 3.x's vendor diff)

python/pyspark/resultiterable.py
  Python 3.10 removed the lazy collections.* abc aliases. Class
  ResultIterable(collections.Iterable) raised AttributeError on import.
  Import from collections.abc with a Python 2 fallback.

python/pyspark/sql/types.py
python/pyspark/sql/session.py
  pandas 2.0 removed DataFrame.iteritems(). PySpark uses it in
  timestamp localization (types.py) and Arrow batch creation
  (session.py x2). Replace with .items() (present in pandas 1.x and
  2.x) guarded by a getattr() probe so older pandas keeps working.

python/pyspark/mllib/linalg/__init__.py
python/pyspark/ml/linalg/__init__.py
  Python 3.9 removed array.array.tostring(). Replace with .tobytes()
  in the DenseVector / SparseVector / DenseMatrix / SparseMatrix
  pickling paths (6+6 sites). Both methods are bytewise-identical so
  serialized payloads stay wire-compatible.

python/lib/py4j-0.10.7-src.zip
  Bundled py4j 0.10.7 (from 2018) imports MutableMapping, Sequence,
  MutableSequence, MutableSet, Set straight from `collections`. Python
  3.10 removed those aliases, causing
    ImportError: cannot import name 'MutableMapping' from 'collections'
  Patch the bundled zip: java_collections.py uses `from collections.abc`
  with a `from collections` fallback. Bytes-only change to the zip,
  no version bump (py4j Java jar stays at 0.10.7 so wire-protocol
  compat is preserved).

Verification
------------
  $ PYTHONPATH=python:python/lib/py4j-0.10.7-src.zip python3.11 \
      -W ignore -c "import pyspark; print(pyspark.__version__)"
  2.4.8
  $ python3.11 -W ignore -c "
      from pyspark import cloudpickle
      def make():
          x = 42
          return lambda r: (r, r * x)
      f = make()
      assert cloudpickle.loads(cloudpickle.dumps(f))(10) == (10, 420)
      print('closure round-trip OK')"
  closure round-trip OK

* ODP-7038: restore HiveUtils imports + isHive23, fix hadoop-3.2 profile

---------

Co-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: senthh <senthil.kumar@acceldata.io>
Co-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants