Skip to content

[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation#24057

Closed
srowen wants to merge 5 commits into
apache:masterfrom
srowen:SPARK-26839
Closed

[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation#24057
srowen wants to merge 5 commits into
apache:masterfrom
srowen:SPARK-26839

Conversation

@srowen

@srowen srowen commented Mar 11, 2019

Copy link
Copy Markdown
Member

Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it.

What changes were proposed in this pull request?

Java 9+ changed how ClassLoaders work. The two most salient points:

  • The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders
  • The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath

How was this patch tested?

We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though.

(NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.)

@srowen srowen self-assigned this Mar 11, 2019
extends Logging {

// Check to make sure that the root classloader does not know about Hive.
assert(Try(rootClassLoader.loadClass("org.apache.hadoop.hive.conf.HiveConf")).isFailure)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always failed before with an NPE. The root classloader is represented by null!

name.startsWith("java.lang.") ||
name.startsWith("java.net") ||
name.startsWith("java.") ||
name.startsWith("javax.") ||

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javax.sql et al were not found without this change. I think they should always have been considered non-shared classes? they're in the JDK from Java 8 at least.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I found the same, this seems like a mistake. I also noticed that I needed to add at least

    // needed for at least com.sun.security.auth.module.UnixLoginModule
    name.startsWith("com.sun") ||

When I ran some tests w/ security. I wondered whether I should make that rule even tighter, but I figured everything under com.sun is probably considered shared.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just do javax.sql in the starts with here given other things like javax.servlet or are those non-shareable as well?

case _: ClassNotFoundException =>
super.loadClass(name, resolve)
if (allJars.isEmpty) {
rootClassLoader

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the tail end of my attempt to get this working; this part doesn't seem quite right.

In Java 9+, we won't have the JARs from the system classloader to construct a new ClassLoader here. (There are risky hacks to get it, but not going there yet.) But this case only comes up in "builtin" mode, where we want to use the JARs already on the classpath. I figured, why build a new ClassLoader? just use the current one.

This gets close but now can't find Spark-Hive integration classes like HiveClientImpl.

I think I need to try baseClassLoader here, and will do so.

Just raising my current state FYI to @squito , maybe @vanzin or @cloud-fan . I know it's been years since you've touched it, but maybe @andrewor14 or @marmbrus wants to see what I'm trying to do here.

@squito squito left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this and sharing, Sean. This is relatively close to my own wip, I left some inline comments

name.startsWith("java.lang.") ||
name.startsWith("java.net") ||
name.startsWith("java.") ||
name.startsWith("javax.") ||

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I found the same, this seems like a mistake. I also noticed that I needed to add at least

    // needed for at least com.sun.security.auth.module.UnixLoginModule
    name.startsWith("com.sun") ||

When I ran some tests w/ security. I wondered whether I should make that rule even tighter, but I figured everything under com.sun is probably considered shared.

if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) {
// Do nothing. The system classloader is no longer a URLClassLoader in Java 9,
// so it won't match the case in allJars above. It no longer exposes URLs of
// the system classpath

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have something slightly different here -- admittedly I wasn't really sure if this was the right thing to do or not, it was just a hack to let me get a little further, not principled at all.

      // For java 11, allJars() does *not* get jars on the classpath added, so we add
      // them here
      val classPathJars = sys.props("java.class.path").split(":").map(new File(_).toURI().toURL())
      logWarning(s"Classpath jars = ${classPathJars.mkString(",")}")
      val jars = allJars(classLoader) ++ classPathJars

I think this will solve your HiveClientImpl issue ... but that doesn't mean its the right thing to do :P

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice call there. I'll try just avoiding a whole new classloader here too just to see if it works; for 'builtin' that seems simpler. But I am wary of changes. But I am wary of depending on something else in the env too, even though that's pretty safe.

@holdenk holdenk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a long time since I looked at the class loader code, so just a quick question with regards to the shared/non-shared logic.

name.startsWith("java.lang.") ||
name.startsWith("java.net") ||
name.startsWith("java.") ||
name.startsWith("javax.") ||

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to just do javax.sql in the starts with here given other things like javax.servlet or are those non-shareable as well?

@srowen

srowen commented Mar 11, 2019

Copy link
Copy Markdown
Member Author

javax.servlet is a good point. I would guess that in practice all the non-JDK javax. classes are shareable w.r.t. Hive, though it could reference older versions I guess. Let me narrow that back down to be safe, once I get a better hold on the other issues here.

@srowen

srowen commented Mar 11, 2019

Copy link
Copy Markdown
Member Author

For those keeping score, the baseClassLoader change I mentioned above lets it make more progress. Now I think we're on to Datanucleus 5.2 being required for Java 11.

The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is available.
org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is available.
	at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
	at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
	at org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
	at org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
	at org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
	at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTable(RDBMSStoreManager.java:3118)
	at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTables(RDBMSStoreManager.java:2909)
	at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.addClassTablesAndValidate(RDBMSStoreManager.java:3182)
	at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2841)
	at org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:122)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.addClasses(RDBMSStoreManager.java:1605)
	at org.datanucleus.store.AbstractStoreManager.addClass(AbstractStoreManager.java:954)
	at org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:679)
	at org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:408)
	at org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:947)
	at org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:370)
	at org.datanucleus.store.query.Query.executeQuery(Query.java:1744)
	at org.datanucleus.store.query.Query.executeWithArray(Query.java:1672)
	at org.datanucleus.store.query.Query.execute(Query.java:1654)
	at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:221)
	at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:183)
	at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.<init>(MetaStoreDirectSql.java:137)
	at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:295)
	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

https://issues.apache.org/jira/browse/HIVE-17632
but:
http://www.datanucleus.org/documentation/news/access_platform_5_2.html
.. may just work when we slip it in.

@SparkQA

SparkQA commented Mar 11, 2019

Copy link
Copy Markdown

Test build #103340 has finished for PR 24057 at commit 9dddcee.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Mar 11, 2019

Copy link
Copy Markdown

Test build #103332 has finished for PR 24057 at commit 31e56ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen

srowen commented Mar 11, 2019

Copy link
Copy Markdown
Member Author

Farther, after updating datanucleus, but I think we might be hitting an incompatibility with Hive, which uses a much older version.

18:42:46.389 WARN DataNucleus.MetaData: Class "org.apache.hadoop.hive.metastore.model.MNotificationLog" field "jdoDetachedState" is an array of (non-serialised) elements of type "java.lang.Object" yet no <join> has been specified. You must have a join table to store an array of non-persistable elements, or serialise the array.
18:42:46.398 WARN DataNucleus.MetaData: Class "org.apache.hadoop.hive.metastore.model.MNotificationNextId" field "jdoDetachedState" is an array of (non-serialised) elements of type "java.lang.Object" yet no <join> has been specified. You must have a join table to store an array of non-persistable elements, or serialise the array.
18:42:46.407 WARN org.apache.hadoop.hive.metastore.MetaStoreDirectSql: Database initialization failed; direct SQL is disabled
javax.jdo.JDOUserException: Identifier name is unresolved (not a static field)
	at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:635)
	at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:456)
	at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:263)
	at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:183)
	at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.<init>(MetaStoreDirectSql.java:137)
	at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:295)
	at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:57)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
18:42:46.490 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Retrying creating default database after error: The class "org.apache.hadoop.hive.metastore.model.MVersionTable" is not persistable. This means that it either hasnt been enhanced, or that the enhanced version of the file is not in the CLASSPATH (or is hidden by an unenhanced version), or the Meta-Data/annotations for the class are not found.
org.datanucleus.api.jdo.exceptions.ClassNotPersistenceCapableException: The class "org.apache.hadoop.hive.metastore.model.MVersionTable" is not persistable. This means that it either hasnt been enhanced, or that the enhanced version of the file is not in the CLASSPATH (or is hidden by an unenhanced version), or the Meta-Data/annotations for the class are not found.
	at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:472)
	at org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:717)
	at org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:738)
	at org.apache.hadoop.hive.metastore.ObjectStore.setMetaStoreSchemaVersion(ObjectStore.java:6773)
	at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:6670)
	at org.apache.hadoop.hive.metastore.ObjectStore.verifySchema(ObjectStore.java:6645)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:114)
	at com.sun.proxy.$Proxy27.verifySchema(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:572)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
	at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
	at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)

The jdoDetachedState is something DataNucleus generates into the byte code, and looks like it isn't there, but is now expected.

I'm concerned that we're going to hit a wall with Java 9 compatibility with big dependencies like Hive and Scala. A Hive update may help here, not sure; the JIRA for support of all this in Hive is still open.

@SparkQA

SparkQA commented Mar 11, 2019

Copy link
Copy Markdown

Test build #103342 has finished for PR 24057 at commit b49d65d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen

srowen commented Mar 13, 2019

Copy link
Copy Markdown
Member Author

Yeah, these changes to datanucleus aren't going to work for the Java 8 build, even. I don't think they solve the problem for Java 11 anyway, unfortunately; it may really only work with new Hive versions. I'm going to revert the datanucleus part as I think the rest is, at least, progress and correct for Java 11, and doesn't affect Java 8.

@SparkQA

SparkQA commented Mar 13, 2019

Copy link
Copy Markdown

Test build #103397 has finished for PR 24057 at commit 3ad22b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen

srowen commented Mar 13, 2019

Copy link
Copy Markdown
Member Author

@squito what do you think of this much?

@srowen srowen changed the title [SPARK-26839][WIP][SQL] Work around classloader changes in Java 9 for Hive isolation [SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation Mar 13, 2019
"Unable to locate hive jars to connect to metastore. " +
s"Please set ${HIVE_METASTORE_JARS.key}.")
if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) {
// Do nothing. The system classloader is no longer a URLClassLoader in Java 9,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we are unable to detect the missing Hive jar with Java 9+?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above tries to get the URLs of all the JARs in the classpath from the system classloader. This isn't available in Java 9. However these only seem to be needed in 'builtin' mode, in which case (I believe) we don't need to do this at all -- don't make a new ClassLoader, just use the current ClassLoader.

@squito has an alternative, to read the classpath off of java.class.path. I think that works too. It's less change, but, I was also wondering whether the whole idea of making a new ClassLoader in this mode isn't necessary anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your approach makes more sense. I was a little worried that someone would add a jar directly to classloader, from java code, which wouldn't show up on java.class.path. probably an unsupported gray area anyway, but this might make it work in a few more cases

@squito

squito commented Mar 13, 2019

Copy link
Copy Markdown
Contributor

looks reasonable -- just want to mention again about the extra rule I needed in the list of shared prefixes for security:

    // needed for at least com.sun.security.auth.module.UnixLoginModule
    name.startsWith("com.sun") ||

I don't have any insight yet into the datanucleus issues -- my fork has a different version of hive so I'm not encountering that issue (or at least, not yet). the security thing may also be specific to my fork, though I doubt it.

@srowen

srowen commented Mar 13, 2019

Copy link
Copy Markdown
Member Author

What version of Hive are you looking at? it would be good to know if version X of Hive is even sorta compatible with Java 9+. It seems like Datanucleus only officially supports it all in 5.2+, but older versions might happen to work.

@squito

squito commented Mar 13, 2019

Copy link
Copy Markdown
Contributor

oh the hive version I'm looking at is heavily forked from any upstream version, so I dunno if it would really be that helpful. Nominally its based on hive 2.1.1.

Looks like its using datanucleus 4.1.6, so maybe I'm going to hit the same problems you are.

Can you give me an exact test case where you hit the issues w/ datanucleus? Probably I would hit the datanucleus errors, but its covered up by some other error which I hadn't dug into yet. (Sorry this work has been preempted a bit on my end)

@srowen

srowen commented Mar 13, 2019

Copy link
Copy Markdown
Member Author

@squito OK good to know. If you were to revert the commit 3ad22b6 here and run on Java 11, you'd see the error at #24057 (comment)

Of course, this may mean that we just can't support Hive older than X on Java 11. This particular failure, as I understand, is from testing against 'builtin' Hive, at 1.2.1. It feels like we'd have to update the built-in Hive version (which really ought to happen for 3.0 anyway), and if we need to test against older ones, just disable them for Java 11.

I don't know which Hive version works, so any confirmation you have that later versions work would be a great data point.

(Aside: we still have REPL-related problems in Java 11 which may be hard to resolve as it depends on Scala compile-time support for Java 9+. This may not be the last kind of blocker. Right now seeing if I can get this one in and then move ahead to fix the YARN failure I saw.)

@srowen

srowen commented Mar 19, 2019

Copy link
Copy Markdown
Member Author

@squito what do you think of this much of the change? Don't know if it just works on Hive 2 but I think this much is necessary in any event

@squito

squito commented Mar 19, 2019

Copy link
Copy Markdown
Contributor

It looks reasonable to me, but I'm a hesitant to say the classloader changes to the builtin mode are the right ones till we get everything else working.

But if you feel pretty good about it, I won't object to you committing -- we can always make further adjustments.

@srowen

srowen commented Mar 19, 2019

Copy link
Copy Markdown
Member Author

Yeah that's my take. It may not be the last change here to be sure. It would probably facilitate further testing

@srowen

srowen commented Mar 20, 2019

Copy link
Copy Markdown
Member Author

Merged to master. I didn't resolve the JIRA though.

@srowen srowen closed this in c65f9b2 Mar 20, 2019
@srowen srowen deleted the SPARK-26839 branch March 26, 2019 00:14
sunchao pushed a commit that referenced this pull request Feb 27, 2023
…uiltin' Hive version for metadata client

### What changes were proposed in this pull request?
When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen.

### Why are the changes needed?
Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin":
> attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive.

I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session.

### Does this PR introduce _any_ user-facing change?
No, except to protect Spark itself from potentially being broken by bad user JARs.

### How was this patch tested?
This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected.

Closes #40144 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/java9strategy.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Chao Sun <sunchao@apple.com>
sunchao pushed a commit that referenced this pull request Feb 27, 2023
…uiltin' Hive version for metadata client

### What changes were proposed in this pull request?
When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen.

### Why are the changes needed?
Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin":
> attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive.

I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session.

### Does this PR introduce _any_ user-facing change?
No, except to protect Spark itself from potentially being broken by bad user JARs.

### How was this patch tested?
This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected.

Closes #40144 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/java9strategy.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Chao Sun <sunchao@apple.com>
xkrogen added a commit to xkrogen/spark that referenced this pull request Feb 28, 2023
…uiltin' Hive version for metadata client

When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen.

Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](apache#24057 (comment))). The isolated clientloader was originally added in apache#5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin":
> attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive.

I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in apache#6435 / apache#6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session.

No, except to protect Spark itself from potentially being broken by bad user JARs.

This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected.
sunchao pushed a commit that referenced this pull request Mar 1, 2023
…uiltin' Hive version for metadata client

### What changes were proposed in this pull request?
When using the 'builtin' Hive version for the Hive metadata client, do not create a separate classloader, and rather continue to use the overall user/application classloader (regardless of Java version). This standardizes the behavior for all Java versions with that of Java 9+. See SPARK-42539 for more details on why this approach was chosen.

Please note that this is a re-submit of #40144. That one introduced test failures, and potentially a real issue, because the PR works by setting `isolationOn = false` for `builtin` mode. In addition to adjusting the classloader, `HiveClientImpl` relies on `isolationOn` to determine if it should use an isolated copy of `SessionState`, so the PR inadvertently switched to using a shared `SessionState` object. I think we do want to continue to have the isolated session state even in `builtin` mode, so this adds a new flag `sessionStateIsolationOn` which controls whether the session state should be isolated, _separately_ from the `isolationOn` flag which controls whether the classloader should be isolated. Default behavior is for `sessionStateIsolationOn` to be set equal to `isolationOn`, but for `builtin` mode, we override it to enable session state isolated even though classloader isolation is turned off.

### Why are the changes needed?
Please see a much more detailed description in SPARK-42539. The tl;dr is that user-provided JARs (such as `hive-exec-2.3.8.jar`) take precedence over Spark/system JARs when constructing the classloader used by `IsolatedClientLoader` on Java 8 in 'builtin' mode, which can cause unexpected behavior and/or breakages. This violates the expectation that, unless user-first classloader mode is used, Spark JARs should be prioritized over user JARs. It also seems that this separate classloader was unnecessary from the start, since the intent of 'builtin' mode is to use the JARs already existing on the regular classloader (as alluded to [here](#24057 (comment))). The isolated clientloader was originally added in #5876 in 2015. This bit in the PR description is the only mention of the behavior for "builtin":
> attempt to discover the jars that were used to load Spark SQL and use those. This option is only valid when using the execution version of Hive.

I can't follow the logic here; the user classloader clearly has all of the necessary Hive JARs, since that's where we're getting the JAR URLs from, so we could just use that directly instead of grabbing the URLs. When this was initially added, it only used the JARs from the user classloader, not any of its parents, which I suspect was the motivating factor (to try to avoid more Spark classes being duplicated inside of the isolated classloader, I guess). But that was changed a month later anyway in #6435 / #6459, so I think this may have basically been deadcode from the start. It has also caused at least one issue over the years, e.g. SPARK-21428, which disables the new-classloader behavior in the case of running inside of a CLI session.

### Does this PR introduce _any_ user-facing change?
No, except to protect Spark itself from potentially being broken by bad user JARs.

### How was this patch tested?
This includes a new unit test in `HiveUtilsSuite` which demonstrates the issue and shows that this approach resolves it. It has also been tested on a live cluster running Java 8 and Hive communication functionality continues to work as expected.

Unit tests failing in #40144 have been locally tested (`HiveUtilsSuite`, `HiveSharedStateSuite`, `HiveCliSessionStateSuite`, `JsonHadoopFsRelationSuite`).

Closes #40224 from xkrogen/xkrogen/SPARK-42539/hive-isolatedclientloader-builtin-user-jar-conflict-fix/take2.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Chao Sun <sunchao@apple.com>
senthh pushed a commit to acceldata-io/spark that referenced this pull request Jun 24, 2026
… for Hive isolation

Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it.

## What changes were proposed in this pull request?

Java 9+ changed how ClassLoaders work. The two most salient points:
- The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders
- The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath

## How was this patch tested?

We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though.

(NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.)

Closes apache#24057 from srowen/SPARK-26839.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

(cherry picked from commit c65f9b2)
senthh added a commit to acceldata-io/spark that referenced this pull request Jun 25, 2026
* ODP-7038|[SPARK-25946][BUILD] Upgrade ASM to 7.x to support JDK11

## What changes were proposed in this pull request?

Upgrade ASM to 7.x to support JDK11

## How was this patch tested?

Existing tests.

Closes apache#22953 from dbtsai/asm7.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>

(cherry picked from commit 3ed91c9)

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038: replace String.lines with split for JDK11 compile

JDK11 added java.lang.String#lines() returning java.util.stream.Stream<String>.
Scala 2.11's StringLike implicit also exposes .lines (Iterator[String]),
but the Java instance method takes resolution priority on JDK11+. The
resulting Stream<String>.toArray returns Object[], and the downstream
.size / .forall(_.size <= N) then fail to typecheck:

  value size is not a member of Object

MatricesSuite (both mllib and mllib-local copies) only needs a plain
newline split, so use .split("\\n") which returns Array[String]
unambiguously on every JDK.

* ODP-7038|[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation

Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it.

## What changes were proposed in this pull request?

Java 9+ changed how ClassLoaders work. The two most salient points:
- The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders
- The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath

## How was this patch tested?

We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though.

(NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.)

Closes apache#24057 from srowen/SPARK-26839.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

(cherry picked from commit c65f9b2)

* ODP-7038|[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile

### What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`.

Hive 2.3.6 release notes:
- [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader)
- [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations)
- [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version

### Why are the changes needed?
Make Spark support JDK 11.

### Does this PR introduce any user-facing change?
Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details.

### How was this patch tested?
Existing unit test and manual test.

Closes apache#25443 from wangyum/test-on-jenkins.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 02a0cde)

* ODP-7038 - Dev - Adding missing orc versions

* ODP-7038: harden Platform.<clinit> Cleaner reflection for JDK11 runtime

On JDK11, jdk.internal.ref is not exported to the unnamed module by
default, so Method.setAccessible() throws InaccessibleObjectException
inside Platform's static block, and spark-shell fails to start with:

  java.lang.ExceptionInInitializerError at ByteArrayMethods.<clinit>
  Caused by: InaccessibleObjectException: Unable to make ...
  jdk.internal.ref.Cleaner.create(Object, Runnable) accessible

Backport the SPARK-26839 graceful-degradation pattern from
upstream 2.4.x+/3.x:

  - Catch InaccessibleObjectException by name (avoids importing the
    JDK9+ class) when setAccessible() on DirectByteBuffer ctor/field
    fails; null both refs.
  - Probe createMethod by calling it with null args; if it throws
    IllegalAccessException, null the method ref.
  - allocateDirectBuffer() now checks for null CLEANER_CREATE_METHOD
    and falls back to ByteBuffer.allocateDirect(size), with a
    helpful OOM message pointing at -XX:MaxDirectMemorySize.

With this, spark-shell on JDK11 starts even without
`--add-opens java.base/jdk.internal.ref=ALL-UNNAMED`. Adding that
add-opens still gives you the bigger off-heap budget.

* ODP-7038: restore hive.version to ODP fork 1.2.1.spark24.0.14.1

The earlier SPARK-28723 cherry-pick (9bbdab0) blindly took upstream's
hive.version=1.2.1.spark2, which is the upstream spark-project.hive
1.2.1 line - NOT the ODP fork that lives in odp-hive-spark and ships
as 1.2.1.spark24.0.14.1.

ODP's deployed jar
  standalone-metastore-1.2.1.spark24.0.14.1-hive3.jar
is built from odp-hive-spark/standalone-metastore at 1.2.1.spark24.0.14.1.
Any JDK11 patches for the embedded HiveMetaStoreClient (e.g. HIVE-21508's
toArray fix) belong in odp-hive-spark, not here.

Keep the rest of SPARK-28723 (hive23.version, hadoop-3.2 profile
overrides, ThriftserverShimUtils) intact - those only kick in when
hadoop-3.2 profile selects the Apache Hive 2.3 path.

* ODP-7038: PySpark + bundled py4j source patches for Python 3.11

Stock Spark 2.4 PySpark targets Python 2.7-3.8. Python 3.10 and 3.11
broke several APIs PySpark and its bundled py4j-0.10.7 / cloudpickle
0.x still relied on. This commit applies source-level patches so a
fresh `pyspark` session runs cleanly under Python 3.11.

The big one: replace the 2017-era single-file pyspark/cloudpickle.py
with the vendored cloudpickle 2.2.1 package (exact backport from
upstream Apache Spark 3.x's python/pyspark/cloudpickle/). cloudpickle
2.2.1 (Aug 2022) is the first release with full Python 3.11 support -
bytecode opcode walker handles the new LOAD_GLOBAL flag encoding,
CodeType construction uses .replace() forward-compat, closure cell
serialization adapted to 3.11 frame layout, and many other 3.10/3.11
fixes that would have required dozens of manual patches to the old
copy.

Verified end-to-end on Python 3.11.15: pyspark imports cleanly, lambda
closure round-trips through cloudpickle.dumps()/loads() succeed for
the patterns that previously raised
  TypeError: code() argument 13 must be str, not int
  IndexError: tuple index out of range  (in extract_code_globals)
  RecursionError in save_function/_fill_function

Source changes
--------------
python/pyspark/cloudpickle.py  ->  python/pyspark/cloudpickle/
  Replace single-file 0.x copy with cloudpickle 2.2.1 vendored as a
  package (matching upstream Apache Spark 3.x layout). Only deltas
  vs upstream PyPI cloudpickle 2.2.1:
    * __init__.py:  `from cloudpickle.X` -> `from pyspark.cloudpickle.X`
      (relocates the package under pyspark)
    * cloudpickle_fast.py:634:  add `len(e.args) > 0` guard to the
      RecursionError fallback (same as Apache Spark 3.x's vendor diff)

python/pyspark/resultiterable.py
  Python 3.10 removed the lazy collections.* abc aliases. Class
  ResultIterable(collections.Iterable) raised AttributeError on import.
  Import from collections.abc with a Python 2 fallback.

python/pyspark/sql/types.py
python/pyspark/sql/session.py
  pandas 2.0 removed DataFrame.iteritems(). PySpark uses it in
  timestamp localization (types.py) and Arrow batch creation
  (session.py x2). Replace with .items() (present in pandas 1.x and
  2.x) guarded by a getattr() probe so older pandas keeps working.

python/pyspark/mllib/linalg/__init__.py
python/pyspark/ml/linalg/__init__.py
  Python 3.9 removed array.array.tostring(). Replace with .tobytes()
  in the DenseVector / SparseVector / DenseMatrix / SparseMatrix
  pickling paths (6+6 sites). Both methods are bytewise-identical so
  serialized payloads stay wire-compatible.

python/lib/py4j-0.10.7-src.zip
  Bundled py4j 0.10.7 (from 2018) imports MutableMapping, Sequence,
  MutableSequence, MutableSet, Set straight from `collections`. Python
  3.10 removed those aliases, causing
    ImportError: cannot import name 'MutableMapping' from 'collections'
  Patch the bundled zip: java_collections.py uses `from collections.abc`
  with a `from collections` fallback. Bytes-only change to the zip,
  no version bump (py4j Java jar stays at 0.10.7 so wire-protocol
  compat is preserved).

Verification
------------
  $ PYTHONPATH=python:python/lib/py4j-0.10.7-src.zip python3.11 \
      -W ignore -c "import pyspark; print(pyspark.__version__)"
  2.4.8
  $ python3.11 -W ignore -c "
      from pyspark import cloudpickle
      def make():
          x = 42
          return lambda r: (r, r * x)
      f = make()
      assert cloudpickle.loads(cloudpickle.dumps(f))(10) == (10, 420)
      print('closure round-trip OK')"
  closure round-trip OK

* ODP-7038: restore HiveUtils imports + isHive23, fix hadoop-3.2 profile

---------

Co-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: senthh <senthil.kumar@acceldata.io>
Co-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
shubhluck added a commit to acceldata-io/spark that referenced this pull request Jun 25, 2026
* ODP-7038|[SPARK-25946][BUILD] Upgrade ASM to 7.x to support JDK11

## What changes were proposed in this pull request?

Upgrade ASM to 7.x to support JDK11

## How was this patch tested?

Existing tests.

Closes apache#22953 from dbtsai/asm7.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>

(cherry picked from commit 3ed91c9)

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038 - Improvement - Enable Spark2 with jdk11 runtime support

* ODP-7038: replace String.lines with split for JDK11 compile

JDK11 added java.lang.String#lines() returning java.util.stream.Stream<String>.
Scala 2.11's StringLike implicit also exposes .lines (Iterator[String]),
but the Java instance method takes resolution priority on JDK11+. The
resulting Stream<String>.toArray returns Object[], and the downstream
.size / .forall(_.size <= N) then fail to typecheck:

  value size is not a member of Object

MatricesSuite (both mllib and mllib-local copies) only needs a plain
newline split, so use .split("\\n") which returns Array[String]
unambiguously on every JDK.

* ODP-7038|[SPARK-26839][SQL] Work around classloader changes in Java 9 for Hive isolation

Note, this doesn't really resolve the JIRA, but makes the changes we can make so far that would be required to solve it.

## What changes were proposed in this pull request?

Java 9+ changed how ClassLoaders work. The two most salient points:
- The boot classloader no longer 'sees' the platform classes. A new 'platform classloader' does and should be the parent of new ClassLoaders
- The system classloader is no longer a URLClassLoader, so we can't get the URLs of JARs in its classpath

## How was this patch tested?

We'll see whether Java 8 tests still pass here. Java 11 tests do not fully pass at this point; more notes below. This does make progress on the failures though.

(NB: to test with Java 11, you need to build with Java 8 first, setting JAVA_HOME and java's executable correctly, then switch both to Java 11 for testing.)

Closes apache#24057 from srowen/SPARK-26839.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

(cherry picked from commit c65f9b2)

* ODP-7038|[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile

### What changes were proposed in this pull request?

This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`.

Hive 2.3.6 release notes:
- [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader)
- [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations)
- [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version

### Why are the changes needed?
Make Spark support JDK 11.

### Does this PR introduce any user-facing change?
Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details.

### How was this patch tested?
Existing unit test and manual test.

Closes apache#25443 from wangyum/test-on-jenkins.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

(cherry picked from commit 02a0cde)

* ODP-7038 - Dev - Adding missing orc versions

* ODP-7038: harden Platform.<clinit> Cleaner reflection for JDK11 runtime

On JDK11, jdk.internal.ref is not exported to the unnamed module by
default, so Method.setAccessible() throws InaccessibleObjectException
inside Platform's static block, and spark-shell fails to start with:

  java.lang.ExceptionInInitializerError at ByteArrayMethods.<clinit>
  Caused by: InaccessibleObjectException: Unable to make ...
  jdk.internal.ref.Cleaner.create(Object, Runnable) accessible

Backport the SPARK-26839 graceful-degradation pattern from
upstream 2.4.x+/3.x:

  - Catch InaccessibleObjectException by name (avoids importing the
    JDK9+ class) when setAccessible() on DirectByteBuffer ctor/field
    fails; null both refs.
  - Probe createMethod by calling it with null args; if it throws
    IllegalAccessException, null the method ref.
  - allocateDirectBuffer() now checks for null CLEANER_CREATE_METHOD
    and falls back to ByteBuffer.allocateDirect(size), with a
    helpful OOM message pointing at -XX:MaxDirectMemorySize.

With this, spark-shell on JDK11 starts even without
`--add-opens java.base/jdk.internal.ref=ALL-UNNAMED`. Adding that
add-opens still gives you the bigger off-heap budget.

* ODP-7038: restore hive.version to ODP fork 1.2.1.spark24.0.14.1

The earlier SPARK-28723 cherry-pick (9bbdab0) blindly took upstream's
hive.version=1.2.1.spark2, which is the upstream spark-project.hive
1.2.1 line - NOT the ODP fork that lives in odp-hive-spark and ships
as 1.2.1.spark24.0.14.1.

ODP's deployed jar
  standalone-metastore-1.2.1.spark24.0.14.1-hive3.jar
is built from odp-hive-spark/standalone-metastore at 1.2.1.spark24.0.14.1.
Any JDK11 patches for the embedded HiveMetaStoreClient (e.g. HIVE-21508's
toArray fix) belong in odp-hive-spark, not here.

Keep the rest of SPARK-28723 (hive23.version, hadoop-3.2 profile
overrides, ThriftserverShimUtils) intact - those only kick in when
hadoop-3.2 profile selects the Apache Hive 2.3 path.

* ODP-7038: PySpark + bundled py4j source patches for Python 3.11

Stock Spark 2.4 PySpark targets Python 2.7-3.8. Python 3.10 and 3.11
broke several APIs PySpark and its bundled py4j-0.10.7 / cloudpickle
0.x still relied on. This commit applies source-level patches so a
fresh `pyspark` session runs cleanly under Python 3.11.

The big one: replace the 2017-era single-file pyspark/cloudpickle.py
with the vendored cloudpickle 2.2.1 package (exact backport from
upstream Apache Spark 3.x's python/pyspark/cloudpickle/). cloudpickle
2.2.1 (Aug 2022) is the first release with full Python 3.11 support -
bytecode opcode walker handles the new LOAD_GLOBAL flag encoding,
CodeType construction uses .replace() forward-compat, closure cell
serialization adapted to 3.11 frame layout, and many other 3.10/3.11
fixes that would have required dozens of manual patches to the old
copy.

Verified end-to-end on Python 3.11.15: pyspark imports cleanly, lambda
closure round-trips through cloudpickle.dumps()/loads() succeed for
the patterns that previously raised
  TypeError: code() argument 13 must be str, not int
  IndexError: tuple index out of range  (in extract_code_globals)
  RecursionError in save_function/_fill_function

Source changes
--------------
python/pyspark/cloudpickle.py  ->  python/pyspark/cloudpickle/
  Replace single-file 0.x copy with cloudpickle 2.2.1 vendored as a
  package (matching upstream Apache Spark 3.x layout). Only deltas
  vs upstream PyPI cloudpickle 2.2.1:
    * __init__.py:  `from cloudpickle.X` -> `from pyspark.cloudpickle.X`
      (relocates the package under pyspark)
    * cloudpickle_fast.py:634:  add `len(e.args) > 0` guard to the
      RecursionError fallback (same as Apache Spark 3.x's vendor diff)

python/pyspark/resultiterable.py
  Python 3.10 removed the lazy collections.* abc aliases. Class
  ResultIterable(collections.Iterable) raised AttributeError on import.
  Import from collections.abc with a Python 2 fallback.

python/pyspark/sql/types.py
python/pyspark/sql/session.py
  pandas 2.0 removed DataFrame.iteritems(). PySpark uses it in
  timestamp localization (types.py) and Arrow batch creation
  (session.py x2). Replace with .items() (present in pandas 1.x and
  2.x) guarded by a getattr() probe so older pandas keeps working.

python/pyspark/mllib/linalg/__init__.py
python/pyspark/ml/linalg/__init__.py
  Python 3.9 removed array.array.tostring(). Replace with .tobytes()
  in the DenseVector / SparseVector / DenseMatrix / SparseMatrix
  pickling paths (6+6 sites). Both methods are bytewise-identical so
  serialized payloads stay wire-compatible.

python/lib/py4j-0.10.7-src.zip
  Bundled py4j 0.10.7 (from 2018) imports MutableMapping, Sequence,
  MutableSequence, MutableSet, Set straight from `collections`. Python
  3.10 removed those aliases, causing
    ImportError: cannot import name 'MutableMapping' from 'collections'
  Patch the bundled zip: java_collections.py uses `from collections.abc`
  with a `from collections` fallback. Bytes-only change to the zip,
  no version bump (py4j Java jar stays at 0.10.7 so wire-protocol
  compat is preserved).

Verification
------------
  $ PYTHONPATH=python:python/lib/py4j-0.10.7-src.zip python3.11 \
      -W ignore -c "import pyspark; print(pyspark.__version__)"
  2.4.8
  $ python3.11 -W ignore -c "
      from pyspark import cloudpickle
      def make():
          x = 42
          return lambda r: (r, r * x)
      f = make()
      assert cloudpickle.loads(cloudpickle.dumps(f))(10) == (10, 420)
      print('closure round-trip OK')"
  closure round-trip OK

* ODP-7038: restore HiveUtils imports + isHive23, fix hadoop-3.2 profile

---------

Co-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: senthh <senthil.kumar@acceldata.io>
Co-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants