Skip to content

Build and test hive-metastore with Hive 3 and Hive 4#12681

Closed
wypoon wants to merge 1 commit into
apache:mainfrom
wypoon:hive4
Closed

Build and test hive-metastore with Hive 3 and Hive 4#12681
wypoon wants to merge 1 commit into
apache:mainfrom
wypoon:hive4

Conversation

@wypoon
Copy link
Copy Markdown
Contributor

@wypoon wypoon commented Mar 29, 2025

We define new modules, hive3-metastore and hive4-metastore, that depend on Hive 3.1.3 and Hive 4.0.1 respectively for their Hive dependencies. The existing hive-metastore module continues to depend on Hive 2.3.10.

We keep most of the source files the same for all three modules and avoid duplicating source files by keeping all common files in the hive-metastore directory. Only files which are not the same for all three Hive versions are separated out to hive2-metastore, hive3-metastore and hive4-metastore directories. It turns out only two files (both test files, one being TestHiveMetastore.java) could not be kept the same across all three Hive versions. In order to achieve this, to workaround HIVE-27925 which introduced a backwards incompatibility, instead of using HiveConf.ConfVars enums we use the configuration property names. (This is also the approach taken by Spark in SPARK-47679 to workaround this problem.)

For this PR, we do not touch other modules (such as Flink and Spark modules) that depend on TestHiveMetastore for testing. Those modules continue to depend on hive-metastore, built with Hive 2.3.10.

UPDATE: Alternative implementation keeping all the source files the same for all three modules is in #12721.

@wypoon wypoon changed the title Ignore. Testing WIP. Build and test hive-metastore with Hive 3 and Hive 4 Mar 29, 2025
@wypoon
Copy link
Copy Markdown
Contributor Author

wypoon commented Mar 29, 2025

Note: TestHiveClientPool.java is the same for Hive 3 and 4 due to a change introduced in Hive 3 that is incompatible with Hive 2 (in fact only one test is changed). TestHiveMetastore.java is the same for Hive 2 and 3 and just different for Hive 4. We could deduplicate further but I didn't think it worthwhile. To keep things simple for defining the Gradle sourceSets, if a file cannot be the same for all three Hive versions, I separate it out to three files.

Comment on lines +63 to +64
// create the metastore handlers based on whether we're working with Hive2 or Hive3 dependencies
// we need to do this because there is a breaking API change between Hive2 and Hive3
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is exactly the same as TestHiveMetastore.java in hive2-metastore.
Therefore the DynConstructors and DynMethods machinery is kept/used.

try (InputStream inputStream =
TestHiveMetastore.class
.getClassLoader()
.getResourceAsStream("hive-schema-3.1.0.derby.sql");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be used for both Hive 2 and Hive 3.

Comment on lines +214 to +226
baseHandler = new HMSHandler("new db based metaserver", serverConf);
IHMSHandler handler = HMSHandlerProxyFactory.getProxy(serverConf, baseHandler, false);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this file is only for Hive 4, we don't need to use the DynConstructors and DynMethods machinery. We simply call the relevant constructor and static method directly!

try (InputStream inputStream =
TestHiveMetastore.class
.getClassLoader()
.getResourceAsStream("hive-schema-4.0.0.derby.sql");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New schema needed for Hive 4.

Comment on lines -448 to +447
Namespace namespace1 = Namespace.of("noLocation");
Namespace namespace1 = Namespace.of("nolocation");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, in Hive 4, the org.apache.hadoop.hive.metastore.api.Database below (database1) returns all-lowercase "nolocation.db" when getLocationUri() is called.

Copy link
Copy Markdown
Contributor Author

@wypoon wypoon Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Namespace.of("noLocation"), when HiveCatalog::createNamespace is called, the metastore client calls createDatabase with a Database containing name "noLocation" and locationUri ending with "noLocation.db", but when the client calls getDatabase, it gets back a Database containing name "nolocation" and locationUri ending with "nolocation.db". On disk, however, I can see that a directory "noLocation.db" has been created in the warehouse directory.
@pvary @deniskuzZ could you shed light on what changed in Hive 4 that causes this behavior? Is it anything we need to do something about on the Iceberg side?

Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check, however, as I recall Hive is case-insensitive for database, table, and partition identifiers, typically converting them to lowercase.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, in my test create database iCe; directory is lower-cased:

aws s3 ls s3://xxx/warehouse/tablespace/external/hive/ice.db
                           PRE ice.db/

Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

difference is probably because Hive doesn't normally use HiveCatalog#createNamespace, only HMS RestCatalog does

database.setLocationUri(new Path(getExternalWarehouseLocation(), namespace.level(0)).toString() + ".db");

HMS (code hasn't changed from 2018)

private String dbDirFromDbName(Database db) throws MetaException {
  return db.getName().toLowerCase() + DATABASE_WAREHOUSE_SUFFIX;
}

@Test
public void testInvalidObjectException() {
TableIdentifier badTi = TableIdentifier.of(DB_NAME, "`tbl`");
TableIdentifier badTi = TableIdentifier.of(DB_NAME, "tábl");
Copy link
Copy Markdown
Contributor Author

@wypoon wypoon Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before Hive 4, the only special character allowed is '/':
https://github.com/apache/hive/blob/branch-3.1/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L155

  private static final char[] specialCharactersInTableNames = new char[] { '/' };

In Hive 4, all non-alphanumeric characters in a US keyboard, including backtick ('`'), are allowed special characters:
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L175-L181

  private static final char[] SPECIAL_CHARACTERS_IN_TABLE_NAMES = new char[] {
      // standard
      ' ', '"', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '[', ']',
      '_', '|', '{', '}', '$', '^',
      // non-standard
      '!', '~', '#', '@', '`'
  };

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test changes before and after hive4?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we do not. The purpose of the test is simply to test exception handling when the table name is not accepted by Hive. For this purpose, any name not accepted by Hive would do. Previously, a name containing the backtick character ('`') fit the bill, but now we need to use a different kind of name; that's all.

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 3, 2025

Hi @wypoon,
Thanks for taking a look at this. I really would like to see HiveCatalog testing against Hive4, but Spark has constraints around testing newer Hive versions which we need to solve, or wait until all of the supported Spark version will have an updated Hive.

OTOH, if possible, I would like to see only a single TestHiveMetastore class. Could we do some changes internally to the class, which based on the classes available on the classpath, or some configuration would use the correct Hive versions (maybe with DynMethods)?

@wypoon
Copy link
Copy Markdown
Contributor Author

wypoon commented Apr 4, 2025

@pvary I replied on the dev list.

I really would like to see HiveCatalog testing against Hive4, but Spark has constraints around testing newer Hive versions which we need to solve, or wait until all of the supported Spark version will have an updated Hive.

I think testing Flink and Spark with a Hive 3/Hive 4 metastore should be a separate PR. This is the first step. I also think this step makes it possible for the Hive project to drop its fork of this module from the Hive repo.
I also do not think we should wait until all supported Spark versions support Hive 4, for this is never going to happen for Spark 3.4 or even Spark 3.5 (which is a LTS version and which I expect Iceberg to support for a while).

OTOH, if possible, I would like to see only a single TestHiveMetastore class. Could we do some changes internally to the class, which based on the classes available on the classpath, or some configuration would use the correct Hive versions (maybe with DynMethods)?

The problem is with the HMSHandler class. Yes, I can use DynConstructors to construct the HMSHandler using the classname based on the Hive version, but TestHiveMetastore holds an instance of HMSHandler in a field, baseHandler, and in TestHiveMetastore::stop(), it calls baseHandler.shutdown(). We need to declare this field, but the type of the field is org.apache.hadoop.hive.metastore.HMSHandler in Hive 4 and org.apache.hadoop.hive.metastore.HiveMetaStore.HMSHandler in Hive 2 and 3. I don't see how to declare that field in a way independent of the Hive version. The interface that HMSHandler implements, IHMSHandler, does not have a shutdown() method. I need to call that method on HMSHandler, not IHMSHandler.

@wypoon
Copy link
Copy Markdown
Contributor Author

wypoon commented Apr 4, 2025

@pvary see #12721.

@wypoon
Copy link
Copy Markdown
Contributor Author

wypoon commented Apr 4, 2025

@pvary I have rebased on main.
I see that this picked up c661a71 / #12637. In that change, a new test was added: TestHiveCommitLocks::testMultipleAlterTableForNoLock. This fails for Hive 3.1.3. I checked https://issues.apache.org/jira/browse/HIVE-26882 and see that it is fixed in 2.3.10 and 4.0.0-beta-1; it does not seem to be fixed in Hive 3 as there is no release after 3.1.3. Based on the date of the last commit in https://github.com/apache/hive/commits/rel/release-3.1.3/ and the date apache/hive#3944 was merged, I conclude that HIVE-26882 is not available in 3.1.3.
Based on my understanding, it is not safe to set engine.hive.lock-enabled to false if HIVE-26882 is not available in the HMS version. @lirui-apache can you confirm? I have disabled TestHiveCommitLocks::testMultipleAlterTableForNoLock for Hive 3 to avoid the failure.

Comment thread build.gradle
}
}

project(':iceberg-hive3-metastore') {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to put everything under hive-metastore directory.

hive-metastore/v2
hive-metastore/v3
hive-metastore/v4

// generate key elements in a certain order, so that the Key instances are comparable
List<Object> elements = Lists.newArrayList();
elements.add(conf.get(HiveConf.ConfVars.METASTOREURIS.varname, ""));
elements.add(conf.get("hive.metastore.uris", ""));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to define this constant in an Iceberg util class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary and I don't see the benefit. Spark had to deal with this issue too, and they used the conf name or HiveConf.getConfVars("...") with the conf name if they needed a enum instance. They didn't go so far as to define a collection of constants for the conf names.

@lirui-apache
Copy link
Copy Markdown
Contributor

@wypoon The no-lock feature needs both HIVE-26882 and HIVE-28121 to work, which are not available in Hive 3.1.3. So +1 to skipping the test for Hive 3.x.

@manuzhang
Copy link
Copy Markdown
Member

I'm +1 for this approach, like how we've supported multiple versions of Flink and Spark. I'm suggesting to keep everything under hive-metatore directory.

@wypoon
Copy link
Copy Markdown
Contributor Author

wypoon commented Apr 7, 2025

@manuzhang thanks for reviewing!
I'd prefer not to follow the approach taken by Spark and Flink, to have versioned directories with duplicated code. I'm trying to keep a single version of code all in one directory as far as possible. The model is more the old mr and hive3 directories.
If you mean keep versioned directories under hive-metastore, but do not duplicate code unless necessary, do you suggest a directory for common? E.g.,

hive-metastore/common
hive-metastore/v2
hive-metastore/v3
hive-metastore/v4

The idea has some appeal, but I didn't do this as I was trying to avoid too much disruption to the existing code base.
It seems a lot more moving files around without substantial benefit.

@manuzhang
Copy link
Copy Markdown
Member

manuzhang commented Apr 8, 2025

If you mean keep versioned directories under hive-metastore, but do not duplicate code unless necessary

Yes, this is what I mean.

If you don't want to move the existing hive-metastore module. How about adding an hive-metastore-tests directory and put v2, v3 and v4 under it? This directory can be later merged into hive-metastore when we deprecate hive2 and hive3 connectors.

@wypoon
Copy link
Copy Markdown
Contributor Author

wypoon commented Apr 9, 2025

@manuzhang thanks for your feedback.
@pvary was very keen to try to keep a single code base if possible. Please take a look at #12721. After some iterations with Peter, I have now implemented that objective. So now all the code and resources are in the hive-metastore directory.
I'm still open to the idea of having builds under

hive-metastore/v2
hive-metastore/v3
hive-metastore/v4

and there is probably some clever way to do that in gradle. Unfortunately, my knowledge of gradle is limited. That can also be a follow-up nice-to-do.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 11, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this May 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants