Build and test hive-metastore with Hive 3 and Hive 4#12681
Conversation
|
Note: TestHiveClientPool.java is the same for Hive 3 and 4 due to a change introduced in Hive 3 that is incompatible with Hive 2 (in fact only one test is changed). TestHiveMetastore.java is the same for Hive 2 and 3 and just different for Hive 4. We could deduplicate further but I didn't think it worthwhile. To keep things simple for defining the Gradle |
| // create the metastore handlers based on whether we're working with Hive2 or Hive3 dependencies | ||
| // we need to do this because there is a breaking API change between Hive2 and Hive3 |
There was a problem hiding this comment.
This file is exactly the same as TestHiveMetastore.java in hive2-metastore.
Therefore the DynConstructors and DynMethods machinery is kept/used.
| try (InputStream inputStream = | ||
| TestHiveMetastore.class | ||
| .getClassLoader() | ||
| .getResourceAsStream("hive-schema-3.1.0.derby.sql"); |
There was a problem hiding this comment.
This can be used for both Hive 2 and Hive 3.
| baseHandler = new HMSHandler("new db based metaserver", serverConf); | ||
| IHMSHandler handler = HMSHandlerProxyFactory.getProxy(serverConf, baseHandler, false); |
There was a problem hiding this comment.
Since this file is only for Hive 4, we don't need to use the DynConstructors and DynMethods machinery. We simply call the relevant constructor and static method directly!
| try (InputStream inputStream = | ||
| TestHiveMetastore.class | ||
| .getClassLoader() | ||
| .getResourceAsStream("hive-schema-4.0.0.derby.sql"); |
There was a problem hiding this comment.
New schema needed for Hive 4.
| Namespace namespace1 = Namespace.of("noLocation"); | ||
| Namespace namespace1 = Namespace.of("nolocation"); |
There was a problem hiding this comment.
For some reason, in Hive 4, the org.apache.hadoop.hive.metastore.api.Database below (database1) returns all-lowercase "nolocation.db" when getLocationUri() is called.
There was a problem hiding this comment.
With Namespace.of("noLocation"), when HiveCatalog::createNamespace is called, the metastore client calls createDatabase with a Database containing name "noLocation" and locationUri ending with "noLocation.db", but when the client calls getDatabase, it gets back a Database containing name "nolocation" and locationUri ending with "nolocation.db". On disk, however, I can see that a directory "noLocation.db" has been created in the warehouse directory.
@pvary @deniskuzZ could you shed light on what changed in Hive 4 that causes this behavior? Is it anything we need to do something about on the Iceberg side?
There was a problem hiding this comment.
I need to check, however, as I recall Hive is case-insensitive for database, table, and partition identifiers, typically converting them to lowercase.
There was a problem hiding this comment.
btw, in my test create database iCe; directory is lower-cased:
aws s3 ls s3://xxx/warehouse/tablespace/external/hive/ice.db
PRE ice.db/
There was a problem hiding this comment.
difference is probably because Hive doesn't normally use HiveCatalog#createNamespace, only HMS RestCatalog does
database.setLocationUri(new Path(getExternalWarehouseLocation(), namespace.level(0)).toString() + ".db");
HMS (code hasn't changed from 2018)
private String dbDirFromDbName(Database db) throws MetaException {
return db.getName().toLowerCase() + DATABASE_WAREHOUSE_SUFFIX;
}
| @Test | ||
| public void testInvalidObjectException() { | ||
| TableIdentifier badTi = TableIdentifier.of(DB_NAME, "`tbl`"); | ||
| TableIdentifier badTi = TableIdentifier.of(DB_NAME, "tábl"); |
There was a problem hiding this comment.
Before Hive 4, the only special character allowed is '/':
https://github.com/apache/hive/blob/branch-3.1/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L155
private static final char[] specialCharactersInTableNames = new char[] { '/' };
In Hive 4, all non-alphanumeric characters in a US keyboard, including backtick ('`'), are allowed special characters:
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L175-L181
private static final char[] SPECIAL_CHARACTERS_IN_TABLE_NAMES = new char[] {
// standard
' ', '"', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '[', ']',
'_', '|', '{', '}', '$', '^',
// non-standard
'!', '~', '#', '@', '`'
};
There was a problem hiding this comment.
Do we need to test changes before and after hive4?
There was a problem hiding this comment.
No, we do not. The purpose of the test is simply to test exception handling when the table name is not accepted by Hive. For this purpose, any name not accepted by Hive would do. Previously, a name containing the backtick character ('`') fit the bill, but now we need to use a different kind of name; that's all.
|
Hi @wypoon, OTOH, if possible, I would like to see only a single TestHiveMetastore class. Could we do some changes internally to the class, which based on the classes available on the classpath, or some configuration would use the correct Hive versions (maybe with DynMethods)? |
|
@pvary I replied on the dev list.
I think testing Flink and Spark with a Hive 3/Hive 4 metastore should be a separate PR. This is the first step. I also think this step makes it possible for the Hive project to drop its fork of this module from the Hive repo.
The problem is with the HMSHandler class. Yes, I can use |
|
@pvary I have rebased on main. |
| } | ||
| } | ||
|
|
||
| project(':iceberg-hive3-metastore') { |
There was a problem hiding this comment.
It might be better to put everything under hive-metastore directory.
hive-metastore/v2
hive-metastore/v3
hive-metastore/v4
| // generate key elements in a certain order, so that the Key instances are comparable | ||
| List<Object> elements = Lists.newArrayList(); | ||
| elements.add(conf.get(HiveConf.ConfVars.METASTOREURIS.varname, "")); | ||
| elements.add(conf.get("hive.metastore.uris", "")); |
There was a problem hiding this comment.
It might be better to define this constant in an Iceberg util class.
There was a problem hiding this comment.
I think it is not necessary and I don't see the benefit. Spark had to deal with this issue too, and they used the conf name or HiveConf.getConfVars("...") with the conf name if they needed a enum instance. They didn't go so far as to define a collection of constants for the conf names.
|
@wypoon The no-lock feature needs both HIVE-26882 and HIVE-28121 to work, which are not available in Hive 3.1.3. So +1 to skipping the test for Hive 3.x. |
|
I'm +1 for this approach, like how we've supported multiple versions of Flink and Spark. I'm suggesting to keep everything under |
|
@manuzhang thanks for reviewing! The idea has some appeal, but I didn't do this as I was trying to avoid too much disruption to the existing code base. |
Yes, this is what I mean. If you don't want to move the existing |
|
@manuzhang thanks for your feedback. and there is probably some clever way to do that in gradle. Unfortunately, my knowledge of gradle is limited. That can also be a follow-up nice-to-do. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
We define new modules, hive3-metastore and hive4-metastore, that depend on Hive 3.1.3 and Hive 4.0.1 respectively for their Hive dependencies. The existing hive-metastore module continues to depend on Hive 2.3.10.
We keep most of the source files the same for all three modules and avoid duplicating source files by keeping all common files in the hive-metastore directory. Only files which are not the same for all three Hive versions are separated out to hive2-metastore, hive3-metastore and hive4-metastore directories. It turns out only two files (both test files, one being TestHiveMetastore.java) could not be kept the same across all three Hive versions. In order to achieve this, to workaround HIVE-27925 which introduced a backwards incompatibility, instead of using
HiveConf.ConfVarsenums we use the configuration property names. (This is also the approach taken by Spark in SPARK-47679 to workaround this problem.)For this PR, we do not touch other modules (such as Flink and Spark modules) that depend on
TestHiveMetastorefor testing. Those modules continue to depend on hive-metastore, built with Hive 2.3.10.UPDATE: Alternative implementation keeping all the source files the same for all three modules is in #12721.