Python: Add Hive Catalog by Fokko · Pull Request #5391 · apache/iceberg

Fokko · 2022-07-29T22:02:42Z

Adds a hive catalog with the basic operations. Still some parts are stubbed, such as:

Table object, which also is being altered in the REST catalog PR.
We don't yet fetch the metadata json, would be good to do this in a separate PR to keep them manageable in size.

This PR includes adding two vendored packages, which are compiled from thrift definitions, instructions can be found in python/vendor/README.md.

rdblue · 2022-07-30T21:06:23Z

+    def create_namespace(self, namespace: Union[str, Identifier], properties: Optional[Properties] = None) -> None:
+        database_name = self.identifier_to_tuple(namespace)[0]
+
+        hive_database = HiveDatabase(name=database_name, parameters=properties)


This should handle the location URI and description like the Java HiveCatalog does.

Nice, wasn't aware of that!

Hmm, this also sets the location:

private String getWarehouseLocation() { String warehouseLocation = conf.get(HiveConf.ConfVars.METASTOREWAREHOUSE.varname); Preconditions.checkNotNull( warehouseLocation, "Warehouse location is not set: hive.metastore.warehouse.dir=null"); return warehouseLocation; }

We don't have the HiveContext in Python.

Warehouse location is used to create default database paths. But if a database has a specific path we should set it in the metastore like Java does.

It's okay that we don't have a config that can tell us the warehouse location. Iceberg has a standard catalog configuration property, warehouse, that we use across catalogs. We should use that here as well.

rdblue · 2022-07-30T21:20:09Z

+    def _columns(self, schema: Schema) -> List[FieldSchema]:
+        return [FieldSchema(field.name, self._convert_hive_type(field.field_type), field.doc) for field in schema.fields]
+
+    def _convert_hive_type(self, col_type: IcebergType) -> str:


I think we'll want a real implementation for this using a type visitor, since the map is very basic.

Since this is part of the write path, I'd postpone this until then. The visitor will be simple, but we need to test it thorough :)

…d-hive-support

rdblue · 2022-08-01T23:32:02Z


    @abstractmethod
-    def list_tables(self, namespace: str | Identifier | None = None) -> list[Identifier]:
+    def list_tables(self, namespace: str | Identifier) -> list[Identifier]:


I think that having None here was correct. You can list tables in the root namespace.

Technically I understand it is possible, but this isn't the case for Hive and also not for the REST catalog (namespace is required): https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L401

Therefore I decided to remove it for now. WDYT?

This isn't a change I would mix in with the current PR, and I probably wouldn't bother making it anyway. The Catalog contract does should allow this, so I don't see a reason to remove it from the contract documentation just because we don't have an implementation in the Python code yet.

rdblue · 2022-08-01T23:43:50Z

Looking great! Just a few more things and I think this will be ready to go.

Also, did you see my comment about removing the CLI utilities? I don't think that we should include those in the vendored code.

rdblue · 2022-08-02T22:38:42Z

+        if key == "comment":
+            database.description = value
+        elif key == "comment":
+            database.description = value


I think you probably intended to update this to location and set the location URI?

rdblue · 2022-08-02T22:46:40Z

@Fokko, I tried to fix _annotate_namespace to get this in, but the tests are failing. Looks like it is caused by merging #5421.

Also, I think we should make sure there's a test that catches the locationUri problem. Otherwise, this is good to merge when tests are passing.

…d-hive-support

Fokko · 2022-08-02T22:59:16Z

@rdblue Not a problem at all, I've fixed the conflicts.

#5391 (comment) is still open, but we can also handle it in a backport.

rdblue

+1 when tests are passing!

rdblue · 2022-08-02T23:50:24Z

Merged! Awesome work, @Fokko!

Fokko added 2 commits July 28, 2022 22:21

First version

80f72a5

Working on testing

d65a6dd

github-actions Bot added docs python labels Jul 29, 2022