Python: Add Hive Catalog#5391
Conversation
| def create_namespace(self, namespace: Union[str, Identifier], properties: Optional[Properties] = None) -> None: | ||
| database_name = self.identifier_to_tuple(namespace)[0] | ||
|
|
||
| hive_database = HiveDatabase(name=database_name, parameters=properties) |
There was a problem hiding this comment.
This should handle the location URI and description like the Java HiveCatalog does.
There was a problem hiding this comment.
Nice, wasn't aware of that!
There was a problem hiding this comment.
Hmm, this also sets the location:
private String getWarehouseLocation() {
String warehouseLocation = conf.get(HiveConf.ConfVars.METASTOREWAREHOUSE.varname);
Preconditions.checkNotNull(
warehouseLocation, "Warehouse location is not set: hive.metastore.warehouse.dir=null");
return warehouseLocation;
}We don't have the HiveContext in Python.
There was a problem hiding this comment.
Warehouse location is used to create default database paths. But if a database has a specific path we should set it in the metastore like Java does.
It's okay that we don't have a config that can tell us the warehouse location. Iceberg has a standard catalog configuration property, warehouse, that we use across catalogs. We should use that here as well.
| def _columns(self, schema: Schema) -> List[FieldSchema]: | ||
| return [FieldSchema(field.name, self._convert_hive_type(field.field_type), field.doc) for field in schema.fields] | ||
|
|
||
| def _convert_hive_type(self, col_type: IcebergType) -> str: |
There was a problem hiding this comment.
I think we'll want a real implementation for this using a type visitor, since the map is very basic.
There was a problem hiding this comment.
Since this is part of the write path, I'd postpone this until then. The visitor will be simple, but we need to test it thorough :)
|
|
||
| @abstractmethod | ||
| def list_tables(self, namespace: str | Identifier | None = None) -> list[Identifier]: | ||
| def list_tables(self, namespace: str | Identifier) -> list[Identifier]: |
There was a problem hiding this comment.
I think that having None here was correct. You can list tables in the root namespace.
There was a problem hiding this comment.
Technically I understand it is possible, but this isn't the case for Hive and also not for the REST catalog (namespace is required): https://github.com/apache/iceberg/blob/master/open-api/rest-catalog-open-api.yaml#L401
Therefore I decided to remove it for now. WDYT?
There was a problem hiding this comment.
This isn't a change I would mix in with the current PR, and I probably wouldn't bother making it anyway. The Catalog contract does should allow this, so I don't see a reason to remove it from the contract documentation just because we don't have an implementation in the Python code yet.
|
Looking great! Just a few more things and I think this will be ready to go. Also, did you see my comment about removing the CLI utilities? I don't think that we should include those in the vendored code. |
3d47248 to
6310e66
Compare
6310e66 to
1b656e5
Compare
| if key == "comment": | ||
| database.description = value | ||
| elif key == "comment": | ||
| database.description = value |
There was a problem hiding this comment.
I think you probably intended to update this to location and set the location URI?
|
@rdblue Not a problem at all, I've fixed the conflicts. #5391 (comment) is still open, but we can also handle it in a backport. |
rdblue
left a comment
There was a problem hiding this comment.
+1 when tests are passing!
|
Merged! Awesome work, @Fokko! |
Adds a hive catalog with the basic operations. Still some parts are stubbed, such as:
This PR includes adding two vendored packages, which are compiled from thrift definitions, instructions can be found in python/vendor/README.md.