Skip to content

[C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata #31723

Description

@asfimport

Context: I ran into this issue when reading Parquet files created by GDAL (using the Arrow C++ APIs, OSGeo/gdal#5477), which writes files that have custom key_value_metadata, but without storing ARROW:schema in those metadata (cc @paleolimbot)

Both in reading and writing files, I expected that we would map Arrow Schema::metadata with Parquet FileMetaData::key_value_metadata. But apparently this doesn't (always) happen out of the box, and only happens through the "ARROW:schema" field (which stores the original Arrow schema, and thus the metadata stored in this schema).

For example, when writing a Table with schema metadata, this is not stored directly in the Parquet FileMetaData (code below is using branch from ARROW-16337 to have the store_schema keyword):

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
pq.write_table(table, "test_metadata_without_arrow_schema.parquet", store_schema=False)

# original schema has metadata
>>> table.schema
a: int64
-- schema metadata --
key: 'value'

# reading back only has the metadata in case we stored ARROW:schema
>>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
a: int64
-- schema metadata --
key: 'value'
# and not if ARROW:schema is absent
>>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
a: int64

It seems that if we store the ARROW:schema, we also store the schema metadata separately. But if store_schema is False, we also stop writing those metadata (not fully sure if this is the intended behaviour, and that's the reason for the above output):

# when storing the ARROW:schema, we ALSO store key:value metadata
>>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
{b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...',
 b'key': b'value'}
# when not storing the schema, we also don't store the key:value
>>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata is None
True

On the reading side, it seems that we generally do read custom key/value metadata into schema metadata. We don't have the pyarrow APIs at the moment to create such a file (given the above), but with a small patch I could create such a file:

# a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
>>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
{b'key': b'value'}

# this metadata is now correctly mapped to the Arrow schema metadata
>>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
a: int64
-- schema metadata --
key: 'value'

But if you have a file that has both custom key/value metadata and an "ARROW:schema" key, we actually ignore the custom keys, and only look at the "ARROW:schema" one.
This was the case that I ran into with GDAL, where I have a file with both keys, but where the custom "geo" key is not also included in the serialized arrow schema in the "ARROW:schema" key:

# includes both keys in the Parquet file
>>> pq.read_metadata("test_gdal.parquet").metadata
{b'geo': b'{"version":"0.1.0","...',
 b'ARROW:schema': b'/////3gBAAAQ...'}
# the "geo" key is lost in the Arrow schema
>>> pq.read_table("test_gdal.parquet").schema.metadata is None
True

Reporter: Joris Van den Bossche / @jorisvandenbossche

Related issues:

Note: This issue was originally created as ARROW-16339. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions