Skip to content

writing to partitioned table uses the wrong column as partition key #7892

@alamb

Description

@alamb

Describe the bug

writing to partitioned table uses the wrong column as partition key

To Reproduce

❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition_id) location '/tmp/test'  options (create_local_path 'true');
Arrow error: Schema error: Unable to get field named "partition_id". Valid fields: ["partition", "trace_id"]
❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition) location '/tmp/test'  options (create_local_path 'true');
0 rows in set. Query took 0.001 seconds.

❯ insert into test select * from 'input.parquet';
Object Store error: Generic LocalFileSystem error: Unable to open file /private/tmp/test/partition=00000000000000003088e9e74cf166bd/QZSOEmePoAsQkvzU.parquet#1: Too many open files (os error 24)

It looks like it used the wrong column as the partition key:

alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | wc -l
   19992

alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | head
partition=0000000000000000000102576ce2faea/
partition=00000000000000000004d8eb49424f0d/
partition=000000000000000000051e89839e2bb0/
partition=00000000000000000005406b87ebcb41/
partition=000000000000000000065226d41eba99/
partition=000000000000000000066e556bea9c68/
partition=0000000000000000000688030531c6ff/
partition=000000000000000000068839bb143b45/
partition=00000000000000000006aaa220390696/
partition=00000000000000000006b0ebabd460a3/

Here is the input: input.parquet

Here is how I made it:

❯ copy (select 'a' as "partition", trace_id from traces UNION ALL select 'b' as "partition", trace_id from traces UNION ALL select 'c' as "partition", trace_id from traces) to 'input.parquet';
+----------+
| count    |
+----------+
| 15557151 |
+----------+
1 row in set. Query took 3.639 seconds.

❯ select * from 'input.parquet' limit 1;
+-----------+----------------------------------+
| partition | trace_id                         |
+-----------+----------------------------------+
| b         | 000000000000000028bf4438cad62275 |
+-----------+----------------------------------+
1 row in set. Query took 0.009 seconds.

❯ describe 'input.parquet';
+-------------+-------------------------+-------------+
| column_name | data_type               | is_nullable |
+-------------+-------------------------+-------------+
| partition   | Utf8                    | NO          |
| trace_id    | Dictionary(Int32, Utf8) | YES         |
+-------------+-------------------------+-------------+
2 rows in set. Query took 0.001 seconds.

Expected behavior

I expect a three output directories to be created, for each of the distinct values of partition_id, a, b and c

/tmp/test/partition_id=a/<uuid>.parquet
/tmp/test/partition_id=b/<uuid>.parquet
/tmp/test/partition_id=c/<uuid>.parquet

Additional context

Found as part of #7801 (review)

@devinjdangelo did some initial investigation: #7801 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions