-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
writing to partitioned table uses the wrong column as partition key
To Reproduce
❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition_id) location '/tmp/test' options (create_local_path 'true');
Arrow error: Schema error: Unable to get field named "partition_id". Valid fields: ["partition", "trace_id"]
❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition) location '/tmp/test' options (create_local_path 'true');
0 rows in set. Query took 0.001 seconds.
❯ insert into test select * from 'input.parquet';
Object Store error: Generic LocalFileSystem error: Unable to open file /private/tmp/test/partition=00000000000000003088e9e74cf166bd/QZSOEmePoAsQkvzU.parquet#1: Too many open files (os error 24)
❯It looks like it used the wrong column as the partition key:
alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | wc -l
19992
alamb@MacBook-Pro-8:~/Software/arrow-datafusion$ ls /tmp/test | head
partition=0000000000000000000102576ce2faea/
partition=00000000000000000004d8eb49424f0d/
partition=000000000000000000051e89839e2bb0/
partition=00000000000000000005406b87ebcb41/
partition=000000000000000000065226d41eba99/
partition=000000000000000000066e556bea9c68/
partition=0000000000000000000688030531c6ff/
partition=000000000000000000068839bb143b45/
partition=00000000000000000006aaa220390696/
partition=00000000000000000006b0ebabd460a3/Here is the input: input.parquet
Here is how I made it:
❯ copy (select 'a' as "partition", trace_id from traces UNION ALL select 'b' as "partition", trace_id from traces UNION ALL select 'c' as "partition", trace_id from traces) to 'input.parquet';
+----------+
| count |
+----------+
| 15557151 |
+----------+
1 row in set. Query took 3.639 seconds.
❯ select * from 'input.parquet' limit 1;
+-----------+----------------------------------+
| partition | trace_id |
+-----------+----------------------------------+
| b | 000000000000000028bf4438cad62275 |
+-----------+----------------------------------+
1 row in set. Query took 0.009 seconds.
❯ describe 'input.parquet';
+-------------+-------------------------+-------------+
| column_name | data_type | is_nullable |
+-------------+-------------------------+-------------+
| partition | Utf8 | NO |
| trace_id | Dictionary(Int32, Utf8) | YES |
+-------------+-------------------------+-------------+
2 rows in set. Query took 0.001 seconds.Expected behavior
I expect a three output directories to be created, for each of the distinct values of partition_id, a, b and c
/tmp/test/partition_id=a/<uuid>.parquet
/tmp/test/partition_id=b/<uuid>.parquet
/tmp/test/partition_id=c/<uuid>.parquet
Additional context
Found as part of #7801 (review)
@devinjdangelo did some initial investigation: #7801 (comment)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working