-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: respect DataFrameWriteOptions::with_single_file_output for paths without extensions #19931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
ab87058
d4e67d3
e53f51f
bb71255
9731717
d30cf77
5824db2
e442ad3
e83fc8e
ba54292
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,7 +23,7 @@ use std::sync::Arc; | |
|
|
||
| use crate::datasource::file_format::file_type_to_format; | ||
| use crate::datasource::listing::ListingTableUrl; | ||
| use crate::datasource::physical_plan::FileSinkConfig; | ||
| use crate::datasource::physical_plan::{FileOutputMode, FileSinkConfig}; | ||
| use crate::datasource::{DefaultTableSource, source_as_provider}; | ||
| use crate::error::{DataFusionError, Result}; | ||
| use crate::execution::context::{ExecutionProps, SessionState}; | ||
|
|
@@ -549,8 +549,30 @@ impl DefaultPhysicalPlanner { | |
| } | ||
| }; | ||
|
|
||
| // Parse single_file_output option if explicitly set | ||
| let file_output_mode = match source_option_tuples | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if it would be better to push this enum into
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I kept |
||
| .get("single_file_output") | ||
| .map(|v| v.trim()) | ||
| { | ||
| None => FileOutputMode::Automatic, | ||
| Some("true") => FileOutputMode::SingleFile, | ||
| Some("false") => FileOutputMode::Directory, | ||
| Some(value) => { | ||
| return Err(DataFusionError::Configuration(format!( | ||
| "provided value for 'single_file_output' was not recognized: \"{value}\"" | ||
| ))); | ||
| } | ||
| }; | ||
|
|
||
| // Filter out sink-related options that are not format options | ||
| let format_options: HashMap<String, String> = source_option_tuples | ||
|
Comment on lines
+567
to
+568
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the need for filtering the option here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
| .iter() | ||
| .filter(|(k, _)| k.as_str() != "single_file_output") | ||
| .map(|(k, v)| (k.clone(), v.clone())) | ||
| .collect(); | ||
|
|
||
| let sink_format = file_type_to_format(file_type)? | ||
| .create(session_state, source_option_tuples)?; | ||
| .create(session_state, &format_options)?; | ||
|
|
||
| // Determine extension based on format extension and compression | ||
| let file_extension = match sink_format.compression_type() { | ||
|
|
@@ -571,6 +593,7 @@ impl DefaultPhysicalPlanner { | |
| insert_op: InsertOp::Append, | ||
| keep_partition_by_columns, | ||
| file_extension, | ||
| file_output_mode, | ||
| }; | ||
|
|
||
| let ordering = input_exec.properties().output_ordering().cloned(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need to be called .parquet even though the dataframe explicitly says
write_parquet? Or is this just to clean up the code?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no file extension and there is no configuration (
Some(true)) then the heuristics decide that this is a folder and creates a Parquet partition file.If there is no config and the path has file extension then it writes all the content in this file (no partitions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually why is this change needed for the example? Is it fixing some behaviour made by this change or just clarifying the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR,
with_single_file_output(true)was silently ignored for paths without extensions it would create a directorycars_encrypted/with parquet files inside. Theread_parquet()call then found these files in the directory. After this PR,with_single_file_output(true)is correctly respected, creating a single file at the exact pathcars_encrypted. However,read_parquet()with default options expects a.parquetextension and fails validation. Adding .parquet to the filename fixes this and makes the example explicit about the expected file format.