Spark-3.5: Support CTAS and RTAS to preserve schema nullability.#10074
Conversation
|
@amogh-jahagirdar @aokolnychyi @RussellSpitzer can you please review this? Thanks! |
| * SELECT ... and creating the table. If false, fields' nullability will be preserved when | ||
| * creating the table. | ||
| */ | ||
| private static final String TABLE_CREATE_NULLABLE_QUERY_SCHEMA = "use-nullable-query-schema"; |
There was a problem hiding this comment.
I have mixed feelings about the name. On one hand, it is not very descriptive. On the other hand, it matches the Spark API. Let me think about it.
There was a problem hiding this comment.
I'd be in favor of slight renaming of the variable name and removing the doc if the name is clear enough for better grouping. This is a private variable. We should add it to our docs, though.
private static final String USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS = "use-nullable-query-schema";
private static final boolean USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT = true;
private boolean useNullableQuerySchema = USE_NULLABLE_QUERY_SCHEMA_CTAS_RTAS_DEFAULT;
There was a problem hiding this comment.
Hm, if there's a pointer to where use-nullable-query-schema is in the Spark API that would be good to see (I couldn't find anything with searching). IMO something that mentions "preserve" would be a better verb rather than "use" as well as something that clarifies this applies for CTAS/RTAS, since it's a bit more clear that we are essentially preserving the nullability from the source query. So something like preserve-ctas-rtas-nullability feels a bit more direct.
There was a problem hiding this comment.
Not super opinionated though, if there's something in Spark that's already following this naming then I'd agree to just follow that since it's less of a burden on a user to be aware of these different namings.
There was a problem hiding this comment.
I was referring to this method that we have to overload.
/**
* If true, mark all the fields of the query schema as nullable when executing
* CREATE/REPLACE TABLE ... AS SELECT ... and creating the table.
*/
default boolean useNullableQuerySchema() {
return true;
}
I agree the name is not very clear but I also don't know if there is a lot of value in deviating from Spark.
There was a problem hiding this comment.
That method can be used in more use cases in the future so it is probably best to stick with what Spark calls it.
| ImmutableMap.of( | ||
| "type", "hive", | ||
| "default-namespace", "default", | ||
| "use-nullable-query-schema", "false") |
There was a problem hiding this comment.
Why always false? Don't we want to test both values?
aokolnychyi
left a comment
There was a problem hiding this comment.
This looks mostly good to me, a few minor comments.
|
@aokolnychyi @amogh-jahagirdar Thanks for reviewing, comments have been addresed, please take a look when you have time. |
|
Thanks, @zhongyujiang! Thanks for reviewing, @amogh-jahagirdar! |
This PR adds a new catalog parameter
use-nullable-query-schemato control whether to set all fields to null in CTAS and RTAS operations.Currently, when using CTAS and RTAS to create tables, the fields of new tables are always marked as optional, even if thier source fileds are marked as required in the original table. By utilizing the parameter
use-nullable-query-schema, we can control whether to preserve the nullability of fields when creating a new table using CTAS or RTAS.Set
use-nullable-query-schematofalseto preserve the nullability of fields:Releated Spark PR: SPARK-43390
Close #7771