Skip to content

[SPARK-29508][SQL] Implicitly cast strings in datetime arithmetic operations#26165

Closed
MaxGekk wants to merge 6 commits into
apache:masterfrom
MaxGekk:string-to-interval-implicit-conv
Closed

[SPARK-29508][SQL] Implicitly cast strings in datetime arithmetic operations#26165
MaxGekk wants to merge 6 commits into
apache:masterfrom
MaxGekk:string-to-interval-implicit-conv

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Oct 18, 2019

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

  • Cast strings to intervals in interval - string. For example:
spark-sql> select interval '1 month' - '1 day 2 hours';
interval 1 months -1 days -2 hours
  • Cast strings to intervals in datetime + string or string + datetime. For example:
spark-sql> select timestamp'today' + '1 month';
2019-11-18 00:00:00
  • Cast strings to timestamps in datetime - string or string - datetime. For example:
spark-sql> select date'today' - '2019-10-01';
interval 2 weeks 3 days

Why are the changes needed?

To improve user experience with Spark SQL

Does this PR introduce any user-facing change?

Yes, previously the operations fails with the errors:

spark-sql> select timestamp'today' + '1 month';
Error in query: cannot resolve '(TIMESTAMP('2019-10-18 00:00:00') + CAST('1 month' AS DOUBLE))' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-18 00:00:00') + CAST('1 month' AS DOUBLE))' (timestamp and double).; line 1 pos 7;
'Project [unresolvedalias((1571346000000000 + cast(1 month as double)), None)]
+- OneRowRelation

How was this patch tested?

  • Add tests to TypeCoercionSuite to check rules
  • End-to-end tests in dateTimeOperations.sql

@MaxGekk

MaxGekk commented Oct 18, 2019

Copy link
Copy Markdown
Member Author

@wangyum Please, review this PR.

@SparkQA

SparkQA commented Oct 18, 2019

Copy link
Copy Markdown

Test build #112279 has finished for PR 26165 at commit 9a2ce13.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum

wangyum commented Oct 18, 2019

Copy link
Copy Markdown
Member

I tried to add this feature before: #25190
cc @maropu @dongjoon-hyun

@MaxGekk

MaxGekk commented Oct 18, 2019

Copy link
Copy Markdown
Member Author

I tried to add this feature before: #25190

Why was it closed? This casting is useful independently from PostgreSQL. I believe this improves user experience with Spark SQL. Don't think it is better to fail on casting a string to DOUBLE.

@SparkQA

SparkQA commented Oct 18, 2019

Copy link
Copy Markdown

Test build #112282 has finished for PR 26165 at commit 01a42a4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Oct 18, 2019

Copy link
Copy Markdown

Test build #112290 has finished for PR 26165 at commit b001177.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum

wangyum commented Oct 18, 2019

Copy link
Copy Markdown
Member

@MaxGekk Please see @maropu's comment:
#25190 (comment)

@MaxGekk

MaxGekk commented Oct 19, 2019

Copy link
Copy Markdown
Member Author

Can we consider the feature out of the scope of compatibility with PostgreSQL?

@MaxGekk

MaxGekk commented Oct 21, 2019

Copy link
Copy Markdown
Member Author

@srowen @cloud-fan @HyukjinKwon WDYT?

@srowen

srowen commented Oct 21, 2019

Copy link
Copy Markdown
Member

I'm just trying to think if this causes any surprising behavior, either because some string is unintentionally cast to a date or interval type, or because it doesn't fully work in all cases (see the other ticket). How much do we need to support this? If it's not a standard behavior across DBs, I'd kind of prefer to have people write more explicit casts for such a thing. Do we cast strings to other types implicitly though?

@cloud-fan

Copy link
Copy Markdown
Contributor

Is it really a useful feature? Users can simply put the INTERVAL keyword before the string literal, like timestamp'today' + interval'1 month'

For timestamp_col + string_col, I don't think it's reasonable to implicitly cast string_col to interval. People may get null result silently and hide the mistake(they may pick a wrong column to add).

@MaxGekk

MaxGekk commented Oct 22, 2019

Copy link
Copy Markdown
Member Author

Users can simply put the INTERVAL keyword before the string literal ...

You are right, users can put it before literals but what about string column. Users have to explicitly cast it, correct?

People may get null result silently and hide the mistake(they may pick a wrong column to add).

Yeh, though the same argument is applicable to other implicit cast from strings, for example:

spark-sql> select 2 * '3';
6.0

Why does Spark implicitly cast '3' to a numeric. Maybe an user wants to repeat '3' two times, and get '33'?

Let's consider proposed implicit casts one by one:

  1. select interval '1 month' - <string>. What else could we subtract from interval besides of another interval. Don't think casting to interval is ambiguous, here.
  2. select timestamp'today' + <string>. What else could we add to a timestamp or date? Definitely, not another timestamp or date. Maybe we could add a number but what is the number - days, seconds or microseconds? Obviously, a timestamp can be adjusted by an interval.
  3. select date'today' - <string>. Here, I follow PostgreSQL which cast the string to datetime implicitly:
maxim=# select date'today' - '2019-10-01';
 ?column?
----------
       21
(1 row)

@maropu

maropu commented Oct 23, 2019

Copy link
Copy Markdown
Member

Yea, I'm still worried that this feature might cause unexpected results in complicated SQL queries as others suggested..., btw, any dbms-like systems support this implicit casts?

@cloud-fan

Copy link
Copy Markdown
Contributor

Why does Spark implicitly cast '3' to a numeric

To be honest I think this is a mistake as well. This kind of implicit cast is really risky as Spark returns null for invalid cast.

Even if invalid cast can throw runtime exception, we should still not allow this kind of implicit cast to be safe. FYI this is the result of pgsql

cloud0fan=# select 1 + s from t;
ERROR:  operator does not exist: integer + character varying
LINE 1: select 1 + s from t;
                 ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.

AFAIK pgsql(and some other DBs) only apply this kind of implicit cast for string literal. We'd either update the type coercion module to handle string literal specially, or not do it at all.

@MaxGekk

MaxGekk commented Oct 25, 2019

Copy link
Copy Markdown
Member Author

I am closing this PR.

@MaxGekk MaxGekk closed this Oct 25, 2019
@MaxGekk MaxGekk deleted the string-to-interval-implicit-conv branch June 5, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants