Skip to content

[SPARK-26540][SQL] Support PostgreSQL numeric arrays without precision/scale#23458

Closed
dongjoon-hyun wants to merge 2 commits into
apache:masterfrom
dongjoon-hyun:SPARK-26540-DECIMAL-ARRAY
Closed

[SPARK-26540][SQL] Support PostgreSQL numeric arrays without precision/scale#23458
dongjoon-hyun wants to merge 2 commits into
apache:masterfrom
dongjoon-hyun:SPARK-26540-DECIMAL-ARRAY

Conversation

@dongjoon-hyun

@dongjoon-hyun dongjoon-hyun commented Jan 5, 2019

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Currently, Spark cannot handle numeric[] type with decimal data because both precision and scale are considered as 0.

postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{1111.222,2222.332}', 222.4555);
INSERT 0 1

BEFORE

scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
pgTable: org.apache.spark.sql.DataFrame = [v: array<decimal(0,0)>, d: decimal(38,18)]

scala> pgTable.show
19/01/04 18:07:38 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0

AFTER

scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
pgTable: org.apache.spark.sql.DataFrame = [v: array<decimal(38,18)>, d: decimal(38,18)]

scala> pgTable.show()
+--------------------+--------------------+
|                   v|                   d|
+--------------------+--------------------+
|[1111.22200000000...|222.4555000000000...|
+--------------------+--------------------+

How was this patch tested?

Manual. Run the PostgresIntegrationSuite.

Closes #23456

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

cc @maropu

@maropu

maropu commented Jan 5, 2019

Copy link
Copy Markdown
Member

Thanks, @dongjoon-hyun! I’ll check tonight (away from a keyboard now)

@maropu maropu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is the same with v1.6?

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Yes. Exactly. It uses decimal(38,18).

@maropu

maropu commented Jan 5, 2019

Copy link
Copy Markdown
Member

Also, did you check if PostgresIntegrationSuite passed in your env? It seems Jenkins doesnot run the integration tests before.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Yes. I ran it inside IntelliJ manually.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Also, cc @gatorsmile and @mgaido91 .

@SparkQA

SparkQA commented Jan 5, 2019

Copy link
Copy Markdown

Test build #100764 has finished for PR 23458 at commit 9e1b8ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jan 5, 2019

Copy link
Copy Markdown

Test build #100765 has finished for PR 23458 at commit 74215de.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Retest this please.

@SparkQA

SparkQA commented Jan 5, 2019

Copy link
Copy Markdown

Test build #100772 has finished for PR 23458 at commit 74215de.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Retest this please.

case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" | "decimal" => Some(DecimalType.bounded(precision, scale))
case "numeric" | "decimal" if precision != 0 || scale != 0 =>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check scale != 0, too?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. Thanks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very important point, but I think we best use precision > 0 || scale > 0. I mean, there may happen that scale is negative. In that case, precision must be > 0. In the above code a numeric(0, -10) which is an invalid combination is parsed as DECIMAL(0, -10). Probably, we can rely that this case never happens, but I think having code robust also for error conditions is better.

@dongjoon-hyun dongjoon-hyun Jan 5, 2019

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. If we consider negative scale cases, we should change all of the existing code together to be consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. We can probably also do that in a followup PR.

@maropu

maropu commented Jan 5, 2019

Copy link
Copy Markdown
Member

LGTM except for one question.

@maropu

maropu commented Jan 5, 2019

Copy link
Copy Markdown
Member

Can you also add 'Closes #23456' in the description? I found the duplicate jira ticket.

@SparkQA

SparkQA commented Jan 5, 2019

Copy link
Copy Markdown

Test build #100777 has finished for PR 23458 at commit 74215de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Oh, I didn't notice that. I see. Thanks.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Retest this please

@maropu

maropu commented Jan 5, 2019

Copy link
Copy Markdown
Member

my bad, I didn't notice the existing jira ticket, sorry.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

BTW, we need to backport this to older branches, right?

@maropu

maropu commented Jan 5, 2019

Copy link
Copy Markdown
Member

Yea, I think so. Since v1.6 accepts this query, this is a kind of regression?

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Yep. I agree.

@mgaido91 mgaido91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just not sure why we have this PR duplicating #23456. Is that PR stale? It has been opened few hours ago, so I wouldn't consider it like that. Probably I am missing something. Thanks.

case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" | "decimal" => Some(DecimalType.bounded(precision, scale))
case "numeric" | "decimal" if precision != 0 || scale != 0 =>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very important point, but I think we best use precision > 0 || scale > 0. I mean, there may happen that scale is negative. In that case, precision must be > 0. In the above code a numeric(0, -10) which is an invalid combination is parsed as DECIMAL(0, -10). Probably, we can rely that this case never happens, but I think having code robust also for error conditions is better.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Actually, two JIRA issues are created in a short time and I received a call on SPARK-26540. So, I didn't check other JIRA and other PR.

@dongjoon-hyun

dongjoon-hyun commented Jan 5, 2019

Copy link
Copy Markdown
Member Author

If @mgaido91 has objection for this PR, I'll close this one. I'm okay. It's fair.
Just tell him to receive this code together.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-26540-DECIMAL-ARRAY branch January 5, 2019 15:45
@dongjoon-hyun

Copy link
Copy Markdown
Member Author

@maropu . I close this PR and JIRA issue together~

@mgaido91 mgaido91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mine was just a question @dongjoon-hyun. I was wondering if the other PR had something wrong which I couldn't figure out. I'd really appreciate if you could provide your feedback and help there then. Thanks.

case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" | "decimal" => Some(DecimalType.bounded(precision, scale))
case "numeric" | "decimal" if precision != 0 || scale != 0 =>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. We can probably also do that in a followup PR.

@SparkQA

SparkQA commented Jan 5, 2019

Copy link
Copy Markdown

Test build #100783 has finished for PR 23458 at commit 74215de.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants