[SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column#21834
[SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column#21834maropu wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #93381 has finished for PR 21834 at commit
|
|
ping |
| case LogicalRelation(JDBCRelation(_, parts, _), _, _, _) => | ||
| val whereClauses = parts.map(_.asInstanceOf[JDBCPartition].whereClause).toSet | ||
| assert(whereClauses === Set( | ||
| """"T" < '2018-07-15 20:50:32.5' or "T" is null""", |
There was a problem hiding this comment.
Add a test case to OracleIntegrationSuite.scala? I am afraid this does not work.
|
@maropu Do we have a log message for users to know the generated where clauses? If not, could you add one? |
|
Currently, no. Is it ok that the log level is |
|
Test build #93532 has finished for PR 21834 at commit
|
|
retest this please |
|
Test build #93537 has finished for PR 21834 at commit
|
|
Test build #93542 has finished for PR 21834 at commit
|
|
My only concern is the time zone issues. However, it will not affect the correctness since the logical partitioning will cover the whole range anyway. The worse case is the partitioning might not start and end in the expected boundaries due to the local/remote time zone issues, if existed. Since this is a new feature, we can address the issues and improve the solution in the following releases. |
|
retest this please |
|
LGTM pending tests |
|
Test build #93778 has finished for PR 21834 at commit
|
|
Thanks! Merged to master. |
|
Thanks for the merge! |
## What changes were proposed in this pull request?
This pr supported Date/Timestamp in a JDBC partition column (a numeric column is only supported in the master). This pr also modified code to verify a partition column type;
```
val jdbcTable = spark.read
.option("partitionColumn", "text")
.option("lowerBound", "aaa")
.option("upperBound", "zzz")
.option("numPartitions", 2)
.jdbc("jdbc:postgresql:postgres", "t", options)
// with this pr
org.apache.spark.sql.AnalysisException: Partition column type should be numeric, date, or timestamp, but string found.;
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.verifyAndGetNormalizedPartitionColumn(JDBCRelation.scala:165)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:317)
// without this pr
java.lang.NumberFormatException: For input string: "aaa"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
```
Closes apache#19999
## How was this patch tested?
Added tests in `JDBCSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes apache#21834 from maropu/SPARK-22814.
| private def toInternalBoundValue(value: String, columnType: DataType): Long = columnType match { | ||
| case _: NumericType => value.toLong | ||
| case DateType => DateTimeUtils.fromJavaDate(Date.valueOf(value)).toLong | ||
| case TimestampType => DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(value)) |
There was a problem hiding this comment.
@maropu The Timestamp.valueOf method expects timestamp in the format yyyy-[m]m-[d]d hh:mm:ss[.f...]. Was it selected intentionally or just because it is the default pattern for Timestamp?
For example, it cannot parse time zones as Cast can:
Timestamp.valueOf("1973-02-27 02:30:00.102030+01:00")
Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
at java.sql.Timestamp.valueOf(Timestamp.java:251)
There was a problem hiding this comment.
I thinks it'd be nice to accept more general pattens for date/timestamp. Can you make a pr to fix that?
There was a problem hiding this comment.
Yes, I am going to prepare a PR for that.
|
@gatorsmile @maropu this currently does not work with pyspark due to this line: spark/python/pyspark/sql/readwriter.py Lines 563 to 564 in d9bcacf it tries to convert The resulting traceback is: I think just removing the EDIT: looks like the pyspark docs explicitly say to use an integer column at the moment |
|
When i chose INSERTION_DATE as my partitionColumn with below dates Getting error : ORA-01861: literal does not match format string |
|
@shatestest |
|
@MaxGekk thanks for reply but same error with .option("dateFormat", "yyyy-MM-dd"); |
|
@SparkQA , may I know how you tested partitionColumn for date/timestamp ? any sample test cases ? |
|
There are tests in the PR, see https://github.com/apache/spark/pull/21834/files#diff-5e0cadf526662f9281aa26315b3750adR442 . @shatestest Could you open an JIRA ticket, and provide a reproducible example, please. |
|
If anyone finds themselves here looking to do this in pyspark, until support for this feature is added, here is a workaround: import datetime
from itertools import tee
def date_range(start, end, intv):
diff = (end - start) / intv
for i in range(intv):
yield start + diff * i
yield end
def pairwise(iterable):
a, b = tee(iterable)
next(b, None)
return zip(a, b)
partition_column = 'mypartitioncol'
now = datetime.datetime.now(datetime.timezone.utc)
num_partitions = sc.defaultParallelism
lower_bound = now + datetime.timedelta(-30)
upper_bound = now
predicates = []
for start, end in pairwise(date_range(lower_bound, upper_bound, num_partitions)):
predicates.append(f'{partition_column} >= \'{start.isoformat()}\' AND {partition_column} < \'{end.isoformat()}\'')
df = (spark.read.jdbc(
url='myjdbcuri',
table='mytable',
predicates=predicates,
properties={'driver': 'org.postgresql.Driver'}))Associated JIRA comment. |
|
@MaxGekk sorry for delay . raised a ticket ...please let me know if you need any more info .... https://issues.apache.org/jira/browse/SPARK-27723 |
What changes were proposed in this pull request?
This pr supported Date/Timestamp in a JDBC partition column (a numeric column is only supported in the master). This pr also modified code to verify a partition column type;
Closes #19999
How was this patch tested?
Added tests in
JDBCSuite.