[SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow#25461
[SPARK-28741][SQL]Optional mode: throw exceptions when casting to integers causes overflow#25461gengliangwang wants to merge 24 commits into
Conversation
|
Test build #109151 has finished for PR 25461 at commit
|
|
retest this please. |
|
retest this please |
|
Test build #109156 has finished for PR 25461 at commit
|
|
Test build #109178 has finished for PR 25461 at commit
|
|
retest this please. |
|
Also cc @cloud-fan @mgaido91 |
|
Test build #109196 has started for PR 25461 at commit |
|
Test build #109193 has finished for PR 25461 at commit
|
|
|
||
| private[this] def castDecimalToIntegerCode( | ||
| ctx: CodegenContext, | ||
| intType: String): CastFunction = { |
There was a problem hiding this comment.
intType? Do you mean inType?
There was a problem hiding this comment.
It's actually integerType. But the full name makes some code longer than 100 characters.
I can change it if you think it is misleading.
| case x: NumericType if failOnIntegerOverflow => | ||
| b => | ||
| val intValue = try { | ||
| x.exactNumeric.asInstanceOf[Numeric[Any]].toInt(b) |
There was a problem hiding this comment.
Why do you cast it into int once?
There was a problem hiding this comment.
The trait Numeric doesn't have the method toInt. Before this code change, the value is also casted to int.
There was a problem hiding this comment.
We cannot check the valid value range in a single place instead of the current two checks in line 520 and 525?
There was a problem hiding this comment.
Well, we can do it by match it case by case. Then the code is a bit long. Casting to short/byte should be minor usage. Also, The previous code also cast to Int before cast to Short.
| checkEvaluation(cast(Literal(value, TimestampType), LongType), | ||
| Math.floorDiv(value, MICROS_PER_SECOND)) | ||
| } | ||
| checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L) |
There was a problem hiding this comment.
How about doing boundary tests like this?
checkEvaluation(cast(java.lang.Math.nextDown(9223372036854775807.9f), LongType), 9223372036854775807L)
--> non-overflow case
checkEvaluation(cast(java.lang.Math.nextUp(9223372036854775807.9f), LongType), 9223372036854775807L)
--> overflow case
There was a problem hiding this comment.
Why would we do that?
scala> java.lang.Math.nextDown(9223372036854775807.9D) < 9223372036854775807.9D
res23: Boolean = true
There was a problem hiding this comment.
Ah, its ok to do it like this instead;
checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L)
--> non-overflow case
checkEvaluation(cast(java.lang.Math.nextUp(9223372036854775807.9f), LongType), 9223372036854775807L)
--> overflow case
What I'm a little worried about is that 9223372036854775807.9f is implicitly truncated (to 9223372036854776000.0f?) by a compiler because it cannot be packed in the float IEEE754 format as you said before. So, IIUC the test is actually the same with cast(9223372036854776000.0f, LongType)?
What I understand is as follows(sorted by values desc) and is this correct?
IEEE754 continuous float values
------------------------------------------
overflow case: 9223373136366404000.0f <--- Math.nextUp(9223372036854775807.9f)
non-overflow case: 9223372036854776000.0f <--- 9223372036854775807.9f
non-overflow case: 9223371487098961900.0f <--- Math.nextDown(9223372036854775807.9f)
|
Test build #109297 has finished for PR 25461 at commit
|
|
retest this please. |
|
Test build #109300 has finished for PR 25461 at commit
|
|
|
||
| test("Cast to byte with option FAIL_ON_INTEGER_OVERFLOW enabled") { | ||
| withSQLConf(SQLConf.FAIL_ON_INTEGER_OVERFLOW.key -> "true") { | ||
| testIntMaxAndMin(ByteType) |
There was a problem hiding this comment.
why we need to test cast int.max +1 to byte? I think it's good enough to test cast byte.max +1 to byte
There was a problem hiding this comment.
I think it is always good to have more test cases here if it doesn't increase the testing time by a few seconds.
For example,
If casting double to byte is implemented as:
val x = doubleValue.toShort
if (x.toByte == x) {
x.toByte
} else {
throw new ...
}
We can find that it is wrong with this test case, because
(Int.MaxValue+1.0).toShort.toByte == (Int.MaxValue+1.0).toShort
is true.
|
|
||
| test("Cast to short with option FAIL_ON_INTEGER_OVERFLOW enabled") { | ||
| withSQLConf(SQLConf.FAIL_ON_INTEGER_OVERFLOW.key -> "true") { | ||
| testIntMaxAndMin(ShortType) |
| } | ||
| checkEvaluation(cast(9223372036854775807.9f, LongType), 9223372036854775807L) | ||
| checkEvaluation(cast(-9223372036854775808.9f, LongType), -9223372036854775808L) | ||
| checkEvaluation(cast(9223372036854775807.9D, LongType), 9223372036854775807L) |
There was a problem hiding this comment.
how about checkEvaluation(cast(0.9D + Long.Max, LongType), Long.Max)?
|
LGTM except a few minor comments |
| buildCast[Boolean](_, b => if (b) 1 else 0) | ||
| case DateType => | ||
| buildCast[Int](_, d => null) | ||
| case TimestampType if failOnIntegerOverflow => |
There was a problem hiding this comment.
do we really need this? AFAIK a timestamp cannot overflow, can it?
There was a problem hiding this comment.
It is possible in theory.
| (c, evPrim, evNull) => code"$evPrim = $c != 0;" | ||
| } | ||
|
|
||
| private[this] def castTimestampToIntegerCode( |
There was a problem hiding this comment.
| private[this] def castTimestampToIntegerCode( | |
| private[this] def castTimestampToIntegralCode( |
There was a problem hiding this comment.
Integral is adjective. "Integral code" seems weird to me.
We can call it castTimestampToIntegralTypeCode if you insist. I didn't use it because the name is a bit long.
There was a problem hiding this comment.
well, Integer is a specific data type, so I think this name is misleading...your suggested one is fine to me
| (c, evPrim, evNull) => code"$evPrim = $c.to${integralType.capitalize}($failOnIntegerOverflow);" | ||
| } | ||
|
|
||
| private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = { |
There was a problem hiding this comment.
| private[this] def castIntegerToIntegerExactCode(integralType: String): CastFunction = { | |
| private[this] def castIntegerToIntegralExactCode(integralType: String): CastFunction = { |
There was a problem hiding this comment.
(oh, I didn't know this functionality, ```suggestion, cool)
| (min.toString + typeIndicator, max.toString + typeIndicator) | ||
| } | ||
|
|
||
| private[this] def castFractionToIntegerExactCode( |
There was a problem hiding this comment.
| private[this] def castFractionToIntegerExactCode( | |
| private[this] def castFractionToIntegralExactCode( |
| * @throws ArithmeticException if checkOverflow is true and | ||
| * the decimal too big to fit in Long type. | ||
| */ | ||
| def toLong(checkOverflow: Boolean): Long = { |
There was a problem hiding this comment.
a lot of duplicated code doing this and an additional function call which may be avoided. Can't we just add a boolean with a default value to the existing functions?
There was a problem hiding this comment.
Then all the other function calls toLong should be at least add parentheses as toLong(). Since Scala only allows Arity-0 https://docs.scala-lang.org/style/method-invocation.html#arity-0 to omit parentheses.
My two concerns here:
- The existing external code calling
Decimal.toLongwill fail - The usage will be different from the trait
Numeric
There was a problem hiding this comment.
BTW, renaming to toLongExact won't accurate either. For example, toLongExact(1.1) should be false, while we are actually doing rounding in toLong.
There was a problem hiding this comment.
for 2, anyway, the usage is already different and here we're not in a Numeric-like class. On 1, I am not sure it is a problem. Decimal is Unstable and this patch will go in Spark 3.0, so it is a major release (best place where to put a breaking change!). And the benefit to avoid extra function calls and a lot of duplicated code is worth the change IMHO.
cc @cloud-fan @maropu for their opinion on this too, but I feel quite strong about this.
There was a problem hiding this comment.
Actually, I am not super comfortable with the code changes in Decimal.scala here.
I did this to address comments in #25461 (comment) .
How about just adding a new method roundToLong. The name is same as
https://github.com/google/guava/blob/master/guava/src/com/google/common/math/DoubleMath.java#L156 . So that we can leave toLong as it is.
There was a problem hiding this comment.
We can't break public classes just to make it easier to write code in Spark.
There was a problem hiding this comment.
I think we can remove the additional function call by codegen. We can remove def toLong(checkOverflow: Boolean): Long, and in the codegen:
if (nullOnOverFlow) code"decimal.toLong" else code"decimal.roundToLong"
There was a problem hiding this comment.
+1 for @cloud-fan 's suggestion. At least we remove the extra method call.
| import scala.math.Numeric._ | ||
| import scala.math.Ordering | ||
|
|
||
| import org.apache.spark.sql.types.Decimal.DecimalIsConflicted |
There was a problem hiding this comment.
nit: I am wondering about moving it here...
There was a problem hiding this comment.
Do you mean moving DecimalIsConflicted into numerics.scala? I think it is fine keeping it in Decimal.scala for now.
There was a problem hiding this comment.
Yes, I meant that. Not a big deal, just to have everything colocated. We can also do in another PR
|
Test build #109596 has finished for PR 25461 at commit
|
|
Test build #109601 has finished for PR 25461 at commit
|
| buildConf("spark.sql.arithmeticOperations.failOnOverFlow") | ||
| .doc("If it is set to true, all arithmetic operations on non-decimal fields throw an " + | ||
| val FAIL_ON_INTEGER_OVERFLOW = | ||
| buildConf("spark.sql.failOnIntegerOverFlow") |
There was a problem hiding this comment.
failOnIntegerOverFlow -> failOnIntegralTypeOverFlow? To me, Integer is a bit ambiguous.
|
|
||
| private lazy val dateFormatter = DateFormatter() | ||
| private lazy val timestampFormatter = TimestampFormatter.getFractionFormatter(zoneId) | ||
| private val failOnIntegerOverflow = SQLConf.get.failOnIntegralTypeOverflow |
There was a problem hiding this comment.
to be consistent, shall we also rename it to failOnIntegralTypeOverflow?
maropu
left a comment
There was a problem hiding this comment.
Looks nice and I have no comment now except for the current left ones.
|
Test build #109617 has finished for PR 25461 at commit
|
|
Test build #109628 has finished for PR 25461 at commit
|
|
Test build #109635 has finished for PR 25461 at commit
|
|
Test build #109636 has finished for PR 25461 at commit
|
|
thanks, merging to master! |
|
@cloud-fan @maropu @mgaido91 Thanks for the review! |
What changes were proposed in this pull request?
To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow.
The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow.
To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow"
How was this patch tested?
Unit test