Skip to content

native_datafusion: silent wrong-answer paths for decimal-to-decimal precision/scale narrowing Spark rejects #4343

@andygrove

Description

@andygrove

Description

native_datafusion silently accepts decimal-to-decimal Parquet reads where the requested read type narrows the precision or scale below what is needed to represent the file's values. Spark's vectorized reader rejects these conversions with SchemaColumnConvertNotSupportedException because the file values cannot be safely represented in the requested type. native_datafusion instead returns wrong (truncated/overflowed) values.

This is the decimal-to-decimal counterpart to #4297 (primitive-to-primitive numeric/date conversions), and a generalisation of the specific case that was tracked in #4089.

Affected tests (Spark 4.1.1, dev/diffs/4.1.1.diff)

Currently tagged IgnoreCometNativeDataFusion pointing at the umbrella #3720:

  • ParquetQuerySuiteSPARK-34212 Parquet should read decimals correctly
    Asserts SchemaColumnConvertNotSupportedException when reading e.g. DECIMAL(18,2) as DECIMAL(3,0).
  • ParquetTypeWideningSuiteparquet decimal precision change Decimal($fromPrecision, 2) -> Decimal($toPrecision, 2)
    Iterates precision pairs across INT32 / INT64 / FIXED_LEN_BYTE_ARRAY backed decimals; expects an error whenever the vectorized reader is enabled and fromPrecision > toPrecision.
  • ParquetTypeWideningSuiteparquet decimal precision and scale change Decimal($fromPrecision, $fromScale) -> Decimal($toPrecision, $toScale)
    Same idea but varies both precision and scale.

The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under #3720 there as well.

Reproduction

import org.apache.comet.CometConf
import org.apache.spark.sql.internal.SQLConf

withSQLConf(
  CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
  SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
  withTempPath { dir =>
    val path = dir.getCanonicalPath
    Seq(BigDecimal("123.45")).toDF("d")
      .selectExpr("cast(d as decimal(10,2)) as d")
      .write.parquet(path)
    spark.read.schema("d decimal(5,0)").parquet(path).show()
    // Expected: SparkException(SchemaColumnConvertNotSupportedException)
    // Actual: silent wrong/truncated value
  }
}

native_iceberg_compat correctly throws SparkException for this case.

Suggested approach

Same direction as #4297: extend the allowlist used by replace_with_spark_cast / the decimal branch of the schema adapter so that decimal-to-decimal coercions match Spark's ParquetVectorUpdaterFactory rules — only accept widening (or equal-scale precision widening) and reject everything else with SparkError::ParquetSchemaConvert.

Parent issue

Split from umbrella #3720 (and #4089, which fixed a single decimal narrowing case but did not unblock the broader test coverage).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions