[SPARK-48622][SQL] get SQLConf once when resolving column names by andrewxue-db · Pull Request #46979 · apache/spark

andrewxue-db · 2024-06-13T20:01:03Z

What changes were proposed in this pull request?

SQLConf.caseSensitiveAnalysis is currently being retrieved for every column when resolving column names. This is expensive if there are many columns. We can instead retrieve it once before the loop, and reuse the result.

Why are the changes needed?

Performance improvement.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Profiles of adding 1 column on an empty 10k column table (hms-parquet):

Before (55s):

After (13s):

Was this patch authored or co-authored using generative AI tooling?

No

yaooqinn · 2024-06-14T02:05:44Z

Merged to master, thank you @andrewxue-db

### What changes were proposed in this pull request? This PR migrates Scala logging to comply with the scala style changes in [#46979](#46947) ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by ensuring `dev/scalastyle` checks pass ### Was this patch authored or co-authored using generative AI tooling? No Closes #46980 from asl3/logging-migrationscala. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…gEntry) ### What changes were proposed in this pull request? This PR proposes two micro-optimizations for `SparkConf.get(ConfigEntry)`: 1. Avoid costly `Regex.replaceAllIn` for variable substitution: if the config value does not contain the substring `${` then it cannot possibly contain any variables, so we can completely skip the regex evaluation in such cases. 2. Avoid Scala collections operations, including `List.flatten` and `AbstractIterable.mkString`, for the common case where a configuration does not define a prepended configuration key. ### Why are the changes needed? Improve performance. This is primarily motivated by unit testing and benchmarking scenarios but it will also slightly benefit production queries. Spark tries to avoid excessive configuration reading in hot paths (e.g. via changes like #46979). If we do accidentally introduce such read paths, though, then this PR's optimizations will help to greatly reduce the associated perf. penalty. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Correctness should be covered by existing unit tests. To measure performance, I did some manual benchmarking by running ``` val conf = new SparkConf() conf.set("spark.network.crypto.enabled", "true") ``` followed by ``` conf.get(NETWORK_CRYPTO_ENABLED) ``` 10 million times in a loop. On my laptop, the optimized code is ~7.5x higher throughput than the original. We can also compare the before-and-after flamegraphs from a `while(true)` configuration reading loop, showing a clear difference in hotspots before and after this change: **Before**: ![image](https://github.com/apache/spark/assets/50748/a60cec03-2400-46a5-95f5-f33b88a4872a) **After**: ![image](https://github.com/apache/spark/assets/50748/10a45575-148b-4f5f-a431-b8036fe59866) ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Github Copilot Closes #47049 from JoshRosen/SPARK-48678-sparkconf-perf-optimizations. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions Bot added the SQL label Jun 13, 2024

get sql config once

16a4bc5

andrewxue-db force-pushed the andrewxue-db/spark-48622 branch from 6d5fb53 to 16a4bc5 Compare June 13, 2024 20:07

yaooqinn approved these changes Jun 14, 2024

View reviewed changes

yaooqinn closed this in be154a3 Jun 14, 2024

JoshRosen mentioned this pull request Jun 21, 2024

[SPARK-48678][CORE] Performance optimizations for SparkConf.get(ConfigEntry) #47049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48622][SQL] get SQLConf once when resolving column names#46979

[SPARK-48622][SQL] get SQLConf once when resolving column names#46979
andrewxue-db wants to merge 1 commit into
apache:masterfrom
andrewxue-db:andrewxue-db/spark-48622

andrewxue-db commented Jun 13, 2024

Uh oh!

yaooqinn commented Jun 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

andrewxue-db commented Jun 13, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Jun 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants