[SPARK-48622][SQL] get SQLConf once when resolving column names#46979
Closed
andrewxue-db wants to merge 1 commit into
Closed
[SPARK-48622][SQL] get SQLConf once when resolving column names#46979andrewxue-db wants to merge 1 commit into
andrewxue-db wants to merge 1 commit into
Conversation
6d5fb53 to
16a4bc5
Compare
yaooqinn
approved these changes
Jun 14, 2024
Member
|
Merged to master, thank you @andrewxue-db |
gengliangwang
pushed a commit
that referenced
this pull request
Jun 19, 2024
### What changes were proposed in this pull request? This PR migrates Scala logging to comply with the scala style changes in [#46979](#46947) ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by ensuring `dev/scalastyle` checks pass ### Was this patch authored or co-authored using generative AI tooling? No Closes #46980 from asl3/logging-migrationscala. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
HyukjinKwon
pushed a commit
that referenced
this pull request
Jun 23, 2024
…gEntry)
### What changes were proposed in this pull request?
This PR proposes two micro-optimizations for `SparkConf.get(ConfigEntry)`:
1. Avoid costly `Regex.replaceAllIn` for variable substitution: if the config value does not contain the substring `${` then it cannot possibly contain any variables, so we can completely skip the regex evaluation in such cases.
2. Avoid Scala collections operations, including `List.flatten` and `AbstractIterable.mkString`, for the common case where a configuration does not define a prepended configuration key.
### Why are the changes needed?
Improve performance.
This is primarily motivated by unit testing and benchmarking scenarios but it will also slightly benefit production queries.
Spark tries to avoid excessive configuration reading in hot paths (e.g. via changes like #46979). If we do accidentally introduce such read paths, though, then this PR's optimizations will help to greatly reduce the associated perf. penalty.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Correctness should be covered by existing unit tests.
To measure performance, I did some manual benchmarking by running
```
val conf = new SparkConf()
conf.set("spark.network.crypto.enabled", "true")
```
followed by
```
conf.get(NETWORK_CRYPTO_ENABLED)
```
10 million times in a loop.
On my laptop, the optimized code is ~7.5x higher throughput than the original.
We can also compare the before-and-after flamegraphs from a `while(true)` configuration reading loop, showing a clear difference in hotspots before and after this change:
**Before**:

**After**:

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Github Copilot
Closes #47049 from JoshRosen/SPARK-48678-sparkconf-perf-optimizations.
Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
SQLConf.caseSensitiveAnalysisis currently being retrieved for every column when resolving column names. This is expensive if there are many columns. We can instead retrieve it once before the loop, and reuse the result.Why are the changes needed?
Performance improvement.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Profiles of adding 1 column on an empty 10k column table (hms-parquet):
Before (55s):

After (13s):

Was this patch authored or co-authored using generative AI tooling?
No