[SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config#56374
[SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config#56374CalvQ wants to merge 6 commits into
Conversation
e8890da to
44a033d
Compare
44a033d to
8788bf3
Compare
| object FileSourceOptions { | ||
| val IGNORE_CORRUPT_FILES = "ignoreCorruptFiles" | ||
| val IGNORE_MISSING_FILES = "ignoreMissingFiles" | ||
| val LIST_HIDDEN_FILES = "listHiddenFiles" |
There was a problem hiding this comment.
I prefer this option name. Open to other suggestions. cc: @cloud-fan
| val LIST_HIDDEN_FILES = "listHiddenFiles" | |
| val IGNORE_METADATA_FILES = "ignoreMetadataFiles" |
There was a problem hiding this comment.
I prefer having a hiddenFileRegex instead that by default is ^[\.\_]. That would give the users flexibility to keep files that start with . as a hidden file but keep _ as readable, etc.
There was a problem hiding this comment.
This is possible with minor logic changes: there's a full-path filter and a path segment filter. To keep behavior consistent we would need to walk the full-path to check our regex against each segment for filtering.
There was a problem hiding this comment.
I'd keep the boolean listHiddenFiles.
Re hiddenFileRegex: the current filter isn't actually expressible as a name-prefix regex, which makes a regex option more misleading than flexible. Today's rule also drops *._COPYING_ by suffix, exempts _metadata/_common_metadata (Parquet summary files), and special-cases _x=y names (startsWith("_") && !contains("=")). So ^[\.\_] as the "behavior-preserving default" is already subtly wrong, and exposing the rule as a user-supplied regex either loses these special cases or forces users to understand them. The flexibility use case (e.g. surface _ files but keep . files hidden) is already covered by combining this option with pathGlobFilter. If a real need shows up later, a regex option can still be added alongside the boolean without breaking anything.
Re ignoreMetadataFiles: the filtered set is broader than metadata files (a .foo.json sidecar isn't metadata), and the default would be true, giving users a double negative. "Hidden files" is the established Hadoop convention (FileInputFormat.hiddenFileFilter is exactly the _/. prefix rule), so listHiddenFiles describes what the mechanism does.
There was a problem hiding this comment.
no, the regex would apply on every directory and file name independently. So you would walk up the directory path if you got a full list, or if you're performing a recursive list, you would just filter out by path.getName. the regex doesn't apply on the full path, it applies on the directory/file names
There was a problem hiding this comment.
I'm thinking to apply the regex after the hardcoded edge cases: we always keep _metadata/_common_metadata, always drop *._COPYING_, and always keep _x=y names. The regex only replaces the generic _/. rule, so we can keep our default as ^[._], and a user-supplied regex cannot change the special rules we hardcode. WDYT? @cloud-fan
There was a problem hiding this comment.
I think we should make the user intention clear here. The hidden file definition is from the file system, and it's a fact. Reading hidden file or not is the user intention. I'm fine with a regex option but let's make sure the option name reflects the user intention.
There was a problem hiding this comment.
Makes sense -- what do you think about ignoredPathSegmentRegex? This would represent the path segments that a user wants ignored during listing, and would default to the normal hidden filters for _/. This option also did not have demand for listing spark-specific hidden files, so I want to keep those files hardcoded unless there's demand for that capability.
There was a problem hiding this comment.
ignoredPathSegmentRegex SGTM!
cloud-fan
left a comment
There was a problem hiding this comment.
1 blocking, 3 non-blocking, 2 nits.
Solid, pattern-consistent implementation with thorough tests; the one real defect is the flag-unaware FileStatusCache interaction for catalog tables.
Correctness (1)
- InMemoryFileIndex.scala:159: toggling the session conf on a catalog table returns stale FileStatusCache listings — see inline
Design / architecture (1)
- HadoopFSUtils.scala:363: two callers of
shouldFilterOutPathNamestill hardcode the default (ArchiveReader parity claim, DataSource warning) — see inline
Suggestions (2)
- ParquetUtils.scala:148: zero-length inference change deserves a migration-guide entry — see inline
- FileBasedDataSourceSuite.scala:1465: add the partitioned Spark-written-layout test — see inline
Nits: 2 minor items (see inline comments).
| new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles | ||
| val fileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(parameters)) | ||
| val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles | ||
| val listHiddenFiles = fileSourceOptions.listHiddenFiles |
There was a problem hiding this comment.
This interacts badly with FileStatusCache: cache entries are keyed by path only, but the listing they hold now depends on listHiddenFiles. Path-based reads are safe (each read gets a fresh cache client), but CatalogFileIndex keeps one cache client for the table's lifetime and resolves the flag from the session conf at listing time — so toggling spark.sql.files.listHiddenFiles between queries on a catalog table silently returns the listing cached under the old value (in both directions) until REFRESH TABLE or eviction.
I'd bypass the cache (skip both getLeafFiles and putLeafFiles in listLeafFiles) when listHiddenFiles is true, so cached entries always hold the canonical hidden-filtered listing. That's simpler than widening the cache key, and re-listing on an opt-in flag is an acceptable cost.
There was a problem hiding this comment.
We don't need a SQL conf if FileSourceOptions is used. And CatalogFileIndex will ignore FileSourceOptions. So we should be ok with enabling caching.
There was a problem hiding this comment.
Do you mean removing the SQLConf entirely? Existing patterns such as ignoreCorruptFiles and ignoreMissingFiles keep both a session conf and a per-read option. Removing the conf would be cache safe, but to match existing patterns I think it would make sense to keep both the conf and per-read option -- this would require the caching bypass
| # $example on:ignored_path_segment_regex$ | ||
| # "(?!)" surfaces files that are hidden by default (e.g. names starting with "_" or ".") | ||
| surfaced_df = spark.read.format("parquet")\ | ||
| .option("ignoredPathSegmentRegex", "(?!)")\ |
There was a problem hiding this comment.
Wouldn't empty string value be simpler?
There was a problem hiding this comment.
Empty string using the .find pattern actually matches all strings, which would mean we filter everything.
I agree that the "" pattern intuitively means allow everything, so I've added a specific edge-case for this, and swapped all examples to this
PTAL if the edge case is better or you want to instead just explain the "" pattern behavior
| Regardless of the regex, three rules always apply: names starting with `_metadata` or `_common_metadata` (Parquet summary files) are always listed, names ending in | ||
| `._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names containing `=` (partition directories) are always kept. | ||
|
|
||
| A regex that never matches, such as `(?!)`, disables the generic hidden-file filtering and surfaces hidden files, including Spark-internal marker files such as |
There was a problem hiding this comment.
An issue with empty pattern string ""?
There was a problem hiding this comment.
Explained #56374 (comment), but the empty pattern string "" actually matches every string, meaning we would filter out everything.
Currently special-casing it to match nothing (disabling filtering), but lmk if you don't think the edge case is necessary
| new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles | ||
| val fileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(parameters)) | ||
| val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles | ||
| val listHiddenFiles = fileSourceOptions.listHiddenFiles |
There was a problem hiding this comment.
We don't need a SQL conf if FileSourceOptions is used. And CatalogFileIndex will ignore FileSourceOptions. So we should be ok with enabling caching.
cloud-fan
left a comment
There was a problem hiding this comment.
6 addressed, 0 remaining, 3 new. (3 newly introduced, 0 late catches.) All six prior findings resolved; the redesign to a regex is clean and very well-tested. The two new blockers are housekeeping that rode in with the redesign.
Design / architecture (1)
- PR title/description still describe the old boolean
listHiddenFilesoption +spark.sql.files.listHiddenFilesconf; the implementation is theignoredPathSegmentRegexregex. Please update both (see PR description suggestions).
Correctness (1)
- SQLConf.scala:2793 / sql-migration-guide.md:25: version should be
4.3.0(next open feature release), not5.0.0— see inline.
Suggestions (1)
- FileSourceOptions.scala:58: compile + validate the
Patternonce here instead of re-compiling at ~5 sites (reinforces @sandip-db's CSVDataSource thread) — see inline.
Verification
Traced the HadoopFSUtils.shouldFilterOutPath rewrite (recursive _metadata logic → per-segment split("/") walk): equivalent to the pre-PR logic for the default ^[._] across _SUCCESS / .-prefixed files / _x=y partition dirs / _metadata leaf+dir cases; the two divergences (/_foo/_metadata, /x/_metadata.json) are the intentional unifications the new HadoopFSUtilsSuite cases assert. isDataPath and the FileStatusCache bypass key off the same resolved regex as the listing filter.
PR description suggestions
- Rewrite the title
Add listHiddenFiles data source optionto theignoredPathSegmentRegexdesign. - Rewrite the body: it describes a boolean
listHiddenFilesoption + aspark.sql.files.listHiddenFilesconf (default false). Replace with the regex option/conf (default^[._],findsemantics, option-overrides-conf), the three hardcoded carve-outs (_metadata/_common_metadatakept,*._COPYING_skipped,_x=ypartition dirs kept), and the(?!)disable-filtering escape hatch.
|
Empty string should allow everything not filter it out. Otherwise how do
you allow ingesting everything?
…On Thu, Jun 18, 2026, 11:02 PM Calvin Qin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In docs/sql-data-sources-generic-options.md
<#56374 (comment)>:
> @@ -97,6 +97,46 @@ you can use:
</div>
</div>
+### Ignored Path Segment Regex
+
+Spark allows you to use the configuration `spark.sql.files.ignoredPathSegmentRegex` or the data source option `ignoredPathSegmentRegex` to control which files are treated as
+hidden during file listing. The value is a Java regular expression that is matched (with find semantics, i.e. `java.util.regex.Matcher.find`) against each individual
+directory and file name below the path being read; names in which the regex finds a match are skipped from file listing, partition discovery, and reads, and a matching
+directory name excludes its whole subtree. The default value is `^[._]`, which skips files and directories whose names start with `_` or `.`. The data source option
+takes precedence over the configuration when both are set.
+
+Regardless of the regex, three rules always apply: names starting with `_metadata` or `_common_metadata` (Parquet summary files) are always listed, names ending in
+`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names containing `=` (partition directories) are always kept.
+
+A regex that never matches, such as `(?!)`, disables the generic hidden-file filtering and surfaces hidden files, including Spark-internal marker files such as
Explained #56374 (comment)
<#56374 (comment)>, but
the empty pattern string "" actually matches every string, meaning we
would filter out everything.
—
Reply to this email directly, view it on GitHub
<#56374?email_source=notifications&email_token=ABIAE66Z2STBAV44ACNJ5OD5ARKH7A5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINJSHA2DSNRWHEZ2M4TFMFZW63VHMNXW23LFNZ2KKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#discussion_r3438872335>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABIAE64K37KX6RIVDBPSSK35ARKH7AVCNFSNUABEKJSXA33TNF2G64TZHMYTOMJWGU3DKOB3JFZXG5LFHM2DMMJWGMZDQMRVGKQXMAQ>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/ABIAE63KQ4QL24EJO34MEJT5ARKH7A5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINJSHA2DSNRWHEZ2M4TFMFZW63VHMNXW23LFNZ2KKZLWMVXHJKTGN5XXIZLSL5UW64Y>
and Android
<https://github.com/notifications/mobile/android/ABIAE6ZNZWV7YFWHSD6H7U35ARKH7A5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINJSHA2DSNRWHEZ2M4TFMFZW63VHMNXW23LFNZ2KKZLWMVXHJLTGN5XXIZLSL5QW4ZDSN5UWI>.
Download it today!
You are receiving this because you commented.Message ID:
***@***.***>
|
Add a `listHiddenFiles` data source option plus a `spark.sql.files.listHiddenFiles` session configuration (default false) for file-based sources such as Parquet, JSON, ORC and CSV. When enabled, files and directories whose names start with `_` or `.` -- including `._COPYING_` files -- participate in file listing, partition discovery, and reads instead of being skipped. The per-read option overrides the session configuration. Both batch and Structured Streaming reads are covered. The flag is threaded through the listing path (HadoopFSUtils.listFiles / parallelListLeafFiles / listLeafFiles), the SQL file index (InMemoryFileIndex.bulkListLeafFiles and PathFilterWrapper), partition discovery (PartitioningAwareFileIndex.isDataPath), and file streaming options (FileStreamOptions). When true, the hidden-file filter is bypassed entirely; users can narrow results with `pathGlobFilter`. Co-authored-by: Isaac
…ter, version 4.3.0 - Add FileSourceOptions.ignoredPathSegmentRegexPattern lazy val and shared compileIgnoredPathSegmentRegex; drop the 5 duplicate Pattern.compile sites and the duplicate require in PartitioningAwareFileIndex - Empty string now disables hidden-file filtering (replaces require(nonEmpty) and the '(?!)' escape hatch); compile maps empty to a never-matching pattern - Simplify PathFilterWrapper: HadoopFSUtils.listFiles/parallelListLeafFiles already apply the regex per path segment, so the wrapper check was redundant - SQLConf version 4.3.0; migration-guide bullet moved into the 4.3 section - Docs and Scala/Java/Python/R examples use "" as the disable value - Update tests for the new semantics; wrap a few over-100-char lines Co-authored-by: Isaac
Agreed that I've now special cased |
4199c6e to
83543a0
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
2 addressed, 1 remaining, 0 new. (0 newly introduced, 0 late catches.) Round-2 findings #2 (version 4.3.0) and #3 (centralized Pattern compile) are resolved. Clean otherwise; the one remaining item is the stale PR-description section.
Design / architecture (1)
- PR description's "Does this PR introduce any user-facing change?" section still describes the removed
listHiddenFiles = trueboolean instead ofignoredPathSegmentRegex— see PR description suggestions. (Unresolved remainder of the round-2 description finding; non-blocking.)
Verification
Re-checked that this round's housekeeping is behavior-preserving: the centralized compileIgnoredPathSegmentRegex + lazy Pattern produces the same effective filter as the prior per-site Pattern.compile; "" maps to a never-matching pattern (avoiding the empty-regex-matches-everything trap), so it disables the generic filter while the three hardcoded carve-outs still apply; and the simplified PathFilterWrapper drops only a check that was a strict subset of the per-segment filtering HadoopFSUtils.listFiles/parallelListLeafFiles already apply with the same threaded regex. useFileStatusCache correctly compares the resolved regex string against the default.
PR description suggestions
- Fix the "Does this PR introduce any user-facing change?" section to describe the
ignoredPathSegmentRegexoption/conf (default^[._]; empty string or(?!)disables filtering; per-read option overrides the session conf) instead of the removedlistHiddenFiles = trueboolean.
|
thanks, merging to master/4.x |
… config ### What changes were proposed in this pull request? Add an `ignoredPathSegmentRegex` data source option plus a `spark.sql.files.ignoredPathSegmentRegex` session config (default `^[._]`) for file-based sources. The value is a Java regex matched with find semantics against each individual directory/file name during listing; matching names are skipped from file listing, partition discovery, and reads (a matching directory excludes its whole subtree). The per-read option overrides the session config. Both batch and Structured Streaming reads are covered. Regardless of the regex, three rules are hardcoded: - `_metadata/_common_metadata` (Parquet summary files) are always listed - `*._COPYING_` (in-flight Hadoop FS-shell copies) are always skipped - `_-`prefixed partition dirs containing `=` are always kept. A regex that never matches (`(?!)`, or empty string -- pending decision below) disables the generic hidden-file filter. Two carve-outs: table statistics (CommandUtils) pin the default regex so ANALYZE sizes are unaffected by the session conf; and Parquet schema inference now excludes zero-length files from footer candidates so a directory containing a 0-byte _SUCCESS marker stays readable when hidden files are surfaced. User documentation is added to `sql-data-sources-generic-options.md` and the Structured Streaming guide. ### Why are the changes needed? Hidden files listing is being requested as a feature. We want to enable a user-facing option for this. ### Does this PR introduce _any_ user-facing change? Yes. By setting the new option `ignorePathSegmentRegex = ""`, hidden files will now show up inside all file listing operations. Default value preserves current behavior of skipping hidden files. ### How was this patch tested? Added unit tests covering the data source option and session conf, exercising the file-listing path, partition discovery, batch reads, and structured streaming. Also added tests for the session conf in streaming, partition discovery with a hidden directory next to partition dirs, and reading an unmodified Spark-written parquet directory with the option enabled. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 Closes #56374 from CalvQ/list-hidden-files-option. Lead-authored-by: Calvin Qin <calvin.qin.ca@gmail.com> Co-authored-by: Calvin Qin <calvin.qin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0417f91) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Add an
ignoredPathSegmentRegexdata source option plus aspark.sql.files.ignoredPathSegmentRegexsession config (default^[._]) for file-based sources. The value is a Java regex matched with find semantics against each individual directory/file name during listing; matching names are skipped from file listing, partition discovery, and reads (a matching directory excludes its whole subtree). The per-read option overrides the session config. Both batch and Structured Streaming reads are covered.Regardless of the regex, three rules are hardcoded:
_metadata/_common_metadata(Parquet summary files) are always listed*._COPYING_(in-flight Hadoop FS-shell copies) are always skipped_-prefixed partition dirs containing=are always kept.A regex that never matches (
(?!), or empty string -- pending decision below) disables the generic hidden-file filter.Two carve-outs: table statistics (CommandUtils) pin the default regex so ANALYZE sizes are unaffected by the session conf; and Parquet schema inference now excludes zero-length files from footer candidates so a directory containing a 0-byte _SUCCESS marker stays readable when hidden files are surfaced.
User documentation is added to
sql-data-sources-generic-options.mdand the Structured Streaming guide.Why are the changes needed?
Hidden files listing is being requested as a feature. We want to enable a user-facing option for this.
Does this PR introduce any user-facing change?
Yes. By setting the new option
ignorePathSegmentRegex = "", hidden files will now show up inside all file listing operations. Default value preserves current behavior of skipping hidden files.How was this patch tested?
Added unit tests covering the data source option and session conf, exercising the file-listing path, partition discovery, batch reads, and structured streaming. Also added tests for the session conf in streaming, partition discovery with a hidden directory next to partition dirs, and reading an unmodified Spark-written parquet directory with the option enabled.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.8