[SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config by CalvQ · Pull Request #56374 · apache/spark

CalvQ · 2026-06-08T20:19:07Z

What changes were proposed in this pull request?

Add an ignoredPathSegmentRegex data source option plus a spark.sql.files.ignoredPathSegmentRegex session config (default ^[._]) for file-based sources. The value is a Java regex matched with find semantics against each individual directory/file name during listing; matching names are skipped from file listing, partition discovery, and reads (a matching directory excludes its whole subtree). The per-read option overrides the session config. Both batch and Structured Streaming reads are covered.

Regardless of the regex, three rules are hardcoded:

_metadata/_common_metadata (Parquet summary files) are always listed
*._COPYING_ (in-flight Hadoop FS-shell copies) are always skipped
_-prefixed partition dirs containing = are always kept.

A regex that never matches ((?!), or empty string -- pending decision below) disables the generic hidden-file filter.
Two carve-outs: table statistics (CommandUtils) pin the default regex so ANALYZE sizes are unaffected by the session conf; and Parquet schema inference now excludes zero-length files from footer candidates so a directory containing a 0-byte _SUCCESS marker stays readable when hidden files are surfaced.

User documentation is added to sql-data-sources-generic-options.md and the Structured Streaming guide.

Why are the changes needed?

Hidden files listing is being requested as a feature. We want to enable a user-facing option for this.

Does this PR introduce any user-facing change?

Yes. By setting the new option ignorePathSegmentRegex = "", hidden files will now show up inside all file listing operations. Default value preserves current behavior of skipping hidden files.

How was this patch tested?

Added unit tests covering the data source option and session conf, exercising the file-listing path, partition discovery, batch reads, and structured streaming. Also added tests for the session conf in streaming, partition discovery with a hidden directory next to partition dirs, and reading an unmodified Spark-written parquet directory with the option enabled.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

sandip-db · 2026-06-10T19:41:48Z

 object FileSourceOptions {
  val IGNORE_CORRUPT_FILES = "ignoreCorruptFiles"
  val IGNORE_MISSING_FILES = "ignoreMissingFiles"
+  val LIST_HIDDEN_FILES = "listHiddenFiles"


I prefer this option name. Open to other suggestions. cc: @cloud-fan

Suggested change

val LIST_HIDDEN_FILES = "listHiddenFiles"

val IGNORE_METADATA_FILES = "ignoreMetadataFiles"

I prefer having a hiddenFileRegex instead that by default is ^[\.\_]. That would give the users flexibility to keep files that start with . as a hidden file but keep _ as readable, etc.

This is possible with minor logic changes: there's a full-path filter and a path segment filter. To keep behavior consistent we would need to walk the full-path to check our regex against each segment for filtering.

I'd keep the boolean listHiddenFiles.

Re hiddenFileRegex: the current filter isn't actually expressible as a name-prefix regex, which makes a regex option more misleading than flexible. Today's rule also drops *._COPYING_ by suffix, exempts _metadata/_common_metadata (Parquet summary files), and special-cases _x=y names (startsWith("_") && !contains("=")). So ^[\.\_] as the "behavior-preserving default" is already subtly wrong, and exposing the rule as a user-supplied regex either loses these special cases or forces users to understand them. The flexibility use case (e.g. surface _ files but keep . files hidden) is already covered by combining this option with pathGlobFilter. If a real need shows up later, a regex option can still be added alongside the boolean without breaking anything.

Re ignoreMetadataFiles: the filtered set is broader than metadata files (a .foo.json sidecar isn't metadata), and the default would be true, giving users a double negative. "Hidden files" is the established Hadoop convention (FileInputFormat.hiddenFileFilter is exactly the _/. prefix rule), so listHiddenFiles describes what the mechanism does.

no, the regex would apply on every directory and file name independently. So you would walk up the directory path if you got a full list, or if you're performing a recursive list, you would just filter out by path.getName. the regex doesn't apply on the full path, it applies on the directory/file names

I'm thinking to apply the regex after the hardcoded edge cases: we always keep _metadata/_common_metadata, always drop *._COPYING_, and always keep _x=y names. The regex only replaces the generic _/. rule, so we can keep our default as ^[._], and a user-supplied regex cannot change the special rules we hardcode. WDYT? @cloud-fan

I think we should make the user intention clear here. The hidden file definition is from the file system, and it's a fact. Reading hidden file or not is the user intention. I'm fine with a regex option but let's make sure the option name reflects the user intention.

Makes sense -- what do you think about ignoredPathSegmentRegex? This would represent the path segments that a user wants ignored during listing, and would default to the normal hidden filters for _/. This option also did not have demand for listing spark-specific hidden files, so I want to keep those files hardcoded unless there's demand for that capability.

ignoredPathSegmentRegex SGTM!

cloud-fan

1 blocking, 3 non-blocking, 2 nits.
Solid, pattern-consistent implementation with thorough tests; the one real defect is the flag-unaware FileStatusCache interaction for catalog tables.

Correctness (1)

InMemoryFileIndex.scala:159: toggling the session conf on a catalog table returns stale FileStatusCache listings — see inline

Design / architecture (1)

HadoopFSUtils.scala:363: two callers of shouldFilterOutPathName still hardcode the default (ArchiveReader parity claim, DataSource warning) — see inline

Suggestions (2)

ParquetUtils.scala:148: zero-length inference change deserves a migration-guide entry — see inline
FileBasedDataSourceSuite.scala:1465: add the partitioned Spark-written-layout test — see inline

Nits: 2 minor items (see inline comments).

cloud-fan · 2026-06-11T23:29:01Z

-      new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+    val fileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(parameters))
+    val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
+    val listHiddenFiles = fileSourceOptions.listHiddenFiles


This interacts badly with FileStatusCache: cache entries are keyed by path only, but the listing they hold now depends on listHiddenFiles. Path-based reads are safe (each read gets a fresh cache client), but CatalogFileIndex keeps one cache client for the table's lifetime and resolves the flag from the session conf at listing time — so toggling spark.sql.files.listHiddenFiles between queries on a catalog table silently returns the listing cached under the old value (in both directions) until REFRESH TABLE or eviction.

I'd bypass the cache (skip both getLeafFiles and putLeafFiles in listLeafFiles) when listHiddenFiles is true, so cached entries always hold the canonical hidden-filtered listing. That's simpler than widening the cache key, and re-listing on an opt-in flag is an acceptable cost.

We don't need a SQL conf if FileSourceOptions is used. And CatalogFileIndex will ignore FileSourceOptions. So we should be ok with enabling caching.

Do you mean removing the SQLConf entirely? Existing patterns such as ignoreCorruptFiles and ignoreMissingFiles keep both a session conf and a per-read option. Removing the conf would be cache safe, but to match existing patterns I think it would make sense to keep both the conf and per-read option -- this would require the caching bypass

sandip-db · 2026-06-17T05:24:41Z

+    # $example on:ignored_path_segment_regex$
+    # "(?!)" surfaces files that are hidden by default (e.g. names starting with "_" or ".")
+    surfaced_df = spark.read.format("parquet")\
+        .option("ignoredPathSegmentRegex", "(?!)")\


Wouldn't empty string value be simpler?

Empty string using the .find pattern actually matches all strings, which would mean we filter everything.

I agree that the "" pattern intuitively means allow everything, so I've added a specific edge-case for this, and swapped all examples to this

PTAL if the edge case is better or you want to instead just explain the "" pattern behavior

sandip-db · 2026-06-17T05:33:03Z

+Regardless of the regex, three rules always apply: names starting with `_metadata` or `_common_metadata` (Parquet summary files) are always listed, names ending in
+`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names containing `=` (partition directories) are always kept.
+
+A regex that never matches, such as `(?!)`, disables the generic hidden-file filtering and surfaces hidden files, including Spark-internal marker files such as


An issue with empty pattern string ""?

Explained #56374 (comment), but the empty pattern string "" actually matches every string, meaning we would filter out everything.

Currently special-casing it to match nothing (disabling filtering), but lmk if you don't think the edge case is necessary

sandip-db · 2026-06-17T06:23:27Z

-      new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
+    val fileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(parameters))
+    val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
+    val listHiddenFiles = fileSourceOptions.listHiddenFiles


We don't need a SQL conf if FileSourceOptions is used. And CatalogFileIndex will ignore FileSourceOptions. So we should be ok with enabling caching.

cloud-fan

6 addressed, 0 remaining, 3 new. (3 newly introduced, 0 late catches.) All six prior findings resolved; the redesign to a regex is clean and very well-tested. The two new blockers are housekeeping that rode in with the redesign.

Design / architecture (1)

PR title/description still describe the old boolean listHiddenFiles option + spark.sql.files.listHiddenFiles conf; the implementation is the ignoredPathSegmentRegex regex. Please update both (see PR description suggestions).

Correctness (1)

SQLConf.scala:2793 / sql-migration-guide.md:25: version should be 4.3.0 (next open feature release), not 5.0.0 — see inline.

Suggestions (1)

FileSourceOptions.scala:58: compile + validate the Pattern once here instead of re-compiling at ~5 sites (reinforces @sandip-db's CSVDataSource thread) — see inline.

Verification

Traced the HadoopFSUtils.shouldFilterOutPath rewrite (recursive _metadata logic → per-segment split("/") walk): equivalent to the pre-PR logic for the default ^[._] across _SUCCESS / .-prefixed files / _x=y partition dirs / _metadata leaf+dir cases; the two divergences (/_foo/_metadata, /x/_metadata.json) are the intentional unifications the new HadoopFSUtilsSuite cases assert. isDataPath and the FileStatusCache bypass key off the same resolved regex as the listing filter.

PR description suggestions

Rewrite the title Add listHiddenFiles data source option to the ignoredPathSegmentRegex design.
Rewrite the body: it describes a boolean listHiddenFiles option + a spark.sql.files.listHiddenFiles conf (default false). Replace with the regex option/conf (default ^[._], find semantics, option-overrides-conf), the three hardcoded carve-outs (_metadata/_common_metadata kept, *._COPYING_ skipped, _x=y partition dirs kept), and the (?!) disable-filtering escape hatch.

brkyvz · 2026-06-18T21:14:53Z

Empty string should allow everything not filter it out. Otherwise how do you allow ingesting everything?

…

On Thu, Jun 18, 2026, 11:02 PM Calvin Qin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In docs/sql-data-sources-generic-options.md <#56374 (comment)>: > @@ -97,6 +97,46 @@ you can use: </div> </div> +### Ignored Path Segment Regex + +Spark allows you to use the configuration `spark.sql.files.ignoredPathSegmentRegex` or the data source option `ignoredPathSegmentRegex` to control which files are treated as +hidden during file listing. The value is a Java regular expression that is matched (with find semantics, i.e. `java.util.regex.Matcher.find`) against each individual +directory and file name below the path being read; names in which the regex finds a match are skipped from file listing, partition discovery, and reads, and a matching +directory name excludes its whole subtree. The default value is `^[._]`, which skips files and directories whose names start with `_` or `.`. The data source option +takes precedence over the configuration when both are set. + +Regardless of the regex, three rules always apply: names starting with `_metadata` or `_common_metadata` (Parquet summary files) are always listed, names ending in +`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names containing `=` (partition directories) are always kept. + +A regex that never matches, such as `(?!)`, disables the generic hidden-file filtering and surfaces hidden files, including Spark-internal marker files such as Explained #56374 (comment) <#56374 (comment)>, but the empty pattern string "" actually matches every string, meaning we would filter out everything. — Reply to this email directly, view it on GitHub <#56374?email_source=notifications&email_token=ABIAE66Z2STBAV44ACNJ5OD5ARKH7A5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINJSHA2DSNRWHEZ2M4TFMFZW63VHMNXW23LFNZ2KKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#discussion_r3438872335>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIAE64K37KX6RIVDBPSSK35ARKH7AVCNFSNUABEKJSXA33TNF2G64TZHMYTOMJWGU3DKOB3JFZXG5LFHM2DMMJWGMZDQMRVGKQXMAQ> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/ABIAE63KQ4QL24EJO34MEJT5ARKH7A5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINJSHA2DSNRWHEZ2M4TFMFZW63VHMNXW23LFNZ2KKZLWMVXHJKTGN5XXIZLSL5UW64Y> and Android <https://github.com/notifications/mobile/android/ABIAE6ZNZWV7YFWHSD6H7U35ARKH7A5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTINJSHA2DSNRWHEZ2M4TFMFZW63VHMNXW23LFNZ2KKZLWMVXHJLTGN5XXIZLSL5QW4ZDSN5UWI>. Download it today! You are receiving this because you commented.Message ID: ***@***.***>

Add a `listHiddenFiles` data source option plus a `spark.sql.files.listHiddenFiles` session configuration (default false) for file-based sources such as Parquet, JSON, ORC and CSV. When enabled, files and directories whose names start with `_` or `.` -- including `._COPYING_` files -- participate in file listing, partition discovery, and reads instead of being skipped. The per-read option overrides the session configuration. Both batch and Structured Streaming reads are covered. The flag is threaded through the listing path (HadoopFSUtils.listFiles / parallelListLeafFiles / listLeafFiles), the SQL file index (InMemoryFileIndex.bulkListLeafFiles and PathFilterWrapper), partition discovery (PartitioningAwareFileIndex.isDataPath), and file streaming options (FileStreamOptions). When true, the hidden-file filter is bypassed entirely; users can narrow results with `pathGlobFilter`. Co-authored-by: Isaac

…ter, version 4.3.0 - Add FileSourceOptions.ignoredPathSegmentRegexPattern lazy val and shared compileIgnoredPathSegmentRegex; drop the 5 duplicate Pattern.compile sites and the duplicate require in PartitioningAwareFileIndex - Empty string now disables hidden-file filtering (replaces require(nonEmpty) and the '(?!)' escape hatch); compile maps empty to a never-matching pattern - Simplify PathFilterWrapper: HadoopFSUtils.listFiles/parallelListLeafFiles already apply the regex per path segment, so the wrapper check was redundant - SQLConf version 4.3.0; migration-guide bullet moved into the 4.3 section - Docs and Scala/Java/Python/R examples use "" as the disable value - Update tests for the new semantics; wrap a few over-100-char lines Co-authored-by: Isaac

CalvQ · 2026-06-18T21:27:27Z

Empty string should allow everything not filter it out. Otherwise how do you allow ingesting everything?

Agreed that "" makes intuitive sense to surface everything. Originally the empty pattern would match every string, so using "" would filter everything out. Previously I used (?!) as an example, which matches nothing and disables filtering.

I've now special cased "" to the match-nothing pattern, so both options disable filtering and list all files

cloud-fan

2 addressed, 1 remaining, 0 new. (0 newly introduced, 0 late catches.) Round-2 findings #2 (version 4.3.0) and #3 (centralized Pattern compile) are resolved. Clean otherwise; the one remaining item is the stale PR-description section.

Design / architecture (1)

PR description's "Does this PR introduce any user-facing change?" section still describes the removed listHiddenFiles = true boolean instead of ignoredPathSegmentRegex — see PR description suggestions. (Unresolved remainder of the round-2 description finding; non-blocking.)

Verification

Re-checked that this round's housekeeping is behavior-preserving: the centralized compileIgnoredPathSegmentRegex + lazy Pattern produces the same effective filter as the prior per-site Pattern.compile; "" maps to a never-matching pattern (avoiding the empty-regex-matches-everything trap), so it disables the generic filter while the three hardcoded carve-outs still apply; and the simplified PathFilterWrapper drops only a check that was a strict subset of the per-segment filtering HadoopFSUtils.listFiles/parallelListLeafFiles already apply with the same threaded regex. useFileStatusCache correctly compares the resolved regex string against the default.

PR description suggestions

Fix the "Does this PR introduce any user-facing change?" section to describe the ignoredPathSegmentRegex option/conf (default ^[._]; empty string or (?!) disables filtering; per-read option overrides the session conf) instead of the removed listHiddenFiles = true boolean.

cloud-fan · 2026-06-23T20:28:09Z

thanks, merging to master/4.x

… config ### What changes were proposed in this pull request? Add an `ignoredPathSegmentRegex` data source option plus a `spark.sql.files.ignoredPathSegmentRegex` session config (default `^[._]`) for file-based sources. The value is a Java regex matched with find semantics against each individual directory/file name during listing; matching names are skipped from file listing, partition discovery, and reads (a matching directory excludes its whole subtree). The per-read option overrides the session config. Both batch and Structured Streaming reads are covered. Regardless of the regex, three rules are hardcoded: - `_metadata/_common_metadata` (Parquet summary files) are always listed - `*._COPYING_` (in-flight Hadoop FS-shell copies) are always skipped - `_-`prefixed partition dirs containing `=` are always kept. A regex that never matches (`(?!)`, or empty string -- pending decision below) disables the generic hidden-file filter. Two carve-outs: table statistics (CommandUtils) pin the default regex so ANALYZE sizes are unaffected by the session conf; and Parquet schema inference now excludes zero-length files from footer candidates so a directory containing a 0-byte _SUCCESS marker stays readable when hidden files are surfaced. User documentation is added to `sql-data-sources-generic-options.md` and the Structured Streaming guide. ### Why are the changes needed? Hidden files listing is being requested as a feature. We want to enable a user-facing option for this. ### Does this PR introduce _any_ user-facing change? Yes. By setting the new option `ignorePathSegmentRegex = ""`, hidden files will now show up inside all file listing operations. Default value preserves current behavior of skipping hidden files. ### How was this patch tested? Added unit tests covering the data source option and session conf, exercising the file-listing path, partition discovery, batch reads, and structured streaming. Also added tests for the session conf in streaming, partition discovery with a hidden directory next to partition dirs, and reading an unmodified Spark-written parquet directory with the option enabled. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 Closes #56374 from CalvQ/list-hidden-files-option. Lead-authored-by: Calvin Qin <calvin.qin.ca@gmail.com> Co-authored-by: Calvin Qin <calvin.qin@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0417f91) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

CalvQ force-pushed the list-hidden-files-option branch 3 times, most recently from e8890da to 44a033d Compare June 9, 2026 01:37

CalvQ changed the title ~~[SPARK-XXXXX][SQL] Add listHiddenFiles data source option~~ [SPARK-57354][SQL] Add listHiddenFiles data source option Jun 9, 2026

CalvQ force-pushed the list-hidden-files-option branch from 44a033d to 8788bf3 Compare June 9, 2026 21:53

sandip-db reviewed Jun 10, 2026

View reviewed changes

xiaonanyang-db reviewed Jun 10, 2026

View reviewed changes

Comment thread core/src/test/scala/org/apache/spark/util/HadoopFSUtilsSuite.scala Outdated

Comment thread sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala Outdated

cloud-fan reviewed Jun 11, 2026

View reviewed changes

sandip-db reviewed Jun 17, 2026

View reviewed changes

cloud-fan reviewed Jun 17, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated

Comment thread docs/sql-migration-guide.md Outdated

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/FileSourceOptions.scala

CalvQ changed the title ~~[SPARK-57354][SQL] Add listHiddenFiles data source option~~ [SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option Jun 18, 2026

CalvQ changed the title ~~[SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option~~ [SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config Jun 18, 2026

Calvin Qin added 5 commits June 18, 2026 21:19

update tests and docs, remove dead filestream codepath

bbeec35

minor fix, pass bool through shouldFilterOut* functions

be873b0

refactor to ignoredPathSegmentRegex and address feedback

ebc527f

CalvQ force-pushed the list-hidden-files-option branch from 4199c6e to 83543a0 Compare June 22, 2026 21:25

cloud-fan approved these changes Jun 23, 2026

View reviewed changes

Fix scalastyle line-length violations (100-char limit)

d262adf

cloud-fan closed this in 0417f91 Jun 23, 2026

	val LIST_HIDDEN_FILES = "listHiddenFiles"
	val IGNORE_METADATA_FILES = "ignoreMetadataFiles"

Uh oh!

Conversation

CalvQ commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Correctness (1)

Design / architecture (1)

Suggestions (2)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CalvQ Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CalvQ Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Correctness (1)

Suggestions (1)

Verification

PR description suggestions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brkyvz commented Jun 18, 2026 via email

Uh oh!

CalvQ commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan left a comment

CalvQ commented Jun 8, 2026 •

edited

Loading

CalvQ Jun 18, 2026 •

edited

Loading

CalvQ Jun 18, 2026 •

edited

Loading

CalvQ commented Jun 18, 2026 •

edited

Loading