Skip to content

[SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config#56374

Closed
CalvQ wants to merge 6 commits into
apache:masterfrom
CalvQ:list-hidden-files-option
Closed

[SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config#56374
CalvQ wants to merge 6 commits into
apache:masterfrom
CalvQ:list-hidden-files-option

Conversation

@CalvQ

@CalvQ CalvQ commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add an ignoredPathSegmentRegex data source option plus a spark.sql.files.ignoredPathSegmentRegex session config (default ^[._]) for file-based sources. The value is a Java regex matched with find semantics against each individual directory/file name during listing; matching names are skipped from file listing, partition discovery, and reads (a matching directory excludes its whole subtree). The per-read option overrides the session config. Both batch and Structured Streaming reads are covered.

Regardless of the regex, three rules are hardcoded:

  • _metadata/_common_metadata (Parquet summary files) are always listed
  • *._COPYING_ (in-flight Hadoop FS-shell copies) are always skipped
  • _-prefixed partition dirs containing = are always kept.

A regex that never matches ((?!), or empty string -- pending decision below) disables the generic hidden-file filter.
Two carve-outs: table statistics (CommandUtils) pin the default regex so ANALYZE sizes are unaffected by the session conf; and Parquet schema inference now excludes zero-length files from footer candidates so a directory containing a 0-byte _SUCCESS marker stays readable when hidden files are surfaced.

User documentation is added to sql-data-sources-generic-options.md and the Structured Streaming guide.

Why are the changes needed?

Hidden files listing is being requested as a feature. We want to enable a user-facing option for this.

Does this PR introduce any user-facing change?

Yes. By setting the new option ignorePathSegmentRegex = "", hidden files will now show up inside all file listing operations. Default value preserves current behavior of skipping hidden files.

How was this patch tested?

Added unit tests covering the data source option and session conf, exercising the file-listing path, partition discovery, batch reads, and structured streaming. Also added tests for the session conf in streaming, partition discovery with a hidden directory next to partition dirs, and reading an unmodified Spark-written parquet directory with the option enabled.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

@CalvQ CalvQ force-pushed the list-hidden-files-option branch 3 times, most recently from e8890da to 44a033d Compare June 9, 2026 01:37
@CalvQ CalvQ changed the title [SPARK-XXXXX][SQL] Add listHiddenFiles data source option [SPARK-57354][SQL] Add listHiddenFiles data source option Jun 9, 2026
@CalvQ CalvQ force-pushed the list-hidden-files-option branch from 44a033d to 8788bf3 Compare June 9, 2026 21:53
object FileSourceOptions {
val IGNORE_CORRUPT_FILES = "ignoreCorruptFiles"
val IGNORE_MISSING_FILES = "ignoreMissingFiles"
val LIST_HIDDEN_FILES = "listHiddenFiles"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this option name. Open to other suggestions. cc: @cloud-fan

Suggested change
val LIST_HIDDEN_FILES = "listHiddenFiles"
val IGNORE_METADATA_FILES = "ignoreMetadataFiles"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer having a hiddenFileRegex instead that by default is ^[\.\_]. That would give the users flexibility to keep files that start with . as a hidden file but keep _ as readable, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible with minor logic changes: there's a full-path filter and a path segment filter. To keep behavior consistent we would need to walk the full-path to check our regex against each segment for filtering.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the boolean listHiddenFiles.

Re hiddenFileRegex: the current filter isn't actually expressible as a name-prefix regex, which makes a regex option more misleading than flexible. Today's rule also drops *._COPYING_ by suffix, exempts _metadata/_common_metadata (Parquet summary files), and special-cases _x=y names (startsWith("_") && !contains("=")). So ^[\.\_] as the "behavior-preserving default" is already subtly wrong, and exposing the rule as a user-supplied regex either loses these special cases or forces users to understand them. The flexibility use case (e.g. surface _ files but keep . files hidden) is already covered by combining this option with pathGlobFilter. If a real need shows up later, a regex option can still be added alongside the boolean without breaking anything.

Re ignoreMetadataFiles: the filtered set is broader than metadata files (a .foo.json sidecar isn't metadata), and the default would be true, giving users a double negative. "Hidden files" is the established Hadoop convention (FileInputFormat.hiddenFileFilter is exactly the _/. prefix rule), so listHiddenFiles describes what the mechanism does.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the regex would apply on every directory and file name independently. So you would walk up the directory path if you got a full list, or if you're performing a recursive list, you would just filter out by path.getName. the regex doesn't apply on the full path, it applies on the directory/file names

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking to apply the regex after the hardcoded edge cases: we always keep _metadata/_common_metadata, always drop *._COPYING_, and always keep _x=y names. The regex only replaces the generic _/. rule, so we can keep our default as ^[._], and a user-supplied regex cannot change the special rules we hardcode. WDYT? @cloud-fan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the user intention clear here. The hidden file definition is from the file system, and it's a fact. Reading hidden file or not is the user intention. I'm fine with a regex option but let's make sure the option name reflects the user intention.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- what do you think about ignoredPathSegmentRegex? This would represent the path segments that a user wants ignored during listing, and would default to the normal hidden filters for _/. This option also did not have demand for listing spark-specific hidden files, so I want to keep those files hardcoded unless there's demand for that capability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoredPathSegmentRegex SGTM!

Comment thread core/src/test/scala/org/apache/spark/util/HadoopFSUtilsSuite.scala Outdated

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 3 non-blocking, 2 nits.
Solid, pattern-consistent implementation with thorough tests; the one real defect is the flag-unaware FileStatusCache interaction for catalog tables.

Correctness (1)

  • InMemoryFileIndex.scala:159: toggling the session conf on a catalog table returns stale FileStatusCache listings — see inline

Design / architecture (1)

  • HadoopFSUtils.scala:363: two callers of shouldFilterOutPathName still hardcode the default (ArchiveReader parity claim, DataSource warning) — see inline

Suggestions (2)

  • ParquetUtils.scala:148: zero-length inference change deserves a migration-guide entry — see inline
  • FileBasedDataSourceSuite.scala:1465: add the partitioned Spark-written-layout test — see inline

Nits: 2 minor items (see inline comments).

new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
val fileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(parameters))
val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
val listHiddenFiles = fileSourceOptions.listHiddenFiles

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interacts badly with FileStatusCache: cache entries are keyed by path only, but the listing they hold now depends on listHiddenFiles. Path-based reads are safe (each read gets a fresh cache client), but CatalogFileIndex keeps one cache client for the table's lifetime and resolves the flag from the session conf at listing time — so toggling spark.sql.files.listHiddenFiles between queries on a catalog table silently returns the listing cached under the old value (in both directions) until REFRESH TABLE or eviction.

I'd bypass the cache (skip both getLeafFiles and putLeafFiles in listLeafFiles) when listHiddenFiles is true, so cached entries always hold the canonical hidden-filtered listing. That's simpler than widening the cache key, and re-listing on an opt-in flag is an acceptable cost.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a SQL conf if FileSourceOptions is used. And CatalogFileIndex will ignore FileSourceOptions. So we should be ok with enabling caching.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean removing the SQLConf entirely? Existing patterns such as ignoreCorruptFiles and ignoreMissingFiles keep both a session conf and a per-read option. Removing the conf would be cache safe, but to match existing patterns I think it would make sense to keep both the conf and per-read option -- this would require the caching bypass

Comment thread core/src/main/scala/org/apache/spark/util/HadoopFSUtils.scala Outdated
Comment thread sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala Outdated
Comment thread docs/sql-data-sources-generic-options.md Outdated
Comment thread docs/sql-data-sources-generic-options.md Outdated
# $example on:ignored_path_segment_regex$
# "(?!)" surfaces files that are hidden by default (e.g. names starting with "_" or ".")
surfaced_df = spark.read.format("parquet")\
.option("ignoredPathSegmentRegex", "(?!)")\

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't empty string value be simpler?

@CalvQ CalvQ Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty string using the .find pattern actually matches all strings, which would mean we filter everything.

I agree that the "" pattern intuitively means allow everything, so I've added a specific edge-case for this, and swapped all examples to this

PTAL if the edge case is better or you want to instead just explain the "" pattern behavior

Regardless of the regex, three rules always apply: names starting with `_metadata` or `_common_metadata` (Parquet summary files) are always listed, names ending in
`._COPYING_` (in-flight copies) are always skipped, and `_`-prefixed names containing `=` (partition directories) are always kept.

A regex that never matches, such as `(?!)`, disables the generic hidden-file filtering and surfaces hidden files, including Spark-internal marker files such as

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue with empty pattern string ""?

@CalvQ CalvQ Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained #56374 (comment), but the empty pattern string "" actually matches every string, meaning we would filter out everything.

Currently special-casing it to match nothing (disabling filtering), but lmk if you don't think the edge case is necessary

new FileSourceOptions(CaseInsensitiveMap(parameters)).ignoreMissingFiles
val fileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(parameters))
val ignoreMissingFiles = fileSourceOptions.ignoreMissingFiles
val listHiddenFiles = fileSourceOptions.listHiddenFiles

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a SQL conf if FileSourceOptions is used. And CatalogFileIndex will ignore FileSourceOptions. So we should be ok with enabling caching.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 addressed, 0 remaining, 3 new. (3 newly introduced, 0 late catches.) All six prior findings resolved; the redesign to a regex is clean and very well-tested. The two new blockers are housekeeping that rode in with the redesign.

Design / architecture (1)

  • PR title/description still describe the old boolean listHiddenFiles option + spark.sql.files.listHiddenFiles conf; the implementation is the ignoredPathSegmentRegex regex. Please update both (see PR description suggestions).

Correctness (1)

  • SQLConf.scala:2793 / sql-migration-guide.md:25: version should be 4.3.0 (next open feature release), not 5.0.0 — see inline.

Suggestions (1)

  • FileSourceOptions.scala:58: compile + validate the Pattern once here instead of re-compiling at ~5 sites (reinforces @sandip-db's CSVDataSource thread) — see inline.

Verification

Traced the HadoopFSUtils.shouldFilterOutPath rewrite (recursive _metadata logic → per-segment split("/") walk): equivalent to the pre-PR logic for the default ^[._] across _SUCCESS / .-prefixed files / _x=y partition dirs / _metadata leaf+dir cases; the two divergences (/_foo/_metadata, /x/_metadata.json) are the intentional unifications the new HadoopFSUtilsSuite cases assert. isDataPath and the FileStatusCache bypass key off the same resolved regex as the listing filter.

PR description suggestions

  • Rewrite the title Add listHiddenFiles data source option to the ignoredPathSegmentRegex design.
  • Rewrite the body: it describes a boolean listHiddenFiles option + a spark.sql.files.listHiddenFiles conf (default false). Replace with the regex option/conf (default ^[._], find semantics, option-overrides-conf), the three hardcoded carve-outs (_metadata/_common_metadata kept, *._COPYING_ skipped, _x=y partition dirs kept), and the (?!) disable-filtering escape hatch.

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated
Comment thread docs/sql-migration-guide.md Outdated
@CalvQ CalvQ changed the title [SPARK-57354][SQL] Add listHiddenFiles data source option [SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option Jun 18, 2026
@CalvQ CalvQ changed the title [SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option [SPARK-57354][SQL] Add ignoredPathSegmentRegex data source option and config Jun 18, 2026
@brkyvz

brkyvz commented Jun 18, 2026 via email

Copy link
Copy Markdown
Contributor

Calvin Qin added 5 commits June 18, 2026 21:19
Add a `listHiddenFiles` data source option plus a
`spark.sql.files.listHiddenFiles` session configuration (default false) for
file-based sources such as Parquet, JSON, ORC and CSV. When enabled, files and
directories whose names start with `_` or `.` -- including `._COPYING_` files --
participate in file listing, partition discovery, and reads instead of being
skipped. The per-read option overrides the session configuration. Both batch and
Structured Streaming reads are covered.

The flag is threaded through the listing path (HadoopFSUtils.listFiles /
parallelListLeafFiles / listLeafFiles), the SQL file index
(InMemoryFileIndex.bulkListLeafFiles and PathFilterWrapper), partition discovery
(PartitioningAwareFileIndex.isDataPath), and file streaming options
(FileStreamOptions). When true, the hidden-file filter is bypassed entirely; users
can narrow results with `pathGlobFilter`.

Co-authored-by: Isaac
…ter, version 4.3.0

- Add FileSourceOptions.ignoredPathSegmentRegexPattern lazy val and shared
  compileIgnoredPathSegmentRegex; drop the 5 duplicate Pattern.compile sites
  and the duplicate require in PartitioningAwareFileIndex
- Empty string now disables hidden-file filtering (replaces require(nonEmpty)
  and the '(?!)' escape hatch); compile maps empty to a never-matching pattern
- Simplify PathFilterWrapper: HadoopFSUtils.listFiles/parallelListLeafFiles
  already apply the regex per path segment, so the wrapper check was redundant
- SQLConf version 4.3.0; migration-guide bullet moved into the 4.3 section
- Docs and Scala/Java/Python/R examples use "" as the disable value
- Update tests for the new semantics; wrap a few over-100-char lines

Co-authored-by: Isaac
@CalvQ

CalvQ commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Empty string should allow everything not filter it out. Otherwise how do you allow ingesting everything?

Agreed that "" makes intuitive sense to surface everything. Originally the empty pattern would match every string, so using "" would filter everything out. Previously I used (?!) as an example, which matches nothing and disables filtering.

I've now special cased "" to the match-nothing pattern, so both options disable filtering and list all files

@CalvQ CalvQ force-pushed the list-hidden-files-option branch from 4199c6e to 83543a0 Compare June 22, 2026 21:25

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 addressed, 1 remaining, 0 new. (0 newly introduced, 0 late catches.) Round-2 findings #2 (version 4.3.0) and #3 (centralized Pattern compile) are resolved. Clean otherwise; the one remaining item is the stale PR-description section.

Design / architecture (1)

  • PR description's "Does this PR introduce any user-facing change?" section still describes the removed listHiddenFiles = true boolean instead of ignoredPathSegmentRegex — see PR description suggestions. (Unresolved remainder of the round-2 description finding; non-blocking.)

Verification

Re-checked that this round's housekeeping is behavior-preserving: the centralized compileIgnoredPathSegmentRegex + lazy Pattern produces the same effective filter as the prior per-site Pattern.compile; "" maps to a never-matching pattern (avoiding the empty-regex-matches-everything trap), so it disables the generic filter while the three hardcoded carve-outs still apply; and the simplified PathFilterWrapper drops only a check that was a strict subset of the per-segment filtering HadoopFSUtils.listFiles/parallelListLeafFiles already apply with the same threaded regex. useFileStatusCache correctly compares the resolved regex string against the default.

PR description suggestions

  • Fix the "Does this PR introduce any user-facing change?" section to describe the ignoredPathSegmentRegex option/conf (default ^[._]; empty string or (?!) disables filtering; per-read option overrides the session conf) instead of the removed listHiddenFiles = true boolean.

@cloud-fan

Copy link
Copy Markdown
Contributor

thanks, merging to master/4.x

@cloud-fan cloud-fan closed this in 0417f91 Jun 23, 2026
cloud-fan pushed a commit that referenced this pull request Jun 23, 2026
… config

### What changes were proposed in this pull request?

Add an `ignoredPathSegmentRegex` data source option plus a `spark.sql.files.ignoredPathSegmentRegex` session config (default `^[._]`) for file-based sources. The value is a Java regex matched with find semantics against each individual directory/file name during listing; matching names are skipped from file listing, partition discovery, and reads (a matching directory excludes its whole subtree). The per-read option overrides the session config. Both batch and Structured Streaming reads are covered.

Regardless of the regex, three rules are hardcoded:
- `_metadata/_common_metadata` (Parquet summary files) are always listed
- `*._COPYING_` (in-flight Hadoop FS-shell copies) are always skipped
- `_-`prefixed partition dirs containing `=` are always kept.

A regex that never matches (`(?!)`, or empty string -- pending decision below) disables the generic hidden-file filter.
Two carve-outs: table statistics (CommandUtils) pin the default regex so ANALYZE sizes are unaffected by the session conf; and Parquet schema inference now excludes zero-length files from footer candidates so a directory containing a 0-byte _SUCCESS marker stays readable when hidden files are surfaced.

User documentation is added to `sql-data-sources-generic-options.md` and the Structured Streaming guide.

### Why are the changes needed?

Hidden files listing is being requested as a feature. We want to enable a user-facing option for this.

### Does this PR introduce _any_ user-facing change?

Yes. By setting the new option `ignorePathSegmentRegex = ""`, hidden files will now show up inside all file listing operations. Default value preserves current behavior of skipping hidden files.

### How was this patch tested?

Added unit tests covering the data source option and session conf, exercising the file-listing path, partition discovery, batch reads, and structured streaming. Also added tests for the session conf in streaming, partition discovery with a hidden directory next to partition dirs, and reading an unmodified Spark-written parquet directory with the option enabled.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.8

Closes #56374 from CalvQ/list-hidden-files-option.

Lead-authored-by: Calvin Qin <calvin.qin.ca@gmail.com>
Co-authored-by: Calvin Qin <calvin.qin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0417f91)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants