From 2618e91079a81f71c14a8fba0a2bf477e80ae793 Mon Sep 17 00:00:00 2001 From: Wenchen Fan Date: Wed, 29 Apr 2026 18:31:07 +0000 Subject: [PATCH 1/2] [SPARK-56666][INFRA] Reduce unidoc CI log noise with -Xdoclint:-missing and -verbose post-filter ### What changes were proposed in this pull request? Refines the unidoc javacOptions in `JavaUnidoc / unidoc / javacOptions` and the post-process stream filter in `docs/_plugins/build_api_docs.rb` so that the Documentation generation CI log is small enough to scan visually while still surfacing per-file `error: reference not found` diagnostics on broken `{@link}` references. Builds on the `-Xmaxerrs` and `-verbose` insight from https://github.com/apache/spark/pull/55581 (SPARK-56630 follow-up): javadoc's default `-Xmaxerrs 100` cap was hit by the ~100 inert genjavadoc-stub errors during source loading, so doclint never ran on the real sources, and the per-file `error: reference not found` diagnostics surfaced only with `-verbose`. That PR's flag set (`-Xmaxerrs 999999`, `-Xmaxwarns 999999`, `-verbose`) achieved the diagnostic goal but at a ~77K-line CI log per run. This PR keeps the diagnostic visibility and brings the visible CI log down to ~4K lines (95% reduction), with three changes: 1. **`-Xmaxerrs 0`** instead of `-Xmaxerrs 999999`. The `0` value is treated as unlimited by javadoc (locally verified) and reads cleaner than the magic number. 2. **`-Xdoclint:all` + `-Xdoclint:-missing`** (two separate flags, matching the existing `Compile / doc / javacOptions` pattern in `SparkBuild.scala`). Suppresses the `missing` doclint group at javadoc level: the ~22K `no comment` / `no @param` / `no @return` / `no @throws` warnings (each rendered as a 3-line block) that dominate the log on every Spark unidoc run. The two-flag form is load-bearing -- bare `-Xdoclint:-missing` alone demotes other doclint groups (notably `reference`) to warning level, making broken `{@link}` non-fatal; the explicit `-Xdoclint:all` first keeps reference at error level. Locally verified. 3. **Drop `-Xmaxwarns 999999`.** Warnings don't fail CI; error visibility is governed by `-Xmaxerrs`, not `-Xmaxwarns`. javadoc's default cap of 100 is sufficient -- shows a sample of any remaining warnings without flooding. Saves ~4K lines beyond `-Xdoclint:-missing` alone. 4. **Post-filter `-verbose` progress lines from the build_api_docs.rb stream.** `-verbose` itself stays (it is load-bearing for per-file `error: reference not found` emission per #55581), but its progress noise -- `Loading source file ...`, `[parsing started/completed]`, `[loading /path/X.class]`, `Generating /path/X.html` -- carries no diagnostic signal. The existing stream filter is extended with a `verbose_line` regex that drops these single-line progress entries from stdout. Saves ~13K lines. ### Why are the changes needed? Documentation generation CI logs were ~77K lines per run after SPARK-56630's flag set. That is large enough that scanning for diagnostics by eye is impractical, and grep-piping is the only reasonable workflow. Most of the volume is structural noise (genjavadoc stub errors, `no comment` warnings, `-verbose` progress markers) with no diagnostic signal. After this PR the log is ~4K lines on a real-failure run; the per-file `error: reference not found` diagnostics PR #55581 added are the dominant content. Empirical breakdown of the reduction (verified via test PR #55605 with deliberately broken `{@link}` plants in both a real `.java` source and a Scala source): | State | Log lines | Vs baseline | | ---------------------------------- | --------: | ----------: | | PR #55581's flag set (baseline) | 77K | | | Add `-Xdoclint:all,-missing` | 22K | -71% | | Drop `-Xmaxwarns 999999` | 18K | -77% | | Post-filter `-verbose` progress | **~4K** | **-95%** | All four diagnostic targets remain visible in the final form: 2 broken `{@link}`s in `ColumnarMap.java` (Java source) and 2 broken `[[Class.member]]`-style refs in a Scala source via the genjavadoc stub. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested end-to-end on PR #55605 (testing-only fork PR) with planted broken `{@link}` references in both code paths: - `ColumnarMap.java` (real Java source): `{@link org.apache.spark.deliberately.NoSuchClass}` and `{@link ColumnVector#nonExistentMethod()}`. - `Partition.scala` (Scala source via genjavadoc): `[[Partition.index]]` -- the wrong `.` separator that javadoc reads as inner-class lookup and fails to resolve. This is the case PR #55581's AGENTS.md note documents as the most common scaladoc-side cause of unidoc failure. Both surfaced as per-file `error: reference not found` diagnostics in the CI log on the test branch, doc gen failed as expected, log size dropped to 3,977 lines, and zero `Loading source file` / `[parsing started]` / `[loading X.class]` / `Generating *.html` / `no comment` lines remained visible. `-Xmaxerrs 0` and the bare-`-Xdoclint:-missing` demotion behavior were verified locally with standalone javadoc invocations on a minimal test file. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) --- docs/_plugins/build_api_docs.rb | 41 ++++++++++++++++++++++----------- project/SparkBuild.scala | 5 +++- 2 files changed, 31 insertions(+), 15 deletions(-) diff --git a/docs/_plugins/build_api_docs.rb b/docs/_plugins/build_api_docs.rb index a75727d8b65db..a3a376ef1b58d 100644 --- a/docs/_plugins/build_api_docs.rb +++ b/docs/_plugins/build_api_docs.rb @@ -134,20 +134,20 @@ def build_spark_scala_and_java_docs_if_necessary command = "build/sbt -Pkinesis-asl unidoc" puts "Running '#{command}'..." - # Suppress genjavadoc-stub diagnostic blocks from the visible log. javadoc - # emits ~3500 `[error]` lines per unidoc run on stubs under `target/java/` - # -- all benign because `--ignore-source-errors` is set, but they bury - # everything else. Each diagnostic is a header line followed by 3-5 - # `[error|warn]`-prefixed continuation lines (snippet, caret, - # symbol/location); the state machine drops both. + # Two filter passes on the unidoc output: # - # Match by *message text*, not just by `target/java/` path. Otherwise - # legitimate doclint diagnostics on stub paths would be hidden too -- - # those messages are real signal. The patterns below are the known-benign - # genjavadoc structural errors; anything else on a `target/java/` path is - # echoed. Diagnostic mirror lines from `SparkUnidocDoclet` use the - # `[unidoc-doclet]` prefix and don't match either regex, so they always - # pass through. + # 1. Genjavadoc-stub diagnostic blocks (~28 `[error]` lines on stubs under + # `target/java/`, plus 3-5 continuation lines each). Inert because + # `--ignore-source-errors` is set; matched by message text so legitimate + # doclint diagnostics on stub paths still pass through. + # + # 2. `-verbose` progress lines (~13K total): `Loading source file ...`, + # `[parsing started/completed ...]`, `[loading /path/X.class]`, + # `Generating .../X.html`. These are dominant in the log when `-verbose` + # is set (which it is in `JavaUnidoc / unidoc / javacOptions` to surface + # per-file `error: reference not found` diagnostics) but carry no signal + # of their own. Suppressing them brings the visible log from ~17K to ~5K + # lines on a typical run while leaving every diagnostic untouched. ansi = /\e\[[0-9;]*[A-Za-z]/ stub_header = %r{ \[(?:error|warn)\]\s+ @@ -159,11 +159,24 @@ def build_spark_scala_and_java_docs_if_necessary |.*?\s+is\s+not\s+public\s+in\s+\S+;\s+cannot\s+be\s+accessed\s+from\s+outside\s+package) }x stub_cont = %r{\A\s*\[(?:error|warn)\]\s+(?!/\S+\.java:\d+(?::\d+)?:\s)} + verbose_line = %r{ + \[(?:error|warn)\]\s+ + (?:Loading\s+source\s+file\s + |\[parsing\s+(?:started|completed)\s + |\[loading\s + |\[checking\s + |\[wrote\s + |Generating\s+\S+\.html + ) + }x in_stub = false IO.popen("#{command} 2>&1", 'r') do |pipe| pipe.each_line do |line| plain = line.gsub(ansi, '') - if plain =~ stub_header + if plain =~ verbose_line + in_stub = false + # suppress -verbose progress line + elsif plain =~ stub_header in_stub = true elsif in_stub && plain =~ stub_cont # continuation of a stub block; suppress diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 33648bccc29e6..737f0f9ba0610 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -1698,7 +1698,10 @@ object Unidoc { "-tag", "todo:X", "-tag", "groupname:X", "-tag", "inheritdoc", - "--ignore-source-errors", "-notree" + "--ignore-source-errors", "-notree", + "-Xmaxerrs", "0", + "-verbose", + "-Xdoclint:all", "-Xdoclint:-missing" ) }, From 524bfa84523261e6f0cf324102f02e2f5a5a812b Mon Sep 17 00:00:00 2001 From: Wenchen Fan Date: Wed, 29 Apr 2026 18:49:42 +0000 Subject: [PATCH 2/2] Drop orphan scalaStyleOn{Compile,Test}/logLevel settings These pre-dated SPARK-14790 (2016) and quieted the per-task log output when scalastyle was hooked into (Compile/compile) and (Test/compile). After SPARK-56636 decoupled scalastyle from compile, the tasks are only invoked from dev/lint-scala, so the per-task logLevel settings reference unread keys (sbt's lintUnused surfaces them as warnings). --- project/SparkBuild.scala | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 737f0f9ba0610..f9694dbb89191 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -253,9 +253,7 @@ object SparkBuild extends PomBuild { // violations surface, with file/line annotations from `dev/scalastyle`. def enableScalaStyle: Seq[sbt.Def.Setting[_]] = Seq( scalaStyleOnCompile := cachedScalaStyle(Compile).value, - scalaStyleOnTest := cachedScalaStyle(Test).value, - (scalaStyleOnCompile / logLevel) := Level.Warn, - (scalaStyleOnTest / logLevel) := Level.Warn + scalaStyleOnTest := cachedScalaStyle(Test).value ) lazy val compilerWarningSettings: Seq[sbt.Def.Setting[_]] = Seq(