From 2618e91079a81f71c14a8fba0a2bf477e80ae793 Mon Sep 17 00:00:00 2001
From: Wenchen Fan <wenchen@databricks.com>
Date: Wed, 29 Apr 2026 18:31:07 +0000
Subject: [PATCH 1/2] [SPARK-56666][INFRA] Reduce unidoc CI log noise with
 -Xdoclint:-missing and -verbose post-filter

### What changes were proposed in this pull request?

Refines the unidoc javacOptions in `JavaUnidoc / unidoc / javacOptions`
and the post-process stream filter in `docs/_plugins/build_api_docs.rb`
so that the Documentation generation CI log is small enough to scan
visually while still surfacing per-file `error: reference not found`
diagnostics on broken `{@link}` references.

Builds on the `-Xmaxerrs` and `-verbose` insight from
https://github.com/apache/spark/pull/55581 (SPARK-56630 follow-up):
javadoc's default `-Xmaxerrs 100` cap was hit by the ~100 inert
genjavadoc-stub errors during source loading, so doclint never ran on
the real sources, and the per-file `error: reference not found`
diagnostics surfaced only with `-verbose`. That PR's flag set
(`-Xmaxerrs 999999`, `-Xmaxwarns 999999`, `-verbose`) achieved the
diagnostic goal but at a ~77K-line CI log per run.

This PR keeps the diagnostic visibility and brings the visible CI log
down to ~4K lines (95% reduction), with three changes:

1. **`-Xmaxerrs 0`** instead of `-Xmaxerrs 999999`. The `0` value is
   treated as unlimited by javadoc (locally verified) and reads
   cleaner than the magic number.

2. **`-Xdoclint:all` + `-Xdoclint:-missing`** (two separate flags,
   matching the existing `Compile / doc / javacOptions` pattern in
   `SparkBuild.scala`). Suppresses the `missing` doclint group at
   javadoc level: the ~22K `no comment` / `no @param` / `no @return`
   / `no @throws` warnings (each rendered as a 3-line block) that
   dominate the log on every Spark unidoc run. The two-flag form is
   load-bearing -- bare `-Xdoclint:-missing` alone demotes other
   doclint groups (notably `reference`) to warning level, making
   broken `{@link}` non-fatal; the explicit `-Xdoclint:all` first
   keeps reference at error level. Locally verified.

3. **Drop `-Xmaxwarns 999999`.** Warnings don't fail CI; error
   visibility is governed by `-Xmaxerrs`, not `-Xmaxwarns`. javadoc's
   default cap of 100 is sufficient -- shows a sample of any
   remaining warnings without flooding. Saves ~4K lines beyond
   `-Xdoclint:-missing` alone.

4. **Post-filter `-verbose` progress lines from the build_api_docs.rb
   stream.** `-verbose` itself stays (it is load-bearing for per-file
   `error: reference not found` emission per #55581), but its
   progress noise -- `Loading source file ...`, `[parsing
   started/completed]`, `[loading /path/X.class]`, `Generating
   /path/X.html` -- carries no diagnostic signal. The existing stream
   filter is extended with a `verbose_line` regex that drops these
   single-line progress entries from stdout. Saves ~13K lines.

### Why are the changes needed?

Documentation generation CI logs were ~77K lines per run after
SPARK-56630's flag set. That is large enough that scanning for
diagnostics by eye is impractical, and grep-piping is the only
reasonable workflow. Most of the volume is structural noise (genjavadoc
stub errors, `no comment` warnings, `-verbose` progress markers) with
no diagnostic signal. After this PR the log is ~4K lines on a
real-failure run; the per-file `error: reference not found`
diagnostics PR #55581 added are the dominant content.

Empirical breakdown of the reduction (verified via test PR #55605
with deliberately broken `{@link}` plants in both a real `.java`
source and a Scala source):

| State                              | Log lines | Vs baseline |
| ---------------------------------- | --------: | ----------: |
| PR #55581's flag set (baseline)    |       77K |             |
| Add `-Xdoclint:all,-missing`       |       22K |        -71% |
| Drop `-Xmaxwarns 999999`           |       18K |        -77% |
| Post-filter `-verbose` progress    |  **~4K**  |    **-95%** |

All four diagnostic targets remain visible in the final form: 2
broken `{@link}`s in `ColumnarMap.java` (Java source) and 2 broken
`[[Class.member]]`-style refs in a Scala source via the genjavadoc
stub.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Tested end-to-end on PR #55605 (testing-only fork PR) with planted
broken `{@link}` references in both code paths:

- `ColumnarMap.java` (real Java source): `{@link
  org.apache.spark.deliberately.NoSuchClass}` and `{@link
  ColumnVector#nonExistentMethod()}`.

- `Partition.scala` (Scala source via genjavadoc): `[[Partition.index]]`
  -- the wrong `.` separator that javadoc reads as inner-class lookup
  and fails to resolve. This is the case PR #55581's AGENTS.md note
  documents as the most common scaladoc-side cause of unidoc failure.

Both surfaced as per-file `error: reference not found` diagnostics in
the CI log on the test branch, doc gen failed as expected, log size
dropped to 3,977 lines, and zero `Loading source file` /
`[parsing started]` / `[loading X.class]` / `Generating *.html` /
`no comment` lines remained visible.

`-Xmaxerrs 0` and the bare-`-Xdoclint:-missing` demotion behavior
were verified locally with standalone javadoc invocations on a
minimal test file.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic)
---
 docs/_plugins/build_api_docs.rb | 41 ++++++++++++++++++++++-----------
 project/SparkBuild.scala        |  5 +++-
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/docs/_plugins/build_api_docs.rb b/docs/_plugins/build_api_docs.rb
index a75727d8b65db..a3a376ef1b58d 100644
--- a/docs/_plugins/build_api_docs.rb
+++ b/docs/_plugins/build_api_docs.rb
@@ -134,20 +134,20 @@ def build_spark_scala_and_java_docs_if_necessary
   command = "build/sbt -Pkinesis-asl unidoc"
   puts "Running '#{command}'..."
 
-  # Suppress genjavadoc-stub diagnostic blocks from the visible log. javadoc
-  # emits ~3500 `[error]` lines per unidoc run on stubs under `target/java/`
-  # -- all benign because `--ignore-source-errors` is set, but they bury
-  # everything else. Each diagnostic is a header line followed by 3-5
-  # `[error|warn]`-prefixed continuation lines (snippet, caret,
-  # symbol/location); the state machine drops both.
+  # Two filter passes on the unidoc output:
   #
-  # Match by *message text*, not just by `target/java/` path. Otherwise
-  # legitimate doclint diagnostics on stub paths would be hidden too --
-  # those messages are real signal. The patterns below are the known-benign
-  # genjavadoc structural errors; anything else on a `target/java/` path is
-  # echoed. Diagnostic mirror lines from `SparkUnidocDoclet` use the
-  # `[unidoc-doclet]` prefix and don't match either regex, so they always
-  # pass through.
+  # 1. Genjavadoc-stub diagnostic blocks (~28 `[error]` lines on stubs under
+  #    `target/java/`, plus 3-5 continuation lines each). Inert because
+  #    `--ignore-source-errors` is set; matched by message text so legitimate
+  #    doclint diagnostics on stub paths still pass through.
+  #
+  # 2. `-verbose` progress lines (~13K total): `Loading source file ...`,
+  #    `[parsing started/completed ...]`, `[loading /path/X.class]`,
+  #    `Generating .../X.html`. These are dominant in the log when `-verbose`
+  #    is set (which it is in `JavaUnidoc / unidoc / javacOptions` to surface
+  #    per-file `error: reference not found` diagnostics) but carry no signal
+  #    of their own. Suppressing them brings the visible log from ~17K to ~5K
+  #    lines on a typical run while leaving every diagnostic untouched.
   ansi = /\e\[[0-9;]*[A-Za-z]/
   stub_header = %r{
     \[(?:error|warn)\]\s+
@@ -159,11 +159,24 @@ def build_spark_scala_and_java_docs_if_necessary
      |.*?\s+is\s+not\s+public\s+in\s+\S+;\s+cannot\s+be\s+accessed\s+from\s+outside\s+package)
   }x
   stub_cont = %r{\A\s*\[(?:error|warn)\]\s+(?!/\S+\.java:\d+(?::\d+)?:\s)}
+  verbose_line = %r{
+    \[(?:error|warn)\]\s+
+    (?:Loading\s+source\s+file\s
+     |\[parsing\s+(?:started|completed)\s
+     |\[loading\s
+     |\[checking\s
+     |\[wrote\s
+     |Generating\s+\S+\.html
+    )
+  }x
   in_stub = false
   IO.popen("#{command} 2>&1", 'r') do |pipe|
     pipe.each_line do |line|
       plain = line.gsub(ansi, '')
-      if plain =~ stub_header
+      if plain =~ verbose_line
+        in_stub = false
+        # suppress -verbose progress line
+      elsif plain =~ stub_header
         in_stub = true
       elsif in_stub && plain =~ stub_cont
         # continuation of a stub block; suppress
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 33648bccc29e6..737f0f9ba0610 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -1698,7 +1698,10 @@ object Unidoc {
         "-tag", "todo:X",
         "-tag", "groupname:X",
         "-tag", "inheritdoc",
-        "--ignore-source-errors", "-notree"
+        "--ignore-source-errors", "-notree",
+        "-Xmaxerrs", "0",
+        "-verbose",
+        "-Xdoclint:all", "-Xdoclint:-missing"
       )
     },
 

From 524bfa84523261e6f0cf324102f02e2f5a5a812b Mon Sep 17 00:00:00 2001
From: Wenchen Fan <wenchen@databricks.com>
Date: Wed, 29 Apr 2026 18:49:42 +0000
Subject: [PATCH 2/2] Drop orphan scalaStyleOn{Compile,Test}/logLevel settings

These pre-dated SPARK-14790 (2016) and quieted the per-task log
output when scalastyle was hooked into (Compile/compile) and
(Test/compile). After SPARK-56636 decoupled scalastyle from compile,
the tasks are only invoked from dev/lint-scala, so the per-task
logLevel settings reference unread keys (sbt's lintUnused surfaces
them as warnings).
---
 project/SparkBuild.scala | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 737f0f9ba0610..f9694dbb89191 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -253,9 +253,7 @@ object SparkBuild extends PomBuild {
   // violations surface, with file/line annotations from `dev/scalastyle`.
   def enableScalaStyle: Seq[sbt.Def.Setting[_]] = Seq(
     scalaStyleOnCompile := cachedScalaStyle(Compile).value,
-    scalaStyleOnTest := cachedScalaStyle(Test).value,
-    (scalaStyleOnCompile / logLevel) := Level.Warn,
-    (scalaStyleOnTest / logLevel) := Level.Warn
+    scalaStyleOnTest := cachedScalaStyle(Test).value
   )
 
   lazy val compilerWarningSettings: Seq[sbt.Def.Setting[_]] = Seq(