|
| 1 | +# Root Cause Analysis: Issue #96 |
| 2 | + |
| 3 | +## The Bug |
| 4 | + |
| 5 | +`agent --version` (or any quick-completing command) run through screen isolation |
| 6 | +produces no output, even though the command succeeds with exit code 0. |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Root Cause 1: GNU Screen's Periodic Log Flush (Native Logging Path, Screen ≥ 4.5.1) |
| 11 | + |
| 12 | +### GNU Screen log flush architecture |
| 13 | + |
| 14 | +GNU Screen's `-L` logging feature writes output to a libc `FILE*` buffer (not directly |
| 15 | +to disk). Flush is controlled by the `log_flush` variable (default: `10` seconds): |
| 16 | + |
| 17 | +```c |
| 18 | +// src/screen.c |
| 19 | +int log_flush = 10; |
| 20 | + |
| 21 | +// src/ansi.c — WLogString (called for every logged character) |
| 22 | +static void WLogString(Window *win, char *buf, size_t len) |
| 23 | +{ |
| 24 | + if (!win->w_log) return; |
| 25 | + logfwrite(win->w_log, buf, len); // writes to FILE* buffer |
| 26 | + if (!log_flush) |
| 27 | + logfflush(win->w_log); // only flushes immediately if log_flush == 0 |
| 28 | +} |
| 29 | +``` |
| 30 | +
|
| 31 | +With `log_flush = 10` (the default), `logfflush` is **not called** after each write. |
| 32 | +Instead, the periodic flush timer fires every 10 seconds: |
| 33 | +
|
| 34 | +```c |
| 35 | +// src/window.c — DoStartLog |
| 36 | +SetTimeout(&logflushev, n * 1000); // n = log_flush = 10 |
| 37 | +evenq(&logflushev); |
| 38 | +``` |
| 39 | + |
| 40 | +For `agent --version` which runs in ~50ms, the screen session exits before this timer |
| 41 | +ever fires. |
| 42 | + |
| 43 | +### Why fclose should flush (but may not) |
| 44 | + |
| 45 | +On graceful shutdown, `Finit()` → `FreeWindow()` → `logfclose()` → `fclose()` is called. |
| 46 | +POSIX guarantees that `fclose()` flushes the stdio buffer before closing. So in the |
| 47 | +**normal exit path**, data is not lost. |
| 48 | + |
| 49 | +However, if screen is terminated via a signal that invokes `_exit()` rather than `exit()`, |
| 50 | +stdio buffers are NOT flushed. This can happen when: |
| 51 | +1. The child process terminates and screen receives SIGCHLD |
| 52 | +2. Screen's signal handler calls `_exit()` directly in certain code paths |
| 53 | + |
| 54 | +This makes the issue **intermittent** — it depends on the OS, screen version, and exact |
| 55 | +timing of the signal handling. |
| 56 | + |
| 57 | +### The fix |
| 58 | + |
| 59 | +Add `logfile flush 0` to a temporary screenrc passed via `screen -c`: |
| 60 | + |
| 61 | +``` |
| 62 | +logfile flush 0 |
| 63 | +``` |
| 64 | + |
| 65 | +This sets `log_flush = 0` in screen's internal variable, causing `WLogString` to call |
| 66 | +`logfflush()` after every write: |
| 67 | + |
| 68 | +```c |
| 69 | +if (!log_flush) |
| 70 | + logfflush(win->w_log); // now always executes |
| 71 | +``` |
| 72 | +
|
| 73 | +This completely eliminates the flush delay for log file writes. |
| 74 | +
|
| 75 | +--- |
| 76 | +
|
| 77 | +## Root Cause 2: Tee Pipe Buffering Race (Tee Fallback, Screen < 4.5.1) |
| 78 | +
|
| 79 | +### macOS bundled screen 4.0.3 |
| 80 | +
|
| 81 | +On macOS, the system-bundled GNU Screen is version 4.0.3, which predates the `-Logfile` |
| 82 | +option (added in 4.5.1). The code falls back to: |
| 83 | +
|
| 84 | +```js |
| 85 | +effectiveCommand = `(${effectiveCommand}) 2>&1 | tee "${logFile}"`; |
| 86 | +``` |
| 87 | + |
| 88 | +The screen session runs: |
| 89 | +``` |
| 90 | +screen -dmS session /bin/zsh -c "(agent --version) 2>&1 | tee /tmp/logfile" |
| 91 | +``` |
| 92 | + |
| 93 | +### The race condition |
| 94 | + |
| 95 | +When `agent --version` completes in ~5ms: |
| 96 | +1. The agent process writes `0.13.2\n` to stdout |
| 97 | +2. The tee process receives the data and writes it to the log file |
| 98 | +3. The zsh process exits |
| 99 | +4. Screen detects the child exit and terminates the session |
| 100 | + |
| 101 | +The RACE: between step 2 (tee writes to OS page cache) and our poller reading the file, |
| 102 | +there may be a brief window where: |
| 103 | +- The log file exists but the `write()` syscall from tee hasn't been committed to the |
| 104 | + page cache yet (on some OS implementations) |
| 105 | +- OR the file still has 0 bytes because tee's userspace buffer hasn't been `fwrite`'d |
| 106 | + yet when screen terminates it |
| 107 | + |
| 108 | +This is a TOCTOU (time-of-check-time-of-use) race between the session appearing gone in |
| 109 | +`screen -ls` and the log file having its complete content. |
| 110 | + |
| 111 | +### The fix |
| 112 | + |
| 113 | +The `logfile flush 0` fix in the screenrc does NOT directly help the tee fallback path |
| 114 | +(since tee is an external process, not governed by screen's log flush). However, an |
| 115 | +additional retry mechanism can be added: if the log file is empty when first read but the |
| 116 | +session just terminated, retry the read after a brief delay to let the OS flush complete. |
| 117 | + |
| 118 | +In practice, on Linux the tee fallback is not used (screen >= 4.5.1 is available), and |
| 119 | +on macOS the `logfile flush 0` option works on screen 4.0.3 as well (it's a screenrc |
| 120 | +command, not a version-gated feature). |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +## Why This Is Intermittent |
| 125 | + |
| 126 | +The issue doesn't always reproduce because: |
| 127 | + |
| 128 | +1. **On Linux with screen 4.09.01**: The normal `fclose()` path usually flushes the buffer |
| 129 | + correctly. Only under certain timing conditions (SIGCHLD handling, `_exit`) does it fail. |
| 130 | + |
| 131 | +2. **On macOS with screen 4.0.3**: The tee fallback's race window is very small (~1ms). |
| 132 | + Most of the time the file has content when read. But for very fast commands with a |
| 133 | + busy system, the window widens. |
| 134 | + |
| 135 | +3. **The `screen -ls` timing**: The check `!sessions.includes(sessionName)` returns true |
| 136 | + as soon as the session process exits, but the OS may still be in the middle of flushing |
| 137 | + the log file's write buffers to the page cache. |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## Test Coverage Gap |
| 142 | + |
| 143 | +There were no tests specifically for: |
| 144 | +1. Quick-completing commands (like `--version` flags) in screen isolation |
| 145 | +2. Verification that version-string output (short, non-whitespace text) is captured |
| 146 | + |
| 147 | +The existing tests use `echo "hello"` which is also fast but the string is longer and |
| 148 | +may flush more reliably in certain conditions. |
0 commit comments