Skip to content

Commit f34db75

Browse files
authored
Merge pull request #97 from link-foundation/issue-96-efde41b8c023
fix: capture output from quick-completing commands in screen isolation (issue #96)
2 parents 6d3559e + afd6b82 commit f34db75

File tree

15 files changed

+1243
-395
lines changed

15 files changed

+1243
-395
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Case Study: Issue #96`agent --version` output missing in screen isolation
2+
3+
## Summary
4+
5+
When running a quick-completing command (like `agent --version`) through screen isolation:
6+
7+
```
8+
$ --isolated screen -- agent --version
9+
```
10+
11+
the command executes successfully (exit code 0) but the output is **not displayed** — the
12+
version string `0.13.2` is silently swallowed.
13+
14+
## Reproduction
15+
16+
```
17+
$ --isolated screen -- agent --version
18+
│ session 1d4a5afb-00f3-40cb-b506-0edde9288b77
19+
│ isolation screen
20+
│ mode attached
21+
│ screen screen-1773491715194-1ppv87
22+
23+
$ agent --version
24+
25+
26+
27+
│ exit 0
28+
│ duration 0.391s
29+
```
30+
31+
The expected output `0.13.2` is missing between `$ agent --version` and the `` marker.
32+
33+
## Fix (v0.24.9 / PR #97)
34+
35+
Two changes were made:
36+
37+
### 1. Add `logfile flush 0` to screen's configuration
38+
39+
For the **native logging path** (`-L -Logfile`, screen ≥ 4.5.1), screen uses a periodic
40+
flush timer that fires every **10 seconds** by default. If the command completes and the
41+
screen session terminates before this timer fires, output buffered inside screen's internal
42+
`FILE*` buffer may not be flushed to the log file before the session ends.
43+
44+
The fix passes a temporary screenrc with `logfile flush 0` via screen's `-c` option. This
45+
forces screen to flush the log after every write, eliminating the race condition.
46+
47+
Before fix:
48+
```
49+
screen -dmS <session> -L -Logfile <logfile> <shell> -c '<command>'
50+
```
51+
52+
After fix:
53+
```
54+
screen -dmS <session> -c <screenrc> -L -Logfile <logfile> <shell> -c '<command>'
55+
```
56+
57+
where `<screenrc>` contains `logfile flush 0`.
58+
59+
### 2. Add integration test for quick-completing commands
60+
61+
Added a test case specifically for the issue: `runInScreen('agent --version')` must
62+
capture the version output correctly.
63+
64+
## See Also
65+
66+
- [root-cause.md](root-cause.md) — Detailed root cause analysis
67+
- [timeline.md](timeline.md) — Sequence of events
68+
- [solutions.md](solutions.md) — Solutions considered and chosen
69+
- Related: [Case Study issue-15](../issue-15/README.md) — Original screen output capture fix
70+
- Related: [Case Study issue-25](../issue-25/README.md) — Screen quoting fix
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Root Cause Analysis: Issue #96
2+
3+
## The Bug
4+
5+
`agent --version` (or any quick-completing command) run through screen isolation
6+
produces no output, even though the command succeeds with exit code 0.
7+
8+
---
9+
10+
## Root Cause 1: GNU Screen's Periodic Log Flush (Native Logging Path, Screen ≥ 4.5.1)
11+
12+
### GNU Screen log flush architecture
13+
14+
GNU Screen's `-L` logging feature writes output to a libc `FILE*` buffer (not directly
15+
to disk). Flush is controlled by the `log_flush` variable (default: `10` seconds):
16+
17+
```c
18+
// src/screen.c
19+
int log_flush = 10;
20+
21+
// src/ansi.c — WLogString (called for every logged character)
22+
static void WLogString(Window *win, char *buf, size_t len)
23+
{
24+
if (!win->w_log) return;
25+
logfwrite(win->w_log, buf, len); // writes to FILE* buffer
26+
if (!log_flush)
27+
logfflush(win->w_log); // only flushes immediately if log_flush == 0
28+
}
29+
```
30+
31+
With `log_flush = 10` (the default), `logfflush` is **not called** after each write.
32+
Instead, the periodic flush timer fires every 10 seconds:
33+
34+
```c
35+
// src/window.c — DoStartLog
36+
SetTimeout(&logflushev, n * 1000); // n = log_flush = 10
37+
evenq(&logflushev);
38+
```
39+
40+
For `agent --version` which runs in ~50ms, the screen session exits before this timer
41+
ever fires.
42+
43+
### Why fclose should flush (but may not)
44+
45+
On graceful shutdown, `Finit()``FreeWindow()``logfclose()``fclose()` is called.
46+
POSIX guarantees that `fclose()` flushes the stdio buffer before closing. So in the
47+
**normal exit path**, data is not lost.
48+
49+
However, if screen is terminated via a signal that invokes `_exit()` rather than `exit()`,
50+
stdio buffers are NOT flushed. This can happen when:
51+
1. The child process terminates and screen receives SIGCHLD
52+
2. Screen's signal handler calls `_exit()` directly in certain code paths
53+
54+
This makes the issue **intermittent** — it depends on the OS, screen version, and exact
55+
timing of the signal handling.
56+
57+
### The fix
58+
59+
Add `logfile flush 0` to a temporary screenrc passed via `screen -c`:
60+
61+
```
62+
logfile flush 0
63+
```
64+
65+
This sets `log_flush = 0` in screen's internal variable, causing `WLogString` to call
66+
`logfflush()` after every write:
67+
68+
```c
69+
if (!log_flush)
70+
logfflush(win->w_log); // now always executes
71+
```
72+
73+
This completely eliminates the flush delay for log file writes.
74+
75+
---
76+
77+
## Root Cause 2: Tee Pipe Buffering Race (Tee Fallback, Screen < 4.5.1)
78+
79+
### macOS bundled screen 4.0.3
80+
81+
On macOS, the system-bundled GNU Screen is version 4.0.3, which predates the `-Logfile`
82+
option (added in 4.5.1). The code falls back to:
83+
84+
```js
85+
effectiveCommand = `(${effectiveCommand}) 2>&1 | tee "${logFile}"`;
86+
```
87+
88+
The screen session runs:
89+
```
90+
screen -dmS session /bin/zsh -c "(agent --version) 2>&1 | tee /tmp/logfile"
91+
```
92+
93+
### The race condition
94+
95+
When `agent --version` completes in ~5ms:
96+
1. The agent process writes `0.13.2\n` to stdout
97+
2. The tee process receives the data and writes it to the log file
98+
3. The zsh process exits
99+
4. Screen detects the child exit and terminates the session
100+
101+
The RACE: between step 2 (tee writes to OS page cache) and our poller reading the file,
102+
there may be a brief window where:
103+
- The log file exists but the `write()` syscall from tee hasn't been committed to the
104+
page cache yet (on some OS implementations)
105+
- OR the file still has 0 bytes because tee's userspace buffer hasn't been `fwrite`'d
106+
yet when screen terminates it
107+
108+
This is a TOCTOU (time-of-check-time-of-use) race between the session appearing gone in
109+
`screen -ls` and the log file having its complete content.
110+
111+
### The fix
112+
113+
The `logfile flush 0` fix in the screenrc does NOT directly help the tee fallback path
114+
(since tee is an external process, not governed by screen's log flush). However, an
115+
additional retry mechanism can be added: if the log file is empty when first read but the
116+
session just terminated, retry the read after a brief delay to let the OS flush complete.
117+
118+
In practice, on Linux the tee fallback is not used (screen >= 4.5.1 is available), and
119+
on macOS the `logfile flush 0` option works on screen 4.0.3 as well (it's a screenrc
120+
command, not a version-gated feature).
121+
122+
---
123+
124+
## Why This Is Intermittent
125+
126+
The issue doesn't always reproduce because:
127+
128+
1. **On Linux with screen 4.09.01**: The normal `fclose()` path usually flushes the buffer
129+
correctly. Only under certain timing conditions (SIGCHLD handling, `_exit`) does it fail.
130+
131+
2. **On macOS with screen 4.0.3**: The tee fallback's race window is very small (~1ms).
132+
Most of the time the file has content when read. But for very fast commands with a
133+
busy system, the window widens.
134+
135+
3. **The `screen -ls` timing**: The check `!sessions.includes(sessionName)` returns true
136+
as soon as the session process exits, but the OS may still be in the middle of flushing
137+
the log file's write buffers to the page cache.
138+
139+
---
140+
141+
## Test Coverage Gap
142+
143+
There were no tests specifically for:
144+
1. Quick-completing commands (like `--version` flags) in screen isolation
145+
2. Verification that version-string output (short, non-whitespace text) is captured
146+
147+
The existing tests use `echo "hello"` which is also fast but the string is longer and
148+
may flush more reliably in certain conditions.
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Solutions: Issue #96
2+
3+
## Problem Statement
4+
5+
Quick-completing commands (like `agent --version`) run through screen isolation
6+
produce no output. The screen session exits cleanly but the log file is empty.
7+
8+
---
9+
10+
## Solutions Considered
11+
12+
### Option A: Add `sleep` after the command in screen
13+
14+
**Approach:** Append `; sleep 0.1` to every command run in screen to give the log a
15+
chance to flush before the session exits.
16+
17+
**Pros:** Simple, one-line change.
18+
19+
**Cons:**
20+
- Slows every command by at least 100ms (unacceptable for interactive use)
21+
- Hacky — fights symptoms rather than addressing the root cause
22+
- Still a race condition (just smaller)
23+
24+
### Option B: Retry reading the log file with exponential backoff
25+
26+
**Approach:** After detecting session completion, if the log file is empty, retry
27+
reading it up to 3 times with 10ms, 20ms, 40ms delays.
28+
29+
**Pros:** Addresses the TOCTOU race condition for the tee fallback path.
30+
31+
**Cons:**
32+
- Does NOT fix the native logging path's `log_flush = 10` issue
33+
- Adds complexity to the polling logic
34+
- Still a race for very slow systems
35+
36+
### Option C: Use a temporary screenrc with `logfile flush 0` (CHOSEN)
37+
38+
**Approach:** For the native logging path (`-L -Logfile`, screen ≥ 4.5.1), create a
39+
temporary screenrc file containing `logfile flush 0` and pass it to screen via the
40+
`-c` option.
41+
42+
```
43+
echo "logfile flush 0" > /tmp/screen-rc-XXXX.rc
44+
screen -dmS session -c /tmp/screen-rc-XXXX.rc -L -Logfile logfile shell -c command
45+
```
46+
47+
`logfile flush 0` sets `log_flush = 0` in screen's internal state, which causes
48+
`WLogString()` in `ansi.c` to call `logfflush()` (equivalent to `fflush()`) after
49+
every write. This completely eliminates the 10-second buffering delay.
50+
51+
**Pros:**
52+
- Eliminates the flush delay entirely for the native logging path
53+
- Standard approach recommended in GNU Screen documentation
54+
- No performance impact on the command itself (the flush is async in screen's event loop)
55+
- Clean implementation — the temp file is created and deleted by our code
56+
- Works on all screen versions that support the `-c` option (very old)
57+
58+
**Cons:**
59+
- Requires creating/deleting an additional temp file per invocation
60+
- The `-c` option overrides the user's `~/.screenrc` — but since we only put
61+
`logfile flush 0` in it, there's no conflict for our use case
62+
63+
**Note on the tee fallback path (screen < 4.5.1):**
64+
The `logfile flush 0` setting in a screenrc also works on older screen versions (it's
65+
a screenrc directive, not version-gated). However, for the tee fallback path, screen
66+
itself doesn't write the log — `tee` does. So `logfile flush 0` doesn't help for
67+
the tee path. For this path, Option B (retry) is added as a complementary fix.
68+
69+
### Option D: Use `script` instead of screen's `-L`
70+
71+
**Approach:** Replace screen's log capture with `script -c "command" logfile`, which
72+
uses a pty and has more reliable flush-on-exit behavior.
73+
74+
**Pros:** `script` is designed for output capture and handles flush correctly.
75+
76+
**Cons:**
77+
- `script` behavior varies significantly between Linux (util-linux) and macOS (BSD)
78+
- macOS `script` uses different flags (`-q`, `-c` not available)
79+
- Complex to handle cross-platform
80+
- Loses screen's session management features
81+
82+
---
83+
84+
## Chosen Solution: Option C + Option B
85+
86+
**Implementation:**
87+
88+
For native logging path (screen ≥ 4.5.1):
89+
```js
90+
// Create temporary screenrc with immediate log flush
91+
const screenrcFile = path.join(os.tmpdir(), `screenrc-${sessionName}`);
92+
fs.writeFileSync(screenrcFile, 'logfile flush 0\n');
93+
94+
const logArgs = ['-dmS', sessionName, '-c', screenrcFile, '-L', '-Logfile', logFile];
95+
screenArgs = [...logArgs, shell, shellArg, effectiveCommand];
96+
97+
// Clean up screenrc on completion
98+
try { fs.unlinkSync(screenrcFile); } catch {}
99+
```
100+
101+
For tee fallback path (screen < 4.5.1): add a brief retry when the log file is empty
102+
immediately after session completion.
103+
104+
**Why this is correct:**
105+
- Setting `logfile flush 0` is the standard, official way to control screen's log
106+
flush behavior, documented in the GNU Screen manual
107+
- It directly addresses the root cause (buffered writes) rather than fighting symptoms
108+
- The retry for the tee path handles the TOCTOU race for the remaining edge case

0 commit comments

Comments
 (0)