You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
live migrate the VM a few times, back and forth between the same two Propolis servers on a single host
fg
kill the stress-ng job
stress-ng --timer 32 --timer-freq 1000000 &
migrate some more
fg, kill the stress-ng job
Expected: guest is generally happy
Observed: guest gets dyspepsia after running the timer stress test:
root@debian:~# [ 501.250937] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 501.254890] rcu: 1-...!: (0 ticks this GP) idle=7f0/0/0x0 softirq=2364/2364 fqs=1 (false positive?)
[ 501.254890] (detected by 3, t=21009 jiffies, g=1749, q=1077)
[ 501.254890] Sending NMI from CPU 3 to CPUs 1:
[ 501.266638] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0xe/0x20
[ 501.254890] rcu: rcu_sched kthread starved for 15759 jiffies! g1749 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 501.254890] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 501.254890] rcu: RCU grace-period kthread stack dump:
[ 501.254890] task:rcu_sched state:I stack: 0 pid: 12 ppid: 2 flags:0x00004000
[ 501.254890] Call Trace:
[ 501.254890] __schedule+0x282/0x870
[ 501.254890] schedule+0x46/0xb0
[ 501.254890] schedule_timeout+0x8b/0x150
[ 501.254890] ? __next_timer_interrupt+0x110/0x110
[ 501.254890] rcu_gp_kthread+0x51b/0xbc0
[ 501.254890] ? rcu_cpu_kthread+0x190/0x190
[ 501.254890] kthread+0x11b/0x140
[ 501.254890] ? __kthread_bind_mask+0x60/0x60
[ 501.254890] ret_from_fork+0x22/0x30
Other observations:
There are a few of these messages in the serial logs (at guest uptimes 501.254, 438.230, 417.186, and 354.166).
The host machine was not otherwise loaded especially heavily during this time except for the VM/migration work.
The messages are not correlated with migrations; the last couple of them occurred in the same Propolis server a couple of minutes after it had been migrated into.
This guest complained about TSC inaccuracy at about 65 seconds of uptime. This is likely because the prior runs of stress-ng --vm tanked migration performance by dirtying a bunch of pages (another case to investigate in Investigate/profile live migration performance #324).
@jmpesp saw a similar issue in local testing earlier this week, but that was without the bits needed to enable the interrupt state transfer implemented in #367. Unless I've missed something, that should have been enabled here (both the Propolis bits and the necessary bhyve bits were present).
This VM no longer seems to be producing any RCU complaints, but I'll hold it in its current state for now.
Propolis commit: c455784
Host OS:
Guest OS: Debian 11 nocloud,
Linux debian 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/LinuxRepro steps:
stress-ngin the gueststress-ng --vm 1 --vm-bytes 2G --verify -v &fgstress-ng --timer 32 --timer-freq 1000000 &fg, kill thestress-ngjobExpected: guest is generally happy
Observed: guest gets dyspepsia after running the timer stress test:
Other observations:
stress-ng --vmtanked migration performance by dirtying a bunch of pages (another case to investigate in Investigate/profile live migration performance #324).@jmpesp saw a similar issue in local testing earlier this week, but that was without the bits needed to enable the interrupt state transfer implemented in #367. Unless I've missed something, that should have been enabled here (both the Propolis bits and the necessary bhyve bits were present).
This VM no longer seems to be producing any RCU complaints, but I'll hold it in its current state for now.