[pull] master from gregkh:master#17
Merged
pull[bot] merged 205 commits intoOpenGamingCollective:masterfrom Mar 4, 2026
Merged
Conversation
debugobjects uses __GFP_HIGH for allocations as it might be invoked within locked regions. That worked perfectly fine until v6.18. It still works correctly when deferred page initialization is disabled and works by chance when no page allocation is required before deferred page initialization has completed. Since v6.18 allocations w/o a reclaim flag cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(), which returns early when deferred page initialization has not yet completed. As the deferred page initialization takes quite a while the debugobject pool is depleted and debugobjects are disabled. This can be worked around when PREEMPT_COUNT is enabled as that allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context is preemtible. When PREEMPT_COUNT is disabled the context is unknown and the reclaim bit can't be set because the caller might hold locks which might deadlock in the allocator. In preemptible context the reclaim bit is harmless and not a performance issue as that's usually invoked from slow path initialization context. That makes debugobjects depend on PREEMPT_COUNT || !DEFERRED_STRUCT_PAGE_INIT. Fixes: af92793 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://patch.msgid.link/87pl6gznti.ffs@tglx
For common cases (HZ=100, 250 or 1000), these helpers are at most one multiply, so there is no point calling a tiny function. Keep them out of line for HZ=300 and others. This saves cycles in TCP fast path, among other things. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/8 grow/shrink: 25/89 up/down: 530/-3474 (-2944) ... nla_put_msecs 193 - -193 message_stats_print 2131 920 -1211 Total: Before=25365208, After=25362264, chg -0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260210170226.57209-1-edumazet@google.com
When a task is migrated out of a css_set, cgroup_migrate_add_task()
first moves it from cset->tasks to cset->mg_tasks via:
list_move_tail(&task->cg_list, &cset->mg_tasks);
If a css_task_iter currently has it->task_pos pointing to this task,
css_set_move_task() calls css_task_iter_skip() to keep the iterator
valid. However, since the task has already been moved to ->mg_tasks,
the iterator is advanced relative to the mg_tasks list instead of the
original tasks list. As a result, remaining tasks on cset->tasks, as
well as tasks queued on cset->mg_tasks, can be skipped by iteration.
Fix this by calling css_set_skip_task_iters() before unlinking
task->cg_list from cset->tasks. This advances all active iterators to
the next task on cset->tasks, so iteration continues correctly even
when a task is concurrently being migrated.
This race is hard to hit in practice without instrumentation, but it
can be reproduced by artificially slowing down cgroup_procs_show().
For example, on an Android device a temporary
/sys/kernel/cgroup/cgroup_test knob can be added to inject a delay
into cgroup_procs_show(), and then:
1) Spawn three long-running tasks (PIDs 101, 102, 103).
2) Create a test cgroup and move the tasks into it.
3) Enable a large delay via /sys/kernel/cgroup/cgroup_test.
4) In one shell, read cgroup.procs from the test cgroup.
5) Within the delay window, in another shell migrate PID 102 by
writing it to a different cgroup.procs file.
Under this setup, cgroup.procs can intermittently show only PID 101
while skipping PID 103. Once the migration completes, reading the
file again shows all tasks as expected.
Note that this change does not allow removing the existing
css_set_skip_task_iters() call in css_set_move_task(). The new call
in cgroup_migrate_add_task() only handles iterators that are racing
with migration while the task is still on cset->tasks. Iterators may
also start after the task has been moved to cset->mg_tasks. If we
dropped css_set_skip_task_iters() from css_set_move_task(), such
iterators could keep task_pos pointing to a migrating task, causing
css_task_iter_advance() to malfunction on the destination css_set,
up to and including crashes or infinite loops.
The race window between migration and iteration is very small, and
css_task_iter is not on a hot path. In the worst case, when an
iterator is positioned on the first thread of the migrating process,
cgroup_migrate_add_task() may have to skip multiple tasks via
css_set_skip_task_iters(). However, this only happens when migration
and iteration actually race, so the performance impact is negligible
compared to the correctness fix provided here.
Fixes: b636fd3 ("cgroup: Implement css_task_iter_skip()")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Qingye Zhao <zhaoqingye@honor.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When CONFIG_ARM64_POE is disabled, KVM does not save/restore POR_EL1. However, ID_AA64MMFR3_EL1 sanitisation currently exposes the feature to guests whenever the hardware supports it, ignoring the host kernel configuration. If a guest detects this feature and attempts to use it, the host will fail to context-switch POR_EL1, potentially leading to state corruption. Fix this by masking ID_AA64MMFR3_EL1.S1POE in the sanitised system registers, preventing KVM from advertising the feature when the host does not support it (i.e. system_supports_poe() is false). Fixes: 70ed723 ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260213143815.1732675-2-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
Although ID register sanitisation prevents guests from seeing the feature, adding this check to the helper allows the compiler to entirely eliminate S1POE-specific code paths (such as context switching POR_EL1) when the host kernel is compiled without support (CONFIG_ARM64_POE is disabled). This aligns with the pattern used for other optional features like SVE (kvm_has_sve()) and FPMR (kvm_has_fpmr()), ensuring no POE logic if the host lacks support, regardless of the guest configuration state. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260213143815.1732675-3-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
In protected mode, the hypervisor maintains a separate instance of the `kvm` structure for each VM. For non-protected VMs, this structure is initialized from the host's `kvm` state. Currently, `pkvm_init_features_from_host()` copies the `KVM_ARCH_FLAG_ID_REGS_INITIALIZED` flag from the host without the underlying `id_regs` data being initialized. This results in the hypervisor seeing the flag as set while the ID registers remain zeroed. Consequently, `kvm_has_feat()` checks at EL2 fail (return 0) for non-protected VMs. This breaks logic that relies on feature detection, such as `ctxt_has_tcrx()` for TCR2_EL1 support. As a result, certain system registers (e.g., TCR2_EL1, PIR_EL1, POR_EL1) are not saved/restored during the world switch, which could lead to state corruption. Fix this by explicitly copying the ID registers from the host `kvm` to the hypervisor `kvm` for non-protected VMs during initialization, since we trust the host with its non-protected guests' features. Also ensure `KVM_ARCH_FLAG_ID_REGS_INITIALIZED` is cleared initially in `pkvm_init_features_from_host` so that `vm_copy_id_regs` can properly initialize them and set the flag once done. Fixes: 41d6028 ("KVM: arm64: Convert the SVE guest vcpu flag to a vm flag") Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260213143815.1732675-4-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
The `sve_state` pointer in `hyp_vcpu->vcpu.arch` is initialized as a hypervisor virtual address during vCPU initialization in `pkvm_vcpu_init_sve()`. `unpin_host_sve_state()` calls `kern_hyp_va()` on this address. Since `kern_hyp_va()` is idempotent, it's not a bug. However, it is unnecessary and potentially confusing. Remove the redundant conversion. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260213143815.1732675-5-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
… type In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "struct gic_kvm_info", but the returned type, while matching, is const qualified. To get them exactly matching, just use the dereferenced pointer for the sizeof(). Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20260206223022.it.052-kees@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
syzbot reported memory leak of struct cred. [0]
nfsd_nl_threads_set_doit() passes get_current_cred() to
nfsd_svc(), but put_cred() is not called after that.
The cred is finally passed down to _svc_xprt_create(),
which calls get_cred() with the cred for struct svc_xprt.
The ownership of the refcount by get_current_cred() is not
transferred to anywhere and is just leaked.
nfsd_svc() is also called from write_threads(), but it does
not bump file->f_cred there.
nfsd_nl_threads_set_doit() is called from sendmsg() and
current->cred does not go away.
Let's use current_cred() in nfsd_nl_threads_set_doit().
[0]:
BUG: memory leak
unreferenced object 0xffff888108b89480 (size 184):
comm "syz-executor", pid 5994, jiffies 4294943386
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 369454a7):
kmemleak_alloc_recursive include/linux/kmemleak.h:44 [inline]
slab_post_alloc_hook mm/slub.c:4958 [inline]
slab_alloc_node mm/slub.c:5263 [inline]
kmem_cache_alloc_noprof+0x412/0x580 mm/slub.c:5270
prepare_creds+0x22/0x600 kernel/cred.c:185
copy_creds+0x44/0x290 kernel/cred.c:286
copy_process+0x7a7/0x2870 kernel/fork.c:2086
kernel_clone+0xac/0x6e0 kernel/fork.c:2651
__do_sys_clone+0x7f/0xb0 kernel/fork.c:2792
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xa4/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: 924f4fb ("NFSD: convert write_threads to netlink command")
Cc: stable@vger.kernel.org
Reported-by: syzbot+dd3b43aa0204089217ee@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69744674.a00a0220.33ccc7.0000.GAE@google.com/
Tested-by: syzbot+dd3b43aa0204089217ee@syzkaller.appspotmail.com
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
nfsd_nl_listener_set_doit() uses get_current_cred() without put_cred(). As we can see from other callers, svc_xprt_create_from_sa() does not require the extra refcount. nfsd_nl_listener_set_doit() is always in the process context, sendmsg(), and current->cred does not go away. Let's use current_cred() in nfsd_nl_listener_set_doit(). Fixes: 16a4711 ("NFSD: add listener-{set,get} netlink command") Cc: stable@vger.kernel.org Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
PLIC ignores interrupt completion message for disabled interrupt, explained
by the specification:
The PLIC signals it has completed executing an interrupt handler by
writing the interrupt ID it received from the claim to the
claim/complete register. The PLIC does not check whether the completion
ID is the same as the last claim ID for that target. If the completion
ID does not match an interrupt source that is currently enabled for
the target, the completion is silently ignored.
This caused problems in the past, because an interrupt can be disabled
while still being handled and plic_irq_eoi() had no effect. That was fixed
by checking if the interrupt is disabled, and if so enable it, before
sending the completion message. That check is done with irqd_irq_disabled().
However, that is not sufficient because the enable bit for the handling
hart can be zero despite irqd_irq_disabled(d) being false. This can happen
when affinity setting is changed while a hart is still handling the
interrupt.
This problem is easily reproducible by dumping a large file to uart (which
generates lots of interrupts) and at the same time keep changing the uart
interrupt's affinity setting. The uart port becomes frozen almost
instantaneously.
Fix this by checking PLIC's enable bit instead of irqd_irq_disabled().
Fixes: cc9f04f ("irqchip/sifive-plic: Implement irq_set_affinity() for SMP host")
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260212114125.3148067-1-namcao@linutronix.de
…ITS supports The ITS driver blindly assumes that EventIDs are in abundant supply, to the point where it never checks how many the hardware actually supports. It turns out that some pretty esoteric integrations make it so that only a few bits are available, all the way down to a single bit. Enforce the advertised limitation at the point of allocating the device structure, and hope that the endpoint driver can deal with such limitation. Fixes: 84a6a2e ("irqchip: GICv3: ITS: device allocation and configuration") Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Zenghui Yu <zenghui.yu@linux.dev> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260206154816.3582887-1-maz@kernel.org
File-scope 'icu_irq_chip' is not used outside of this unit and is not modified anywhere, so make it static const to silence sparse warning: irq-mmp.c:139:17: warning: symbol 'icu_irq_chip' was not declared. Should it be static? Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260216110449.160277-2-krzysztof.kozlowski@oss.qualcomm.com
Using set_memory_wc() to enable write-combining for the DPP portion of the MMIO mapping is wrong as set_memory_*() is meant to operate on RAM only, not MMIO mappings. In fact, as used currently triggers a BUG_ON() with enabled CONFIG_DEBUG_VIRTUAL. Simply map the DPP region separately and in addition to the already existing mappings, avoiding any possible negative side effects for these. Fixes: 1351e69 ("scsi: lpfc: Add push-to-adapter support to sli4") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Reviewed-by: Mathias Krause <minipli@grsecurity.net> Link: https://patch.msgid.link/20260212192327.141104-1-justintee8345@gmail.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
This resolves the follow splat and lock-up when running with PREEMPT_RT
enabled on Hyper-V:
[ 415.140818] BUG: scheduling while atomic: stress-ng-iomix/1048/0x00000002
[ 415.140822] INFO: lockdep is turned off.
[ 415.140823] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery pmt_class intel_pmc_ssram_telemetry intel_vsec ghash_clmulni_intel aesni_intel rapl binfmt_misc nls_ascii nls_cp437 vfat fat snd_pcm hyperv_drm snd_timer drm_client_lib drm_shmem_helper snd sg soundcore drm_kms_helper pcspkr hv_balloon hv_utils evdev joydev drm configfs efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common hv_sock vmw_vsock_vmci_transport vsock vmw_vmci efivarfs autofs4 ext4 crc16 mbcache jbd2 sr_mod sd_mod cdrom hv_storvsc serio_raw hid_generic scsi_transport_fc hid_hyperv scsi_mod hid hv_netvsc hyperv_keyboard scsi_common
[ 415.140846] Preemption disabled at:
[ 415.140847] [<ffffffffc0656171>] storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
[ 415.140854] CPU: 8 UID: 0 PID: 1048 Comm: stress-ng-iomix Not tainted 6.19.0-rc7 #30 PREEMPT_{RT,(full)}
[ 415.140856] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 09/04/2024
[ 415.140857] Call Trace:
[ 415.140861] <TASK>
[ 415.140861] ? storvsc_queuecommand+0x2e1/0xbe0 [hv_storvsc]
[ 415.140863] dump_stack_lvl+0x91/0xb0
[ 415.140870] __schedule_bug+0x9c/0xc0
[ 415.140875] __schedule+0xdf6/0x1300
[ 415.140877] ? rtlock_slowlock_locked+0x56c/0x1980
[ 415.140879] ? rcu_is_watching+0x12/0x60
[ 415.140883] schedule_rtlock+0x21/0x40
[ 415.140885] rtlock_slowlock_locked+0x502/0x1980
[ 415.140891] rt_spin_lock+0x89/0x1e0
[ 415.140893] hv_ringbuffer_write+0x87/0x2a0
[ 415.140899] vmbus_sendpacket_mpb_desc+0xb6/0xe0
[ 415.140900] ? rcu_is_watching+0x12/0x60
[ 415.140902] storvsc_queuecommand+0x669/0xbe0 [hv_storvsc]
[ 415.140904] ? HARDIRQ_verbose+0x10/0x10
[ 415.140908] ? __rq_qos_issue+0x28/0x40
[ 415.140911] scsi_queue_rq+0x760/0xd80 [scsi_mod]
[ 415.140926] __blk_mq_issue_directly+0x4a/0xc0
[ 415.140928] blk_mq_issue_direct+0x87/0x2b0
[ 415.140931] blk_mq_dispatch_queue_requests+0x120/0x440
[ 415.140933] blk_mq_flush_plug_list+0x7a/0x1a0
[ 415.140935] __blk_flush_plug+0xf4/0x150
[ 415.140940] __submit_bio+0x2b2/0x5c0
[ 415.140944] ? submit_bio_noacct_nocheck+0x272/0x360
[ 415.140946] submit_bio_noacct_nocheck+0x272/0x360
[ 415.140951] ext4_read_bh_lock+0x3e/0x60 [ext4]
[ 415.140995] ext4_block_write_begin+0x396/0x650 [ext4]
[ 415.141018] ? __pfx_ext4_da_get_block_prep+0x10/0x10 [ext4]
[ 415.141038] ext4_da_write_begin+0x1c4/0x350 [ext4]
[ 415.141060] generic_perform_write+0x14e/0x2c0
[ 415.141065] ext4_buffered_write_iter+0x6b/0x120 [ext4]
[ 415.141083] vfs_write+0x2ca/0x570
[ 415.141087] ksys_write+0x76/0xf0
[ 415.141089] do_syscall_64+0x99/0x1490
[ 415.141093] ? rcu_is_watching+0x12/0x60
[ 415.141095] ? finish_task_switch.isra.0+0xdf/0x3d0
[ 415.141097] ? rcu_is_watching+0x12/0x60
[ 415.141098] ? lock_release+0x1f0/0x2a0
[ 415.141100] ? rcu_is_watching+0x12/0x60
[ 415.141101] ? finish_task_switch.isra.0+0xe4/0x3d0
[ 415.141103] ? rcu_is_watching+0x12/0x60
[ 415.141104] ? __schedule+0xb34/0x1300
[ 415.141106] ? hrtimer_try_to_cancel+0x1d/0x170
[ 415.141109] ? do_nanosleep+0x8b/0x160
[ 415.141111] ? hrtimer_nanosleep+0x89/0x100
[ 415.141114] ? __pfx_hrtimer_wakeup+0x10/0x10
[ 415.141116] ? xfd_validate_state+0x26/0x90
[ 415.141118] ? rcu_is_watching+0x12/0x60
[ 415.141120] ? do_syscall_64+0x1e0/0x1490
[ 415.141121] ? do_syscall_64+0x1e0/0x1490
[ 415.141123] ? rcu_is_watching+0x12/0x60
[ 415.141124] ? do_syscall_64+0x1e0/0x1490
[ 415.141125] ? do_syscall_64+0x1e0/0x1490
[ 415.141127] ? irqentry_exit+0x140/0x7e0
[ 415.141129] entry_SYSCALL_64_after_hwframe+0x76/0x7e
get_cpu() disables preemption while the spinlock hv_ringbuffer_write is
using is converted to an rt-mutex under PREEMPT_RT.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Tested-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Link: https://patch.msgid.link/0c7fb5cd-fb21-4760-8593-e04bade84744@siemens.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Older UFS spec devices (2.2 and earlier) do not expose per-region RPMB
sizes, as only one RPMB region is supported. In such cases, the size of the
single RPMB region can be deduced from the Logical Block Count and Logical
Block Size fields in the RPMB Unit Descriptor.
Add a fallback mechanism to calculate the RPMB region size from these
fields if the device implements an older spec, so that the RPMB driver can
work with such devices - otherwise it silently skips the whole RPMB.
Section 14.1.4.6 (RPMB Unit Descriptor)
Link: https://www.jedec.org/system/files/docs/JESD220C-2_2.pdf
Cc: stable@vger.kernel.org
Fixes: b06b8c4 ("scsi: ufs: core: Add OP-TEE based RPMB driver for UFS devices")
Reviewed-by: Bean Huo <beanhuo@micron.com>
Signed-off-by: Alexey Charkov <alchark@flipper.net>
Link: https://patch.msgid.link/20260209-ufs-rpmb-v3-1-b1804e71bd38@flipper.net
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
On a multipath SAS system some devices don't end up with correct symlinks from the SCSI device to its enclosure. Some devices even have enclosure links pointing to enclosures attached to different SCSI hosts. ses_match_to_enclosure() calls enclosure_for_each_device() which iterates over all enclosures on the system, not just enclosures attached to the current SCSI host. Replace the iteration with a direct call to ses_enclosure_find_by_addr(). Reviewed-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Tomas Henzl <thenzl@redhat.com> Link: https://patch.msgid.link/20260210191850.36784-1-thenzl@redhat.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Ensures that UFS Runtime PM can achieve power saving after System PM suspend by resetting hba->urgent_bkops_lvl. Also modify the ufshcd_bkops_exception_event_handler to avoid setting urgent_bkops_lvl when status is 0, which helps maintain optimal power management. On UFS devices supporting UFSHCD_CAP_AUTO_BKOPS_SUSPEND, a BKOPS exception event can lead to a situation where UFS Runtime PM can't enter low-power mode states even after the BKOPS exception has been resolved. BKOPS exception with bkops status 0 occurs, the driver logs: "ufshcd_bkops_exception_event_handler: device raised urgent BKOPS exception for bkops status 0" When a BKOPS exception occurs, ufshcd_bkops_exception_event_handler() reads the BKOPS status and sets hba->urgent_bkops_lvl to BKOPS_STATUS_NO_OP(0). This allows the device to perform Runtime PM without changing the UFS power mode. (__ufshcd_wl_suspend(hba, UFS_RUNTIME_PM)) During system PM suspend, ufshcd_disable_auto_bkops() is called, disabling auto bkops. After UFS System PM Resume, when runtime PM attempts to suspend again, ufshcd_urgent_bkops() is invoked. Since hba->urgent_bkops_lvl remains at BKOPS_STATUS_NO_OP(0), ufshcd_enable_auto_bkops() is triggered. However, in ufshcd_bkops_ctrl(), the driver compares the current BKOPS status with hba->urgent_bkops_lvl, and only enables auto bkops if curr_status >= hba->urgent_bkops_lvl. Since both values are 0, the condition is met As a result, __ufshcd_wl_suspend(hba, UFS_RUNTIME_PM) skips power mode transitions and remains in an active state, preventing power saving even though no urgent BKOPS condition exists. Signed-off-by: Won Jung <wone.jung@samsung.com> Reviewed-by: Peter Wang <peter.wang@mediatek.com> Link: https://patch.msgid.link/1891546521.01770806581968.JavaMail.epsvc@epcpadp2new Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The driver encountered a crash during resource cleanup when the reply and request queues were NULL due to freed memory. This issue occurred when the creation of reply or request queues failed, and the driver freed the memory first, but attempted to mem set the content of the freed memory, leading to a system crash. Add NULL pointer checks for reply and request queues before accessing the reply/request memory during cleanup Signed-off-by: Ranjan Kumar <ranjan.kumar@broadcom.com> Link: https://patch.msgid.link/20260212070026.30263-1-ranjan.kumar@broadcom.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Commit e29c47f ("scsi: pm8001: Simplify pm8001_task_exec()") refactors pm8001_queue_command(), however it introduces a potential cause of a double free scenario when it changes the function to return -ENODEV in case of phy down/device gone state. In this path, pm8001_queue_command() updates task status and calls task_done to indicate to upper layer that the task has been handled. However, this also frees the underlying SAS task. A -ENODEV is then returned to the caller. When libsas sas_ata_qc_issue() receives this error value, it assumes the task wasn't handled/queued by LLDD and proceeds to clean up and free the task again, resulting in a double free. Since pm8001_queue_command() handles the SAS task in this case, it should return 0 to the caller indicating that the task has been handled. Fixes: e29c47f ("scsi: pm8001: Simplify pm8001_task_exec()") Signed-off-by: Salomon Dushimirimana <salomondush@google.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260213192806.439432-1-salomondush@google.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The (struct vnic_dev).linkstatus buffer is freed in svnic_dev_unregister() and referenced in svnic_dev_link_status() but never alloc'd. This means (struct vnic_dev).linkstatus is always null and the dealloc the reference in svnic_dev_link_status() is dead code. Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Acked-by: Karan Tilak Kumar <kartilak@cisco.com> Link: https://patch.msgid.link/20260216141056.59429-2-fourier.thomas@gmail.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Update snic maintainers. Signed-off-by: Karan Tilak Kumar <kartilak@cisco.com> Link: https://patch.msgid.link/20260217204658.5465-1-kartilak@cisco.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Since 3669ddd ("KVM: arm64: Add a range to pkvm_mappings"), pKVM tracks the memory that has been mapped into a guest in a side data structure. Crucially, it uses it to find out whether a page has already been mapped, and therefore refuses to map it twice. So far, so good. However, this very patch completely breaks non-4kB page support, with guests being unable to boot. The most obvious symptom is that we take the same fault repeatedly, and not making forward progress. A quick investigation shows that this is because of the above rejection code. As it turns out, there are multiple issues at play: - while the HPFAR_EL2 register gives you the faulting IPA minus the bottom 12 bits, it will still give you the extra bits that are part of the page offset for anything larger than 4kB, even for a level-3 mapping - pkvm_pgtable_stage2_map() assumes that the address passed as a parameter is aligned to the size of the intended mapping - the faulting address is only aligned for a non-page mapping When the planets are suitably aligned (pun intended), the guest faults on a page by accessing it past the bottom 4kB, and extra bits get set in the HPFAR_EL2 register. If this results in a page mapping (which is likely with large granule sizes), nothing aligns it further down, and pkvm_mapping_iter_first() finds an intersection that doesn't really exist. We assume this is a spurious fault and return -EAGAIN. And again... This doesn't hit outside of the protected code, as the page table code always aligns the IPA down to a page boundary, hiding the issue for everyone else. Fix it by always forcing the alignment on vma_pagesize, irrespective of the value of vma_pagesize. Fixes: 3669ddd ("KVM: arm64: Add a range to pkvm_mappings") Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://https://patch.msgid.link/20260222141000.3084258-1-maz@kernel.org Cc: stable@vger.kernel.org
cifs_pick_channel uses (start % chan_count) when channels are equally loaded, but that can return a channel that failed the eligibility checks. Drop the fallback and return the scan-selected channel instead. If none is eligible, keep the existing behavior of using the primary channel. Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Acked-by: Meetakshi Setiya <msetiya@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
When device_get_child_node_count() got split to the fwnode and device respective APIs, the fwnode didn't inherit the ability to traverse over the secondary fwnode. Hence any user, that switches from device to fwnode API misses this feature. In particular, this was revealed by the commit 1490cbb ("device property: Split fwnode_get_child_node_count()") that effectively broke the GPIO enumeration on Intel Galileo boards. Fix this by moving the secondary lookup from device to fwnode API. Note, in general no device_*() API should go into the depth of the fwnode implementation. Fixes: 114dbb4 ("drivers property: When no children in primary, try secondary") Cc: stable@vger.kernel.org Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Link: https://patch.msgid.link/20260210135822.47335-1-andriy.shevchenko@linux.intel.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Update the example devicetree with unique regulator names for all regulators. This reflects the same change made to the actual .dtsi file. Signed-off-by: David Lechner <dlechner@baylibre.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Link: https://patch.msgid.link/20260219-mtk-mt6359-fix-regulator-names-v1-2-ee0fcebfe1d9@baylibre.com Signed-off-by: Mark Brown <broonie@kernel.org>
Currently, the define_read!() and define_write!() I/O macros are crate public. The only user outside of the I/O module is PCI (for the configurations space I/O backend). Consequently, when CONFIG_PCI=n this causes a compile time warning [1]. In order to fix this, rename the macros to io_define_read!() and io_define_write!() and use #[macro_export] to export them. This is better than making the crate public visibility conditional, as eventually subsystems will have their own crate. Also, I/O backends are valid to be implemented by drivers as well. For instance, there are devices (such as GPUs) that run firmware which allows to program other devices only accessible through the primary device through indirect I/O. Since the macros are now public, also add the corresponding documentation. Fixes: 121d87b ("rust: io: separate generic I/O helpers from MMIO implementation") Reported-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Closes: https://lore.kernel.org/driver-core/CANiq72khOYkt6t5zwMvSiyZvWWHMZuNCMERXu=7K=_5tT-8Pgg@mail.gmail.com/ [1] Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com> Link: https://patch.msgid.link/20260216131534.65008-1-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
array_index_nospec() is no use if the result gets spilled to the stack, as it makes the believed safe-under-speculation value subject to memory predictions. For all practical purposes, this means array_index_nospec() must be used in the expression that accesses the array. As the code currently stands, it's the wrong side of irqentry_enter(), and 'index' is put into %ebp across the function call. Remove the index variable and reposition array_index_nospec(), so it's calculated immediately before the array access. Fixes: 14619d9 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260106131504.679932-1-andrew.cooper3@citrix.com
The commit 5b472b6 ("x86_64/bug: Implement __WARN_printf()") implemented __WARN_printf(), which changed the mechanism to use UD1 instead of UD2. However, it only handles the trap in the runtime IDT handler, while the early booting IDT handler lacks this handling. As a result, the usage of WARN() before the runtime IDT setup can lead to kernel crashes. Since KMSAN is enabled after the runtime IDT setup, it is safe to use handle_bug() directly in early_fixup_exception() to address this issue. Fixes: 5b472b6 ("x86_64/bug: Implement __WARN_printf()") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/c4fb3645f60d3a78629d9870e8fcc8535281c24f.1768016713.git.houwenlong.hwl@antgroup.com
Rustam reported his clang builds did not boot properly; turns out his .config has: CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y set. Fix up the FineIBT code to deal with this unusual alignment. Fixes: 931ab63 ("x86/ibt: Implement FineIBT") Reported-by: Rustam Kovhaev <rkovhaev@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Rustam Kovhaev <rkovhaev@gmail.com>
…ctly The call of alloc_pages_bulk() skips to fill entries of page array when the entries already have values. While, 1394 OHCI PCI driver passes the page array without initializing. It could cause invalid state at PFN validation in vmap(). Fixes: f2ae927 ("firewire: ohci: split page allocation from dma mapping") Reported-by: John Ogness <john.ogness@linutronix.de> Reported-and-tested-by: Harald Arnesen <linux@skogtun.org> Reported-and-tested-by: David Gow <david@davidgow.net> Closes: https://lore.kernel.org/lkml/87tsv1vig5.fsf@jogness.linutronix.de/ Signed-off-by: Takashi Sakamoto <o-takashi@sakamocchi.jp> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
…cifs-2.6 Pull smb client fixes from Steve French: - Two multichannel fixes - Locking fix for superblock flags - Fix to remove debug message that could log password - Cleanup fix for setting credentials * tag 'v7.0rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: Use snprintf in cifs_set_cifscreds smb: client: Don't log plaintext credentials in cifs_set_cifscreds smb: client: fix broken multichannel with krb5+signing smb: client: use atomic_t for mnt_cifs_flags smb: client: fix cifs_pick_channel when channels are equally loaded
…/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Do not register imx_clk_scu_driver in imx8qxp_clk_probe(); besides fixing two other issues, this avoids a deadlock in combination with commit dc23806 ("driver core: enforce device_lock for driver_match_device()") - Move secondary node lookup from device_get_next_child_node() to fwnode_get_next_child_node(); this avoids issues when users switch from the device API to the fwnode API - Export io_define_{read,write}!() to avoid unused import warnings when CONFIG_PCI=n * tag 'driver-core-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: clk: scu/imx8qxp: do not register driver in probe() rust: io: macro_export io_define_read!() and io_define_write!() device property: Allow secondary lookup in fwnode_get_next_child_node()
…t/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad Tabba) - Fix invariant violation for single value tnums in the verifier (Harishankar Vishwanathan, Paul Chaignon) - Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai) - Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen) - Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri Olsa) - Fix race in freeing special fields in BPF maps to prevent memory leaks (Kumar Kartikeya Dwivedi) - Fix OOB read in dmabuf_collector (T.J. Mercier) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits) selftests/bpf: Avoid simplification of crafted bounds test selftests/bpf: Test refinement of single-value tnum bpf: Improve bounds when tnum has a single possible value bpf: Introduce tnum_step to step through tnum's members bpf: Fix race in devmap on PREEMPT_RT bpf: Fix race in cpumap on PREEMPT_RT selftests/bpf: Add tests for special fields races bpf: Retire rcu_trace_implies_rcu_gp() from local storage bpf: Delay freeing fields in local storage bpf: Lose const-ness of map in map_check_btf() bpf: Register dtor for freeing special fields selftests/bpf: Fix OOB read in dmabuf_collector selftests/bpf: Fix a memory leak in xdp_flowtable test bpf: Fix stack-out-of-bounds write in devmap bpf: Fix kprobe_multi cookies access in show_fdinfo callback bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing selftests/bpf: Don't override SIGSEGV handler with ASAN selftests/bpf: Check BPFTOOL env var in detect_bpftool_path() selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN selftests/bpf: Fix array bounds warning in jit_disasm_helpers ...
…it/jejb/scsi Pull SCSI fixes from James Bottomley: "All changes in drivers (well technically SES is enclosure services, but its change is minor). The biggest is the write combining change in lpfc followed by the additional NULL checks in mpi3mr" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: ufs: core: Fix shift out of bounds when MAXQ=32 scsi: ufs: core: Move link recovery for hibern8 exit failure to wl_resume scsi: ufs: core: Fix possible NULL pointer dereference in ufshcd_add_command_trace() scsi: snic: MAINTAINERS: Update snic maintainers scsi: snic: Remove unused linkstatus scsi: pm8001: Fix use-after-free in pm8001_queue_command() scsi: mpi3mr: Add NULL checks when resetting request and reply queues scsi: ufs: core: Reset urgent_bkops_lvl to allow runtime PM power mode scsi: ses: Fix devices attaching to different hosts scsi: ufs: core: Fix RPMB region size detection for UFS 2.2 scsi: storvsc: Fix scheduling while atomic on PREEMPT_RT scsi: lpfc: Properly set WC for DPP mapping
…ux/kernel/git/tip/tip Pull irqchip driver fixes from Ingo Molnar: - Fix frozen interrupt bug in the sifive-plic driver - Limit per-device MSI interrupts on uncommon gic-v3-its hardware variants - Address Sparse warning by constifying a variable in the MMP driver - Revert broken commit and also fix an error check in the ls-extirq driver * tag 'irq-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/ls-extirq: Fix devm_of_iomap() error check Revert "irqchip/ls-extirq: Use for_each_of_imap_item iterator" irqchip/mmp: Make icu_irq_chip variable static const irqchip/gic-v3-its: Limit number of per-device MSIs to the range the ITS supports irqchip/sifive-plic: Fix frozen interrupt due to affinity setting
…/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Now that LLVM 22 has been released officially, require a release version to use the new CONFIG_WARN_CONTEXT_ANALYSIS feature. In particular this avoids the widely used Android clang 22.0.1 pre-release build which is known to be broken for this usecase" * tag 'locking-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lib/Kconfig.debug: Require a release version of LLVM 22 for context analysis
…nux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix lock ordering bug found by lockdep in perf_event_wakeup() - Fix uncore counter enumeration on Granite Rapids and Sierra Forest - Fix perf_mmap() refcount bug found by Syzkaller - Fix __perf_event_overflow() vs perf_remove_from_context() race * tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix __perf_event_overflow() vs perf_remove_from_context() race perf/core: Fix refcount bug and potential UAF in perf_mmap perf/x86/intel/uncore: Add per-scheduler IMC CAS count events perf/core: Fix invalid wait context in ctx_sched_in()
…inux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix zero_vruntime tracking when there's a single task running - Fix slice protection logic - Fix the ->vprot logic for reniced tasks - Fix lag clamping in mixed slice workloads - Fix objtool uaccess warning (and bug) in the !CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining, which triggers with older compilers - Fix a comment in the rseq registration rseq_size bound check code - Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes differently, which special size we now reached naturally and want to avoid. The visible ugliness of the new reserved field will be avoided the next time the RSEQ area is extended. * tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: slice ext: Ensure rseq feature size differs from original rseq size rseq: Clarify rseq registration rseq_size bound check comment sched/core: Fix wakeup_preempt's next_class tracking rseq: Mark rseq_arm_slice_extension_timer() __always_inline sched/fair: Fix lag clamp sched/eevdf: Update se->vprot in reweight_entity() sched/fair: Only set slice protection at pick time sched/fair: Fix zero_vruntime tracking
…linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for the common HZ=100, 250 or 1000 cases. Only use a function call for odd HZ values like HZ=300 that generate more code. The function call overhead showed up in performance tests of the TCP code" * tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
…ux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix speculative safety in fred_extint() - Fix __WARN_printf() trap in early_fixup_exception() - Fix clang-build boot bug for unusual alignments, triggered by CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y - Replace the final few __ASSEMBLY__ stragglers that snuck in lately into non-UAPI x86 headers and use __ASSEMBLER__ consistently (again) * tag 'x86-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/headers: Replace __ASSEMBLY__ stragglers with __ASSEMBLER__ x86/cfi: Fix CFI rewrite for odd alignments x86/bug: Handle __WARN_printf() trap in early_fixup_exception() x86/fred: Correct speculative safety in fred_extint()
…scm/linux/kernel/git/tip/tip Pull debugobjects fix from Thomas Gleixner: "A single fix for debugobjects. The deferred page initialization prevents debug objects from allocating slab pages until the initialization is complete. That causes depletion of the pool and disabling of debugobjects. The reason is that debugobjects uses __GFP_HIGH for allocations as it might be invoked from arbitrary contexts. When PREEMPT_COUNT is disabled there is no way to know whether the context is safe to set __GFP_KSWAPD_RECLAIM. This worked until v6.18. Since then allocations w/o a reclaim flag cause new_slab() to end up in alloc_frozen_pages_nolock_noprof(), which returns early when deferred page initialization has not yet completed. Work around that when PREEMPT_COUNT is enabled as the preempt counter allows debugobjects to add __GFP_KSWAPD_RECLAIM to the GFP flags when the context is preemtible. When PREEMPT_COUNT is disabled the context is unknown and the reclaim bit can't be set because the caller might hold locks which might deadlock in the allocator. That makes debugobjects depend on PREEMPT_COUNT || !DEFERRED_STRUCT_PAGE_INIT, which limits the coverage slightly, but keeps it functional for most cases" * tag 'core-debugobjects-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: debugobject: Make it work with deferred page initialization - again
Pull kvm fixes from Paolo Bonzini:
"Arm:
- Make sure we don't leak any S1POE state from guest to guest when
the feature is supported on the HW, but not enabled on the host
- Propagate the ID registers from the host into non-protected VMs
managed by pKVM, ensuring that the guest sees the intended feature
set
- Drop double kern_hyp_va() from unpin_host_sve_state(), which could
bite us if we were to change kern_hyp_va() to not being idempotent
- Don't leak stage-2 mappings in protected mode
- Correctly align the faulting address when dealing with single page
stage-2 mappings for PAGE_SIZE > 4kB
- Fix detection of virtualisation-capable GICv5 IRS, due to the
maintainer being obviously fat fingered... [his words, not mine]
- Remove duplication of code retrieving the ASID for the purpose of
S1 PT handling
- Fix slightly abusive const-ification in vgic_set_kvm_info()
Generic:
- Remove internal Kconfigs that are now set on all architectures
- Remove per-architecture code to enable KVM_CAP_SYNC_MMU, all
architectures finally enable it in Linux 7.0"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: always define KVM_CAP_SYNC_MMU
KVM: remove CONFIG_KVM_GENERIC_MMU_NOTIFIER
KVM: arm64: Deduplicate ASID retrieval code
irqchip/gic-v5: Fix inversion of IRS_IDR0.virt flag
KVM: arm64: Revert accidental drop of kvm_uninit_stage2_mmu() for non-NV VMs
KVM: arm64: Fix protected mode handling of pages larger than 4kB
KVM: arm64: vgic: Handle const qualifier from gic_kvm_info allocation type
KVM: arm64: Remove redundant kern_hyp_va() in unpin_host_sve_state()
KVM: arm64: Fix ID register initialization for non-protected pKVM guests
KVM: arm64: Optimise away S1POE handling when not supported by host
KVM: arm64: Hide S1POE from guests when not supported by the host
…it/cel/linux Pull nfsd fixes from Chuck Lever: - Restore previous nfsd thread count reporting behavior - Fix credential reference leaks in the NFSD netlink admin protocol * tag 'nfsd-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: report the requested maximum number of threads instead of number running nfsd: Fix cred ref leak in nfsd_nl_listener_set_doit(). nfsd: Fix cred ref leak in nfsd_nl_threads_set_doit().
scx_bpf_dsq_nr_queued() reads dsq->nr via READ_ONCE() without holding
any lock, making dsq->nr a lock-free concurrently accessed variable.
However, dsq_mod_nr(), the sole writer of dsq->nr, only uses
WRITE_ONCE() on the write side without the matching READ_ONCE() on the
read side:
WRITE_ONCE(dsq->nr, dsq->nr + delta);
^^^^^^^
plain read -- KCSAN data race
The KCSAN documentation requires that if one accessor uses READ_ONCE()
or WRITE_ONCE() on a variable to annotate lock-free access, all other
accesses must also use the appropriate accessor. A plain read on the
right-hand side of WRITE_ONCE() leaves the pair incomplete and will
trigger KCSAN warnings.
Fix by using READ_ONCE() for the read side of the update:
WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta);
This is consistent with scx_bpf_dsq_nr_queued() and makes the
concurrent access annotation complete and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_attr_ops_show() and scx_uevent() access scx_root->ops.name directly.
This is problematic for two reasons:
1. The file-level comment explicitly identifies naked scx_root
dereferences as a temporary measure that needs to be replaced
with proper per-instance access.
2. scx_attr_events_show(), the neighboring sysfs show function in
the same group, already uses the correct pattern:
struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj);
Having inconsistent access patterns in the same sysfs/uevent
group is error-prone.
The kobject embedded in struct scx_sched is initialized as:
kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root");
so container_of(kobj, struct scx_sched, kobj) correctly retrieves
the owning scx_sched instance in both callbacks.
Replace the naked scx_root dereferences with container_of()-based
access, consistent with scx_attr_events_show() and in preparation
for proper multi-instance scx_sched support.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If a 'const struct foo __user *ptr' is used for the address passed to scoped_user_read_access() then you get a warning/error uaccess.h:691:1: error: initialization discards 'const' qualifier from pointer target type [-Werror=discarded-qualifiers] for the void __user *_tmpptr = __scoped_user_access_begin(mode, uptr, size, elbl) assignment. Fix by using 'auto' for both _tmpptr and the redeclaration of uptr. Replace the CLASS() with explicit __cleanup() functions on uptr. Fixes: e497310 ("uaccess: Provide scoped user access regions") Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-and-tested-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable():
WRITE_ONCE(scx_watchdog_timeout, timeout);
However, three read-side accesses use plain reads without the matching
READ_ONCE():
/* check_rq_for_timeouts() - L2824 */
last_runnable + scx_watchdog_timeout
/* scx_watchdog_workfn() - L2852 */
scx_watchdog_timeout / 2
/* scx_enable() - L5179 */
scx_watchdog_timeout / 2
The KCSAN documentation requires that if one accessor uses WRITE_ONCE()
to annotate lock-free access, all other accesses must also use the
appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair
incomplete and can trigger KCSAN warnings.
Note that scx_tick() already uses the correct READ_ONCE() annotation:
last_check + READ_ONCE(scx_watchdog_timeout)
Fix the three remaining plain reads to match, making all accesses to
scx_watchdog_timeout consistently annotated and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Similar to commit 835a507 ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit 639f58a ("bpftool: Fix build warnings due to MS extensions") The kernel is now built with -fms-extensions, therefore generated vmlinux.h contains types like: struct aes_key { struct aes_enckey; union aes_invkey_arch inv_k; }; struct ns_common { ... union { struct ns_tree; struct callback_head ns_rcu; }; }; Which raise warning like below when building scx scheduler: tools/sched_ext/build/include/vmlinux.h:50533:3: warning: declaration does not declare anything [-Wmissing-declarations] 50533 | struct ns_tree; | ^ Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Similar to commit 835a507 ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit 639f58a ("bpftool: Fix build warnings due to MS extensions") Fix "declaration does not declare anything" warning by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
When compiling sched_ext selftests using clang 17.0.6, it raised compiler crash and build error: Error at line 68: Unsupport signed division for DAG: 0x55b2f9a60240: i64 = sdiv 0x55b2f9a609b0, Constant:i64<100>, peek_dsq.bpf.c:68:25 @[ peek_dsq.bpf.c:95:4 @[ peek_dsq.bpf.c:169:8 @[ peek _dsq.bpf.c:140:6 ] ] ]Please convert to unsigned div/mod After digging, it's not a compiler error, clang supported Signed division only when using -mcpu=v4, while we use -mcpu=v3 currently, the better way is to use unsigned div, see [1] for details. [1] llvm/llvm-project#70433 Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy under cgroup_lock(). It does not increment the reference counts on yielded css structs. According to the cgroup documentation, css_put() should only be used to release a reference obtained via css_get() or css_tryget_online(). Since the iterator does not use either of these to acquire a reference, calling css_put() in the error path of scx_cgroup_init() causes a refcount underflow. Remove the unbalanced css_put() to prevent a potential Use-After-Free (UAF) vulnerability. Fixes: 8195136 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
…nel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"One-liner or short fixes for minor/moderate problems reported recently:
- fixes or level adjustments of error messages
- fix leaked transaction handles after aborted transactions, when
using the remap tree feature
- fix a few leaked chunk maps after errors
- fix leaked page array in io_uring encoded read if an error occurs
and the 'finished' is not called
- fix double release of reserved extents when doing a range COW
- don't commit super block when the filesystem is in shutdown state
- fix squota accounting condition when checking members vs parent
usage
- other error handling fixes"
* tag 'for-7.0-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: check block group lookup in remove_range_from_remap_tree()
btrfs: fix transaction handle leaks in btrfs_last_identity_remap_gone()
btrfs: fix chunk map leak in btrfs_map_block() after btrfs_translate_remap()
btrfs: fix chunk map leak in btrfs_map_block() after btrfs_chunk_map_num_copies()
btrfs: fix compat mask in error messages in btrfs_check_features()
btrfs: print correct subvol num if active swapfile prevents deletion
btrfs: fix warning in scrub_verify_one_metadata()
btrfs: fix objectid value in error message in check_extent_data_ref()
btrfs: fix incorrect key offset in error message in check_dev_extent_item()
btrfs: fix error message order of parameters in btrfs_delete_delayed_dir_index()
btrfs: don't commit the super block when unmounting a shutdown filesystem
btrfs: free pages on error in btrfs_uring_read_extent()
btrfs: fix referenced/exclusive check in squota_check_parent_usage()
btrfs: remove pointless WARN_ON() in cache_save_setup()
btrfs: convert log messages to error level in btrfs_replay_log()
btrfs: remove btrfs_handle_fs_error() after failure to recover log trees
btrfs: remove redundant warning message in btrfs_check_uuid_tree()
btrfs: change warning messages to error level in open_ctree()
btrfs: fix a double release on reserved extents in cow_one_range()
btrfs: handle discard errors in in btrfs_finish_extent_commit()
During scx_enable(), the READY -> ENABLED task switching loop changes the calling thread's sched_class from fair to ext. Since fair has higher priority than ext, saturating fair-class workloads can indefinitely starve the enable thread, hanging the system. This was introduced when the enable path switched from preempt_disable() to scx_bypass() which doesn't protect against fair-class starvation. Note that the original preempt_disable() protection wasn't complete either - in partial switch modes, the calling thread could still be starved after preempt_enable() as it may have been switched to ext class. Fix it by offloading the enable body to a dedicated system-wide RT (SCHED_FIFO) kthread which cannot be starved by either fair or ext class tasks. scx_enable() lazily creates the kthread on first use and passes the ops pointer through a struct scx_enable_cmd containing the kthread_work, then synchronously waits for completion. The workfn runs on a different kthread from sch->helper (which runs disable_work), so it can safely flush disable_work on the error path without deadlock. Fixes: 8c2090c ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
…cm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix starvation of scx_enable() under fair-class saturation by offloading the enable path to an RT kthread - Fix out-of-bounds access in idle mask initialization on systems with non-contiguous NUMA node IDs - Fix a preemption window during scheduler exit and a refcount underflow in cgroup init error path - Fix SCX_EFLAG_INITIALIZED being a no-op flag - Add READ_ONCE() annotations for KCSAN-clean lockless accesses and replace naked scx_root dereferences with container_of() in kobject callbacks - Tooling and selftest fixes: compilation issues with clang 17, strtoul() misuse, unused options cleanup, and Kconfig sync * tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix starvation of scx_enable() under fair-class saturation sched_ext: Remove redundant css_put() in scx_cgroup_init() selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17 selftests/sched_ext: Add -fms-extensions to bpf build flags tools/sched_ext: Add -fms-extensions to bpf build flags sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout sched_ext: Replace naked scx_root dereferences in kobject callbacks sched_ext: Use READ_ONCE() for the read side of dsq->nr update tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq() sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag sched_ext: Fix out-of-bounds access in scx_idle_init_masks() sched_ext: Disable preemption between scx_claim_exit() and kicking helper work tools/sched_ext: Add Kconfig to sync with upstream tools/sched_ext: Sync README.md Kconfig with upstream scx selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c tools/sched_ext: scx_sdt: Remove unused '-f' option tools/sched_ext: scx_central: Remove unused '-p' option selftests/sched_ext: Fix unused-result warning for read() selftests/sched_ext: Abort test loop on signal
…linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix circular locking dependency in cpuset partition code by deferring housekeeping_update() calls to a workqueue instead of calling them directly under cpus_read_lock - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when generate_sched_domains() returns NULL due to kmalloc failure - Fix incorrect cpuset behavior for effective_xcpus in partition_xcpus_del() and cpuset_update_tasks_cpumask() in update_cpumasks_hier() - Fix race between task migration and cgroup iteration * tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed cgroup/cpuset: Clarify exclusion rules for cpuset internal variables cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() cgroup: fix race between task migration and iteration
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )