Skip to content

Commit bf2c313

Browse files
committed
Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD
KVM mediated PMU support for 6.20 Add support for mediated PMUs, where KVM gives the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allows direct access to data MSRs and PMCs (restricted by the vPMU model), but intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. To keep overall complexity reasonable, mediated PMU usage is all or nothing for a given instance of KVM (controlled via module param). The Mediated PMU is disabled default, partly to maintain backwards compatilibity for existing setup, partly because there are tradeoffs when running with a mediated PMU that may be non-starters for some use cases, e.g. the host loses the ability to profile guests with mediated PMUs, the fastpath run loop is also a blind spot, entry/exit transitions are more expensive, etc. Versus the emulated PMU, where KVM is "just another perf user", the mediated PMU delivers more accurate profiling and monitoring (no risk of contention and thus dropped events), with significantly less overhead (fewer exits and faster emulation/programming of event selectors) E.g. when running Specint-2017 on a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from within the guest: Perf command: a. basic-sampling: perf record -F 1000 -e 6-instructions -a --overwrite b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite Guest performance overhead: --------------------------------------------------------------------------- | Test case | emulated vPMU | all passthrough | passthrough with | | | | | event filters | --------------------------------------------------------------------------- | basic-sampling | 33.62% | 4.24% | 6.21% | --------------------------------------------------------------------------- | multiplex-sampling | 79.32% | 7.34% | 10.45% | ---------------------------------------------------------------------------
2 parents 1b13885 + d374b89 commit bf2c313

File tree

44 files changed

+1463
-312
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1463
-312
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3079,6 +3079,26 @@ Kernel parameters
30793079

30803080
Default is Y (on).
30813081

3082+
kvm.enable_pmu=[KVM,X86]
3083+
If enabled, KVM will virtualize PMU functionality based
3084+
on the virtual CPU model defined by userspace. This
3085+
can be overridden on a per-VM basis via
3086+
KVM_CAP_PMU_CAPABILITY.
3087+
3088+
If disabled, KVM will not virtualize PMU functionality,
3089+
e.g. MSRs, PMCs, PMIs, etc., even if userspace defines
3090+
a virtual CPU model that contains PMU assets.
3091+
3092+
Note, KVM's vPMU support implicitly requires running
3093+
with an in-kernel local APIC, e.g. to deliver PMIs to
3094+
the guest. Running without an in-kernel local APIC is
3095+
not supported, though KVM will allow such a combination
3096+
(with severely degraded functionality).
3097+
3098+
See also enable_mediated_pmu.
3099+
3100+
Default is Y (on).
3101+
30823102
kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86]
30833103
If enabled, KVM will enable virtualization in hardware
30843104
when KVM is loaded, and disable virtualization when KVM
@@ -3125,6 +3145,35 @@ Kernel parameters
31253145
If the value is 0 (the default), KVM will pick a period based
31263146
on the ratio, such that a page is zapped after 1 hour on average.
31273147

3148+
kvm-{amd,intel}.enable_mediated_pmu=[KVM,AMD,INTEL]
3149+
If enabled, KVM will provide a mediated virtual PMU,
3150+
instead of the default perf-based virtual PMU (if
3151+
kvm.enable_pmu is true and PMU is enumerated via the
3152+
virtual CPU model).
3153+
3154+
With a perf-based vPMU, KVM operates as a user of perf,
3155+
i.e. emulates guest PMU counters using perf events.
3156+
KVM-created perf events are managed by perf as regular
3157+
(guest-only) events, e.g. are scheduled in/out, contend
3158+
for hardware resources, etc. Using a perf-based vPMU
3159+
allows guest and host usage of the PMU to co-exist, but
3160+
incurs non-trivial overhead and can result in silently
3161+
dropped guest events (due to resource contention).
3162+
3163+
With a mediated vPMU, hardware PMU state is context
3164+
switched around the world switch to/from the guest.
3165+
KVM mediates which events the guest can utilize, but
3166+
gives the guest direct access to all other PMU assets
3167+
when possible (KVM may intercept some accesses if the
3168+
virtual CPU model provides a subset of hardware PMU
3169+
functionality). Using a mediated vPMU significantly
3170+
reduces PMU virtualization overhead and eliminates lost
3171+
guest events, but is mutually exclusive with using perf
3172+
to profile KVM guests and adds latency to most VM-Exits
3173+
(to context switch PMU state).
3174+
3175+
Default is N (off).
3176+
31283177
kvm-amd.nested= [KVM,AMD] Control nested virtualization feature in
31293178
KVM/SVM. Default is 1 (enabled).
31303179

arch/arm64/kvm/arm.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2413,7 +2413,7 @@ static int __init init_subsystems(void)
24132413
if (err)
24142414
goto out;
24152415

2416-
kvm_register_perf_callbacks(NULL);
2416+
kvm_register_perf_callbacks();
24172417

24182418
out:
24192419
if (err)

arch/loongarch/kvm/main.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ static int kvm_loongarch_env_init(void)
402402
}
403403

404404
kvm_init_gcsr_flag();
405-
kvm_register_perf_callbacks(NULL);
405+
kvm_register_perf_callbacks();
406406

407407
/* Register LoongArch IPI interrupt controller interface. */
408408
ret = kvm_loongarch_register_ipi_device();

arch/riscv/kvm/main.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ static int __init riscv_kvm_init(void)
174174

175175
kvm_riscv_setup_vendor_features();
176176

177-
kvm_register_perf_callbacks(NULL);
177+
kvm_register_perf_callbacks();
178178

179179
rc = kvm_init(sizeof(struct kvm_vcpu), 0, THIS_MODULE);
180180
if (rc) {

arch/x86/entry/entry_fred.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ static idtentry_t sysvec_table[NR_SYSTEM_VECTORS] __ro_after_init = {
114114

115115
SYSVEC(IRQ_WORK_VECTOR, irq_work),
116116

117+
SYSVEC(PERF_GUEST_MEDIATED_PMI_VECTOR, perf_guest_mediated_pmi_handler),
117118
SYSVEC(POSTED_INTR_VECTOR, kvm_posted_intr_ipi),
118119
SYSVEC(POSTED_INTR_WAKEUP_VECTOR, kvm_posted_intr_wakeup_ipi),
119120
SYSVEC(POSTED_INTR_NESTED_VECTOR, kvm_posted_intr_nested_ipi),

arch/x86/events/amd/core.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1439,6 +1439,8 @@ static int __init amd_core_pmu_init(void)
14391439

14401440
amd_pmu_global_cntr_mask = x86_pmu.cntr_mask64;
14411441

1442+
x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
1443+
14421444
/* Update PMC handling functions */
14431445
x86_pmu.enable_all = amd_pmu_v2_enable_all;
14441446
x86_pmu.disable_all = amd_pmu_v2_disable_all;

arch/x86/events/core.c

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
#include <linux/device.h>
3131
#include <linux/nospec.h>
3232
#include <linux/static_call.h>
33+
#include <linux/kvm_types.h>
3334

3435
#include <asm/apic.h>
3536
#include <asm/stacktrace.h>
@@ -56,6 +57,8 @@ DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
5657
.pmu = &pmu,
5758
};
5859

60+
static DEFINE_PER_CPU(bool, guest_lvtpc_loaded);
61+
5962
DEFINE_STATIC_KEY_FALSE(rdpmc_never_available_key);
6063
DEFINE_STATIC_KEY_FALSE(rdpmc_always_available_key);
6164
DEFINE_STATIC_KEY_FALSE(perf_is_hybrid);
@@ -1760,13 +1763,43 @@ void perf_events_lapic_init(void)
17601763
apic_write(APIC_LVTPC, APIC_DM_NMI);
17611764
}
17621765

1766+
#ifdef CONFIG_PERF_GUEST_MEDIATED_PMU
1767+
void perf_load_guest_lvtpc(u32 guest_lvtpc)
1768+
{
1769+
u32 masked = guest_lvtpc & APIC_LVT_MASKED;
1770+
1771+
apic_write(APIC_LVTPC,
1772+
APIC_DM_FIXED | PERF_GUEST_MEDIATED_PMI_VECTOR | masked);
1773+
this_cpu_write(guest_lvtpc_loaded, true);
1774+
}
1775+
EXPORT_SYMBOL_FOR_KVM(perf_load_guest_lvtpc);
1776+
1777+
void perf_put_guest_lvtpc(void)
1778+
{
1779+
this_cpu_write(guest_lvtpc_loaded, false);
1780+
apic_write(APIC_LVTPC, APIC_DM_NMI);
1781+
}
1782+
EXPORT_SYMBOL_FOR_KVM(perf_put_guest_lvtpc);
1783+
#endif /* CONFIG_PERF_GUEST_MEDIATED_PMU */
1784+
17631785
static int
17641786
perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
17651787
{
17661788
u64 start_clock;
17671789
u64 finish_clock;
17681790
int ret;
17691791

1792+
/*
1793+
* Ignore all NMIs when the CPU's LVTPC is configured to route PMIs to
1794+
* PERF_GUEST_MEDIATED_PMI_VECTOR, i.e. when an NMI time can't be due
1795+
* to a PMI. Attempting to handle a PMI while the guest's context is
1796+
* loaded will generate false positives and clobber guest state. Note,
1797+
* the LVTPC is switched to/from the dedicated mediated PMI IRQ vector
1798+
* while host events are quiesced.
1799+
*/
1800+
if (this_cpu_read(guest_lvtpc_loaded))
1801+
return NMI_DONE;
1802+
17701803
/*
17711804
* All PMUs/events that share this PMI handler should make sure to
17721805
* increment active_events for their events.
@@ -3073,11 +3106,12 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
30733106
cap->version = x86_pmu.version;
30743107
cap->num_counters_gp = x86_pmu_num_counters(NULL);
30753108
cap->num_counters_fixed = x86_pmu_num_counters_fixed(NULL);
3076-
cap->bit_width_gp = x86_pmu.cntval_bits;
3077-
cap->bit_width_fixed = x86_pmu.cntval_bits;
3109+
cap->bit_width_gp = cap->num_counters_gp ? x86_pmu.cntval_bits : 0;
3110+
cap->bit_width_fixed = cap->num_counters_fixed ? x86_pmu.cntval_bits : 0;
30783111
cap->events_mask = (unsigned int)x86_pmu.events_maskl;
30793112
cap->events_mask_len = x86_pmu.events_mask_len;
30803113
cap->pebs_ept = x86_pmu.pebs_ept;
3114+
cap->mediated = !!(pmu.capabilities & PERF_PMU_CAP_MEDIATED_VPMU);
30813115
}
30823116
EXPORT_SYMBOL_FOR_KVM(perf_get_x86_pmu_capability);
30833117

arch/x86/events/intel/core.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5695,6 +5695,8 @@ static void intel_pmu_check_hybrid_pmus(struct x86_hybrid_pmu *pmu)
56955695
else
56965696
pmu->intel_ctrl &= ~GLOBAL_CTRL_EN_PERF_METRICS;
56975697

5698+
pmu->pmu.capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
5699+
56985700
intel_pmu_check_event_constraints_all(&pmu->pmu);
56995701

57005702
intel_pmu_check_extra_regs(pmu->extra_regs);
@@ -7314,6 +7316,9 @@ __init int intel_pmu_init(void)
73147316
pr_cont(" AnyThread deprecated, ");
73157317
}
73167318

7319+
/* The perf side of core PMU is ready to support the mediated vPMU. */
7320+
x86_get_pmu(smp_processor_id())->capabilities |= PERF_PMU_CAP_MEDIATED_VPMU;
7321+
73177322
/*
73187323
* Many features on and after V6 require dynamic constraint,
73197324
* e.g., Arch PEBS, ACR.
@@ -7405,6 +7410,7 @@ __init int intel_pmu_init(void)
74057410
case INTEL_ATOM_SILVERMONT_D:
74067411
case INTEL_ATOM_SILVERMONT_MID:
74077412
case INTEL_ATOM_AIRMONT:
7413+
case INTEL_ATOM_AIRMONT_NP:
74087414
case INTEL_ATOM_SILVERMONT_MID2:
74097415
memcpy(hw_cache_event_ids, slm_hw_cache_event_ids,
74107416
sizeof(hw_cache_event_ids));

arch/x86/events/intel/cstate.c

Lines changed: 26 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
* MSR_CORE_C1_RES: CORE C1 Residency Counter
4242
* perf code: 0x00
4343
* Available model: SLM,AMT,GLM,CNL,ICX,TNT,ADL,RPL
44-
* MTL,SRF,GRR,ARL,LNL,PTL
44+
* MTL,SRF,GRR,ARL,LNL,PTL,WCL,NVL
4545
* Scope: Core (each processor core has a MSR)
4646
* MSR_CORE_C3_RESIDENCY: CORE C3 Residency Counter
4747
* perf code: 0x01
@@ -53,19 +53,20 @@
5353
* Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW,
5454
* SKL,KNL,GLM,CNL,KBL,CML,ICL,ICX,
5555
* TGL,TNT,RKL,ADL,RPL,SPR,MTL,SRF,
56-
* GRR,ARL,LNL,PTL
56+
* GRR,ARL,LNL,PTL,WCL,NVL
5757
* Scope: Core
5858
* MSR_CORE_C7_RESIDENCY: CORE C7 Residency Counter
5959
* perf code: 0x03
6060
* Available model: SNB,IVB,HSW,BDW,SKL,CNL,KBL,CML,
6161
* ICL,TGL,RKL,ADL,RPL,MTL,ARL,LNL,
62-
* PTL
62+
* PTL,WCL,NVL
6363
* Scope: Core
6464
* MSR_PKG_C2_RESIDENCY: Package C2 Residency Counter.
6565
* perf code: 0x00
6666
* Available model: SNB,IVB,HSW,BDW,SKL,KNL,GLM,CNL,
6767
* KBL,CML,ICL,ICX,TGL,TNT,RKL,ADL,
68-
* RPL,SPR,MTL,ARL,LNL,SRF,PTL
68+
* RPL,SPR,MTL,ARL,LNL,SRF,PTL,WCL,
69+
* NVL
6970
* Scope: Package (physical package)
7071
* MSR_PKG_C3_RESIDENCY: Package C3 Residency Counter.
7172
* perf code: 0x01
@@ -78,7 +79,7 @@
7879
* Available model: SLM,AMT,NHM,WSM,SNB,IVB,HSW,BDW,
7980
* SKL,KNL,GLM,CNL,KBL,CML,ICL,ICX,
8081
* TGL,TNT,RKL,ADL,RPL,SPR,MTL,SRF,
81-
* ARL,LNL,PTL
82+
* ARL,LNL,PTL,WCL,NVL
8283
* Scope: Package (physical package)
8384
* MSR_PKG_C7_RESIDENCY: Package C7 Residency Counter.
8485
* perf code: 0x03
@@ -97,11 +98,12 @@
9798
* MSR_PKG_C10_RESIDENCY: Package C10 Residency Counter.
9899
* perf code: 0x06
99100
* Available model: HSW ULT,KBL,GLM,CNL,CML,ICL,TGL,
100-
* TNT,RKL,ADL,RPL,MTL,ARL,LNL,PTL
101+
* TNT,RKL,ADL,RPL,MTL,ARL,LNL,PTL,
102+
* WCL,NVL
101103
* Scope: Package (physical package)
102104
* MSR_MODULE_C6_RES_MS: Module C6 Residency Counter.
103105
* perf code: 0x00
104-
* Available model: SRF,GRR
106+
* Available model: SRF,GRR,NVL
105107
* Scope: A cluster of cores shared L2 cache
106108
*
107109
*/
@@ -527,6 +529,18 @@ static const struct cstate_model lnl_cstates __initconst = {
527529
BIT(PERF_CSTATE_PKG_C10_RES),
528530
};
529531

532+
static const struct cstate_model nvl_cstates __initconst = {
533+
.core_events = BIT(PERF_CSTATE_CORE_C1_RES) |
534+
BIT(PERF_CSTATE_CORE_C6_RES) |
535+
BIT(PERF_CSTATE_CORE_C7_RES),
536+
537+
.module_events = BIT(PERF_CSTATE_MODULE_C6_RES),
538+
539+
.pkg_events = BIT(PERF_CSTATE_PKG_C2_RES) |
540+
BIT(PERF_CSTATE_PKG_C6_RES) |
541+
BIT(PERF_CSTATE_PKG_C10_RES),
542+
};
543+
530544
static const struct cstate_model slm_cstates __initconst = {
531545
.core_events = BIT(PERF_CSTATE_CORE_C1_RES) |
532546
BIT(PERF_CSTATE_CORE_C6_RES),
@@ -599,6 +613,7 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
599613
X86_MATCH_VFM(INTEL_ATOM_SILVERMONT, &slm_cstates),
600614
X86_MATCH_VFM(INTEL_ATOM_SILVERMONT_D, &slm_cstates),
601615
X86_MATCH_VFM(INTEL_ATOM_AIRMONT, &slm_cstates),
616+
X86_MATCH_VFM(INTEL_ATOM_AIRMONT_NP, &slm_cstates),
602617

603618
X86_MATCH_VFM(INTEL_BROADWELL, &snb_cstates),
604619
X86_MATCH_VFM(INTEL_BROADWELL_D, &snb_cstates),
@@ -638,6 +653,7 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
638653
X86_MATCH_VFM(INTEL_EMERALDRAPIDS_X, &icx_cstates),
639654
X86_MATCH_VFM(INTEL_GRANITERAPIDS_X, &icx_cstates),
640655
X86_MATCH_VFM(INTEL_GRANITERAPIDS_D, &icx_cstates),
656+
X86_MATCH_VFM(INTEL_DIAMONDRAPIDS_X, &srf_cstates),
641657

642658
X86_MATCH_VFM(INTEL_TIGERLAKE_L, &icl_cstates),
643659
X86_MATCH_VFM(INTEL_TIGERLAKE, &icl_cstates),
@@ -654,6 +670,9 @@ static const struct x86_cpu_id intel_cstates_match[] __initconst = {
654670
X86_MATCH_VFM(INTEL_ARROWLAKE_U, &adl_cstates),
655671
X86_MATCH_VFM(INTEL_LUNARLAKE_M, &lnl_cstates),
656672
X86_MATCH_VFM(INTEL_PANTHERLAKE_L, &lnl_cstates),
673+
X86_MATCH_VFM(INTEL_WILDCATLAKE_L, &lnl_cstates),
674+
X86_MATCH_VFM(INTEL_NOVALAKE, &nvl_cstates),
675+
X86_MATCH_VFM(INTEL_NOVALAKE_L, &nvl_cstates),
657676
{ },
658677
};
659678
MODULE_DEVICE_TABLE(x86cpu, intel_cstates_match);

arch/x86/events/msr.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ static bool test_intel(int idx, void *data)
7878
case INTEL_ATOM_SILVERMONT:
7979
case INTEL_ATOM_SILVERMONT_D:
8080
case INTEL_ATOM_AIRMONT:
81+
case INTEL_ATOM_AIRMONT_NP:
8182

8283
case INTEL_ATOM_GOLDMONT:
8384
case INTEL_ATOM_GOLDMONT_D:

0 commit comments

Comments
 (0)