Skip to content

Commit d7a5da7

Browse files
KAGA-KOKOPeter Zijlstra
authored andcommitted
rseq: Add fields and constants for time slice extension
Aside of a Kconfig knob add the following items: - Two flag bits for the rseq user space ABI, which allow user space to query the availability and enablement without a syscall. - A new member to the user space ABI struct rseq, which is going to be used to communicate request and grant between kernel and user space. - A rseq state struct to hold the kernel state of this - Documentation of the new mechanism Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251215155708.669472597@linutronix.de
1 parent 4fe82cf commit d7a5da7

File tree

6 files changed

+220
-1
lines changed

6 files changed

+220
-1
lines changed

Documentation/userspace-api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ System calls
2121
ebpf/index
2222
ioctl/index
2323
mseal
24+
rseq
2425

2526
Security-related interfaces
2627
===========================
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
=====================
2+
Restartable Sequences
3+
=====================
4+
5+
Restartable Sequences allow to register a per thread userspace memory area
6+
to be used as an ABI between kernel and userspace for three purposes:
7+
8+
* userspace restartable sequences
9+
10+
* quick access to read the current CPU number, node ID from userspace
11+
12+
* scheduler time slice extensions
13+
14+
Restartable sequences (per-cpu atomics)
15+
---------------------------------------
16+
17+
Restartable sequences allow userspace to perform update operations on
18+
per-cpu data without requiring heavyweight atomic operations. The actual
19+
ABI is unfortunately only available in the code and selftests.
20+
21+
Quick access to CPU number, node ID
22+
-----------------------------------
23+
24+
Allows to implement per CPU data efficiently. Documentation is in code and
25+
selftests. :(
26+
27+
Scheduler time slice extensions
28+
-------------------------------
29+
30+
This allows a thread to request a time slice extension when it enters a
31+
critical section to avoid contention on a resource when the thread is
32+
scheduled out inside of the critical section.
33+
34+
The prerequisites for this functionality are:
35+
36+
* Enabled in Kconfig
37+
38+
* Enabled at boot time (default is enabled)
39+
40+
* A rseq userspace pointer has been registered for the thread
41+
42+
The thread has to enable the functionality via prctl(2)::
43+
44+
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
45+
PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
46+
47+
prctl() returns 0 on success or otherwise with the following error codes:
48+
49+
========= ==============================================================
50+
Errorcode Meaning
51+
========= ==============================================================
52+
EINVAL Functionality not available or invalid function arguments.
53+
Note: arg4 and arg5 must be zero
54+
ENOTSUPP Functionality was disabled on the kernel command line
55+
ENXIO Available, but no rseq user struct registered
56+
========= ==============================================================
57+
58+
The state can be also queried via prctl(2)::
59+
60+
prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
61+
62+
prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
63+
disabled. Otherwise it returns with the following error codes:
64+
65+
========= ==============================================================
66+
Errorcode Meaning
67+
========= ==============================================================
68+
EINVAL Functionality not available or invalid function arguments.
69+
Note: arg3 and arg4 and arg5 must be zero
70+
========= ==============================================================
71+
72+
The availability and status is also exposed via the rseq ABI struct flags
73+
field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
74+
``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
75+
space and only for informational purposes.
76+
77+
If the mechanism was enabled via prctl(), the thread can request a time
78+
slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
79+
interrupted and the interrupt results in a reschedule request in the
80+
kernel, then the kernel can grant a time slice extension and return to
81+
userspace instead of scheduling out. The length of the extension is
82+
determined by the ``rseq_slice_extension_nsec`` sysctl.
83+
84+
The kernel indicates the grant by clearing rseq::slice_ctrl::request and
85+
setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
86+
thread after granting the extension, the kernel clears the granted bit to
87+
indicate that to userspace.
88+
89+
If the request bit is still set when the leaving the critical section,
90+
userspace can clear it and continue.
91+
92+
If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
93+
leaving the critical section to relinquish the CPU. The kernel enforces
94+
this by arming a timer to prevent misbehaving userspace from abusing this
95+
mechanism.
96+
97+
If both the request bit and the granted bit are false when leaving the
98+
critical section, then this indicates that a grant was revoked and no
99+
further action is required by userspace.
100+
101+
The required code flow is as follows::
102+
103+
rseq->slice_ctrl.request = 1;
104+
barrier(); // Prevent compiler reordering
105+
critical_section();
106+
barrier(); // Prevent compiler reordering
107+
rseq->slice_ctrl.request = 0;
108+
if (rseq->slice_ctrl.granted)
109+
rseq_slice_yield();
110+
111+
As all of this is strictly CPU local, there are no atomicity requirements.
112+
Checking the granted state is racy, but that cannot be avoided at all::
113+
114+
if (rseq->slice_ctrl.granted)
115+
-> Interrupt results in schedule and grant revocation
116+
rseq_slice_yield();
117+
118+
So there is no point in pretending that this might be solved by an atomic
119+
operation.
120+
121+
If the thread issues a syscall other than rseq_slice_yield(2) within the
122+
granted timeslice extension, the grant is also revoked and the CPU is
123+
relinquished immediately when entering the kernel. This is required as
124+
syscalls might consume arbitrary CPU time until they reach a scheduling
125+
point when the preemption model is either NONE or VOLUNTARY and therefore
126+
might exceed the grant by far.
127+
128+
The preferred solution for user space is to use rseq_slice_yield(2) which
129+
is side effect free. The support for arbitrary syscalls is required to
130+
support onion layer architectured applications, where the code handling the
131+
critical section and requesting the time slice extension has no control
132+
over the code within the critical section.
133+
134+
The kernel enforces flag consistency and terminates the thread with SIGSEGV
135+
if it detects a violation.

include/linux/rseq_types.h

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,20 +72,46 @@ struct rseq_ids {
7272
};
7373
};
7474

75+
/**
76+
* union rseq_slice_state - Status information for rseq time slice extension
77+
* @state: Compound to access the overall state
78+
* @enabled: Time slice extension is enabled for the task
79+
* @granted: Time slice extension was granted to the task
80+
*/
81+
union rseq_slice_state {
82+
u16 state;
83+
struct {
84+
u8 enabled;
85+
u8 granted;
86+
};
87+
};
88+
89+
/**
90+
* struct rseq_slice - Status information for rseq time slice extension
91+
* @state: Time slice extension state
92+
*/
93+
struct rseq_slice {
94+
union rseq_slice_state state;
95+
};
96+
7597
/**
7698
* struct rseq_data - Storage for all rseq related data
7799
* @usrptr: Pointer to the registered user space RSEQ memory
78100
* @len: Length of the RSEQ region
79-
* @sig: Signature of critial section abort IPs
101+
* @sig: Signature of critical section abort IPs
80102
* @event: Storage for event management
81103
* @ids: Storage for cached CPU ID and MM CID
104+
* @slice: Storage for time slice extension data
82105
*/
83106
struct rseq_data {
84107
struct rseq __user *usrptr;
85108
u32 len;
86109
u32 sig;
87110
struct rseq_event event;
88111
struct rseq_ids ids;
112+
#ifdef CONFIG_RSEQ_SLICE_EXTENSION
113+
struct rseq_slice slice;
114+
#endif
89115
};
90116

91117
#else /* CONFIG_RSEQ */

include/uapi/linux/rseq.h

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,15 @@ enum rseq_flags {
2323
};
2424

2525
enum rseq_cs_flags_bit {
26+
/* Historical and unsupported bits */
2627
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
2728
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
2829
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
30+
/* (3) Intentional gap to put new bits into a separate byte */
31+
32+
/* User read only feature flags */
33+
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT = 4,
34+
RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT = 5,
2935
};
3036

3137
enum rseq_cs_flags {
@@ -35,6 +41,11 @@ enum rseq_cs_flags {
3541
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
3642
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
3743
(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
44+
45+
RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE =
46+
(1U << RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT),
47+
RSEQ_CS_FLAG_SLICE_EXT_ENABLED =
48+
(1U << RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT),
3849
};
3950

4051
/*
@@ -53,6 +64,27 @@ struct rseq_cs {
5364
__u64 abort_ip;
5465
} __attribute__((aligned(4 * sizeof(__u64))));
5566

67+
/**
68+
* rseq_slice_ctrl - Time slice extension control structure
69+
* @all: Compound value
70+
* @request: Request for a time slice extension
71+
* @granted: Granted time slice extension
72+
*
73+
* @request is set by user space and can be cleared by user space or kernel
74+
* space. @granted is set and cleared by the kernel and must only be read
75+
* by user space.
76+
*/
77+
struct rseq_slice_ctrl {
78+
union {
79+
__u32 all;
80+
struct {
81+
__u8 request;
82+
__u8 granted;
83+
__u16 __reserved;
84+
};
85+
};
86+
};
87+
5688
/*
5789
* struct rseq is aligned on 4 * 8 bytes to ensure it is always
5890
* contained within a single cache-line.
@@ -141,6 +173,12 @@ struct rseq {
141173
*/
142174
__u32 mm_cid;
143175

176+
/*
177+
* Time slice extension control structure. CPU local updates from
178+
* kernel and user space.
179+
*/
180+
struct rseq_slice_ctrl slice_ctrl;
181+
144182
/*
145183
* Flexible array member at end of structure, after last feature field.
146184
*/

init/Kconfig

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1938,6 +1938,18 @@ config RSEQ
19381938

19391939
If unsure, say Y.
19401940

1941+
config RSEQ_SLICE_EXTENSION
1942+
bool "Enable rseq-based time slice extension mechanism"
1943+
depends on RSEQ && HIGH_RES_TIMERS && GENERIC_ENTRY && HAVE_GENERIC_TIF_BITS
1944+
help
1945+
Allows userspace to request a limited time slice extension when
1946+
returning from an interrupt to user space via the RSEQ shared
1947+
data ABI. If granted, that allows to complete a critical section,
1948+
so that other threads are not stuck on a conflicted resource,
1949+
while the task is scheduled out.
1950+
1951+
If unsure, say N.
1952+
19411953
config RSEQ_STATS
19421954
default n
19431955
bool "Enable lightweight statistics of restartable sequences" if EXPERT

kernel/rseq.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -389,6 +389,8 @@ static bool rseq_reset_ids(void)
389389
*/
390390
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
391391
{
392+
u32 rseqfl = 0;
393+
392394
if (flags & RSEQ_FLAG_UNREGISTER) {
393395
if (flags & ~RSEQ_FLAG_UNREGISTER)
394396
return -EINVAL;
@@ -440,6 +442,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
440442
if (!access_ok(rseq, rseq_len))
441443
return -EFAULT;
442444

445+
if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
446+
rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
447+
443448
scoped_user_write_access(rseq, efault) {
444449
/*
445450
* If the rseq_cs pointer is non-NULL on registration, clear it to
@@ -449,11 +454,13 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
449454
* clearing the fields. Don't bother reading it, just reset it.
450455
*/
451456
unsafe_put_user(0UL, &rseq->rseq_cs, efault);
457+
unsafe_put_user(rseqfl, &rseq->flags, efault);
452458
/* Initialize IDs in user space */
453459
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
454460
unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
455461
unsafe_put_user(0U, &rseq->node_id, efault);
456462
unsafe_put_user(0U, &rseq->mm_cid, efault);
463+
unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
457464
}
458465

459466
/*

0 commit comments

Comments
 (0)