|
| 1 | +===================== |
| 2 | +Restartable Sequences |
| 3 | +===================== |
| 4 | + |
| 5 | +Restartable Sequences allow to register a per thread userspace memory area |
| 6 | +to be used as an ABI between kernel and userspace for three purposes: |
| 7 | + |
| 8 | + * userspace restartable sequences |
| 9 | + |
| 10 | + * quick access to read the current CPU number, node ID from userspace |
| 11 | + |
| 12 | + * scheduler time slice extensions |
| 13 | + |
| 14 | +Restartable sequences (per-cpu atomics) |
| 15 | +--------------------------------------- |
| 16 | + |
| 17 | +Restartable sequences allow userspace to perform update operations on |
| 18 | +per-cpu data without requiring heavyweight atomic operations. The actual |
| 19 | +ABI is unfortunately only available in the code and selftests. |
| 20 | + |
| 21 | +Quick access to CPU number, node ID |
| 22 | +----------------------------------- |
| 23 | + |
| 24 | +Allows to implement per CPU data efficiently. Documentation is in code and |
| 25 | +selftests. :( |
| 26 | + |
| 27 | +Scheduler time slice extensions |
| 28 | +------------------------------- |
| 29 | + |
| 30 | +This allows a thread to request a time slice extension when it enters a |
| 31 | +critical section to avoid contention on a resource when the thread is |
| 32 | +scheduled out inside of the critical section. |
| 33 | + |
| 34 | +The prerequisites for this functionality are: |
| 35 | + |
| 36 | + * Enabled in Kconfig |
| 37 | + |
| 38 | + * Enabled at boot time (default is enabled) |
| 39 | + |
| 40 | + * A rseq userspace pointer has been registered for the thread |
| 41 | + |
| 42 | +The thread has to enable the functionality via prctl(2):: |
| 43 | + |
| 44 | + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, |
| 45 | + PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); |
| 46 | + |
| 47 | +prctl() returns 0 on success or otherwise with the following error codes: |
| 48 | + |
| 49 | +========= ============================================================== |
| 50 | +Errorcode Meaning |
| 51 | +========= ============================================================== |
| 52 | +EINVAL Functionality not available or invalid function arguments. |
| 53 | + Note: arg4 and arg5 must be zero |
| 54 | +ENOTSUPP Functionality was disabled on the kernel command line |
| 55 | +ENXIO Available, but no rseq user struct registered |
| 56 | +========= ============================================================== |
| 57 | + |
| 58 | +The state can be also queried via prctl(2):: |
| 59 | + |
| 60 | + prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); |
| 61 | + |
| 62 | +prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if |
| 63 | +disabled. Otherwise it returns with the following error codes: |
| 64 | + |
| 65 | +========= ============================================================== |
| 66 | +Errorcode Meaning |
| 67 | +========= ============================================================== |
| 68 | +EINVAL Functionality not available or invalid function arguments. |
| 69 | + Note: arg3 and arg4 and arg5 must be zero |
| 70 | +========= ============================================================== |
| 71 | + |
| 72 | +The availability and status is also exposed via the rseq ABI struct flags |
| 73 | +field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the |
| 74 | +``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user |
| 75 | +space and only for informational purposes. |
| 76 | + |
| 77 | +If the mechanism was enabled via prctl(), the thread can request a time |
| 78 | +slice extension by setting rseq::slice_ctrl::request to 1. If the thread is |
| 79 | +interrupted and the interrupt results in a reschedule request in the |
| 80 | +kernel, then the kernel can grant a time slice extension and return to |
| 81 | +userspace instead of scheduling out. The length of the extension is |
| 82 | +determined by the ``rseq_slice_extension_nsec`` sysctl. |
| 83 | + |
| 84 | +The kernel indicates the grant by clearing rseq::slice_ctrl::request and |
| 85 | +setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the |
| 86 | +thread after granting the extension, the kernel clears the granted bit to |
| 87 | +indicate that to userspace. |
| 88 | + |
| 89 | +If the request bit is still set when the leaving the critical section, |
| 90 | +userspace can clear it and continue. |
| 91 | + |
| 92 | +If the granted bit is set, then userspace invokes rseq_slice_yield(2) when |
| 93 | +leaving the critical section to relinquish the CPU. The kernel enforces |
| 94 | +this by arming a timer to prevent misbehaving userspace from abusing this |
| 95 | +mechanism. |
| 96 | + |
| 97 | +If both the request bit and the granted bit are false when leaving the |
| 98 | +critical section, then this indicates that a grant was revoked and no |
| 99 | +further action is required by userspace. |
| 100 | + |
| 101 | +The required code flow is as follows:: |
| 102 | + |
| 103 | + rseq->slice_ctrl.request = 1; |
| 104 | + barrier(); // Prevent compiler reordering |
| 105 | + critical_section(); |
| 106 | + barrier(); // Prevent compiler reordering |
| 107 | + rseq->slice_ctrl.request = 0; |
| 108 | + if (rseq->slice_ctrl.granted) |
| 109 | + rseq_slice_yield(); |
| 110 | + |
| 111 | +As all of this is strictly CPU local, there are no atomicity requirements. |
| 112 | +Checking the granted state is racy, but that cannot be avoided at all:: |
| 113 | + |
| 114 | + if (rseq->slice_ctrl.granted) |
| 115 | + -> Interrupt results in schedule and grant revocation |
| 116 | + rseq_slice_yield(); |
| 117 | + |
| 118 | +So there is no point in pretending that this might be solved by an atomic |
| 119 | +operation. |
| 120 | + |
| 121 | +If the thread issues a syscall other than rseq_slice_yield(2) within the |
| 122 | +granted timeslice extension, the grant is also revoked and the CPU is |
| 123 | +relinquished immediately when entering the kernel. This is required as |
| 124 | +syscalls might consume arbitrary CPU time until they reach a scheduling |
| 125 | +point when the preemption model is either NONE or VOLUNTARY and therefore |
| 126 | +might exceed the grant by far. |
| 127 | + |
| 128 | +The preferred solution for user space is to use rseq_slice_yield(2) which |
| 129 | +is side effect free. The support for arbitrary syscalls is required to |
| 130 | +support onion layer architectured applications, where the code handling the |
| 131 | +critical section and requesting the time slice extension has no control |
| 132 | +over the code within the critical section. |
| 133 | + |
| 134 | +The kernel enforces flag consistency and terminates the thread with SIGSEGV |
| 135 | +if it detects a violation. |
0 commit comments