Process Changes Meta Pull Request#10429
Conversation
…e process that is really currently running despite on any effective process installed. It is very critical to ignore the simulation trick of the debugger (the effective process) during normal completion of the process. This method ensures that every process is able to complete. See the method comment
Fix terminating state of normally completed process by moving the flag logic to final termination methods.
…recise criteria when suspendedContext points to the endPC of Process>>doTerminatinOfYourself method. The last instruction there is final suspend operation which removes the process from the scheduler. It means that the process is not terminated until it is really not running (when suspend is executed). Helper methods are added to support safe stepping (simulation) in the debugger: - Process>>canBeStepped - Context>>canBeSteppedSafelly For the moment #canBeSteppedSafelly is only about preventing stepping over "Processor terminateRealActive". But it can be extended if we will find other cases. Now the debugger should use these methods as a criteria for the finished execution.
…thod. Now the step always delegates the simulation into #terminateActive method to affect the effective process instead of the process running the simulation. It solves all simulation tests (discovered before this commit). Sindarin changes are reverted back to #terminating flag to prevent stepping over final #suspend. It should be also supported but it's a different story
…which are triggered from the process itself as a result of unhandled exception.
…onFinished. The actual problem was on the level of kernel. The simulation of the process up to the self termination was not working using stepInto, stepOver and stepThrough methods. Corresponding tests are added to cover these basic scenarious. And they are fixed. The important change is on the primitive simulation: the process #suspend must do nothing during primitive (see the comment). It makes safe the debugging over #suspend call for real active process.
… that current context is actually the end of process termination. Using #hasSender: is not reliable criteria for stepping because simulated code can perform some context manipulations breaking this condition in the middle of logic while continuing the execution can still return to the home
It removes the duplication of checking both criterias. And it really makes sense in other cases where only #isDead was used. Possibly it wil fix some corner cases during debugger operations over #terminateFromYourself (not normaly accessed by users)
…a criteria for the isTerminated method
…ests expectations
…n - skips multiple unwind blocks and corrupts the image
Add tests, modify outerMost sender, replace tempAt with accessors Move two failing tests to expectedFailures Reset outerMost context as top context
… [ #some ] ensure: [ ^#unwindReturn ]
|
Hi all, Thanks for looking at the PRs, here are my comments: #9137 is an isolated adjustment of an inconsistency in exceptions handling I expect will not interfere with anything here. #8509: I suggest you add #10383 into the mix as it addresses (and fixes) the root cause of the "stepOver bug" (#8443), and I think #10383 supersedes the fix in #8509 (the test in #8509 should of course be valid for #10383 as well). #8993: this is a part of a larger project (of mine) rewriting #terminate. I extracted this part for Pharo to work without clashes with other code; however #8567 was so big I couldn't foresee all possible interactions - so sorry if it does. The main objective of this PR was to eliminate the terrible "unwind error" and extend a bit the scope of unwinding during termination (complete ALL half-way through unwind contexts). #8845 was an attempt to fix a bug in #isTerminated and avoid hardcoding exact conditions like "pc + 1 = self endPC". This PR will definitely interfere with #8567; I don't exactly know what the "grand plan" to make this super-PR work is but I consider this #8845 of less importance and could be withdrawn if it meant substantial adjustments on #8567, and an alternative solution merged later. Summary: Best regards, |
|
Thanks for the quick reply, I'll check #10383 and put it together here
I'm super sad about the current state of process termination. I spent over lunch doing a sketch of a different solution: The idea being that no process can terminate another process. A process can only terminate itself. This would imply for example:
Process >> join
"Wait the calling process until the receiver process terminates.
If the process is already terminated, return directly.
Otherwise, wait in a Semaphore"
self isTerminated ifTrue: [ ^ self ].
(joinSemaphore ifNil: [ joinSemaphore := Semaphore new ]) wait
Process >> terminate2
"Request termination of the receiver process.
The current process will eventually terminate itself"
suspendedContext := [
self doTerminationFromYourself.
joinSemaphore signalAll.
] asContextWithSender: suspendedContext
Process >> endProcess
"When I reach this method, I'm terminated. Suspending or terminating me is harmless."
<isTerminated>
thisContext terminateTo: nil. "set thisContext sender to nil"
--> joinSemaphore ifNotNil: [ :s | s signalAll ].
self suspend.
Semaphore >> critical: mutuallyExcludedBlock
| blockValue caught |
caught := false.
[
caught := true.
self wait.
blockValue := mutuallyExcludedBlock value
] ensure: [caught ifNotNil: [self signal]].
^blockValue
===>
Semaphore >> critical: mutuallyExcludedBlock
| blockValue caught |
[
caught := self wait.
blockValue := mutuallyExcludedBlock value
] ensure: [caught ifNotNil: [self signal]].
^blockValueAnother thing to take into account is that many languages don't support forcing the termination of processes
Maybe a large discussion should be done on this topic and proposed as a phep (https://github.com/pharo-project/pheps) |
When debugging things like this (and similar, described in the Issue): [^2] ensure: [Transcript cr; show: 'done'] if we step into the protected block [^2] and then step over ^2, we crash the image or get an infinite recursion etc. This behavior happens when #return:from: supplies firstUnwindContext to #aboutToReturn:through: but before this argument can be used in #resume:through: the #runUntilErrorOrReturnFrom: method inserts another unwind context before this firstUnwindContext! As a result firstUnwindContext is no longer the first unwind context and the computation derails. The proposed fix communicates the fact we are in simuation by sending nil instead of the firstUnwindContext causing a fresh search for the actual first unwind context at the right time - before executing resume:through: This creates a close link between a simulation environment and the execution environment which is not ideal so I've suggested a note explicitely pointing at the relationship. In theory even VM could send nil as firstUnwindContext value and the code would still work.
|
Hm, It can be tricky to manage such composite PR. If we would at least know the CI status of each separate PR that would help. But the main changes were not mergeable for a long time. Do we know the CI status for #8993 ? I can't find it on Jenkins |
Why not use pharo-dev ML for this? |
It sounds very elegant to me. But probably there are a lot of underwater stones there. |
I was always amazed by this Java "feature" . And diving over the internet never shown me any proper explanation.
|
|
Denis we can use both the mailing list and a pheps |
It's not that simple to just change the #wait primitive to return a flag. See the example:
We had long story about critical sections with @bencoman. And the final approach was proposed here: https://pharo.fogbugz.com/f/cases/19186. It requires new lock primitives with new semantics. Generally speaking we should deprecate Semaphore>>critical: and always use Mutex instead. But it requires proper implementation based on locks. |
Well, here with my model this does not hold. Process C will never force termination on process A. |
|
We are investigating the tests failing in the CI. Those tests do not fail locally. assertTerminatedFailedChildProcessesFor: aTestCase
| failedProcesses |
failedProcesses := aTestCase failedChildProcesses.
self assert: failedProcesses notEmpty.
self assert: (failedProcesses allSatisfy: #isTerminated)
SUnitTest >> failedChildProcesses
^Process allInstances
select: [: each | each name = ('failed child for ', testSelector)]Moreover, this really depends on the GC. If the GC is called just before the processes are collected then the test fails! If changing the assertion method by: SUnitTest >> assertTerminatedFailedChildProcessesFor: aTestCase
| failedProcesses |
--> Smalltalk garbageCollect.
failedProcesses := aTestCase failedChildProcesses.
self assert: failedProcesses notEmpty.
self assert: (failedProcesses allSatisfy: #isTerminated) I'm able to reproduce 4 of the CI failing tests locally. |
Right. SUnitTest's for process monitoring are kind of integration tests. Forked processes from related methods can be accumulated into an explicit instance variable. For example: failedChildProcessTest
"During this test forked process should signal error.
It means that after fork we should give the process control"
| process |
process := ([ self error: 'error from child process'] newProcessNamed: 'failed child for ', testSelector.
forkedProcesses add: process.
process resume.
Processor yield.Then assertion will check these local processes. And it will be stable. |
|
SindarinDebuggerTest requires changes in Sindarin project: pharo-spec/ScriptableDebugger#14 |
src/Kernel/Process.class.st
Outdated
|
|
||
| ^suspendedContext isNil or: [ | ||
| suspendedContext isDead or: [ | ||
| (suspendedContext method hasPragmaNamed: #isTerminated)]] |
There was a problem hiding this comment.
Our PRs #8567 and #8845 were not compatible to each other. And I think it leads to some broken tests here. I would exclude #8845 from this PR for now.
But the correct merge would be:
- use #isTerminated from my PR [Very critical fix] Processor>>#terminateRealActive #8567.
- modify #isEndOfProcessTermination to use Jaromir logic based on pragma (hasPragma here)
The reason why I suggest to postpone #8845 is because in my PR I followed the idea that the full stepping over the process should be equivalent to the normal execution which always ends up at the final #suspend message. Therefore the real termination status requires explicit check for the #pc to detect such condition. See #isEndOfProcessTermination method.
Thus a simple test proposed in #8845 is not enough. Stepping the process would end up at the start of the method where pragma is detected. And it will not be equivalent to the normal execution of a process: stepping will not execute "thisContext terminateTo: nil" part of #endProcess method.
There was a problem hiding this comment.
But the correct merge would be:
- use #isTerminated from my PR [Very critical fix] Processor>>#terminateRealActive [Very critical fix] Processor>>#terminateRealActive #8567.
- modify #isEndOfProcessTermination to use Jaromir logic based on pragma (hasPragma here)
Yes, that's what I'd suggest as well; #8845 is not that important - it doesn't fix bugs; just simplifies the logic.
Do you mean this model? It will not solve the problem. Process A will resume and perform doTermination logic which will not set the caught flag in the #critical: method and the ensure will incorrectly skip the semaphore signal. |
0023475 to
26593bc
Compare
…mall image. We need to keep record of that
…he later uses the former.
…thod. To do so, we need to have support for them. This support is not available in the minimal image. Process termination should not depend on pragmas.
|
|
||
| ^suspendedContext isNil or: [ | ||
| suspendedContext isDead or: [ | ||
| (suspendedContext method == (self class >> #endProcess))]] |
There was a problem hiding this comment.
See my previous comments. #isDead already incorporates this logic but it needs to be adjusted for #endProcess method
… the correct singalling context when resuming. - Returning to previous change introduced in pharo-project#10429 - Fix pharo-project#10651 - Adding Tests
It's been really difficult to review the batch of issues (#8567 , #8845, #9137, #8993, #8509, #10383).
We have spent in the last internal sprints several hours taking a look at them, but it is a reality that those changes are difficult to review, some like #8567 are very big, their effects in the system are difficult to foresee independently and even worse in combination.
This PR rebases and joins together all changes in question with the objective of testing the combination of all of them.
We then propose that we eagerly merge this (meta) PR and be ready to quickly apply hot-fixes as soon as possible or revert it in case of fire.