Process Changes Meta Pull Request by guillep · Pull Request #10429 · pharo-project/pharo

guillep · 2021-11-24T10:12:30Z

It's been really difficult to review the batch of issues (#8567 , #8845, #9137, #8993, #8509, #10383).
We have spent in the last internal sprints several hours taking a look at them, but it is a reality that those changes are difficult to review, some like #8567 are very big, their effects in the system are difficult to foresee independently and even worse in combination.

This PR rebases and joins together all changes in question with the objective of testing the combination of all of them.
We then propose that we eagerly merge this (meta) PR and be ready to quickly apply hot-fixes as soon as possible or revert it in case of fire.

…e process that is really currently running despite on any effective process installed. It is very critical to ignore the simulation trick of the debugger (the effective process) during normal completion of the process. This method ensures that every process is able to complete. See the method comment

Fix terminating state of normally completed process by moving the flag logic to final termination methods.

…recise criteria when suspendedContext points to the endPC of Process>>doTerminatinOfYourself method. The last instruction there is final suspend operation which removes the process from the scheduler. It means that the process is not terminated until it is really not running (when suspend is executed). Helper methods are added to support safe stepping (simulation) in the debugger: - Process>>canBeStepped - Context>>canBeSteppedSafelly For the moment #canBeSteppedSafelly is only about preventing stepping over "Processor terminateRealActive". But it can be extended if we will find other cases. Now the debugger should use these methods as a criteria for the finished execution.

…thod. Now the step always delegates the simulation into #terminateActive method to affect the effective process instead of the process running the simulation. It solves all simulation tests (discovered before this commit). Sindarin changes are reverted back to #terminating flag to prevent stepping over final #suspend. It should be also supported but it's a different story

…which are triggered from the process itself as a result of unhandled exception.

…onFinished. The actual problem was on the level of kernel. The simulation of the process up to the self termination was not working using stepInto, stepOver and stepThrough methods. Corresponding tests are added to cover these basic scenarious. And they are fixed. The important change is on the primitive simulation: the process #suspend must do nothing during primitive (see the comment). It makes safe the debugging over #suspend call for real active process.

… that current context is actually the end of process termination. Using #hasSender: is not reliable criteria for stepping because simulated code can perform some context manipulations breaking this condition in the middle of logic while continuing the execution can still return to the home

It removes the duplication of checking both criterias. And it really makes sense in other cases where only #isDead was used. Possibly it wil fix some corner cases during debugger operations over #terminateFromYourself (not normaly accessed by users)

…a criteria for the isTerminated method

…ests expectations

…n - skips multiple unwind blocks and corrupts the image

Add tests, modify outerMost sender, replace tempAt with accessors Move two failing tests to expectedFailures Reset outerMost context as top context

… [ #some ] ensure: [ ^#unwindReturn ]

isCzech · 2021-11-24T12:57:06Z

Hi all,

Thanks for looking at the PRs, here are my comments:

#9137 is an isolated adjustment of an inconsistency in exceptions handling I expect will not interfere with anything here.

#8509: I suggest you add #10383 into the mix as it addresses (and fixes) the root cause of the "stepOver bug" (#8443), and I think #10383 supersedes the fix in #8509 (the test in #8509 should of course be valid for #10383 as well).

#8993: this is a part of a larger project (of mine) rewriting #terminate. I extracted this part for Pharo to work without clashes with other code; however #8567 was so big I couldn't foresee all possible interactions - so sorry if it does. The main objective of this PR was to eliminate the terrible "unwind error" and extend a bit the scope of unwinding during termination (complete ALL half-way through unwind contexts).

#8845 was an attempt to fix a bug in #isTerminated and avoid hardcoding exact conditions like "pc + 1 = self endPC". This PR will definitely interfere with #8567; I don't exactly know what the "grand plan" to make this super-PR work is but I consider this #8845 of less importance and could be withdrawn if it meant substantial adjustments on #8567, and an alternative solution merged later.

Summary:
If I might suggest (but I'm fully aware your experience exceeds mine by orders of magnitude) I'd get rid of the "easy" non-disruptive stuff first - i.e. include #10383 to get rid of the "stepOver bug" (or even merge it separately first; the impact of the change is very isolated - but the consequences of the fix quite substantial).

Best regards,
Jaromir

guillep · 2021-11-24T13:31:48Z

Thanks for the quick reply, I'll check #10383 and put it together here

#8993: this is a part of a larger project (of mine) rewriting #terminate. I extracted this part for Pharo to work without clashes with other code; however #8567 was so big I couldn't foresee all possible interactions - so sorry if it does. The main objective of this PR was to eliminate the terrible "unwind error" and extend a bit the scope of unwinding during termination (complete ALL half-way through unwind contexts).

I'm super sad about the current state of process termination. I spent over lunch doing a sketch of a different solution: The idea being that no process can terminate another process. A process can only terminate itself. This would imply for example:

processes should be able to join others

Process >> join

	"Wait the calling process until the receiver process terminates.
	If the process is already terminated, return directly.
	Otherwise, wait in a Semaphore"
	self isTerminated ifTrue: [ ^ self ].

	(joinSemaphore ifNil: [ joinSemaphore := Semaphore new ]) wait

termination is eventual: when the terminated process runs it will terminate. This would mean that semantics are weaker because now terminate does not force a termination but requests a termination. The process will terminate when it cans (the catch is that it will only do it if it can). A pro: the code is much simpler and we can throw away a lot of complex code.

Process >> terminate2
	
	"Request termination of the receiver process.
	
	The current process will eventually terminate itself"
	suspendedContext := [
		self doTerminationFromYourself.
		joinSemaphore signalAll.
	] asContextWithSender: suspendedContext

Although the above does not work with the changes in this PR, and needs to be defined as something like:

Process >> endProcess
	"When I reach this method, I'm terminated. Suspending or terminating me is harmless."

	<isTerminated>
	thisContext terminateTo: nil.   "set thisContext sender to nil"
-->	joinSemaphore ifNotNil: [ :s | s signalAll ].
	self suspend.

Another detail we were looking at is how process termination interacts with Semaphores, which we find rather fragile because ensure blocks need to know the method structure of semaphores to work properly. The problem being that when executing the ensure block, we really don't know if the critical section was really taken or not. An idea would be to use the result of self wait as a token of the taken critical section. Something as follows. The idea is that with that change, the ensure block should be safe to execute regardless of when/how stack unwind happens.

Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	caught := false.
	[
		caught := true.
		self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue
===>
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	[
		caught := self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue

Another thing to take into account is that many languages don't support forcing the termination of processes

.NET Thread.abort()is not supported anymore https://docs.microsoft.com/en-us/dotnet/api/system.threading.thread.abort?view=net-6.0
Java Thread.stop(), is deprecated, marked to remove https://download.java.net/java/early_access/loom/docs/api/java.base/java/lang/Thread.html#stop()

Maybe a large discussion should be done on this topic and proposed as a phep (https://github.com/pharo-project/pheps)

When debugging things like this (and similar, described in the Issue): [^2] ensure: [Transcript cr; show: 'done'] if we step into the protected block [^2] and then step over ^2, we crash the image or get an infinite recursion etc. This behavior happens when #return:from: supplies firstUnwindContext to #aboutToReturn:through: but before this argument can be used in #resume:through: the #runUntilErrorOrReturnFrom: method inserts another unwind context before this firstUnwindContext! As a result firstUnwindContext is no longer the first unwind context and the computation derails. The proposed fix communicates the fact we are in simuation by sending nil instead of the firstUnwindContext causing a fresh search for the actual first unwind context at the right time - before executing resume:through: This creates a close link between a simulation environment and the execution environment which is not ideal so I've suggested a note explicitely pointing at the relationship. In theory even VM could send nil as firstUnwindContext value and the code would still work.

dionisiydk · 2021-11-24T22:12:52Z

Hm, It can be tricky to manage such composite PR.
CI shows strange test failures in bootstrap image while they pass on full image (see SUnitTest's)
The analysis will require disabling some of the changes here. But we (original authors) have no rights to modify this PR.

If we would at least know the CI status of each separate PR that would help. But the main changes were not mergeable for a long time. Do we know the CI status for #8993 ? I can't find it on Jenkins

dionisiydk · 2021-11-24T22:15:39Z

Maybe a large discussion should be done on this topic and proposed as a phep (https://github.com/pharo-project/pheps)

Why not use pharo-dev ML for this?

dionisiydk · 2021-11-24T22:18:30Z

termination is eventual: when the terminated process runs it will terminate. This would mean that semantics are weaker because now terminate does not force a termination but requests a termination. The process will terminate when it cans (the catch is that it will only do it if it can). A pro: the code is much simpler and we can throw away a lot of complex code.
Process >> terminate2
	
	"Request termination of the receiver process.
	
	The current process will eventually terminate itself"
	suspendedContext := [
		self doTerminationFromYourself.
		joinSemaphore signalAll.
	] asContextWithSender: suspendedContext

It sounds very elegant to me. But probably there are a lot of underwater stones there.

dionisiydk · 2021-11-24T22:34:39Z

Another thing to take into account is that many languages don't support forcing the termination of processes

.NET Thread.abort()is not supported anymore https://docs.microsoft.com/en-us/dotnet/api/system.threading.thread.abort?view=net-6.0

Java Thread.stop(), is deprecated, marked to remove https://download.java.net/java/early_access/loom/docs/api/java.base/java/lang/Thread.html#stop()

I was always amazed by this Java "feature" . And diving over the internet never shown me any proper explanation.
On the OS level you are able to kill processes and nobody would ever tell you that it is a bad thing. But for the thread model in these nice languages it is forbidden or unimplemented. It's no sense to me.

Thread.Abort Method (System.Threading)
Raises a ThreadAbortException in the thread on which it is invoked, to begin the process of terminating the thread. Calling this method usually terminates the thread.

Error Page 404

Ducasse · 2021-11-24T22:42:00Z

Denis we can use both the mailing list and a pheps
the pheps should be able to summarize the discussions.
Because discussions can be long to follow and remember.

dionisiydk · 2021-11-24T23:43:22Z

Another detail we were looking at is how process termination interacts with Semaphores, which we find rather fragile because ensure blocks need to know the method structure of semaphores to work properly. The problem being that when executing the ensure block, we really don't know if the critical section was really taken or not. An idea would be to use the result of self wait as a token of the taken critical section. Something as follows. The idea is that with that change, the ensure block should be safe to execute regardless of when/how stack unwind happens.
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	caught := false.
	[
		caught := true.
		self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue
===>
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	[
		caught := self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue

It's not that simple to just change the #wait primitive to return a flag. See the example:

the processes A and B wait on critical section on semaphore - they are suspended and added to the semaphore waiting list.
the active process C with higher priority finishes the critical section and signals the semaphore. It resumes the process A (first in the semaphore list).
process C continues execution due to the hight priority and it decided to terminate the process A
process A still did not continue execution (#wait did not return) and therefore the #caught variable is still nil
process A terminates without #signal in #ensure section
state fo semaphore become inconsistent - process B will never wake up.

We had long story about critical sections with @bencoman. And the final approach was proposed here: https://pharo.fogbugz.com/f/cases/19186. It requires new lock primitives with new semantics.

Generally speaking we should deprecate Semaphore>>critical: and always use Mutex instead. But it requires proper implementation based on locks.

guillep · 2021-11-25T10:34:08Z

process C continues execution due to the hight priority and it decided to terminate the process A

Well, here with my model this does not hold. Process C will never force termination on process A.
So process A can resume, take its critical section, terminate itself, free the critical section, and die.

guillep · 2021-11-25T10:39:28Z

We are investigating the tests failing in the CI. Those tests do not fail locally.
I've seen there is a potential problem here:

assertTerminatedFailedChildProcessesFor: aTestCase

	| failedProcesses |
	failedProcesses := aTestCase failedChildProcesses.
	self assert: failedProcesses notEmpty.
	self assert: (failedProcesses allSatisfy: #isTerminated)

First, having many assertions in a single test makes it difficult to follow. We are not sure what assertion is failing in the CI.
The implementation of failedChildProcesses is fragile and inefficient:

SUnitTest >> failedChildProcesses
	^Process allInstances 
		select: [: each | each name = ('failed child for ', testSelector)]

Moreover, this really depends on the GC. If the GC is called just before the processes are collected then the test fails!

If changing the assertion method by:

SUnitTest >> assertTerminatedFailedChildProcessesFor: aTestCase

	| failedProcesses |
-->   Smalltalk garbageCollect.
	failedProcesses := aTestCase failedChildProcesses.
	self assert: failedProcesses notEmpty.
	self assert: (failedProcesses allSatisfy: #isTerminated)

I'm able to reproduce 4 of the CI failing tests locally.

dionisiydk · 2021-11-25T22:08:13Z

I'm able to reproduce 4 of the CI failing tests locally.

Right. SUnitTest's for process monitoring are kind of integration tests. Forked processes from related methods can be accumulated into an explicit instance variable. For example:

failedChildProcessTest
	"During this test forked process should signal error.
	It means that after fork we should give the process control"
	| process |
	process := ([ self error: 'error from child process'] newProcessNamed: 'failed child for ', testSelector.
	forkedProcesses add: process.
	process resume.
	Processor yield.

Then assertion will check these local processes. And it will be stable.

dionisiydk · 2021-11-25T22:10:54Z

SindarinDebuggerTest requires changes in Sindarin project: pharo-spec/ScriptableDebugger#14

dionisiydk · 2021-11-25T23:11:42Z

src/Kernel/Process.class.st

-
+	^suspendedContext isNil or: [
+		suspendedContext isDead or: [
+			(suspendedContext method hasPragmaNamed: #isTerminated)]]


Our PRs #8567 and #8845 were not compatible to each other. And I think it leads to some broken tests here. I would exclude #8845 from this PR for now.
But the correct merge would be:

use #isTerminated from my PR [Very critical fix] Processor>>#terminateRealActive #8567.

modify #isEndOfProcessTermination to use Jaromir logic based on pragma (hasPragma here)

The reason why I suggest to postpone #8845 is because in my PR I followed the idea that the full stepping over the process should be equivalent to the normal execution which always ends up at the final #suspend message. Therefore the real termination status requires explicit check for the #pc to detect such condition. See #isEndOfProcessTermination method.
Thus a simple test proposed in #8845 is not enough. Stepping the process would end up at the start of the method where pragma is detected. And it will not be equivalent to the normal execution of a process: stepping will not execute "thisContext terminateTo: nil" part of #endProcess method.

But the correct merge would be:

use #isTerminated from my PR [Very critical fix] Processor>>#terminateRealActive [Very critical fix] Processor>>#terminateRealActive #8567.

modify #isEndOfProcessTermination to use Jaromir logic based on pragma (hasPragma here)

Yes, that's what I'd suggest as well; #8845 is not that important - it doesn't fix bugs; just simplifies the logic.

dionisiydk · 2021-11-25T23:20:33Z

process C continues execution due to the hight priority and it decided to terminate the process A

Well, here with my model this does not hold. Process C will never force termination on process A. So process A can resume, take its critical section, terminate itself, free the critical section, and die.

Do you mean this model?

Process >> terminate2
	"Request termination of the receiver process.	
	The current process will eventually terminate itself"
	suspendedContext := [
		self doTerminationFromYourself.
		joinSemaphore signalAll.
	] asContextWithSender: suspendedContext

It will not solve the problem. Process A will resume and perform doTermination logic which will not set the caught flag in the #critical: method and the ensure will incorrectly skip the semaphore signal.

…mall image. We need to keep record of that

…he later uses the former.

…thod. To do so, we need to have support for them. This support is not available in the minimal image. Process termination should not depend on pragmas.

dionisiydk · 2021-11-29T23:17:06Z

src/Kernel/Process.class.st

-
+	^suspendedContext isNil or: [
+		suspendedContext isDead or: [
+			(suspendedContext method == (self class >> #endProcess))]]


See my previous comments. #isDead already incorporates this logic but it needs to be adjusted for #endProcess method

… the correct singalling context when resuming. - Returning to previous change introduced in pharo-project#10429 - Fix pharo-project#10651 - Adding Tests

dionisiydk and others added 19 commits November 24, 2021 10:36

Introduce Processor>>#realActiveProcess to be used in debugger requests.

c6e1b31

Fix terminating state of normally completed process by moving the flag logic to final termination methods.

Missing #isTerminate

0d85eb4

Fix isTermination test to reflect #terminateActive delegation

a49ec55

Better comments

14aff96

better #terminateRealActive comment

0686a9a

Final critical fix: use #realActiveProcess for the debugger requests …

ce4d0df

…which are triggered from the process itself as a result of unhandled exception.

recover self return in termination method which became lost after merge

997b2ea

Fix pharo-project#8844 Avoid hardcoded #doTerminationFromYourself as …

95ef932

…a criteria for the isTerminated method

Comment out 'thisContext restart' in endProcess to satisfy Debugger t…

873a6ac

…ests expectations

Fixes pharo-project#8992 Unwind mechanism during termination is broke…

8cc5971

…n - skips multiple unwind blocks and corrupts the image

Reset the context's pc and sender to nil

c24662f

Add tests, modify outerMost sender, replace tempAt with accessors Move two failing tests to expectedFailures Reset outerMost context as top context

Fix pharo-project#9136 - inconsistent handling of exception return

dcdff01

Fix simulation of return from unwind block like in following example:…

b9267f9

… [ #some ] ensure: [ ^#unwindReturn ]

Fix some merge hiccups

66d467b

guillep mentioned this pull request Nov 24, 2021

Fix #8443 Crash on step over #10383

Closed

Adding timestamp to xml to identify test execution order

d470035

Merge 3f89c11

b32e5b1

dionisiydk reviewed Nov 25, 2021

View reviewed changes

tesonep added 3 commits November 26, 2021 10:29

archiving fuel files

c0db7fb

Merge branch 'Pharo10' into process-meta-pr

126d0d8

Recovering the stack serialization

26593bc

dionisiydk mentioned this pull request Nov 27, 2021

Fix stability of SUnitTest's related to the process monitoring #10514

Merged

tesonep force-pushed the process-meta-pr branch from 0023475 to 26593bc Compare November 29, 2021 14:03

tesonep added 3 commits November 29, 2021 15:09

The Stage name has kernel in the name when running the tests in the s…

4d7c0d1

…mall image. We need to keep record of that

The initialize of stageName should be done before the node name, as t…

e16a716

…he later uses the former.

Removing the need for a pragma. Pragmas need to be compiled in the me…

7e1d957

…thod. To do so, we need to have support for them. This support is not available in the minimal image. Process termination should not depend on pragmas.

dionisiydk reviewed Nov 29, 2021

View reviewed changes

tesonep added 2 commits November 30, 2021 14:39

We should access the known processes without creating a SDL driver

0e316ee

Merge branch 'Pharo10' into process-meta-pr

f994623

guillep merged commit 8fc6c8b into pharo-project:Pharo10 Dec 2, 2021

guillep deleted the process-meta-pr branch December 2, 2021 09:44

tesonep mentioned this pull request Dec 13, 2021

Resuming a exception that has been signalled twice does not return correctly #10651

Closed

tesonep mentioned this pull request Dec 13, 2021

10651-Resuming-a-exception-that-has-been-signalled-twice-does-not-return-correctly #10652

Merged

Uh oh!

Conversation

guillep commented Nov 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isCzech commented Nov 24, 2021

Uh oh!

guillep commented Nov 24, 2021

Uh oh!

dionisiydk commented Nov 24, 2021

Uh oh!

dionisiydk commented Nov 24, 2021

Uh oh!

dionisiydk commented Nov 24, 2021

Uh oh!

dionisiydk commented Nov 24, 2021 • edited by unfurl-links bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ducasse commented Nov 24, 2021

Uh oh!

dionisiydk commented Nov 24, 2021

Uh oh!

guillep commented Nov 25, 2021

Uh oh!

guillep commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dionisiydk commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dionisiydk commented Nov 25, 2021

Uh oh!

dionisiydk Nov 25, 2021

Choose a reason for hiding this comment

Uh oh!

isCzech Nov 26, 2021

Choose a reason for hiding this comment

Uh oh!

dionisiydk commented Nov 25, 2021

Uh oh!

dionisiydk Nov 29, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

guillep commented Nov 24, 2021 •

edited

Loading

dionisiydk commented Nov 24, 2021 •

edited by unfurl-links bot

Loading

guillep commented Nov 25, 2021 •

edited

Loading

dionisiydk commented Nov 25, 2021 •

edited

Loading