Skip to content

Process Changes Meta Pull Request#10429

Merged
guillep merged 31 commits intopharo-project:Pharo10from
guillep:process-meta-pr
Dec 2, 2021
Merged

Process Changes Meta Pull Request#10429
guillep merged 31 commits intopharo-project:Pharo10from
guillep:process-meta-pr

Conversation

@guillep
Copy link
Member

@guillep guillep commented Nov 24, 2021

It's been really difficult to review the batch of issues (#8567 , #8845, #9137, #8993, #8509, #10383).
We have spent in the last internal sprints several hours taking a look at them, but it is a reality that those changes are difficult to review, some like #8567 are very big, their effects in the system are difficult to foresee independently and even worse in combination.

This PR rebases and joins together all changes in question with the objective of testing the combination of all of them.
We then propose that we eagerly merge this (meta) PR and be ready to quickly apply hot-fixes as soon as possible or revert it in case of fire.

dionisiydk and others added 19 commits November 24, 2021 10:36
…e process that is really currently running despite on any effective process installed.

It is very critical to ignore the simulation trick of the debugger (the effective process) during normal completion of the process. This method ensures that every process is able to complete. 
See the method comment
Fix  terminating state of normally completed process by moving the flag logic to final termination methods.
…recise criteria when suspendedContext points to the endPC of Process>>doTerminatinOfYourself method.

The last instruction there is final suspend operation which removes the process from the scheduler.
It means that the process is not terminated until it is really not running (when suspend is executed).
Helper methods are added to support safe stepping (simulation) in the debugger:
- Process>>canBeStepped
- Context>>canBeSteppedSafelly
For the moment #canBeSteppedSafelly is only about preventing stepping over "Processor terminateRealActive". But it can be extended if we will find other cases.
Now the debugger should use these methods as a criteria for the finished execution.
…thod. Now the step always delegates the simulation into #terminateActive method to affect the effective process instead of the process running the simulation. It solves all simulation tests (discovered before this commit).

Sindarin changes are reverted back to #terminating flag to prevent stepping over final #suspend. It should be also supported but it's a different story
…which are triggered from the process itself as a result of unhandled exception.
…onFinished.

The actual problem was on the level of kernel. The simulation of the process up to the self termination was not working  using stepInto, stepOver and stepThrough methods.
Corresponding tests are added to cover these basic scenarious. And they are fixed.
The important change is on the primitive simulation: the process #suspend must do nothing during primitive (see the comment). It makes safe the debugging over #suspend call for real active process.
… that current context is actually the end of process termination.

Using #hasSender: is not reliable criteria for stepping because simulated code can perform some context manipulations breaking this condition in the middle of logic while continuing the execution can still return to the home
It removes the duplication of checking both criterias. And  it really makes sense in other cases where only #isDead was used. Possibly it wil fix some corner cases during debugger operations over #terminateFromYourself (not normaly accessed by users)
…n - skips multiple unwind blocks and corrupts the image
Add tests, modify outerMost sender, replace tempAt  with accessors

Move two failing tests to expectedFailures

Reset outerMost context as top context
@isCzech
Copy link
Contributor

isCzech commented Nov 24, 2021

Hi all,

Thanks for looking at the PRs, here are my comments:

#9137 is an isolated adjustment of an inconsistency in exceptions handling I expect will not interfere with anything here.

#8509: I suggest you add #10383 into the mix as it addresses (and fixes) the root cause of the "stepOver bug" (#8443), and I think #10383 supersedes the fix in #8509 (the test in #8509 should of course be valid for #10383 as well).

#8993: this is a part of a larger project (of mine) rewriting #terminate. I extracted this part for Pharo to work without clashes with other code; however #8567 was so big I couldn't foresee all possible interactions - so sorry if it does. The main objective of this PR was to eliminate the terrible "unwind error" and extend a bit the scope of unwinding during termination (complete ALL half-way through unwind contexts).

#8845 was an attempt to fix a bug in #isTerminated and avoid hardcoding exact conditions like "pc + 1 = self endPC". This PR will definitely interfere with #8567; I don't exactly know what the "grand plan" to make this super-PR work is but I consider this #8845 of less importance and could be withdrawn if it meant substantial adjustments on #8567, and an alternative solution merged later.

Summary:
If I might suggest (but I'm fully aware your experience exceeds mine by orders of magnitude) I'd get rid of the "easy" non-disruptive stuff first - i.e. include #10383 to get rid of the "stepOver bug" (or even merge it separately first; the impact of the change is very isolated - but the consequences of the fix quite substantial).

Best regards,
Jaromir

@guillep
Copy link
Member Author

guillep commented Nov 24, 2021

Thanks for the quick reply, I'll check #10383 and put it together here

#8993: this is a part of a larger project (of mine) rewriting #terminate. I extracted this part for Pharo to work without clashes with other code; however #8567 was so big I couldn't foresee all possible interactions - so sorry if it does. The main objective of this PR was to eliminate the terrible "unwind error" and extend a bit the scope of unwinding during termination (complete ALL half-way through unwind contexts).

I'm super sad about the current state of process termination. I spent over lunch doing a sketch of a different solution: The idea being that no process can terminate another process. A process can only terminate itself. This would imply for example:

  • processes should be able to join others
Process >> join

	"Wait the calling process until the receiver process terminates.
	If the process is already terminated, return directly.
	Otherwise, wait in a Semaphore"
	self isTerminated ifTrue: [ ^ self ].

	(joinSemaphore ifNil: [ joinSemaphore := Semaphore new ]) wait
  • termination is eventual: when the terminated process runs it will terminate. This would mean that semantics are weaker because now terminate does not force a termination but requests a termination. The process will terminate when it cans (the catch is that it will only do it if it can). A pro: the code is much simpler and we can throw away a lot of complex code.
Process >> terminate2
	
	"Request termination of the receiver process.
	
	The current process will eventually terminate itself"
	suspendedContext := [
		self doTerminationFromYourself.
		joinSemaphore signalAll.
	] asContextWithSender: suspendedContext
  • Although the above does not work with the changes in this PR, and needs to be defined as something like:
Process >> endProcess
	"When I reach this method, I'm terminated. Suspending or terminating me is harmless."

	<isTerminated>
	thisContext terminateTo: nil.   "set thisContext sender to nil"
-->	joinSemaphore ifNotNil: [ :s | s signalAll ].
	self suspend.
  • Another detail we were looking at is how process termination interacts with Semaphores, which we find rather fragile because ensure blocks need to know the method structure of semaphores to work properly. The problem being that when executing the ensure block, we really don't know if the critical section was really taken or not. An idea would be to use the result of self wait as a token of the taken critical section. Something as follows. The idea is that with that change, the ensure block should be safe to execute regardless of when/how stack unwind happens.
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	caught := false.
	[
		caught := true.
		self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue
===>
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	[
		caught := self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue

Another thing to take into account is that many languages don't support forcing the termination of processes

Maybe a large discussion should be done on this topic and proposed as a phep (https://github.com/pharo-project/pheps)

When debugging things like this (and similar, described in the Issue):
	[^2] ensure: [Transcript cr; show: 'done']
if we step into the protected block [^2] and then step over ^2, we crash the image or get an infinite recursion etc.
This behavior happens when #return:from: supplies firstUnwindContext to #aboutToReturn:through: but before this argument can be used in #resume:through: the #runUntilErrorOrReturnFrom: method inserts another unwind context before this firstUnwindContext! As a result firstUnwindContext is no longer the first unwind context and the computation derails.
The proposed fix communicates the fact we are in simuation by sending nil instead of the firstUnwindContext causing a fresh search for the actual first unwind context at the right time - before executing resume:through:
This creates a close link between a simulation environment and the execution environment which is not ideal so I've suggested a note explicitely pointing at the relationship. In theory even VM could send nil as firstUnwindContext value and the code would still work.
@dionisiydk
Copy link
Contributor

Hm, It can be tricky to manage such composite PR.
CI shows strange test failures in bootstrap image while they pass on full image (see SUnitTest's)
The analysis will require disabling some of the changes here. But we (original authors) have no rights to modify this PR.

If we would at least know the CI status of each separate PR that would help. But the main changes were not mergeable for a long time. Do we know the CI status for #8993 ? I can't find it on Jenkins

@dionisiydk
Copy link
Contributor

Maybe a large discussion should be done on this topic and proposed as a phep (https://github.com/pharo-project/pheps)

Why not use pharo-dev ML for this?

@dionisiydk
Copy link
Contributor

  • termination is eventual: when the terminated process runs it will terminate. This would mean that semantics are weaker because now terminate does not force a termination but requests a termination. The process will terminate when it cans (the catch is that it will only do it if it can). A pro: the code is much simpler and we can throw away a lot of complex code.
Process >> terminate2
	
	"Request termination of the receiver process.
	
	The current process will eventually terminate itself"
	suspendedContext := [
		self doTerminationFromYourself.
		joinSemaphore signalAll.
	] asContextWithSender: suspendedContext

It sounds very elegant to me. But probably there are a lot of underwater stones there.

@dionisiydk
Copy link
Contributor

dionisiydk commented Nov 24, 2021

Another thing to take into account is that many languages don't support forcing the termination of processes

I was always amazed by this Java "feature" . And diving over the internet never shown me any proper explanation.
On the OS level you are able to kill processes and nobody would ever tell you that it is a bad thing. But for the thread model in these nice languages it is forbidden or unimplemented. It's no sense to me.

Raises a ThreadAbortException in the thread on which it is invoked, to begin the process of terminating the thread. Calling this method usually terminates the thread.

@Ducasse
Copy link
Member

Ducasse commented Nov 24, 2021

Denis we can use both the mailing list and a pheps
the pheps should be able to summarize the discussions.
Because discussions can be long to follow and remember.

@dionisiydk
Copy link
Contributor

  • Another detail we were looking at is how process termination interacts with Semaphores, which we find rather fragile because ensure blocks need to know the method structure of semaphores to work properly. The problem being that when executing the ensure block, we really don't know if the critical section was really taken or not. An idea would be to use the result of self wait as a token of the taken critical section. Something as follows. The idea is that with that change, the ensure block should be safe to execute regardless of when/how stack unwind happens.
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	caught := false.
	[
		caught := true.
		self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue
===>
Semaphore >> critical: mutuallyExcludedBlock			
	| blockValue caught |
	[
		caught := self wait.
		blockValue := mutuallyExcludedBlock value
	] ensure: [caught ifNotNil: [self signal]].
	^blockValue

It's not that simple to just change the #wait primitive to return a flag. See the example:

  • the processes A and B wait on critical section on semaphore - they are suspended and added to the semaphore waiting list.
  • the active process C with higher priority finishes the critical section and signals the semaphore. It resumes the process A (first in the semaphore list).
  • process C continues execution due to the hight priority and it decided to terminate the process A
  • process A still did not continue execution (#wait did not return) and therefore the #caught variable is still nil
  • process A terminates without #signal in #ensure section
  • state fo semaphore become inconsistent - process B will never wake up.

We had long story about critical sections with @bencoman. And the final approach was proposed here: https://pharo.fogbugz.com/f/cases/19186. It requires new lock primitives with new semantics.

Generally speaking we should deprecate Semaphore>>critical: and always use Mutex instead. But it requires proper implementation based on locks.

@guillep
Copy link
Member Author

guillep commented Nov 25, 2021

  • process C continues execution due to the hight priority and it decided to terminate the process A

Well, here with my model this does not hold. Process C will never force termination on process A.
So process A can resume, take its critical section, terminate itself, free the critical section, and die.

@guillep
Copy link
Member Author

guillep commented Nov 25, 2021

We are investigating the tests failing in the CI. Those tests do not fail locally.
I've seen there is a potential problem here:

assertTerminatedFailedChildProcessesFor: aTestCase

	| failedProcesses |
	failedProcesses := aTestCase failedChildProcesses.
	self assert: failedProcesses notEmpty.
	self assert: (failedProcesses allSatisfy: #isTerminated) 
  1. First, having many assertions in a single test makes it difficult to follow. We are not sure what assertion is failing in the CI.
  2. The implementation of failedChildProcesses is fragile and inefficient:
SUnitTest >> failedChildProcesses
	^Process allInstances 
		select: [: each | each name = ('failed child for ', testSelector)]

Moreover, this really depends on the GC. If the GC is called just before the processes are collected then the test fails!

If changing the assertion method by:

SUnitTest >> assertTerminatedFailedChildProcessesFor: aTestCase

	| failedProcesses |
-->   Smalltalk garbageCollect.
	failedProcesses := aTestCase failedChildProcesses.
	self assert: failedProcesses notEmpty.
	self assert: (failedProcesses allSatisfy: #isTerminated) 

I'm able to reproduce 4 of the CI failing tests locally.

@dionisiydk
Copy link
Contributor

dionisiydk commented Nov 25, 2021

I'm able to reproduce 4 of the CI failing tests locally.

Right. SUnitTest's for process monitoring are kind of integration tests. Forked processes from related methods can be accumulated into an explicit instance variable. For example:

failedChildProcessTest
	"During this test forked process should signal error.
	It means that after fork we should give the process control"
	| process |
	process := ([ self error: 'error from child process'] newProcessNamed: 'failed child for ', testSelector.
	forkedProcesses add: process.
	process resume.
	Processor yield.

Then assertion will check these local processes. And it will be stable.

@dionisiydk
Copy link
Contributor

SindarinDebuggerTest requires changes in Sindarin project: pharo-spec/ScriptableDebugger#14


^suspendedContext isNil or: [
suspendedContext isDead or: [
(suspendedContext method hasPragmaNamed: #isTerminated)]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our PRs #8567 and #8845 were not compatible to each other. And I think it leads to some broken tests here. I would exclude #8845 from this PR for now.
But the correct merge would be:

The reason why I suggest to postpone #8845 is because in my PR I followed the idea that the full stepping over the process should be equivalent to the normal execution which always ends up at the final #suspend message. Therefore the real termination status requires explicit check for the #pc to detect such condition. See #isEndOfProcessTermination method.
Thus a simple test proposed in #8845 is not enough. Stepping the process would end up at the start of the method where pragma is detected. And it will not be equivalent to the normal execution of a process: stepping will not execute "thisContext terminateTo: nil" part of #endProcess method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the correct merge would be:

Yes, that's what I'd suggest as well; #8845 is not that important - it doesn't fix bugs; just simplifies the logic.

@dionisiydk
Copy link
Contributor

  • process C continues execution due to the hight priority and it decided to terminate the process A

Well, here with my model this does not hold. Process C will never force termination on process A. So process A can resume, take its critical section, terminate itself, free the critical section, and die.

Do you mean this model?

Process >> terminate2
	"Request termination of the receiver process.	
	The current process will eventually terminate itself"
	suspendedContext := [
		self doTerminationFromYourself.
		joinSemaphore signalAll.
	] asContextWithSender: suspendedContext

It will not solve the problem. Process A will resume and perform doTermination logic which will not set the caught flag in the #critical: method and the ensure will incorrectly skip the semaphore signal.

…thod. To do so, we need to have support for them. This support is not available in the minimal image.

Process termination should not depend on pragmas.

^suspendedContext isNil or: [
suspendedContext isDead or: [
(suspendedContext method == (self class >> #endProcess))]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comments. #isDead already incorporates this logic but it needs to be adjusted for #endProcess method

@guillep guillep merged commit 8fc6c8b into pharo-project:Pharo10 Dec 2, 2021
@guillep guillep deleted the process-meta-pr branch December 2, 2021 09:44
tesonep added a commit to tesonep/pharo that referenced this pull request Dec 13, 2021
… the correct singalling context when resuming.

- Returning to previous change introduced in pharo-project#10429
- Fix pharo-project#10651
- Adding Tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants