Skip to content

improve netns cleanup#2112

Merged
openshift-merge-bot[bot] merged 4 commits intocontainers:mainfrom
Luap99:netns-cleanup
Aug 8, 2024
Merged

improve netns cleanup#2112
openshift-merge-bot[bot] merged 4 commits intocontainers:mainfrom
Luap99:netns-cleanup

Conversation

@Luap99
Copy link
Member

@Luap99 Luap99 commented Aug 6, 2024

see commits

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Aug 6, 2024
@Luap99 Luap99 marked this pull request as draft August 6, 2024 14:34
@rhatdan
Copy link
Member

rhatdan commented Aug 7, 2024

LGTM

// the file will work and the kernel will destroy the bind mount in the
// other ns because of this. We also need it so pasta doesn't leak.
rErr = fmt.Errorf("failed to unmount NS: at %s: %w", nsPath, err)
// EINVAL means the path exists but is not mounted, just try to remove the path below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just use containers/storage/pkg/system.EnsureRemoveAll() here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh, I guess it would work but the logic there seems much more complicated especially in the normal case were a unmount+remove works. We don't need any of the recursive dir logic, EnsureRemoveAll() would then read /proc/thread-self/mountinfo every time which seems wasteful

Luap99 added 4 commits August 8, 2024 16:45
When I wrote this originally I thought we must avoid leaking the netns
so I tried to decrement first. However now I think this wrong because
podman actially calls into the cleanup function again if it returned an
error on the next cleanup attempt. As such we ended up doing a double
decrement and the ref counter went below zero causing a sort of issues[1].

Now if we have a bug the other way around were we not decrement
correctly this is much less of a problem. It simply means we leak once
netns file and the pasta/slirp4netns process which isn't a problem other
than needed a bit of resources.

[1] containers/podman#21569

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The Run() function is used to run long running command in the netns,
namly podman unshare --rootless-netns used that. As such the function
actually unlocks for the main command as otherwise a user could hold the
lock forever effectively causing deadlocks.

Now because we unlock the ref count might change during that time. And
just because we create the netns doesn't mean there are now other users
of the netns. Therefore the cleanup in runInner() was wrong in that
case causing problems for other running containers.

To fix this make sure we do not cleanup in the Run() case unless the
count is 0.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Podman might call us more than once on the same path. If the path is not
mounted or does not exists simply return no error.

Second, retry the unmount/remove until the unmount succeeded. For some
reason we must use MNT_DETACH as otherwise the unmount call will fail
all time the time. However MNT_DETACH means it unmounts async in the
background. Now if we call remove on the file and the unmount was not
done yet it will fail with EBUSY. In this case we try again until it
works or we get another error.

This should help containers/podman#19721

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99 Luap99 marked this pull request as ready for review August 8, 2024 14:47
@Luap99 Luap99 changed the title try to improve netns cleanup to fix podman CI flakes improve netns cleanup Aug 8, 2024
@Luap99
Copy link
Member Author

Luap99 commented Aug 8, 2024

Ok this should be good to go, it will not fix the flake though

@Luap99
Copy link
Member Author

Luap99 commented Aug 8, 2024

@mheon PTAL

@mheon
Copy link
Member

mheon commented Aug 8, 2024

LGTM

@rhatdan
Copy link
Member

rhatdan commented Aug 8, 2024

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5.2 Wanted for Podman v5.2 approved lgtm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants