-
Notifications
You must be signed in to change notification settings - Fork 675
Memory leak/OOM with "Received update for IP range I own" messages in log #3659
Description
What you expected to happen?
Memory usage of the weave process is expected to be stable and not grow unbounded over time.
What happened?
I had a stable 2.5.0 weave network in my Kubernetes 1.9 cluster of about 100 nodes. The weave was initially installed by kops and had a memory limit of 200mb set. There were no occurrences of "Received update for IP range I own" in the log files and memory usage for weave pods in the cluster had been very stable over time for weeks.
As part of refactoring some services, about 30 nodes were removed from the cluster (bringing the cluster size down to 71 nodes). After this action, the memory usage of the weave pods started growing until it exceeded the memory limit, at which time the pod was OOM killed and restarted. These restarts result in brief disruption for the node on which the restart occurs. At this time the "Received update for IP range I own" message started appearing in the logs (although not from all pods, this nuance was not discovered until later).
After looking at some related tickets and such here (#3650, #3600, #2797), the following actions were taken:
- The "status ipam" output was checked and seen to have a lot of "unreachable" peers listed in it
- The unreachable nodes listed by "status ipam" were removed with rmpeer on one node, though this did not fix all the unreachables on all the nodes, the process of listing and removing unreachables was done on a couple of other systems before all systems were showing all 71 nodes in the list and all as reachable.
- updated to 2.5.2 as there were some related looking tickets mentioned in that release
- increased the memory limit so that OOM killing might happen less frequently (from 200mb to 1gb)
Weave pods continue to grow in memory usage, the new 2.5.2 pods have not hit their 1g limit yet but look to be heading that way. The "update for IP range I own" messages are still being seen - however on closer inspection these messages are only coming from 3 of the 71 pods.
How to reproduce it?
Have a working kubernetes cluster and delete some nodes out of it.
Anything else we need to know?
Versions:
Version: 2.5.2 (up to date; next check at 2019/07/12 18:43:12)
Service: router
Protocol: weave 1..2
Name: ea:38:6f:58:7b:81(ip-10-32-124-236.us-west-2.compute.internal)
Encryption: disabled
PeerDiscovery: enabled
Targets: 71
Connections: 71 (70 established, 1 failed)
Peers: 71 (with 4966 established, 4 pending connections)
TrustedSubnets: none
Service: ipam
Status: ready
Range: 100.96.0.0/11
DefaultSubnet: 100.96.0.0/11
admin@ip-10-32-92-49:~$ docker version
Client:
Version: 17.03.2-ce
API version: 1.27
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:09:56 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.2-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: f5ec1e2
Built: Tue Jun 27 02:09:56 2017
OS/Arch: linux/amd64
Experimental: false
Linux ip-10-32-92-49 4.4.121-k8s #1 SMP Sun Mar 11 19:39:47 UTC 2018 x86_64 GNU/Linux
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.8", GitCommit:"c138b85178156011dc934c2c9f4837476876fb07", GitTreeState:"clean", BuildDate:"2018-05-21T18:53:18Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}```
Logs:
This is the logs from one of the weave pods that is showing the "Received update for IP range I own" messages: weave-net-q56hl.log
This is the pprof/heap output for the above node
weave-net-q56hl.heap.gz
This is status ipam from the above node
weave-net-q56hl.ipam.txt
This is status peers from the above node
weave-net-q56hl.peers.txt
This is the logs from one of the weave pods not showing that message:
weave-net-9t7d8.log
This is the pprof/heap output for the above node weave-net-9t7d8.heap.gz
And here's a picture showing the history of memory usage form these pods
