Preface - I haven't yet debugged this issue enough to know precisely where the issue lies. I do know that I can very trivially reproduce the problem and wanted to at least get the ticket filed / conversation going. It may be related to some combination of:
- LCOW (or LCOW image / kernel / opengcs / etc)
- Alpine 3.9
- Environment - containers are running inside a Server 2019 Hyper-V VM that has nested virtualization enabled
- Docker version / some nuance of the Docker DNS resolver
I'm pretty sure this has something to do with Alpine in particular, since running the failing scenario with Ubuntu containers instead does not fail.
docker info
Client:
Debug Mode: false
Plugins:
app: Docker Application (Docker Inc., v0.8.0-beta2)
buildx: Build with BuildKit (Docker Inc., v0.2.0-6-g509c4b6-tp)
Server:
Containers: 2
Running: 0
Paused: 0
Stopped: 2
Images: 138
Server Version: master-dockerproject-2019-04-28
Storage Driver: windowsfilter (windows) lcow (linux)
Windows:
LCOW:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: inactive
Default Isolation: hyperv
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows 10 Enterprise Version 1809 (OS Build 17763.437)
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 16GiB
Name: ci-lcow-prod-1
ID: 0ac02c9d-aaba-42f4-8749-5a64af3068d8
Docker Root Dir: C:\ProgramData\docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
The LCOW image is built from linuxkit/lcow@d5dfdbc - it includes kernel 4.19.27 amongst other bits. There is an updated kernel image PR that was merged containing newer versions of OpenGCS, Alpine, kernel and runc BUT when I built it, it didn't launch containers and I had to revert (more info in linuxkit/lcow#45 (comment))
compose file to demonstrate the problem
version: '3'
services:
foo:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "while true; do nslookup bar.internal && sleep 1s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "while true; do nslookup foo.internal && sleep 1s; done"
networks:
default:
aliases:
- bar.internal
Output from compose up
The problem is that DNS resolution failures occur pretty regularly - i.e. foo cannot resolve bar.internal fail and vice versa. While the log also shows some successes, there are a number of failures as well (which vary depending on each run).
PS C:\source\alpine-test> docker-compose -f .\docker-compose-bad.yml up
Creating network "alpine-test_default" with the default driver
Creating alpine-test_bar_1 ... done
Creating alpine-test_foo_1 ... done
Attaching to alpine-test_foo_1, alpine-test_bar_1
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | nslookup: can't resolve 'foo.internal': Name does not resolve
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | nslookup: can't resolve 'foo.internal': Name does not resolve
bar_1 |
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
foo_1 |
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
Gracefully stopping... (press Ctrl+C again to force)
Workaround
One way to workaround the problem is to have the Alpine container perform a dig against the host, which presumably will cache the DNS record for future nslookup calls
compose file
version: '3'
services:
foo:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "apk add bind-tools; dig bar.internal; while true; do nslookup bar.internal; sleep 2s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "apk add bind-tools; dig foo.internal; while true; do nslookup foo.internal; sleep 2s; done"
networks:
default:
aliases:
- bar.internal
Output from compose up
The nslookup results have changed quite a bit from:
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
To
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
Here's a longer run from the above compose file showing that nslookup no longer fails intermittently.
PS C:\source\alpine-test> docker-compose up
Creating network "alpine-test_default" with the default driver
Creating alpine-test_bar_1 ... done
Creating alpine-test_foo_1 ... done
Attaching to alpine-test_foo_1, alpine-test_bar_1
foo_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
bar_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
foo_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
bar_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
foo_1 | (1/10) Installing libgcc (8.3.0-r0)
bar_1 | (1/10) Installing libgcc (8.3.0-r0)
bar_1 | (2/10) Installing krb5-conf (1.0-r1)
foo_1 | (2/10) Installing krb5-conf (1.0-r1)
bar_1 | (3/10) Installing libcom_err (1.44.5-r0)
foo_1 | (3/10) Installing libcom_err (1.44.5-r0)
bar_1 | (4/10) Installing keyutils-libs (1.6-r0)
foo_1 | (4/10) Installing keyutils-libs (1.6-r0)
bar_1 | (5/10) Installing libverto (0.3.0-r1)
bar_1 | (6/10) Installing krb5-libs (1.15.5-r0)
foo_1 | (5/10) Installing libverto (0.3.0-r1)
foo_1 | (6/10) Installing krb5-libs (1.15.5-r0)
bar_1 | (7/10) Installing json-c (0.13.1-r0)
bar_1 | (8/10) Installing libxml2 (2.9.9-r1)
foo_1 | (7/10) Installing json-c (0.13.1-r0)
foo_1 | (8/10) Installing libxml2 (2.9.9-r1)
bar_1 | (9/10) Installing bind-libs (9.12.4_p1-r1)
foo_1 | (9/10) Installing bind-libs (9.12.4_p1-r1)
foo_1 | (10/10) Installing bind-tools (9.12.4_p1-r1)
bar_1 | (10/10) Installing bind-tools (9.12.4_p1-r1)
foo_1 | Executing busybox-1.29.3-r10.trigger
bar_1 | Executing busybox-1.29.3-r10.trigger
bar_1 | OK: 12 MiB in 24 packages
foo_1 | OK: 12 MiB in 24 packages
foo_1 |
foo_1 | ; <<>> DiG 9.12.4-P1 <<>> bar.internal
foo_1 | ;; global options: +cmd
foo_1 | ;; Got answer:
foo_1 | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62166
foo_1 | ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
foo_1 |
foo_1 | ;; QUESTION SECTION:
foo_1 | ;bar.internal. IN A
foo_1 |
foo_1 | ;; ANSWER SECTION:
foo_1 | bar.internal. 600 IN A 172.25.137.174
foo_1 |
foo_1 | ;; Query time: 0 msec
foo_1 | ;; SERVER: 172.25.128.1#53(172.25.128.1)
foo_1 | ;; WHEN: Fri May 03 18:26:29 UTC 2019
foo_1 | ;; MSG SIZE rcvd: 58
foo_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 |
bar_1 | ; <<>> DiG 9.12.4-P1 <<>> foo.internal
bar_1 | ;; global options: +cmd
bar_1 | ;; Got answer:
bar_1 | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34929
bar_1 | ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
bar_1 |
bar_1 | ;; QUESTION SECTION:
bar_1 | ;foo.internal. IN A
bar_1 |
bar_1 | ;; ANSWER SECTION:
bar_1 | foo.internal. 600 IN A 172.25.139.149
bar_1 |
bar_1 | ;; Query time: 0 msec
bar_1 | ;; SERVER: 172.25.128.1#53(172.25.128.1)
bar_1 | ;; WHEN: Fri May 03 18:26:29 UTC 2019
bar_1 | ;; MSG SIZE rcvd: 58
bar_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
Ubuntu results
Compose file
version: '3'
services:
foo:
image: ubuntu:latest
dns_search: internal
entrypoint: sh -c "apt-get update && apt-get install -y dnsutils; while true; do nslookup 'bar.internal'; sleep 2s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: ubuntu:latest
dns_search: internal
entrypoint: sh -c "apt-get update && apt-get install -y dnsutils; while true; do nslookup 'foo.internal'; sleep 2s; done"
networks:
default:
aliases:
- bar.internal
I'll spare the full log here, but switching to an Ubuntu container and nslookup succeeds from the onset:
foo_1 | Server: 172.30.16.1
foo_1 | Address: 172.30.16.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.30.18.190
foo_1 |
bar_1 | Server: 172.30.16.1
bar_1 | Address: 172.30.16.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.30.28.25
bar_1 |
Preface - I haven't yet debugged this issue enough to know precisely where the issue lies. I do know that I can very trivially reproduce the problem and wanted to at least get the ticket filed / conversation going. It may be related to some combination of:
I'm pretty sure this has something to do with Alpine in particular, since running the failing scenario with Ubuntu containers instead does not fail.
docker info
The LCOW image is built from linuxkit/lcow@d5dfdbc - it includes kernel 4.19.27 amongst other bits. There is an updated kernel image PR that was merged containing newer versions of OpenGCS, Alpine, kernel and runc BUT when I built it, it didn't launch containers and I had to revert (more info in linuxkit/lcow#45 (comment))
compose file to demonstrate the problem
Output from
compose upThe problem is that DNS resolution failures occur pretty regularly - i.e.
foocannot resolvebar.internalfail and vice versa. While the log also shows some successes, there are a number of failures as well (which vary depending on each run).Workaround
One way to workaround the problem is to have the Alpine container perform a
digagainst the host, which presumably will cache the DNS record for futurenslookupcallscompose file
Output from
compose upThe nslookup results have changed quite a bit from:
To
Here's a longer run from the above compose file showing that nslookup no longer fails intermittently.
Ubuntu results
Compose file
I'll spare the full log here, but switching to an Ubuntu container and
nslookupsucceeds from the onset: