Ops Manager 18.104.22.168
Elastic Runtime 22.214.171.124
Ops Metrics 126.96.36.199
I fired up PCF in my lab and have been playing with it. One thing that bothered me is that the smoke-tests errand would fail on Elastic Runtime. I tried both Elastic Runtime 188.8.131.52 and 184.108.40.206, and my fix was just to uncheck it so it did not run as a Post Install Errand. But I was not happy with that. The exact errors given by the installer looks like so:
cf api https://api.cf.lab.local −−skip−ssl−validation
Setting api endpoint to https://api.cf.lab.local...
You will see that basically what is happening, is that while running the smoke-tests it fails to establish a connection to the API. I tested using cf from my laptop however, and it works fine, and is quick.
So I simply finished the install and unchecked the smoke-tests errand. Installation completed just fine.
Determined to troubleshoot the issue, I manually kicked off the errand from the director using
bosh run errand smoke−tests.
So you have your bearings, my environment is:
PCF Infrastructure network:
PCF Deployment network:
I then logged into the VM where the smoke-tests was running, and just tried the API commands that were failing:
vcap@a1e1b7bf−ae1a−4e29−9cfb−34cd1a62be07:~$ /var/vcap/packages/cli/bin/cf api https://api.cf.lab.local −−skip−ssl−validation Setting api endpoint to https://api.cf.lab.local...
FAILED i/o timeout vcap@a1e1b7bf−ae1a−4e29−9cfb−34cd1a62be07:~$ vcap@a1e1b7bf−ae1a−4e29−9cfb−34cd1a62be07:~$ /var/vcap/packages/cli/bin/cf api https://api.cf.lab.local −−skip−ssl−validation Setting api endpoint to https://api.cf.lab.local... OK
API endpoint: https://api.cf.lab.local (API version: 2.13.0) Not logged in. Use 'cf login' to log in.
Sure enough, you can see it fails. But then, when you run it again it succeeds! So I went off troubleshooting. It turns out in trying three times it fails, succeeds, and succeeds. I was drawn to there possibly being a DNS issue.
So I inspected my DNS and found its working properly. I have wildcard DNS setup for
*.cf.lab.local, and its returning the HA Proxy no problem. So I look at the DNS on the smoke-tests machine, and thats where I see the issue.
When I log into my smoke-tests VM, here is what the
/etc/resolv.conf looks like:
vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ cat /etc/resolv.conf nameserver 172.16.200.2 nameserver 172.16.5.30 nameserver 172.16.201.4
172.16.200.2 is the Infrastructure Network address for Ops Manager Director
172.16.5.30 is my internal DNS server, the one I configured whenever asked for DNS (ova deployment, Ops Mgr install, ER install, etc)
172.16.201.4 is the Deployment Network address for Ops Manager Director
Obviously the host will attempt to use the first address listed in
resolve names. It may then alternate name servers with subsequent requests. This explains why it fails the first time. It tries the first name server. But then if you run the API call again, it succeeds, and again it succeeds. Then it fails again. Below I test resolving against all three hosts listed in resolv.conf:
vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.200.2 ;; reply from unexpected source: 172.16.201.4#53, expected 172.16.200.2#53 Server: 172.16.200.2 Address: 172.16.200.2#53 Name: api.cf.lab.local Address: 172.16.201.5 vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.5.30 Server: 172.16.5.30 Address: 172.16.5.30#53 Name: api.cf.lab.local Address: 172.16.201.5 vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.201.4 Server: 172.16.201.4 Address: 172.16.201.4#53 Name: api.cf.lab.local Address: 172.16.201.5
When attempting to resolve against
172.16.200.2, you see it outputs “
reply from unexpected source“.
There is also a considerable delay, enough to cause a time out. Resolutions against the other sources are instantaneous, with no delay or any strange output.
I test again against 172.16.200.2 for good measure:
vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.200.2 ;; reply from unexpected source: 172.16.201.4#53, expected 172.16.200.2#53 Server: 172.16.200.2 Address: 172.16.200.2#53 Name: api.cf.lab.local Address: 172.16.201.5
We can see the routing table on the Ops Manager Director below:
vcap@bm−f80e2644−c1d2−4c30−af89−4885bacf1a98:~$ netstat −rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 0.0.0.0 172.16.200.1 0.0.0.0 UG 0 0 0 eth0 172.16.200.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 172.16.201.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
It is dual homed on both the Deployment and Infrastructure networks as expected. It’s default gateway is on the Infrastructure network. The Ops Mgr should be able to communicate on either network, using either address. Obviously if the Ops Mgr were to receive traffic from a network not local to it, it would need to respond using its Infrastructure address 172.16.200.2, since that is what shares a network with the default gateway.
It see’s a packet coming to it from the smoke-test VM at
172.16.201.24, and destined to its 172.16.200.2 address. The Ops Mgr responds from its
172.16.201.4 address as its local to the requester. But this does not seem correct, as the requester is not prepared to see a reply from a different address than it requested. This asymmetric routing creates an issue.
It turns out, the fact that Ops Mgr’s Infrastructure network IP address being listed in
/etc/resolv.conf on the smoke-tests VM is a bug. You will not see this issue if you just deploy one single network. But if you split Infrastructure and Deployment networks, then Ops Mgr is multi-homed and you will see this bug.
This is fixed in Pivotal Cloud Foundry 1.4. Thanks to Pivotal support who replied with a fix below:
To change this behaviour in Pivotal CF v1.3.x, on the Ops Manager VM, change
"dns" => [microbosh_dns_ip] + network.parsed_dns,
"dns" => network.parsed_dns,
Now you can re-enable the smoke-tests errand and re-apply changes and all will be well!