Recently I built a Pivotal Cloud Foundry lab. I used the following builds:
Ops Manager 1.3.4.0
Elastic Runtime 1.3.5.0
Ops Metrics 1.3.3.0
I fired up PCF in my lab and have been playing with it. One thing that bothered me is that the smoke-tests errand would fail on Elastic Runtime. I tried both Elastic Runtime 1.3.4.0 and 1.3.5.0, and my fix was just to uncheck it so it did not run as a Post Install Errand. But I was not happy with that. The exact errors given by the installer looks like so:
1 |
cf<!--DVFMTSC--> api<!--DVFMTSC--> https://api.cf.lab.local<!--DVFMTSC--> <!--DVFMTSC-->−<!--DVFMTSC-->−skip<!--DVFMTSC-->−ssl<!--DVFMTSC-->−validation |
1 |
Setting api endpoint to https://api.cf.lab.local... |
1 |
FAILED |
1 |
i/o<!--DVFMTSC--> timeout |
You will see that basically what is happening, is that while running the smoke-tests it fails to establish a connection to the API. I tested using cf from my laptop however, and it works fine, and is quick.
So I simply finished the install and unchecked the smoke-tests errand. Installation completed just fine.
Determined to troubleshoot the issue, I manually kicked off the errand from the director using bosh run errand smoke−tests
.
So you have your bearings, my environment is:
PCF Infrastructure network: 172.16.200.0/24
PCF Deployment network: 172.16.201.0/24
I then logged into the VM where the smoke-tests was running, and just tried the API commands that were failing:
1 2 |
vcap@a1e1b7bf<!--DVFMTSC-->−ae1a<!--DVFMTSC-->−4e29<!--DVFMTSC-->−9cfb<!--DVFMTSC-->−34cd1a62be07:~$<!--DVFMTSC--> /var/vcap/packages/cli/bin/cf<!--DVFMTSC--> api<!--DVFMTSC--> https://api.cf.lab.local<!--DVFMTSC--> <!--DVFMTSC-->−<!--DVFMTSC-->−skip<!--DVFMTSC-->−ssl<!--DVFMTSC-->−validation<!--DVFMTSC--> Setting<!--DVFMTSC--> api<!--DVFMTSC--> endpoint<!--DVFMTSC--> to<!--DVFMTSC--> https://api.cf.lab.local... |
1 2 3 4 5 6 |
FAILED i/o<!--DVFMTSC--> timeout vcap@a1e1b7bf<!--DVFMTSC-->−ae1a<!--DVFMTSC-->−4e29<!--DVFMTSC-->−9cfb<!--DVFMTSC-->−34cd1a62be07:~$<!--DVFMTSC--> vcap@a1e1b7bf<!--DVFMTSC-->−ae1a<!--DVFMTSC-->−4e29<!--DVFMTSC-->−9cfb<!--DVFMTSC-->−34cd1a62be07:~$<!--DVFMTSC--> /var/vcap/packages/cli/bin/cf<!--DVFMTSC--> api<!--DVFMTSC--> https://api.cf.lab.local<!--DVFMTSC--> <!--DVFMTSC-->−<!--DVFMTSC-->−skip<!--DVFMTSC-->−ssl<!--DVFMTSC-->−validation<!--DVFMTSC--> Setting<!--DVFMTSC--> api<!--DVFMTSC--> endpoint<!--DVFMTSC--> to<!--DVFMTSC--> https://api.cf.lab.local... OK |
1 2 |
API<!--DVFMTSC--> endpoint:<!--DVFMTSC--> https://api.cf.lab.local<!--DVFMTSC--> (API<!--DVFMTSC--> version:<!--DVFMTSC--> 2.13.0)<!--DVFMTSC--> Not<!--DVFMTSC--> logged<!--DVFMTSC--> in.<!--DVFMTSC--> Use<!--DVFMTSC--> 'cf<!--DVFMTSC--> login'<!--DVFMTSC--> to<!--DVFMTSC--> log<!--DVFMTSC--> in. |
Sure enough, you can see it fails. But then, when you run it again it succeeds! So I went off troubleshooting. It turns out in trying three times it fails, succeeds, and succeeds. I was drawn to there possibly being a DNS issue.
So I inspected my DNS and found its working properly. I have wildcard DNS setup for *.cf.lab.local
, and its returning the HA Proxy no problem. So I look at the DNS on the smoke-tests machine, and thats where I see the issue.
When I log into my smoke-tests VM, here is what the /etc/resolv.conf
looks like:
1 2 3 4 |
vcap@497b8af0<!--DVFMTSC-->−dc12<!--DVFMTSC-->−49e4<!--DVFMTSC-->−a702<!--DVFMTSC-->−ad59c6348d59:~$<!--DVFMTSC--> cat<!--DVFMTSC--> /etc/resolv.conf<!--DVFMTSC--> nameserver<!--DVFMTSC--> 172.16.200.2 nameserver<!--DVFMTSC--> 172.16.5.30 nameserver<!--DVFMTSC--> 172.16.201.4 |
172.16.200.2
is the Infrastructure Network address for Ops Manager Director
172.16.5.30
is my internal DNS server, the one I configured whenever asked for DNS (ova deployment, Ops Mgr install, ER install, etc)
172.16.201.4
is the Deployment Network address for Ops Manager Director
Obviously the host will attempt to use the first address listed in /etc/resolv.conf
to
resolve names. It may then alternate name servers with subsequent requests. This explains why it fails the first time. It tries the first name server. But then if you run the API call again, it succeeds, and again it succeeds. Then it fails again. Below I test resolving against all three hosts listed in resolv.conf:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
vcap@497b8af0<!--DVFMTSC-->−dc12<!--DVFMTSC-->−49e4<!--DVFMTSC-->−a702<!--DVFMTSC-->−ad59c6348d59:~$<!--DVFMTSC--> nslookup<!--DVFMTSC--> api.cf.lab.local<!--DVFMTSC--> 172.16.200.2 ;;<!--DVFMTSC--> reply<!--DVFMTSC--> from<!--DVFMTSC--> unexpected<!--DVFMTSC--> source:<!--DVFMTSC--> 172.16.201.4#53,<!--DVFMTSC--> expected<!--DVFMTSC--> 172.16.200.2#53 Server:<!--DVFMTSC--> 172.16.200.2 Address:<!--DVFMTSC--> 172.16.200.2#53 <!--DVFMTSC--> Name:<!--DVFMTSC--> api.cf.lab.local Address:<!--DVFMTSC--> 172.16.201.5 <!--DVFMTSC--> vcap@497b8af0<!--DVFMTSC-->−dc12<!--DVFMTSC-->−49e4<!--DVFMTSC-->−a702<!--DVFMTSC-->−ad59c6348d59:~$<!--DVFMTSC--> nslookup<!--DVFMTSC--> api.cf.lab.local<!--DVFMTSC--> 172.16.5.30 Server:<!--DVFMTSC--> 172.16.5.30 Address:<!--DVFMTSC--> 172.16.5.30#53 <!--DVFMTSC--> Name:<!--DVFMTSC--> api.cf.lab.local Address:<!--DVFMTSC--> 172.16.201.5 <!--DVFMTSC--> vcap@497b8af0<!--DVFMTSC-->−dc12<!--DVFMTSC-->−49e4<!--DVFMTSC-->−a702<!--DVFMTSC-->−ad59c6348d59:~$<!--DVFMTSC--> nslookup<!--DVFMTSC--> api.cf.lab.local<!--DVFMTSC--> 172.16.201.4 Server:<!--DVFMTSC--> 172.16.201.4 Address:<!--DVFMTSC--> 172.16.201.4#53 <!--DVFMTSC--> Name:<!--DVFMTSC--> api.cf.lab.local Address:<!--DVFMTSC--> 172.16.201.5 |
When attempting to resolve against 172.16.200.2
, you see it outputs “reply from unexpected source
“.
There is also a considerable delay, enough to cause a time out. Resolutions against the other sources are instantaneous, with no delay or any strange output.
I test again against 172.16.200.2 for good measure:
1 2 3 4 5 6 7 |
vcap@497b8af0<!--DVFMTSC-->−dc12<!--DVFMTSC-->−49e4<!--DVFMTSC-->−a702<!--DVFMTSC-->−ad59c6348d59:~$<!--DVFMTSC--> nslookup<!--DVFMTSC--> api.cf.lab.local<!--DVFMTSC--> 172.16.200.2 ;;<!--DVFMTSC--> reply<!--DVFMTSC--> from<!--DVFMTSC--> unexpected<!--DVFMTSC--> source:<!--DVFMTSC--> 172.16.201.4#53,<!--DVFMTSC--> expected<!--DVFMTSC--> 172.16.200.2#53 Server:<!--DVFMTSC--> 172.16.200.2 Address:<!--DVFMTSC--> 172.16.200.2#53 <!--DVFMTSC--> Name:<!--DVFMTSC--> api.cf.lab.local Address:<!--DVFMTSC--> 172.16.201.5 |
We can see the routing table on the Ops Manager Director below:
1 2 3 4 5 6 |
vcap@bm<!--DVFMTSC-->−f80e2644<!--DVFMTSC-->−c1d2<!--DVFMTSC-->−4c30<!--DVFMTSC-->−af89<!--DVFMTSC-->−4885bacf1a98:~$<!--DVFMTSC--> netstat<!--DVFMTSC--> <!--DVFMTSC-->−rn Kernel<!--DVFMTSC--> IP<!--DVFMTSC--> routing<!--DVFMTSC--> table Destination<!--DVFMTSC--> Gateway<!--DVFMTSC--> Genmask<!--DVFMTSC--> Flags<!--DVFMTSC--> MSS<!--DVFMTSC--> Window<!--DVFMTSC--> irtt<!--DVFMTSC--> Iface 0.0.0.0<!--DVFMTSC--> 172.16.200.1<!--DVFMTSC--> 0.0.0.0<!--DVFMTSC--> UG<!--DVFMTSC--> 0<!--DVFMTSC--> 0<!--DVFMTSC--> 0<!--DVFMTSC--> eth0 172.16.200.0<!--DVFMTSC--> 0.0.0.0<!--DVFMTSC--> 255.255.255.0<!--DVFMTSC--> U<!--DVFMTSC--> 0<!--DVFMTSC--> 0<!--DVFMTSC--> 0<!--DVFMTSC--> eth0 172.16.201.0<!--DVFMTSC--> 0.0.0.0<!--DVFMTSC--> 255.255.255.0<!--DVFMTSC--> U<!--DVFMTSC--> 0<!--DVFMTSC--> 0<!--DVFMTSC--> 0<!--DVFMTSC--> eth1 |
It is dual homed on both the Deployment and Infrastructure networks as expected. It’s default gateway is on the Infrastructure network. The Ops Mgr should be able to communicate on either network, using either address. Obviously if the Ops Mgr were to receive traffic from a network not local to it, it would need to respond using its Infrastructure address 172.16.200.2, since that is what shares a network with the default gateway.
It see’s a packet coming to it from the smoke-test VM at 172.16.201.24
, and destined to its 172.16.200.2 address. The Ops Mgr responds from its 172.16.201.4
address as its local to the requester. But this does not seem correct, as the requester is not prepared to see a reply from a different address than it requested. This asymmetric routing creates an issue.
It turns out, the fact that Ops Mgr’s Infrastructure network IP address being listed in /etc/resolv.conf
on the smoke-tests VM is a bug. You will not see this issue if you just deploy one single network. But if you split Infrastructure and Deployment networks, then Ops Mgr is multi-homed and you will see this bug.
This is fixed in Pivotal Cloud Foundry 1.4. Thanks to Pivotal support who replied with a fix below:
To change this behaviour in Pivotal CF v1.3.x, on the Ops Manager VM, change /home/tempest−web/tempest/app/models/tempest/manifests/network_section.rb
Line 20: "dns" => [microbosh_dns_ip] + network.parsed_dns,
to "dns" => network.parsed_dns,
Now you can re-enable the smoke-tests errand and re-apply changes and all will be well!