smoke-tests fail in Pivotal Cloud Foundry 1.3 (Solution)

featured-pcfRecently I built a Pivotal Cloud Foundry lab. I used the following builds:

Ops Manager 1.3.4.0
Elastic Runtime 1.3.5.0
Ops Metrics 1.3.3.0

I fired up PCF in my lab and have been playing with it. One thing that bothered me is that the smoke-tests errand would fail on Elastic Runtime. I tried both Elastic Runtime 1.3.4.0 and 1.3.5.0, and my fix was just to uncheck it so it did not run as a Post Install Errand. But I was not happy with that. The exact errors given by the installer looks like so:

cf api https://api.cf.lab.local −skip−ssl−validation
Setting api endpoint to https://api.cf.lab.local...
FAILED
i/o timeout

You will see that basically what is happening, is that while running the smoke-tests it fails to establish a connection to the API. I tested using cf from my laptop however, and it works fine, and is quick.

So I simply finished the install and unchecked the smoke-tests errand. Installation completed just fine.

Determined to troubleshoot the issue, I manually kicked off the errand from the director using bosh run errand smoke−tests.

So you have your bearings, my environment is:

PCF Infrastructure network: 172.16.200.0/24

PCF Deployment network: 172.16.201.0/24

 

I then logged into the VM where the smoke-tests was running, and just tried the API commands that were failing:

vcap@a1e1b7bf−ae1a−4e29−9cfb−34cd1a62be07:~$ /var/vcap/packages/cli/bin/cf api https://api.cf.lab.local −skip−ssl−validation 
Setting api endpoint to https://api.cf.lab.local...
FAILED
i/o timeout
vcap@a1e1b7bf−ae1a−4e29−9cfb−34cd1a62be07:~$ 
vcap@a1e1b7bf−ae1a−4e29−9cfb−34cd1a62be07:~$ /var/vcap/packages/cli/bin/cf api https://api.cf.lab.local −skip−ssl−validation 
Setting api endpoint to https://api.cf.lab.local...
OK
API endpoint: https://api.cf.lab.local (API version: 2.13.0) 
Not logged in. Use 'cf login' to log in.

 

Sure enough, you can see it fails. But then, when you run it again it succeeds! So I went off troubleshooting. It turns out in trying three times it fails, succeeds, and succeeds. I was drawn to there possibly being a DNS issue.

So I inspected my DNS and found its working properly. I have wildcard DNS setup for *.cf.lab.local, and its returning the HA Proxy no problem. So I look at the DNS on the smoke-tests machine, and thats where I see the issue.

 

When I log into my smoke-tests VM, here is what the /etc/resolv.conf looks like:

vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ cat /etc/resolv.conf 
nameserver 172.16.200.2
nameserver 172.16.5.30
nameserver 172.16.201.4

172.16.200.2 is the Infrastructure Network address for Ops Manager Director

172.16.5.30 is my internal DNS server, the one I configured whenever asked for DNS (ova deployment, Ops Mgr install, ER install, etc)

172.16.201.4 is the Deployment Network address for Ops Manager Director

Obviously the host will attempt to use the first address listed in /etc/resolv.conf to
resolve names. It may then alternate name servers with subsequent requests. This explains why it fails the first time. It tries the first name server. But then if you run the API call again, it succeeds, and again it succeeds. Then it fails again. Below I test resolving against all three hosts listed in resolv.conf:

vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.200.2
;; reply from unexpected source: 172.16.201.4#53, expected 172.16.200.2#53
Server: 172.16.200.2
Address: 172.16.200.2#53
 
Name: api.cf.lab.local
Address: 172.16.201.5
 
vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.5.30
Server: 172.16.5.30
Address: 172.16.5.30#53
 
Name: api.cf.lab.local
Address: 172.16.201.5
 
vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.201.4
Server: 172.16.201.4
Address: 172.16.201.4#53
 
Name: api.cf.lab.local
Address: 172.16.201.5

When attempting to resolve against 172.16.200.2, you see it outputs “reply from unexpected source“.

There is also a considerable delay, enough to cause a time out. Resolutions against the other sources are instantaneous, with no delay or any strange output.

I test again against 172.16.200.2 for good measure:

vcap@497b8af0−dc12−49e4−a702−ad59c6348d59:~$ nslookup api.cf.lab.local 172.16.200.2
;; reply from unexpected source: 172.16.201.4#53, expected 172.16.200.2#53
Server: 172.16.200.2
Address: 172.16.200.2#53
 
Name: api.cf.lab.local
Address: 172.16.201.5

We can see the routing table on the Ops Manager Director below:

vcap@bm−f80e2644−c1d2−4c30−af89−4885bacf1a98:~$ netstat −rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 172.16.200.1 0.0.0.0 UG 0 0 0 eth0
172.16.200.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
172.16.201.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1

 

It is dual homed on both the Deployment and Infrastructure networks as expected.  It’s default gateway is on the Infrastructure network. The Ops Mgr should be able to communicate on either network, using either address. Obviously if the Ops Mgr were to receive traffic from a network not local to it, it would need to respond using its Infrastructure address 172.16.200.2, since that is what shares a network with the default gateway.

It see’s a packet coming to it from the smoke-test VM at 172.16.201.24, and destined to its 172.16.200.2 address. The Ops Mgr responds from its 172.16.201.4 address as its local to the requester. But this does not seem correct, as the requester is not prepared to see a reply from a different address than it requested. This asymmetric routing creates an issue.

It turns out, the fact that Ops Mgr’s Infrastructure network IP address being listed in /etc/resolv.conf on the smoke-tests VM is a bug. You will not see this issue if you just deploy one single network. But if you split Infrastructure and Deployment networks, then Ops Mgr is multi-homed and you will see this bug.

 

This is fixed in Pivotal Cloud Foundry 1.4. Thanks to Pivotal support who replied with a fix below:

To change this behaviour in Pivotal CF v1.3.x, on the Ops Manager VM, change /home/tempest−web/tempest/app/models/tempest/manifests/network_section.rb

Line 20: "dns" => [microbosh_dns_ip] + network.parsed_dns,

to "dns" => network.parsed_dns,

 

Now you can re-enable the smoke-tests errand and re-apply changes and all will be well!

This entry was posted in Cloud Foundry, Pivotal and tagged . Bookmark the permalink.

Leave a Reply