The Genesis Blog

Sam Loyd (@samloyd), via Unsplash

Debugging BOSH IaaS Issues

Thursday, November 8, 2018

A well-tuned BOSH director is a thing of beauty. When all the pieces and parts are humming and clicking properly, everything just works and you start to forget that provisioning operating systems and software stacks, especially in the cloud provider of your choice, was ever difficult in the first place.

Meet The T3, Same As The T2

On August 21st, 2018, Amazon introduced T3, the next round of upgrades to the T-series of instance types that include the venerable t2.micro, and the miserly t2.nano. If you, like me, do a lot of cloud demos and want to keep costs down, you've probably run into a Tx.something or five.

I just so happened to be sitting down to fire up a lab for installing Genesis when I saw the announcement, so I decided to take them for a spin. I was also working through writing distributable Terraform configurations for Genesis and so I incorporated the T3's into the cloud configuration template, deployed a VPC, and promptly forgot about the instance type.

All was not well, however, and my first deployment on the demo BOSH director was failing to compile packages, with this error:

Task 21 | 14:08:28 | Compiling packages: vault/4f754614313fc27928476ec02e248efd1622c94d (00:15:33)
               L Error: Timed out pinging to 4a320dab-328d-48d0-bd6f-f7776f821b53 after 600 seconds

We see this a lot when networking is wonky; firewalls, bad NAT configurations, AWS security groups shenanigans can all cause this. When BOSH says it's pinging the instance, it's actually waiting patiently for the instance to send an "All Good" on the BOSH NATS message bus.

In Amazon, I can use my EC2 SSH key pair to remotely access the instance, as the vcap user:

SSH_AUTH_SOCK= ssh -i '[email protected]' [email protected]

([email protected] is just what I named the private key file. I also unset $SSH_AUTH_SOCK because I have a lot of keys in my SSH agent, and I don't want to run out of authentication attempts.)

That lets me in just fine. The instance is clearly being created, and there are at least no weird networking restrictions between my jumpbox and the compilation VM. We need to dig in more.

The next step is to review the BOSH agent logs (in /var/vcap/bosh/logs) and see what the agent is doing, but to do that I need root privileges.

$ sudo -i
Password for vcap:

Dang. So close. I don't know what the vcap user's password is. Normally, BOSH randomizes it for each instance it deploys, to help secure the entire platform. We need to knock that off for a bit, since it is interfering with our troubleshooting.

A Detour To "Soften" The VMs

To get our sudo back in working order, we have to soften up these VMs a little bit, in the security department.

First, we are going to need a suitable, encrypted password:

$6$lFvaAqSVJbrigByR$XRtYFRHw2CdNQ54g3sCew5aNpRIrU9gPC8lY8b6ZVM8Er9j16HNyIlJERj4RAPf96p8Ggn/94PUrzEJot0nV4/

BOSH can set a password for the vcap user, if you give it one in crypt(3) / shadow(5) format. In case you don't read the SHA-512 dialect of crypt, that says "debug".

Now, we just need to inject that into the deployment recipe, via cloud config. Because I'm specifically troubleshooting my compilation VMs, I'm putting this in the compilation: stanza.

compilation:
  #
  # ... the rest of the compilation block ...
  #

  env:
    bosh:
      password: $6$lFvaAqSVJbrigByR$XRtYFRHw2CdNQ54g3sCew5aNpRIrU9gPC8lY8b6ZVM8Er9j16HNyIlJERj4RAPf96p8Ggn/94PUrzEJot0nV4/

You can also put the env: stuff in your instance_groups: stanza in the deployment manifest, if you're okay to compile but having similar "pinging" issues elsewhere.

One re-deploy later, and I'm back on a new (failing) compilation VM, ONLY THIS TIME I CAN SUDO!

$ SSH_AUTH_SOCK= ssh -i '[email protected]' [email protected]
$ sudo -i
Password for vcap: (debug)

# whoami
root

Now we're getting somewhere.

Remember that T3?

Turns out the T3 line is not at all the same as the T2. Namely, the disk device names under /dev/ are all kinds of different; nvme* instead of sd*. My stemcell was too old for that kind of change, and the BOSH agent was quite insistent on mounting /dev/sdb and /dev/sdf, which just plain weren't there.

Here's the relevant log snippet, in case it helps:

2018-11-08_15:53:49.01246 [File System] 2018/11/08 15:53:49 DEBUG - Checking if file exists /dev/xvdb
2018-11-08_15:53:49.01248 [File System] 2018/11/08 15:53:49 DEBUG - Stat '/dev/xvdb'
2018-11-08_15:53:49.01248 [File System] 2018/11/08 15:53:49 DEBUG - Checking if file exists /dev/vdb
2018-11-08_15:53:49.01249 [File System] 2018/11/08 15:53:49 DEBUG - Stat '/dev/vdb'
2018-11-08_15:53:49.01249 [File System] 2018/11/08 15:53:49 DEBUG - Checking if file exists /dev/sdb
2018-11-08_15:53:49.01249 [File System] 2018/11/08 15:53:49 DEBUG - Stat '/dev/sdb'
2018-11-08_15:53:49.01298 [unlimitedRetryStrategy] 2018/11/08 15:53:49 DEBUG - Making attempt #271
2018-11-08_15:53:49.01300 [DelayedAuditLogger] 2018/11/08 15:53:49 ERROR - Unix syslog delivery error
2018-11-08_15:53:49.11261 [File System] 2018/11/08 15:53:49 DEBUG - Checking if file exists /dev/xvdb
2018-11-08_15:53:49.11264 [File System] 2018/11/08 15:53:49 DEBUG - Stat '/dev/xvdb'
2018-11-08_15:53:49.11264 [File System] 2018/11/08 15:53:49 DEBUG - Checking if file exists /dev/vdb
2018-11-08_15:53:49.11264 [File System] 2018/11/08 15:53:49 DEBUG - Stat '/dev/vdb'
2018-11-08_15:53:49.11265 [File System] 2018/11/08 15:53:49 DEBUG - Checking if file exists /dev/sdb
2018-11-08_15:53:49.11265 [File System] 2018/11/08 15:53:49 DEBUG - Stat '/dev/sdb'

Luckily, I was able to get into the compilation VM via SSH, and diagnose the problem in under two hours. And the demo lived happily ever after, the end.