GMP-CF-0001 - Database Scheme Fix Migration

Overview

We recently discovered an issue affecting versions 1.1.0, 1.1.1 and 1.1.2 of the Genesis Cloud Foundry Kit. Locket and Diego were incorrectly configured to place their tables in the postgres database, instead of their dedicated locketdb and diegodb databases.

While this does not affect day-to-day operation of Cloud Foundry, we still see this an issue that needs to be corrected as soon as possible.

Please note that although your PostgreSQL instance may already contain locketdb and diegodb databases, they are empty and unused. All tables that should be in those databases were instead created in the postgres database.

Analysis

A major rewrite of internal Genesis YAML occurred for the 1.1.0 kit release, as various components were upgraded and required configuration changes. Namely, one of these changes was to the database connection information.

For diego and locket, their BOSH release configuration uses the term db_schema to refer to the name of the desired database, and db_driver to set the type of database used (typically mysql or postgres)

sql.db_schema:
    description: "Database name to use for connecting to SQL backend"
    default: ""
sql.db_driver:
    description: "Database driver to use for SQL backend (for example: mysql,postgres)"
    default: mysql

Whereas for most other SQL configuration YAML blocks used a variation of db_name or database.name to refer to the desired database name, and db_type or database.type to refer to the type of database used.

database.name:
    description: "Name of logical database to use."
database.type:
    description: "Type of database: postgres or mysql"

Mistakenly, we set sql.db_schema to an internal db_scheme variable that Genesis sets according to operator configuration. Depending on environment settings, sql.db_schema was either set to postgres or mysql:

sql:
    db_host:      (( grab params.locketdb_host ))
    db_port:      (( grab params.locketdb_port ))
    db_schema:    (( grab params.locketdb_scheme ))
    db_username:  (( grab params.locketdb_user ))
    db_password:  (( grab params.locketdb_password ))
    db_driver:    (( grab params.locketdb_scheme ))

Because PostgreSQL has a database named postgres, and MySQL has a database named mysql by default, our mistake went unnoticed because Cloud Foundry still operated normally as it was able to access and write to those databases.

To fix this, we are making changes to db_schema to reference the proper db_name variable set. To mitigate future occurrences of internal YAML variable misuse, we plan to organize the Cloud Foundry kit's params variables into meta and params, a paradigm that we've used successfully for newer kits.

Impact

We believe the impact to be minimal to the stability and runtime of your Cloud Foundry deployments. However, it is in the best interest to correct this problem as confusion stemming from unorganized data may arise. Specifically, it's possible that selective backups of Cloud Foundry databases expected Diego and Locket data to be in their proper database names; however backups taken of diegodb and locketdb are empty.

The fix for this issue will require downtime. In our testing, applications were inaccessible for 15 minutes while the fix was being applied.

The Process

The following sections contain step-by-step instructions on how to successfully migrate Locket and Diego data from postgres and into their respective databases. This guide will mention when downtime approximately starts and ends.

There are different sets of steps to take depending on your Cloud Foundry environment. Please choose the one that matches your deployment, using the matrix below:

Kit Version Genesis Feature Relevant Section
1.1.0
1.1.1
1.1.2
local-db Standalone Local PostgreSQL
local-ha-db Highly-Available Local PostgreSQL
postgres External PostgreSQL Databases
mysql External MySQL Databases

Standalone Local PostgreSQL

This process applies if you deployed with the local-db feature.

Am I Affected?

Execute the following command, which will grab the tables currently in the postgres database name.

bosh ssh -e [environment name] -d [deployment name] postgres/0 -rc \
  "/var/vcap/packages/postgres/bin/psql -U vcap -p 5432 postgres -t -c '\dt' 2>&1"

Next to the column stdout, there will be a list of tables. Look for any of the following:

public | actual_lrps    | table | diegoadmin
public | configurations | table | diegoadmin
public | desired_lrps   | table | diegoadmin
public | domains        | table | diegoadmin
public | locks          | table | locketadmin
public | tasks          | table | diegoadmin

If any of your output matches above, you are affected and must follow the rest of this guide to upgrade to 1.2.0 or beyond. Otherwise, you can go brew yourself a cup of tea and pursue other endeavors.

Step A1: Getting Started

We'll frequently reference [environment name] and [deployment name] during this guide. [environment name] refers to the name of your BOSH director where Cloud Foundry is deployed to. [deployment name] refers to the name of your Cloud Foundry deployment.

Before starting, ensure your Cloud Foundry deployment is running properly by running smoke tests. If tests pass, continue onto step 2. Otherwise, please diagnose and correct your errors; do not continue until smoke tests pass.

Once you finish this guide, you will once again execute smoke tests to validate your Cloud Foundry deployment. It's a good idea to run smoke tests now so that if your final smoke tests were to fail, you can determine if it was this migration process that caused it and not previous changes

To execute smoke tests, run the following command on your Cloud Foundry deployment:

bosh -e [environment name] -d [deployment name] run-errand smoke-tests

Step A2: Stage Genesis Environment File for Deploy

The Genesis Cloud Foundry Kit version 1.2.0 includes the fix to GMP-CF-001. Download it via genesis download cf/1.2.0 in your Cloud Foundry deployments folder. Then, edit your environment file and set version: to version: 1.2.0. Do not deploy yet, that step will come last.

Step A3: Stop All BBS VMs

In order to migrate the data, we first need to stop the processes that access the affected DB tables. We'll do this via BOSH SSHing into all the BBS instances

To do so, gather how many BBS instances running via:

bosh -e [environment name] -d [deplyoment name] vms | grep bbs

This will print out a list of all BBS instances deployed in your CF environment. With each entry on the list, copy the bbs/instance-guid-here and run:

bosh -e [environment name] -d [deployment name] ssh `bbs/instance-guid-here`

Once connected to your VM, become the root user:

sudo -i

Stop all processess on the VM:

monit stop all

Logout of the root user:

exit

End the SSH session:

exit

Repeat these steps for each BBS instance listed from above.

Approximately 2 minutes after stopping all processes on all VMs,routes to your apps will no longer work and downtime begins.

Step A4: Connect to your Postgres VM

To BOSH SSH into your Postgres VM, execute the following:

bosh -e [environment name] -d [deployment name] ssh postgres

Once connected, become the root user:

sudo -i

It is necessary to add a directory to your PATH, as the commands in step 4 will require it. To do so, run:

export PATH=$PATH:/var/vcap/packages/postgres/bin

Step A5: Migrate the Data

We'll need a place to store SQL dumps, so create a directory under the permanent data directory by running:

mkdir /var/vcap/store/migration
cd /var/vcap/store/migration

Dump the data in the postgres database into files named after their proper database name:

pg_dump -U vcap -p 5432    -t locks postgres > locketdb.sql;
pg_dump -U vcap -p 5432 -c -t actual_lrps -t configurations -t desired_lrps -t domains -t tasks postgres > diegodb.sql;

You will then have two files under /var/vcap/store/migration, locket.sql will contain the Locket database data, and diegodb.sql will contain the Diego database data.

Now that we've exported the data from postgres, we want to import back into the properly named databases:

psql -U vcap -p 5432 locketdb < locketdb.sql;
psql -U vcap -p 5432 diegodb  < diegodb.sql;

Step A6: Verify the Data Migration was Successful

To make certain that no GUIDs were changed and that all primary keys stayed the same, we want to re-export the recently imported data and verify against our original export:

pg_dump -U vcap -p 5432    -t locks locketdb > locketdb.new.sql;
pg_dump -U vcap -p 5432 -c -t actual_lrps -t configurations -t desired_lrps -t domains -t tasks postgres > diegodb.new.sql;

Verify the recently exported data has no difference from the originally exported data:

diff locketdb.sql  locketdb.new.sql;
diff diegodb.sql   diegodb.new.sql;

If all went well, there should be no output from any of those diff commands.

Step A7: Rename Old Databases

Now that we've migrated the data to their proper databases, we want to prevent any old configurations from referencing data from postgres. To do so:

psql -U vcap -p 5432 postgres -c "ALTER TABLE locks          RENAME TO locks_old";
psql -U vcap -p 5432 postgres -c "ALTER TABLE actual_lrps    RENAME TO actual_lrps_old";
psql -U vcap -p 5432 postgres -c "ALTER TABLE configurations RENAME TO configurations_old";
psql -U vcap -p 5432 postgres -c "ALTER TABLE desired_lrps   RENAME TO desired_lrps_old";
psql -U vcap -p 5432 postgres -c "ALTER TABLE domains        RENAME TO domains_old";
psql -U vcap -p 5432 postgres -c "ALTER TABLE tasks          RENAME TO tasks_old";

This way, any software that references the old tables should fail, making it easier to discover improperly configured processes.

Step A8: Redeploy Genesis Environment

If all went well, the next step is to redeploy your Cloud Foundry environment with the v1.2 fix, which will point Locket and Diego to their properly named databases:

genesis deploy [genesis environment name]

If all went well, after redeploy BBS will begin running again and routes will restore within a minute after deploying. Downtime ends here.

Step A9: Cleanup & Verification (Final step)

Execute smoke tests to verify that your redeployed Cloud Foundry environment is operating properly. Once you are satisfied with the results of the migration, you may delete the old tables:

psql -U vcap -p 5432 postgres -c "DROP TABLE locks_old CASCADE";
psql -U vcap -p 5432 postgres -c "DROP TABLE actual_lrps_old CASCADE";
psql -U vcap -p 5432 postgres -c "DROP TABLE configurations_old CASCADE";
psql -U vcap -p 5432 postgres -c "DROP TABLE desired_lrps_old CASCADE";
psql -U vcap -p 5432 postgres -c "DROP TABLE domains_old CASCADE";
psql -U vcap -p 5432 postgres -c "DROP TABLE tasks_old CASCADE";

Highly-Available Local PostgreSQL

This process applies if you deployed with the local-ha-db feature.

Am I Affected?

Execute the following command, which will grab the tables currently in the postgres database name.

bosh ssh -e [environment name] -d [deployment name] postgres/0 -rc \
  "/var/vcap/packages/postgres/bin/psql -U vcap -p 6432 postgres -t -c '\dt' 2>&1"

Next to the column stdout, there will be a list of tables. Look for any of the following:

public | actual_lrps    | table | diegoadmin
public | configurations | table | diegoadmin
public | desired_lrps   | table | diegoadmin
public | domains        | table | diegoadmin
public | locks          | table | locketadmin
public | tasks          | table | diegoadmin

If any of your output matches above, you are affected and must follow the rest of this guide to upgrade to 1.2.0 or beyond. Otherwise, you can go brew yourself a cup of tea and pursue other endeavors.

Step B1: Getting Started

We'll frequently reference [environment name] and [deployment name] during this guide. [environment name] refers to the name of your BOSH director where Cloud Foundry is deployed to. [deployment name] refers to the name of your Cloud Foundry deployment.

Before starting, ensure your Cloud Foundry deployment is running properly by running smoke tests. If tests pass, continue onto step 2. Otherwise, please diagnose and correct your errors; do not continue until smoke tests pass.

Once you finish this guide, you will once again execute smoke tests to validate your Cloud Foundry deployment. It's a good idea to run smoke tests now so that if your final smoke tests were to fail, you can determine if it was this migration process that caused it and not previous changes

To execute smoke tests, run the following command on your Cloud Foundry deployment:

bosh -e [environment name] -d [deployment name] run-errand smoke-tests

Step B2: Stage Genesis Environment File for Deploy

The Genesis Cloud Foundry Kit version 1.2.0 includes the fix to GMP-CF-0001. Download it via genesis download cf/1.2.0 in your Cloud Foundry deployments folder. Then, edit your environment file and set version: to version: 1.2.0. Do not deploy yet, that step will come last.

Step B3: Stop All BBS VMs

In order to migrate the data, we first need to stop the processes that access the affected DB tables. We'll do this via BOSH SSHing into all the BBS instances

To do so, gather how many BBS instances running via:

bosh -e [environment name] -d [deplyoment name] vms | grep bbs

This will print out a list of all BBS instances deployed in your CF environment. With each entry on the list, copy the bbs/instance-guid-here and run:

bosh -e [environment name] -d [deployment name] ssh `bbs/instance-guid-here`

Once connected to your VM, become the root user:

sudo -i

Stop all processess on the VM:

monit stop all

Logout of the root user:

exit

End the SSH session:

exit

Repeat these steps for each BBS instance listed from above.

Approximately 2 minutes after stopping all processes on all VMs,routes to your apps will no longer work and downtime begins.

Step B4: Connect to your Postgres VM

For Highly-Available PostgreSQL installations, all data manipulation commands must execute against the write master. To determine which of your two postgres/* nodes is the master, execute the following command:

bosh ssh postgres/0 -rc \
  "/var/vcap/packages/postgres/bin/psql -U vcap -p 6432 postgres -t -c 'SELECT pg_is_in_recovery()' 2>&1"

If the column stdout reads f, then postgres/0 is your master. If the printed output is t, then postgres/1 is your master.

To BOSH SSH into your Postgres VM, execute the following, substituting # with the number of your master node.

bosh -e [environment name] -d [deployment name] ssh postgres/#

Once connected, become the root user:

sudo -i

It is necessary to add a directory to your PATH, as the commands in step 4 will require it. To do so, run:

export PATH=$PATH:/var/vcap/packages/postgres/bin

Step B5: Migrate the Data

We'll need a place to store SQL dumps, so create a directory under the permanent data directory by running:

mkdir /var/vcap/store/migration
cd /var/vcap/store/migration

Dump the data in the postgres database into files named after their proper database name:

pg_dump -U vcap -p 6432    -t locks postgres > locketdb.sql;
pg_dump -U vcap -p 6432 -c -t actual_lrps -t configurations -t desired_lrps -t domains -t tasks postgres > diegodb.sql;

You will then have two files under /var/vcap/store/migration, locket.sql will contain the Locket database data, and diegodb.sql will contain the Diego database data.

Now that we've exported the data from postgres, we want to import back into the properly named databases:

psql -U vcap -p 6432 locketdb < locketdb.sql;
psql -U vcap -p 6432 diegodb  < diegodb.sql;

Step B6: Verify the Data Migration was Successful

To make certain that no GUIDs were changed and that all primary keys stayed the same, we want to re-export the recently imported data and verify against our original export:

pg_dump -U vcap -p 6432    -t locks locketdb > locketdb.new.sql;
pg_dump -U vcap -p 6432 -c -t actual_lrps -t configurations -t desired_lrps -t domains -t tasks postgres > diegodb.new.sql;

Verify the recently exported data has no difference from the originally exported data:

diff locketdb.sql  locketdb.new.sql;
diff diegodb.sql   diegodb.new.sql;

If all went well, there should be no output from any of those diff commands.

Step B7: Rename Old Databases

Now that we've migrated the data to their proper databases, we want to prevent any old configurations from referencing data from postgres. To do so:

psql -U vcap -p 6432 postgres -c "ALTER TABLE locks          RENAME TO locks_old";
psql -U vcap -p 6432 postgres -c "ALTER TABLE actual_lrps    RENAME TO actual_lrps_old";
psql -U vcap -p 6432 postgres -c "ALTER TABLE configurations RENAME TO configurations_old";
psql -U vcap -p 6432 postgres -c "ALTER TABLE desired_lrps   RENAME TO desired_lrps_old";
psql -U vcap -p 6432 postgres -c "ALTER TABLE domains        RENAME TO domains_old";
psql -U vcap -p 6432 postgres -c "ALTER TABLE tasks          RENAME TO tasks_old";

This way, any software that references the old tables should fail, making it easier to discover improperly configured processes.

Step B8: Redeploy Genesis Environment

If all went well, the next step is to redeploy your Cloud Foundry environment with the v1.2 fix, which will point Locket and Diego to their properly named databases:

genesis deploy [genesis environment name]

If all went well, after redeploy BBS will begin running again and routes will restore within a minute after deploying. Downtime ends here.

Step B9: Cleanup & Verification (Final step)

Execute smoke tests to verify that your redeployed Cloud Foundry environment is operating properly. Once you are satisfied with the results of the migration, you may delete the old tables:

psql -U vcap -p 6432 postgres -c "DROP TABLE locks_old CASCADE";
psql -U vcap -p 6432 postgres -c "DROP TABLE actual_lrps_old CASCADE";
psql -U vcap -p 6432 postgres -c "DROP TABLE configurations_old CASCADE";
psql -U vcap -p 6432 postgres -c "DROP TABLE desired_lrps_old CASCADE";
psql -U vcap -p 6432 postgres -c "DROP TABLE domains_old CASCADE";
psql -U vcap -p 6432 postgres -c "DROP TABLE tasks_old CASCADE";

External PostgreSQL

This process applies if you deployed with the postgres feature.

We do not currently have migration steps for this scenario.

Please reach out to us on Genesis Slack

External MySQL

This process applies if you deployed with the mysql feature.

We do not currently have migration steps for this scenario.

Please reach out to us on Genesis Slack

Help & Support

If you have concerns about the impact of this migration process, or need assistance running through it, please don't hesitate to find us in #help on Slack.