How Bitnami uses kops to manage Kubernetes clusters on AWS

Bitnami runs Kubernetes in production. Of all our new services and web sites are deployed to Kubernetes. To help us manage the cluster lifecycle, visibility and surrounding infrastructure, we turned to kops and added some extra tooling around it. In this article, we’ll share Bitnami’s experience with kops and infrastructure automation. If you are interested in running a Kubernetes cluster in production, but have non-trivial operation requirements for reliability, reproducibility, and developer usability, this article is for you.

Historically, the quick-and-dirty way to launch a Kubernetes cluster on AWS was to use a script in the Kubernetes repository, called kube-up.sh, with a collection of specific environment variables. The “hello world” cluster works, but if you plan on maintaining multiple clusters (e.g. development, staging and production), keeping them up-to-date and configuring their security, you might be disappointed. While setting up a basic cluster is simple with kube-up.sh, Bitnami’s infrastructure team anticipated that the rapid evolution of Kubernetes and support for the full software development lifecycle, setup wasn’t going to be a one-off fire-and-forget event. Tools to manage the cluster beyond launch in a way that provides transparency are also vital. If you have ever examined a distributed systems infrastructure whose configuration didn’t have a clear lineage, you can appreciate the “how did the cluster get into that state?” question. Beyond Kubernetes itself, there is also a plethora of surrounding cloud resources that the cluster must operate within; VPC’s, subnets, gateways, network ACL’s, DHCP options, SSH keys and so forth. To help us manage the surrounding infrastructure we added some extra tooling around our use of kops.

kops: Kubernetes Operations

kops is a command line interface (CLI) tool that has implemented the major verbs of managing a cluster; as they say on the kops GitHub page, “We like to think of it as kubectl for clusters.” Among the features that makes kops an attractive option:

  • Automates provisioning that otherwise would have to done the hard way
  • It has externalized and editable configuration parameters
  • The application of Kubernetes configuration parameter updates
  • Simplifies deploying Kubernetes masters in high-availability (HA) mode
  • Supports Kubernetes version upgrades
  • Built on a state-sync model for dry-runs and automatic idempotency
  • Ability to generate Terraform configuration
  • Supports custom add-ons for kubectl

There’s a strong community, also very active on Slack (#kops, #sig-aws), and wide adoption around kops, so we feel confident that we’re not excessively reliant on a tool destined for obscurity. The CLI supports the full range of cluster CRUD activities. For instance, kops create cluster enables setting down the tent stakes for your cluster, bootstrapping the desired configuration state in S3; no cloud resources are created unless the --yes option is applied. But it’s recommended to create the cloud resources in a separate kops update cluster step. Of course you can kops get clusters and kops delete cluster as well. The complete set of CLI operations is well documented in the kops GitHub repo.

Configuration As Code

While kops supports just about everything needed to configure a cluster as command line arguments, one of the key wins we enjoy with it is that the desired state for a cluster is represented in a set of files. Using command line options is adequate for ad-hoc operations, but for repeatable and transparent ones, code wins. At Bitnami, we manage cluster configuration changes with a git and Jenkins workflow pipeline. And, of course, we not only want the cluster configuration files in git, we want the driver code in git too. With Jenkins, the pipeline-as-code is manifested as a Jenkinsfile. We use the Jenkinsfiles DSL, which is just groovy, to codify our continuous integration and other automation workloads. Here’s a snippet that drives a kops cluster configuration replace operation where the files for a particular cluster are in our code (in the CLUSTER_PATH folder):

def replaceClusterDefinition() {
    sh(
        """
        for f in `ls ${CLUSTER_PATH}`; do
          kops replace -f ${CLUSTER_PATH}/\$f;
        done
        """
    )
}

The replace operation will update the state store, which is an S3 bucket, applying the changes is a separate update operation.

def updateCluster(dryRun = true) {
    sh "kops update cluster ${CLUSTER_NAME} " + (dryRun ? "" : "--yes")
}

With the cluster state expressed as code in git, established best-practices for continuous integration and continuous delivery (CI/CD) can be employed. When an infrastructure team member wants to update the cluster, they edit the config file, push it to git, validation rules are tested in a continuous integration job and, assuming they tests pass, the changes are merged into master. Applying the changes to the cluster is a separate operation, also invoked in Jenkins, to run the replace and update above.

Filling In The Gaps

At Bitnami we have specific network connectivity and isolation requirements. Peering VPC’s across AWS account and adding routes, pre-assigning subnets by availability zone, configuring NAT gateways, propagating private DNS and other specific networking needs may not all be addressed by kops. But that’s not a problem; we sandwhich our use of kops with some ruby code to assure the network provisions are valid. The concerns could just as well be addressed with some Terraform templates.

Validity

The infrastructure surrounding a cluster must have valid network routes, security group ingress rules and so forth. To that end, the items our validation rules inspect includes checking that - assures a complete and correct definition for the VPC that Kubernetes runs in - peering is set up between the VPC that Kubernetes runs in and other appropriate VPC’s - subnets are defined without IP space collisions - a default route configured via the internet gateway - access via a NAT gateway By programmatically validating the surrounding infrastructure, we can provision a cluster with confidence that it will work in a repeatable, consistent way.

Integrity Checking

While we have constraints on who has privileges to update our cloud resources, we check for rogue changes by comparing the contents of the kops state store (an S3 object s3://<bucket name>/<cluster name>/config). We periodically monitor that the state in S3 is not diverging from what’s in git, if it has been modified out-of-band from the pipeline an alert is posted for the infrastructure team.

Onward

Using git, Jenkins, kops and a little bit of glue code, we have our cluster configurations fully automated. Our lifecycle operations are repeatable, their history visible in Jenkins job logs and access to operations are controlled by ACL’s in Jenkins. The cluster states are maintained confidently, reproducibly and idempotently. Most of all, this management stack has made the hard things easy. We’re excited to be using, and contributing to, the kops project going forward.

Want to reach the next level in Kubernetes?

Contact us for a Kubernetes Training