Starting the Cloud Native Infrastructure Revolution

by Fabian Müller, Sebastian Spanner & Adalbert Sklorz / January 22nd 2019

Hi everyone,

We are the platform engineering and operation team. In this blog post we want to give you a little insight into how we tackled the challenges on our way towards Cloud Native Infrastructure. After we’ve explained what Cloud Native actually means to us, we want to tell you something about the technical and cultural problems we faced along the way.

Hopefully this will just be the first of many blog posts. We’d like to talk about more technical topics and challenges as well as give you some examples that may inspire you. If you have any questions, feedback or ideas for new topics, feel free to send us an email at peo@p7s1.net.

What is this Cloud Native thing?

We know the term is quite overloaded and used as a buzzword in many of the “blabla” CTO/CIO speeches, often without any real meaning. For us, Cloud Native is not directly connected to a software or product; nor does it mean that we need to rush to cloud providers and migrate all of our business applications there. As far as we are concerned, these are the major benefits:

  • Speed
  • Scalability
  • Reliability
  • Use of OSS (Open Source Software)

Coming from a background of traditional operation in a legacy enterprise environment, we noticed that most of the applications we managed weren’t able to provide all of these attributes in one place. Shipping new features from development into production was slow and error-prone. Given that test and live systems are never 100% equal, every change came with a risk of breaking functionality or causing a complete outage. Spinning up new environments for testing took a very long time; sometimes it was even impossible because of costs, or the time and effort involved. In terms of reliability, the applications had some basic high availability options, but most of them didn’t survive a simple database failover. The apps needed to be restarted manually in order to bring them back up. Scaling an app was also quite frustrating. In most cases, there was no real option to scale, so we just upgraded hardware or resized the virtual machine. Making an application more reliable by changing the source code was also nearly impossible since we primarily used proprietary software and by changing the code we would have been in violation of the support agreements.

Starting the Journey

We started our journey in early 2015 by researching potential improvements to the problems described above. When looking at the (slow) speed, we quickly realized that this was not only a technical problem, but also an organizational and cultural problem. The technical teams worked in silos and had their own priorities and goals. Creating value for our customers involved almost every team, adding considerable overheads in terms of communication, moving priorities and planning for change windows. In order to minimize overheads and increase speed, the development teams needed more control over certain parts of the application lifecycle. For example, the operation engineers needed extended rights to manage servers or databases.

If teams are able to perform certain tasks internally then some dependencies on other teams can be eliminated. But this change also comes with few downsides: Someone in the dev team would need to expand their remit and learn new skills such as operating a database. You also need a certain level of trust, as two or more teams could potentially be managing one service. Another way to improve speed would be to move from manual processes to more automation. To do this, common infrastructure services (like provisioning virtual machines, storage and network) would need to be accessible via APIs. When considering our options, we quickly realized that it would take the organization a long time to adjust its way of thinking and adapt to the level of technology level we desired. So, we needed to start researching the technologies that matched our requirements. We began by comparing the big cloud providers, classical VMs and the rising container technologies against our constraints and requirements. Since our team worked in the Content and Production domain, we had to consider a lot of media content workflows. This included incoming and outgoing content transfers as well as archiving. Our content archive is located in on-site datacenters and its size is somewhere in the region of two-digit petabytes, so moving all services to a public cloud provider would be far too expansive. More than two-thirds of our workflows would need to retrieve content from the archive systems and then deliver it to a destination, modify the content or archive it again. Since every gigabyte of data that would need to be transferred out of the cloud would add to the cost, we decided against a public cloud. In addition, most of the workflows grow in a linear manner and so have no need for quick scalability. Our conclusion was that the technology we chose would need to run in our datacenters.

Virtual Machines vs. Containers

For anyone who’s not familiar with containers, the biggest difference compared to a virtual machine, or VM, is the fact that a container is much more lightweight. A VM needs a type 2 hypervisor and a full operation system while still offering strong isolation from the other VMs using the same host. Containers, on the other hand, have weaker isolation but don’t need the VM overhead, which makes them faster and lets them utilize the underlying hardware more effectively.

container_vs_VMs VMs vs. Containers

reference: https://techburst.io/containers-vs-vms-what-is-the-battle-all-about-a17fb2162826

Coming back to our story: We already had virtualization platforms in our IT department, so why not make use of them? Well, most of the APIs were not available to other departments. Making APIs available to everyone is quite a challenge and couldn’t be realized within a reasonable timeframe. Building a new virtualization platform also didn’t make sense. At the same time, we also noticed that some software projects at ProSiebenSat.1 were aiming for a more service-oriented (SOA) and micro-service approach architecture. With that in mind, we started looking at Docker as a container runtime and Docker Swarm as a container orchestrator (the Swarm mode we tested is now referenced as the legacy Swarm). Swarm was simple and easy to setup, and worked well for small services with low complexity. However, when we started to migrate/setup a real application with all its requirements, like multiple environments for development and testing, user management, access control, security, and so on, we soon realized that Swarm quickly reached its limitations. A more promising solution was Kubernetes by Google, which was released in July 2015.

Kubernetes – A Container Orchestrator and Abstraction Layer

Kubernetes (K8s) is a container orchestrator and abstraction layer. K8s itself is a very complex, distributed system that helps automate container deployment, management and scaling. It offers a rich API that allows administrators to define how software is run and determine which dependencies are required. This includes compute resources, storage, configuration files, secrets, load balancing for external access, health checks and so on. To configure these items, operators define API objects with YAML or JSON, and apply these files against the K8s API. In other words: You describe your infrastructure as code. This lets you manage it in the same way as you manage your application code.

Running Kubernetes on-site can be quite challenging. There are already package solutions out there, but we decided against them. We didn’t want any proprietary software as this just adds complexity and makes the whole system even harder to run. Another issue was the vendor lock-in. We wanted to stick as closely as possible to the vanilla K8s API to be conformant, thus making switching between environments and providers very easy.

Kubernetes as an abstraction layer Kubernetes as an abstraction layer

To actually deploy Kubernetes, we had to write our own Ansible roles and playbooks. This was quite an effort, but totally worth the investment as we are now able to control every single parameter across the entire setup. For our production cluster, we used only bare metal servers and a simple layer 2 network topology to minimize complexity. Over time, the Ansible playbooks evolved and we achieved a highly available cluster setup, including multiple master nodes and workers spread over different datacenters. With the high availability setup, we are also able to perform rolling updates of Kubernetes on our production clusters without any interruption to the operation. We established a release process for Kubernetes which aims to always have the previous releases deployed to production. For example: When Kubernetes 1.13 is released, we will update our clusters to the latest 1.12 version. Every new feature or change to configs or operation system settings will need to go through our git-flow model.

Git-Flow Git-Flow

reference: https://nvie.com/posts/a-successful-git-branching-model/

Standard procedure is to create a feature branch and test the changes in our local Vagrant Virtual-Box minimal setup. If that change seems to be ok, we will create a merge request for our dev environment and assign this change to another person in our team. If the merge request is accepted, the change will be deployed to the dev system. The change is then tested manually and automatically. We also need to make sure that we are conformant with the official Kubernetes cluster capabilities by running the Kubernetes test suite from the official GitHub repos. If there are no major problems, we create another merge request for the production system. The deployment and testing processes are started again.

When it comes to running containers, we quickly realized that a standard K8s deployment didn’t satisfy all our needs. Pretty early on we had requests from our customers to not only run stateless containers, but also stateful applications like databases and message queues. Other requests included, for example, making it easier to the expose services that are running inside the clusters to the outside world. Also, at a certain level, more and more teams wanted to use our services, so we needed to think about multi-tenancy. Over time, we found solutions to most of these problems, but we are still trying to improve our clusters and make them easier to use. We learned that, in the end, it takes way more effort than just deploying a basic Kubernetes installation, especially if you’re not running everything inside a public cloud.

Supporting the Container Lifecycle

To really make use of container technologies, you need more tools than just something that runs containers. Software needs to be built, packaged and stored somewhere. You need to take care of metrics and logs, and also think about security. At the very least you need a place where you can store code and configuration. Over the course of the last 4 years we’ve tackled a lot of these problems and have gradually (kind of) figured most of them out. The problems we had to solve along the way will provide plenty of material for future blog posts. But for now, we’ll just take a look at how code becomes running software.

container lifecycle Container Lifecycle

From Code to Software

All our code is stored in an on-site GitLab instance; this was a prerequisite even before we started on our way towards a Cloud Native Infrastructure. We began with Jenkins as our CI/CD tool, as at the time it was (and still is) the go to tool for these kinds of jobs. We had Jenkins running as our sole CI/CD tool for a year or two and were mostly happy with it. Over time we realized that taking care of Jenkins can become really tedious. No one wanted to take full responsibility for its maintenance. We had hundreds of plugins installed, some of them were in use and some of them were not. But no one really knew. We also created a shared groovy library for all projects, which was intended to abstract common functions and make it easier to get started. But this ultimately also became hard for us to maintain as people added functions to the library we could not use or even understand. Jenkins became the elephant in the room that no one wanted to touch or be responsible for.

When we were battling with Jenkins, Gitlab introduced GitlabCI and – once we realized Jenkins was not our ideal solution – we started playing around with it. Our workloads were relatively easy: Most of the time we just built code, ran a few tests, built a Docker image and deployed it afterwards. It took a few tries, but we soon found a solution that works for us, to this today. We currently use a custom build container, which is based on Docker but also includes basic tools like make, kubectl or helm. It also contains a wrapper cli written in go which abstracts some common functions but has a narrower scope than the previous Jenkins Workflow library. The exact steps for running builds or tests are defined in a Makefile. So whatever project you are looking at, ‘make build’ followed by a ‘make push’ will always build and push your Docker image.

Our workflow looks like this:

  • If code is committed, the GitlabCI Pipeline is triggered
  • A build container with our custom base image is started in one of our clusters
  • Code is checked out, tested and built by running ‘make build’
  • The built image is pushed to our container registry
  • Optionally: The software is deployed with helm

This works pretty well for most of our uses. That being said, we still have Jenkins running for other workloads, for example those which contain a lot of UI testing and other additional stages. As always, there is no perfect tool for every job, it always depends on the workload and the problem you’re facing. GitlabCI currently works pretty well for our needs, but we are also considering other tools, as it too has its limitations. After all, adapting to change is one of the most important aspects of modern IT culture.

To Summarize

In closing this blog post, we wanted to give you an impression of our scale by summarizing our platform’s moving parts. We are currently three people and actively maintain the following components:

  • 4 Kubernetes Clusters composed of 12 VMs and 40 bare metal servers as well as an overall capacity of 2700 cpu cores and 15TB memory
  • Around 120 Git repos
  • Around 70 Docker Images
  • Nearly 30.000 lines of Ansible roles and playbooks

We hope you like our post. Stay tuned for more!

The PEO team.

MORE BLOG POSTS