Home - Results from #25

Through a combination of carefully structured remote backends, thoughtful output design, and targeted use of the terraform_remote_state data source, you can establish a controlled information flow between different tenant levels - all without compromising the isolation of individual tenants.

Effectively using remote state for information exchange between organizational units requires a well-thought-out configuration of the Terraform environment. Central to this is the selection and setup of a suitable storage backend for storing state data in what are known as state files.

In the previous part of this series, we explained the basics of the remote state concept in Terraform and how it can be used for information inheritance in multi-tenancy environments. Now we will illustrate this with a concrete architectural example.

Scaling Terraform across organizational boundaries requires a careful balance between standardization and flexibility. With clear team structures, well-thought-out governance, automated CI/CD processes, and appropriate tooling support, even complex multi-tenant infrastructures can be effectively managed. With this foundation, you can expand your Terraform practice from individual teams to the entire organization while ensuring consistency, security, and efficiency.

This is the first part of a series on designing multi-tenancy as Infrastructure-as-Code in large-scale infrastructures.

Target, one of the largest retailers in the USA with over 1,800 stores, faced a complex challenge: orchestrating workloads across multiple environments - from the public cloud to its own data centers and edge locations in stores. Kubernetes was already in use in some areas but was too complex and too expensive in terms of overall operational costs. The decision was ultimately made in favor of HashiCorp Nomad, which led to a significant acceleration of development cycles and a simplification of the infrastructure. This success story highlights a recurring pattern in the industry: companies are increasingly recognizing the value of lean, efficient orchestration solutions that focus on the essentials.

On July 19, 2024, a severe IT outage caused by a faulty update to CrowdStrike's Falcon platform led to widespread disruptions across various sectors, including air travel, hospitals, and government agencies. This platform is designed to enhance security by preventing attacks in real-time. To achieve this, monitoring points are deeply embedded within the server system, requiring the highest administrative privileges on the system itself. This approach is already questionable in itself, but an additional attack surface was introduced: these sensors, which are deeply integrated into system processes for security monitoring, received updates via a global distribution system controlled by CrowdStrike - implemented with the good intention of achieving the most consistent global security coverage possible, rather than relying on customers to take action themselves.

Such a centralized approach is, of course, only unproblematic if it functions correctly and does not cause damage. However, this is precisely what happened - a faulty update was widely distributed to server systems running Microsoft Windows. Due to the deep integration into these systems, a faulty library (sensorsvc.dll) triggered the kernel panics known as Blue Screens, and through this Single Point of Failure, the intended consistent global security posture turned into a global outage. Airlines were particularly affected - approximately 1,500 flights were canceled - along with banks, retail, and healthcare sectors. Although the update was rolled back, the server systems had to be manually repaired in safe mode. This incident highlighted the vulnerabilities of centralized update distribution systems and the chain reactions that such a Single Point of Failure can cause.

Furthermore, it became evident what can happen when fundamental measures for ensuring failover protection are neglected: robust service health monitoring with automated failovers, mechanisms to contain the blast radius, and comprehensive disaster recovery capabilities. Customers who had considered such precautions could simply activate their standby systems. However, very few had thought that far ahead. Yet, such architectural principles are becoming increasingly essential for mission-critical systems.

Terraform @ Scale - Part 1c: Practical Implementation of Remote State Data Flows

Terraform @ Scale - Part 1b: Multi-Tenancy Architectural Example for Modular Cloud Infrastructures

Terraform @ Scale - Part 1a: Multi-Tenancy - Inheriting Information Across Organizational Units and Customers

Nomad: Modern and Lightweight Workload Orchestration for Enterprises

HashiCorp Consul: Modern Enterprise Zero Trust Networking - An Overview

More Articles …