Sägetstrasse 18, 3123 Belp, Switzerland +41 79 173 36 84 info@ict.technology

    Terraform @ Scale - Part 1a: Multi-Tenancy - Inheriting Information Across Organizational Units and Customers

    Scaling Terraform across organizational boundaries requires a careful balance between standardization and flexibility. With clear team structures, well-thought-out governance, automated CI/CD processes, and appropriate tooling support, even complex multi-tenant infrastructures can be effectively managed. With this foundation, you can expand your Terraform practice from individual teams to the entire organization while ensuring consistency, security, and efficiency.

    This is the first part of a series on designing multi-tenancy as Infrastructure-as-Code in large-scale infrastructures.

     

    Introduction

    In the world of modern cloud infrastructures, "multi-tenancy" initially sounds like one of those buzzwords you hear at conferences while secretly checking your emails on your phone. However, for companies that serve dozens or even hundreds of different tenants, departments, or customers on a shared infrastructure, it is a harsh reality. As the number of tenants increases, complexity grows disproportionately (often quadratically or cubically) due to the multiplication of dependencies and interfaces.

    Imagine having to manage 20 separate Terraform projects - each with its own variables, states, backends, and module versions.

    Now imagine having to share fundamental networking information across these projects without using duplicated code blocks that inevitably diverge with the next change.

    And if that still seems manageable, add the organizational dimension: now imagine that different employees and CI/CD pipelines across various projects implement infrastructure as Infrastructure-as-Code. These could even be teams from other business units, with different decision-makers and managers in their respective organizational structures.

    Welcome to the world of multi-tenancy management with Terraform.

    Traditional Terraform approaches quickly reach their limits here. The classic pattern - one repository, one state, one workspace - works excellently for manageable environments. However, as soon as you need to share information across logically separate infrastructure domains, things become complicated. Questions arise such as:

    • How do you ensure that all teams have access to the same fundamental network configurations?
    • How do you make sure that changes to such foundational configurations are automatically propagated to dependent teams?
    • How can you prevent different teams from creating configurations that overlap or even conflict, such as identical network addresses?
    • How do you avoid one team accidentally overwriting another team's resources?
    • And how do you maintain oversight despite growing complexity?

    In a previous article, we explored how easily infrastructure disasters can occur. The Knight Capital case - where inconsistent deployment across just eight servers led to a $460 million loss - serves as a stark warning. If a single system with only a few components can have such dramatic consequences, what does that mean for complex multi-tenant environments?

    The good news: with the right patterns and techniques, this complexity can be managed. Terraform offers an elegant way to share information across different infrastructure domains without compromising tenant isolation through its Remote State concept.

    In this first part of our series "Terraform at Scale", we will focus on precisely this topic:

    • How can we effectively use Terraform to manage multi-tenancy environments?
    • How can we strategically inherit information across organizational units?

    We will outline practical patterns that have been successfully applied in various customer projects and that will help you take your own Terraform infrastructure to the next level - without any $460 million mistakes.

    Understanding Multi-Tenancy - More Than Just Separate Environments

    When we talk about multi-tenancy - or tenant capability - in the cloud world, many initially think of fully isolated environments: Customer A gets their own network, Customer B gets their own network, and the two shall never meet. This perspective is not wrong, but it falls short, especially when it comes to Infrastructure-as-Code (IaC).

    In the context of Terraform, multi-tenancy is not just about the technical separation of resources but also about the organizational structuring of code, states, and workflows. It is about modularizing our infrastructure in a way that reflects the reality of our organization - whether it is a service provider with external customers or a company with various business units and projects.

    Logical vs. Physical Tenant Separation

    With physical separation, each tenant receives its own dedicated resources - separate servers, networks, and storage systems. This provides maximum isolation but is costly and often unnecessarily strict. Logical separation, on the other hand, uses shared physical resources while segregating data and access at the application level. In practice, we often see hybrid approaches:

    • Network Layer: Dedicated VPCs or subnets per tenant, but shared physical infrastructure
    • Compute Layer: Dedicated VM instances but on shared hardware
    • Data Layer: Separate databases or schemas but possibly on shared database servers
    • Identity Layer: Separate IAM policies and roles but within a shared identity system

    With Terraform, we can elegantly model these different separation layers but must carefully design the information flows between them.

    The Organization of Tenants in Reality

    In practice, tenant structures are rarely one-dimensional. A typical company, for example, has:

    • Business units (Finance, HR, Production, Sales)
    • Functional teams within these units (Development, QA, Operations)
    • Projects, which are often cross-functional
    • Environments (Development, Testing, Production) for each project
    • Regional structures due to legal or latency-based requirements

    These organizational dimensions overlap, and our Terraform structure must be able to reflect this complexity.

    While cloud providers such as AWS with Organizations or OCI with Compartments offer hierarchical structures for this, our Terraform code often needs to be even more granularly organized. This is because different departments and their teams are composed of different people, with varying levels of expertise, responsibilities, and sometimes even different expectations, goals, and - in the case of global players - diverse cultural backgrounds.

    All of this contributes to a complexity that can lead to constant surprises when translating it into code - both positive and negative.

    The challenge is also that the various dimensions provide or require different types of information. For example:

    • The network team defines core networks and subnets,
    • the security team configures firewalls, policies, and certificates,
    • the storage team manages filers, block storage, object storage, and backups,
    • the database team provides the databases,
    • and the application teams must be able to access these relevant resources without duplicating them.

    The Problem of Information Inheritance

    And this is precisely where the real challenge lies: How do we ensure that changes to fundamental infrastructure components are automatically propagated to all dependent components?

    • If our network team changes a subnet CIDR, how do all VMs become aware of it?
    • If we add new cloud regions, how do we ensure that all teams use the same default settings?

    In a small environment, we could manage all of this within a single Terraform state. However, this quickly leads to a state monolith with:

    • Hundreds or thousands of resources
    • Slow planning times
    • The risk that one team accidentally impacts another team's resources

    The alternative - completely separate states without information exchange - results in:

    • Duplication,
    • Inconsistencies,
    • and the notorious "Copy & Paste DevOps" that we all strive to avoid.

    Fortunately, Terraform offers an elegant solution to this dilemma with its Remote State concept. In the next section, we will explore how the strategic use of Remote States enables flexible, scalable information inheritance across tenants - without compromising tenant isolation.

    The Terraform Remote State

    Terraform stores information about your infrastructure in a state file - the so-called State File, which serves as the single source of truth for the infrastructure.

    Anyone who has ever accidentally deleted a local .tfstate file is painfully aware of this. This seemingly inconspicuous file is Terraform's memory - without it, Terraform no longer knows which resources have already been created and would attempt to provision everything from scratch, often with catastrophic consequences.

    What Is Terraform Remote State and Why Is It Needed?

    This provides several key advantages:

    • Collaboration: Multiple team members can work on the same infrastructure without manually exchanging state files.
    • Security: State files often contain sensitive information - central storage allows for better protection.
    • State Locking: Prevents multiple people from making changes simultaneously, which could lead to inconsistencies.
    • Information Sharing: Remote states can serve as a data source for other Terraform projects.

    In large infrastructures, this last point is particularly crucial. It allows us to build a scalable and modular architecture:

    • Each team or project manages its own Terraform states without embedding dependencies on other teams directly into the code.
    • Shared infrastructure components (e.g., networks, identity and access controls, shared services) can be provided as a remote state by their respective responsible teams.
    • Changes in the global infrastructure are automatically propagated to dependent projects.

    Importance of Remote State for Multi-Tenancy Architectures

    Terraform Remote StatesIn a typical multi-tenancy environment, we have different layers of infrastructure that build upon each other:

    1. Global Infrastructure: Core networks, IAM configurations, shared services
    2. Tenant-Specific Resources: Databases, application servers, storage solutions, ...
    3. Application Layer: The tenant’s actual workloads

    With remote states, we can cleanly separate these layers without requiring higher layers to duplicate core information from lower layers.

    For example, an application team can access network information managed by the networking team without having to define its own network resources.

    Remote State in Practice

    Let’s take a look at how this works in practice:

    • A central infrastructure team manages core network configurations and storage systems for all teams within a dedicated Terraform project.
    • The central Security NOC defines firewalls, associated rules, and policies.
    • The database team manages databases and provides tenants with isolated database partitions.
    • An application team can then access these network configurations, storage systems, security policies, and databases within its own Terraform projects for various value streams without redefining them. Additionally, pipelines across different value streams can exchange data via remote state.
    • If the network, storage, security, or database team makes changes, these updates automatically propagate to downstream projects.

    The "terraform_remote_state" Data Source in Detail

    At the heart of this information exchange is the terraform_remote_state data source. It enables a Terraform project to access the state of another project. Here’s a simple example:


    data "terraform_remote_state" "network" {
      backend = "remote"
      config = {
        organization = "my-org"
        workspaces = {
          name = "network-global"
        }
      }
    }
    
    resource "cloud_instance" "app_server" {
      image_id      = "image-12345678"
      instance_type = "standard"
      subnet_id     = data.terraform_remote_state.network.outputs.private_subnet_id
    }

    In this example, the application team retrieves the subnet ID defined by the network team. The key point here is that only values explicitly defined as outputs are accessible—this establishes a clear interface between projects.

    Another powerful feature: Terraform allows access to states from different backends, facilitating information exchange between different environments (e.g., AWS and OCI). This is particularly valuable for organizations pursuing multi-cloud strategies.

    Security Considerations for Exposing State Information

    As powerful as remote states are, they require careful security planning. State files can contain potentially sensitive information:

    1. Access Control: Not every team should have access to every remote state. Use the access controls of your backend provider (e.g., IAM policies for S3).
    2. Sensitive Data: State files may contain passwords and other secrets. Use Terraform’s sensitive = true for outputs and consider HashiCorp Vault for critical secrets.
    3. Encryption: Ensure that your state files are encrypted at rest and in transit.
    4. Output Control: Only expose the information that is truly needed through outputs. Each output creates a dependency and a potential security risk. Avoid including sensitive data in outputs. In general, avoid using sensitive = true in outputs, as such data cannot be inherited.

    An especially effective pattern is the creation of dedicated "interface states" or "proxy states" that provide curated information for other teams. These do not contain actual resources but only data sources that retrieve relevant information from the main infrastructure state and expose it as filtered outputs. Tenants gain access only to this state with selected information, not to the original state, which may contain sensitive or tenant-specific data.

    With these fundamentals in place, we can now examine an architectural example: What does multi-tenancy look like in practice, and how can we use remote states to propagate information across organizational units?

    We will explore this topic in the next section of this series in the near future.