Terraform @ Scale - Part 2: The Art of Optimal State Sizing

Details: Read Time: 22 mins; Created: 14 May 2025

Infrastructure-as-Code is no longer optional. Companies that aim to run and scale their cloud infrastructure seriously rely on Terraform. But with growing success and increasing complexity, a critical question arises: how large or small should a Terraform state actually be?

A state that is too large blocks teams, slows down processes, and creates unnecessary risk. A state that is too small, on the other hand, leads to unnecessary overhead and fragile consistency. The goal is to find the right balance - not too much, not too little, but just right. Welcome to the Goldilocks principle for Terraform.

The Goldilocks Principle in the IaC World

The so-called Goldilocks principle originates from an English fairy tale. The main character, Goldilocks, tries different options - too hot, too cold, and then just right - and finally chooses the middle one. This image translates perfectly to the world of Terraform.

Here too, we are looking for the ideal balance. The goal is a setup that is neither too granular nor too monolithic. Three key objectives are in focus:

Maximum efficiency in the provisioning and management of infrastructure
Avoidance of unnecessary redundancy, as often seen in overly large states
Avoidance of incomplete states, which arise through excessive fragmentation

In practice, this means we need a structure that grants teams the necessary independence while also allowing for a consistent overview of the entire system.

The Challenge of Proper State Sizing

The question of the "right" size of a Terraform state arises sooner or later in every larger environment. There is no universal formula - what works depends heavily on the specific use case, and this is always unique to each customer's needs.

Still, after numerous projects for various clients, typical challenges have become clearly recognizable.

What Happens When States Become Too Large?

At first glance, large Terraform states seem attractive: everything in one place, easy to version, neatly organized... the data center at the push of a button, the dream of many managers and decision-makers, seems achievable through Terraform. I speak from experience here, because when I first started working with Terraform nearly a decade ago, I fell into this very trap. Yes, it looks impressive and makes a strong first impression, many problems seem resolved. But this impression is misleading. Reality quickly paints a different picture.

Soon you encounter:

Performance issues: terraform plan or apply take forever, as large states consume a lot of compute time and network bandwidth. The largest case I’ve encountered was a terraform apply that attempted to install not only networks and VMs, but also an entire Oracle database cluster - and in the best case, finished after 75 minutes. That was the epitome of a beginner's mistake.
Blocking locks: When multiple teams or individuals work on the same state, conflicts and delays arise. The result is usually frustration, vocal complaints, and lost productivity.
Complex risk management: Errors in one module can impact wide areas of the infrastructure. This is especially critical in production environments. The more resources a Terraform module tries to implement, the greater the blast radius when something goes wrong. Even with Terraform, the old rule from operations still applies: the more complex something is, the faster and louder it breaks.
Slower development cycles: Changes take longer, time-to-market increases, and continuous delivery becomes a challenge. Feedback loops should be as short as possible. Sixty minutes for a single attempt to roll out infrastructure is not short.

And When States Are Too Small?

The opposite extreme is just as problematic. Those who break infrastructure down into too many small states quickly end up with a different kind of mess:

Administrative overhead: Dozens or hundreds of states must be maintained, versioned, and coordinated. This consumes time and creates complexity. In the worst case, complexity turns into complication, and then no one can keep track anymore. The last thing you want to recreate with Infrastructure-as-Code is an old problem from the handcrafted IT world - something so aged and historically grown that no one wants to touch it, and it gradually becomes a critical time bomb.
Consistency issues: Dependencies between states - for example, VPCs, subnets, or security groups - become difficult to track. Sources of error multiply. You make a change in one corner of the infrastructure, and unexpectedly something breaks somewhere else. Not a pleasant scenario if you're trying to avoid late-night emergency patching of states.
Code duplication: Reusable logic is no longer maintained centrally, but rewritten across many modules. This contradicts the DRY principle and causes the infrastructure’s state to drift from a well-defined condition into one that, although still defined, becomes increasingly obscure. Reverse-engineering infrastructure is something we had hoped to leave behind, but if everyone reinvents the wheel, that’s exactly where we’re headed over time.
Fragmented resource management: Without a centralized view of the infrastructure, the overall understanding becomes difficult. Who changed what, where, and when? Often the only answer is a shrug.

Architecture Examples for Optimized Terraform States

The theory of the "right balance" is only as good as its practical implementation. That is why it's worth taking a closer look at proven architectural patterns that have established themselves in scalable Terraform setups. Two approaches have proven particularly effective: the layered approach for multi-account strategies, and a domain-based structure for microservice environments.

Example 1: Multi-Account Cloud Strategy with Layered Approach

A well-tested model is the division of states into functional layers. This principle is guided by the lifecycle and rate of change of infrastructure components. The result is a clearly structured state layout:

├── Foundation Layer
│ ├── network-state.tf # VPCs, Subnets, Transit Gateways
│ ├── security-baseline-state.tf # Security Groups, NACLs, IAM Baseline
│ └── monitoring-state.tf # CloudWatch, Logging, Alarming
│
├── Platform Layer
│ ├── data-services-state.tf # Databases, Queues, Storage
│ ├── kubernetes-state.tf # EKS/OKE Cluster, Node Groups
│ └── ci-cd-state.tf # CI/CD Infrastructure
│
└── Application Layer
 ├── app-team-a-state.tf # Team A's Applications
 ├── app-team-b-state.tf # Team B's Applications
 └── shared-services-state.tf # Shared Application Services

What makes this approach so attractive?

This model matches the way engineers think. There are three layers:

Foundation layer: Contains stable resources with a low rate of change. Networks, security baselines, and observability components are long-lived and subject more to infrastructure than business requirements. The change rate is low - no one regularly renames CIDR ranges or rewrites security policies.
Platform layer: Components like databases or Kubernetes clusters change at a moderate pace. They are business-relevant but not subject to daily modifications. Classic change management is often still in place here, with release cycles of 3, 6, or 12 months depending on business and compliance requirements.
Application layer: This part is highly dynamic. Feature development and product innovation lead to frequent deployments and testing, which in turn results in frequent infrastructure changes. In this layer, "on demand" is key - resources are provisioned only when truly needed, and then destroyed again to save costs. States here are small and isolated per team. If Team A breaks something, Team B is not affected.

This structure supports both a clear separation of responsibilities and efficient parallel work between teams. At the same time, it reduces the risk of unintended side effects from changes.

Example 2: Domain-Driven Terraform for Microservice Environments

A very different, but equally powerful approach can be found in microservice-oriented organizations: domain-driven state partitioning. Here, Infrastructure-as-Code is organized along business domains - similar to how it is done in Domain-Driven Design.

├── Infrastructure Domain
│ ├── networking-state.tf # Shared Networking
│ └── security-state.tf # Security Controls
│
├── Service Domains
│ ├── user-service-domain.tf # User Management Services + Infrastructure
│ ├── payment-domain.tf # Payment Processing + Infrastructure
│ └── content-domain.tf # Content Management + Infrastructure
│
└── Cross-cutting Concerns
 ├── monitoring-state.tf # Observability Infrastructure
 ├── backup-state.tf # Backup and Recovery Systems
 └── compliance-state.tf # Compliance Controls

Why does this work well?

This approach aligns with how IT managers think.

Alignment with business logic: The technical infrastructure reflects the organization of business domains. This simplifies communication between Dev, Ops, and management.
Team autonomy: Each team manages its own domain, including the associated infrastructure. Dependence on central teams is minimized. When different silos (network, databases, storage, etc.) work on the same service, they are bundled per service. This increases agility and time-to-market, both of which are popular buzzwords in management circles.
Less coordination needed: Infrastructure changes within a domain rarely affect other teams, significantly reducing coordination overhead. This is also popular with managers because it emphasizes accountability within value streams and Scrum teams.

A particularly significant advantage of this approach is the flexibility in team organization. New services or domains can be added without disrupting or reworking existing structures. Staff can be assigned to both core infrastructure teams and service teams (virtual teams). However, this flexibility also increases the risk of code duplication, which is why the Infrastructure Domain and Cross-Cutting Concerns layers are essential. They help avoid code conflicts and define clear guidelines and base modules that service teams must use. For this, introducing Policy-as-Code is indispensable, because policies not only need to be defined, but their enforcement at the Terraform level must be ensured.

Value Streams and Terraform State Management

The structuring of Terraform states should not be based solely on technical layers or modules. Instead, it is worth considering the value streams of an organization. Structuring states along these value streams means aligning strictly with what actually generates value. This applies to both infrastructure provisioning and the rollout of new features.

Two key value streams have proven particularly relevant in practice: Infrastructure Provisioning and Application Deployment.

Value Stream 1: Infrastructure Provisioning

When provisioning a complete environment - for example, for a new project or a new region - it is crucial that Terraform states are logically layered but exhibit as few interdependencies as possible. The goal is to provision large parts of the infrastructure independently of one another.

A proven model looks like this:

What’s behind it?

Remote State Layer: Global configurations such as IAM policies, centralized secrets management, or state backends form the foundation.
Network Layer: Regional resources like VPCs, subnets, or transit gateways are built on top of this.
Platform Layer: Cloud-native services like Kubernetes clusters, databases, or messaging systems follow next.
Application Layer: This layer includes workloads, CI/CD configurations, and application logic. It should be as independently deployable as possible.

This modular setup allows teams to quickly and safely bootstrap entire environments. This is especially useful for new customers, regions, or business units.

Value Stream 2: Application Deployment

A second, equally critical value stream is feature deployment. Here, the focus is on speed, stability, and minimal impact on other components. Ideally, changes to an application can be rolled out in isolation and without side effects on other states.

The target structure then looks like this:

What matters here?

Environment separation: Development, testing, and production environments each have their own states. This allows new features to be validated in isolated test environments before going live. While it is desirable for environments to be identical and representative of each other, state separation is essential to prevent changes in test from accidentally affecting production.
Minimal dependencies: States are intentionally kept lean to allow for changes with as little risk as possible. Resources like databases or networks are referenced via outputs, but not modified directly.
High parallelism: Feature branches in Git lead to isolated changes in a few states. This allows the deployment team to respond quickly without running into conflicts with other teams.

The combination of value stream thinking and intentionally designed state structures creates a stable framework for efficient, secure, and scalable deployments. From the infrastructure foundation to the go-live of a new feature, this combination enables a high degree of flexibility while maintaining control over the infrastructure and the blast radius in the event of an error. A well-defined state of the entire data center becomes attainable.

The Practical Application of the Goldilocks Principle

The Goldilocks principle is more than just a pleasant metaphor - it is a practical tool for designing robust, maintainable Terraform modules. The key is to choose the right number of variables, configurations, and outputs. Not too many, not too few, but just right.

This diagram visualizes the tension between extremes: on the left is under-dimensioning, on the right is overengineering. In the middle lies the so-called Goldilocks zone - the range in which modules are both flexible and stable, as well as easy to maintain.

When it’s too little...

An under-dimensioned module often seems appealing at first, because it appears simple. But this is deceptive. Typical signs include:

Monoliths: A single, central state file that contains everything - from networking and security to workloads
Incompleteness: Few or missing parameters, resulting in reliance on provider defaults and hardcoded values
Static code: Limited reusability because everything is tailored to a single use case and the module is not fully dynamic
Poor return values: Rigid output structures that cannot be easily integrated into other modules
Missing guardrails: Incomplete input validation (so-called variable validations), lifecycle conditions, and runtime checks for validating computed or API-returned values are essential to ensure proper interaction with and between provisioned resources

Modules like these are difficult to adapt and tend to fall apart at the first demand for flexibility.

When it becomes too much...

On the other end of the spectrum lies the risk of overstructuring. Signs of this include:

Micromanagement: A multitude of small modules and states that are difficult to oversee
Excessive parameterization: Every detail is externally controlled, even trivial configurations. Defaults are either not meaningful or missing altogether
Complex dependencies: Outputs reference each other excessively, leading to a true "output orgy"
Data overload: Redundant or unnecessarily generic outputs that add little value, instead of a few clear outputs with referencable maps (key/value pairs) as content

This creates unnecessary overhead, both in the code and cognitively. The learning curve for new team members increases, and the risk of errors grows.

The Goldilocks zone: Just right

The goal lies in the middle: Terraform modules that are modular, understandable, and team-compatible. Typical characteristics include:

Modularized states with clear boundaries: States and modules are separated by layers, domains, or environments
Sufficient but not excessive parameters: Modules use meaningful defaults that cover 80% of use cases
Reusable modules: Full dynamization creates building blocks that can be used across teams thanks to thoughtful structure
Consistent output structures: Outputs are logically named and easy to reference

Those who find this middle ground lay the foundation for stable IaC architectures that scale with the company instead of working against it.

Best Practices for Scalable Terraform Implementations

Terraform is a powerful tool - but its true strength is only revealed when it is used systematically and strategically. Especially in scaling environments, certain core principles are crucial for building maintainable, robust, and efficient setups in the long term. Below, we outline four proven best practices that have stood the test of time in complex cloud landscapes.

1. State Segmentation by Change Velocity

Not all infrastructure elements change at the same frequency - and this should be reflected in the Terraform structure. Segmenting by change velocity helps to avoid conflicts and accelerate development cycles.

An example of such a structure in an AWS environment:

└── AWS Infrastructure
    ├── low-velocity/        # Rarely changed (e.g. networks, IAM)
    │   ├── network/
    │   ├── security/
    │   └── dns/
    │
    ├── medium-velocity/     # Occasional changes (e.g. databases)
    │   ├── rds/
    │   ├── elasticache/
    │   └── sqs/
    │
    └── high-velocity/       # Frequent changes (e.g. compute, workloads)
        ├── app-cluster-a/
        ├── app-cluster-b/
        └── batch-processing/

Why this works: Teams can work independently without blocking each other. Infrastructure elements that are rarely modified are decoupled from rapid deployments - increasing both stability and agility.

2. Dynamic Terraform Modules with Validated Variables

A well-structured module is more than just a container for resources. It should be designed to adapt flexibly to different contexts - without becoming unmanageable.

What this includes:

Variable validation: Every variable should - where possible - be validated using validation blocks. This helps catch invalid values early.
Type safety: Use precise type definitions (string, number, bool, list(object), etc.) to prevent misunderstandings.
Optional variables with null support: Many variables are optional - but must be explicitly null-able so they can be used cleanly in conditional expressions.

An example:

variable "mounts" {
  type = map(object({
    audit_non_hmac_request_keys  = optional(list(string))
    audit_non_hmac_response_keys = optional(list(string))
    allowed_managed_keys         = optional(set(string))
    default_lease_ttl_seconds    = optional(number)
    description                  = optional(string)
    external_entropy_access      = optional(bool)
    identity_token_key           = optional(string)
    listing_visibility           = optional(string)
    local                        = optional(bool)
    max_lease_ttl_seconds        = optional(number)
    namespace                    = optional(string)
    options                      = optional(map(any))
    passthrough_request_headers  = optional(list(string))
    plugin_version               = optional(string)
    seal_wrap                    = optional(bool)
    allowed_response_headers     = optional(list(string))
    delegated_auth_accessors     = optional(list(string))
    path                         = string
    type                         = string
  }))
  default = null

  description = <<-EOT
Defines the configurations for Vault mounts. Each mount configuration should specify the following keys:

[...]
EOT

  validation {
    condition = (
      var.mounts == null ? true : alltrue([
        for mount in var.mounts : (
          mount.listing_visibility == null ||
          mount.listing_visibility == "unauth" ||
          mount.listing_visibility == "hidden"
        )
      ])
    )
    error_message = "The 'listing_visibility' value must be either 'unauth' or 'hidden' if specified."
  }

  validation {
    condition = (
      var.mounts == null ? true : alltrue([
        for mount in var.mounts : (
          mount.path == null || (
            !startswith(mount.path, "/") &&
            !endswith(mount.path, "/")
          )
        )
      ])
    )
    error_message = "The 'path' value must not start or end with a '/'."
  }
}

Modules like this can be used across teams and significantly reduce support effort.

3. Cross-State Dependencies via Remote State

In complex environments, dependencies between resources can rarely be avoided entirely. Instead of modeling them directly, it's better to use remote state outputs. The core principle is simple: one state produces relevant outputs, another consumes them via the terraform_remote_state data source.

For a deep dive, refer to the first part of our “Terraform @ Scale” article series, where we explored this in depth across several subchapters.

4. The HashiCorp Tool Stack as an Enabler for State Strategies

Hashicorp Logos CompactStrap OnLight Scalable Terraform implementations rely on more than just good code. They are built on an ecosystem that supports security, orchestration, and reusability. At ICT.technology, we consistently rely on the HashiCorp stack to meet these requirements. The most important components here include:

Terraform / Terraform Enterprise
Centralized state management, policy-as-code (e.g. with Sentinel), and workspace isolation for teams
Vault / Vault Enterprise
Secure secrets management, dynamic credentials for cloud access, rotation, and audit trails
Consul / Consul Enterprise
Service discovery, health checks, dynamic configuration, and integrations with platform services
Nomad / Nomad Enterprise
Orchestration for containers, VMs, and bare metal - with a smaller footprint than Kubernetes
Packer
Creation of standardized images (golden images) with versioning, idempotency, and CI/CD integration
Boundary / Boundary Enterprise (only if you already run PostgreSQL clusters at an enterprise level with corresponding SLAs and OLAs)
Secure, identity-based access to infrastructure components - especially relevant for production databases or regulated workloads

These tools enable a consistent, auditable, and secure Terraform practice - even in heterogeneous, highly dynamic enterprise environments.

Practical Guide to State Sizing

The optimal size of a Terraform state does not come from a template, but from a structured process that aligns technical realities with organizational requirements. Below, we outline a practical, proven guide that helps in planning and implementation.

1. Analyze the Organizational Structure

The first step is to understand your organization in detail:

Who manages which resources? Are there centralized infrastructure teams, or do application teams operate independently?
Where is autonomy required? Teams with high deployment frequency should have their own states to avoid mutual blocking.
What do the lifecycles look like? Resources that are developed or removed together should ideally reside in the same state.

2. Dependency Analysis

The second step involves identifying technical dependencies between resources:

Which resources are tightly coupled? For example, Security Groups and EC2 instances, or Subnets and Load Balancers.
What can be decoupled? Monitoring, IAM roles, or backup systems can often be managed separately - reducing complexity.

3. Trial Run and Iteration

Before launching a large-scale reorganization, a focused test is recommended:

Pilot projects with variable state structures: For example, a team experimenting with both monolithic and segmented approaches.
Measurable criteria: Performance (Plan/Apply), error rate, team satisfaction, release velocity.

Case Study: Cloud Migration of an Enterprise Client

A compelling example comes from a financial services provider that ICT.technology supported in transitioning from an on-premise infrastructure to Oracle Cloud Infrastructure (OCI).

Initial situation:
The Terraform environment consisted of a single, large state. The result:

Almost 45 minutes runtime per terraform apply
Frequent conflicts due to state locking
Slow development and release cycles

Solution approach:
We applied the Goldilocks principle and segmented the infrastructure as follows (with tenancy in Terraform Enterprise used to assign teams and services under each segment according to individual needs):

├── Foundation
│   ├── network/
│   │   ├── transit/
│   │   ├── network/
│   │   └── interconnect/
│   └── security/
│       ├── iam/
│       ├── security-groups/
│       └── baseline/
├── Data Services
│   ├── databases/
│   ├── storage/
│   └── analytics/
└── Applications
    ├── frontend/
    ├── backend/
    ├── batch/
    └── monitoring/

Results:

Apply time reduced to under 5 minutes, as infrastructure changes were deployed much faster
Parallel teamwork became possible, as teams could deploy independently without blocking one another
70% faster releases due to shorter feedback cycles and less friction
Greater stability due to smaller, clearly defined changes and a minimized blast radius in case of misconfigurations or human error in operations
Massive cost savings through dynamic and demand-driven provisioning of bare metal servers and VMs - deployments only happened when actually needed, followed by automated teardown

Conclusion: The Path to Optimal Terraform Scaling

Sizing Terraform states is not a one-time architectural task. It is a continuous maturity process. The Goldilocks principle helps avoid extreme structures: your codebase should be neither too monolithic nor too fragmented.

The key is to understand your organization, separate responsibilities cleanly, and implement technical solutions based on real workflows.

ICT.technology guides organizations along this path - from initial cloud adoption to full Infrastructure-as-Code excellence. With deep expertise in Terraform, the HashiCorp ecosystem, and cloud architectures, we build solutions that are technically robust and organizationally sustainable.

Well-structured Terraform states are not an end in themselves. They are a key lever for scalability, speed, and operational stability - and therefore a critical success factor for any modern IT organization.

Ralf Ramge

Founder, Cloud Architect & IT Consultant

Terraform @ Scale - Part 5a: Understanding API Limits

Terraform @ Scale - Part 4b: Best Practices for Scaling Data Sources

Terraform @ Scale - Part 4a: Data Sources are Dangerous!

Terraform @ Scale - Part 3c: Monitoring and Alerting for Blast Radius Events

HashiCorp Vault Deep Dive – Part 2b: Practical Work with the Key/Value Secrets Engine

Terraform @Scale - Part 3b: Blast Radius Recovery Strategies

HashiCorp Vault Deep Dive - Part 2a: Activating the Key/Value Secrets Engine

Terraform @ Scale - Part 3a: Blast-Radius Management

HashiCorp Vault Deep Dive - Part 1: Fundamentals of Secret Engines

Terraform @ Scale - Part 2: The Art of Optimal State Sizing

The Goldilocks Principle in the IaC World

The Challenge of Proper State Sizing

What Happens When States Become Too Large?

And When States Are Too Small?

Architecture Examples for Optimized Terraform States

Example 1: Multi-Account Cloud Strategy with Layered Approach

Example 2: Domain-Driven Terraform for Microservice Environments

Value Streams and Terraform State Management

Value Stream 1: Infrastructure Provisioning

Value Stream 2: Application Deployment

The Practical Application of the Goldilocks Principle

When it’s too little...

When it becomes too much...

The Goldilocks zone: Just right

Best Practices for Scalable Terraform Implementations

1. State Segmentation by Change Velocity

2. Dynamic Terraform Modules with Validated Variables

3. Cross-State Dependencies via Remote State

4. The HashiCorp Tool Stack as an Enabler for State Strategies

Practical Guide to State Sizing

1. Analyze the Organizational Structure

2. Dependency Analysis

3. Trial Run and Iteration

Case Study: Cloud Migration of an Enterprise Client

Conclusion: The Path to Optimal Terraform Scaling

Ralf Ramge

ICT.technology

Terraform @ Scale - Part 5a: Understanding API Limits

The Certificate Bomb is Ticking: The 200-day Deadline Threatens Your Business!

Terraform @ Scale - Part 4b: Best Practices for Scaling Data Sources

Terraform @ Scale - Part 4a: Data Sources are Dangerous!

Terraform @ Scale - Part 3c: Monitoring and Alerting for Blast Radius Events

HashiCorp Vault Deep Dive – Part 2b: Practical Work with the Key/Value Secrets Engine

Terraform @Scale - Part 3b: Blast Radius Recovery Strategies

HashiCorp Vault Deep Dive - Part 2a: Activating the Key/Value Secrets Engine

Terraform @ Scale - Part 3a: Blast-Radius Management

HashiCorp Vault Deep Dive - Part 1: Fundamentals of Secret Engines

Terraform @ Scale - Part 2: The Art of Optimal State Sizing

The Goldilocks Principle in the IaC World

The Challenge of Proper State Sizing

What Happens When States Become Too Large?

And When States Are Too Small?

Architecture Examples for Optimized Terraform States

Example 1: Multi-Account Cloud Strategy with Layered Approach

Example 2: Domain-Driven Terraform for Microservice Environments

Value Streams and Terraform State Management

Value Stream 1: Infrastructure Provisioning

Value Stream 2: Application Deployment

The Practical Application of the Goldilocks Principle

When it’s too little...

When it becomes too much...

The Goldilocks zone: Just right

Best Practices for Scalable Terraform Implementations

1. State Segmentation by Change Velocity

2. Dynamic Terraform Modules with Validated Variables

3. Cross-State Dependencies via Remote State

4. The HashiCorp Tool Stack as an Enabler for State Strategies

Practical Guide to State Sizing

1. Analyze the Organizational Structure

2. Dependency Analysis

3. Trial Run and Iteration

Case Study: Cloud Migration of an Enterprise Client

Conclusion: The Path to Optimal Terraform Scaling

Ralf Ramge

ICT.technology