Sägetstrasse 18, 3123 Belp, Switzerland +41 79 173 36 84 info@ict.technology

      Terraform @ Scale - Part 1d: Pitfalls and Best Practices in Multi-Tenant Environments

      Remote states are a powerful tool for controlled information sharing across teams and tenants. Especially in complex cloud environments with multiple areas of responsibility, they enable transparency, reusability and scalability. At the same time, they pose risks: faulty states, access issues and unresolved dependencies can compromise the stability of the entire infrastructure. This article demonstrates how to avoid these challenges and how to lay the foundation for reliable, automated infrastructure through clear structures and proven practices.

      State Locking in Multi-Tenant Environments

      In an environment where multiple teams work simultaneously on different parts of the infrastructure, state locking is indispensable. Without locking, it is possible for multiple Terraform operations to try updating the same state in parallel, overwriting each other in the process. This almost always ends in disaster. Instead, a mechanism is essential that reserves a state file exclusively for a single user until they have completed their write operations and releases it again for other accessing processes.

      Best Practices for State Locking

      • Choose a backend with a robust locking mechanism:

        As explained in the previous part of this article series, apart from using a local file in a single-user environment, only the use of Consul as a remote backend for state files, as well as Terraform Cloud and Enterprise, is supported and officially certified. Terraform also supports other backends such as HTTPS or S3, but these are not covered by support agreements, their use is explicitly at your own risk and responsibility, and many of them do not support robust, error-free locking. In contrast, Consul and Terraform Enterprise offer reliable enterprise-level locking mechanisms.

        When using Consul, Terraform automatically creates a session to protect the state during the plan and apply process.


        terraform {
          backend "consul" {
            address = "consul.example.com:8500"
            path    = "terraform/customer-a"
            lock    = true  # Explicitly enables locking
          }
        }

        This locking feature therefore requires no manual setup of sessions or other mechanisms - Terraform takes care of this automatically.

      • Identify and resolve orphaned locks:

        Orphaned locks typically occur when Terraform processes terminate unexpectedly (e.g. due to CI/CD interruptions, network outages or user cancellations).

        To automatically clean up orphaned locks, you should implement a preventive cleanup process. In practice, a simple check before each Terraform run is sufficient:


        consul lock -delete "terraform/customer-a" || true


        This command removes any existing locks before a new Terraform process starts - efficient and without unnecessary administrative overhead.

      • Configure automatic lock timeouts: In the Consul configuration (consul.hcl) you can define timeout values for lock sessions. This ensures that stale locks are automatically removed.


        Recommended values for production environments are:


        session_ttl_min = "15m"
        session_ttl_max = "1h"

        This prevents orphaned locks from blocking resources over the long term.

      • Account for lock conflicts in CI/CD pipelines:

        In CI/CD environments, conflicts can arise if multiple pipelines attempt to acquire the same lock simultaneously.

        A proven pattern here is a retry mechanism with a backoff strategy:


        for i in {1..5}; do
          terraform apply && break
          echo "Lock conflict detected. Retrying in $((i * 10)) seconds..."
          sleep $((i * 10))
        done

        This mechanism ensures that your deployment remains stable even in the case of short-term lock contention. An explanation of what the script does:

        1. It loops five times: (for i in {1..5})
        2. On each iteration, it attempts to run terraform apply
        3. If terraform apply succeeds (&&), it exits the loop using break
        4. If terraform apply fails, it waits a defined period (sleep) before the next attempt
        5. The wait time increases with each attempt ($((i * 10))), i.e. 10, 20, 30, 40 and 50 seconds


        This way, the deployment using Terraform is automatically retried in the event of temporary issues such as network errors, API limits or resource conflicts that might cause a single attempt to fail.

        A robust locking system reduces the risk of errors and prevents Terraform runs from blocking each other. With the measures described, you can reliably safeguard your environment against orphaned locks and unnecessary delays.

      Robust versioning is essential. In case of problems, it allows you to easily revert to previous versions. Terraform Cloud and Enterprise offer this feature by default, whereas with Consul, you need to implement additional backup mechanisms (snapshots).

      Handling State Dependencies and Avoiding Cycles

      Dependencies between states can quickly become complex and, in the worst case, lead to circular dependencies that make infrastructure updates impossible.

      Since remote state dependencies in terraform_remote_state are static, dependent configurations must be updated manually whenever relevant outputs change.

      Best practices to avoid cycles:

      • Hierarchical dependency structure: Global resourcesRegional resourcesTenant-specific resourcesApplication-specific resources. This clear one-way dependency prevents cycles.
      • Use indirect references: If a direct reference causes cycles, route the information through an intermediate level:
        # Instead of direct reference from A → C and C → A:
        # A → B → C (with B as intermediary)
        # In configuration B:
        output "information_from_a" {
        value = data.terraform_remote_state.a.outputs.needed_value
        }

         



        and then


        # In configuration C:
        data "terraform_remote_state" "b" {
        # ...
        }
        local {
        value_from_a = data.terraform_remote_state.b.outputs.information_from_a
        }

      • Prefer data sources over remote state: If possible, use native data sources. These are more dynamic and reduce dependencies between states.

        # Instead of:
        data "terraform_remote_state" "network" {
          # ...
        }
        
        # Preferably, if possible:
        data "oci_core_subnet" "app_subnet" {
          subnet_id = "ocid1.subnet.oc1..."
        }

        However, use data sources selectively and avoid using them inside for_each or other loops, as they trigger an API call on every execution and thus do not scale well. Using data sources within foundational modules is therefore often not a good idea. But modules called by your root modules also should not access state files. Therefore, place your data sources at the root module level as well. 

      Minimizing State Information to Improve Performance

      The larger your state file becomes, the longer each terraform plan takes. Terraform reads the entire state, and in complex environments this can significantly slow down operations.

      Best practices for state optimization:

      • Granular state separation: A good rule of thumb: one state should include no more than 100 to 250 resources to ensure acceptable planning times. Apply the Goldilocks Principle (we will cover this in more detail in a later article in this series).
      • Be cautious with complex outputs: Limit outputs to what is essential and necessary:

        # Avoid:
        output "entire_vcn" {
          value = oci_core_vcn.main
        }
        
        # Better:
        output "vcn_essential_info" {
          value = {
            id         = oci_core_vcn.main.id
            cidr_block = oci_core_vcn.main.cidr_block
          }
        }

      • Avoid sensitive outputs when not necessary: They increase the size of the state and Terraform stores additional metadata. Sensitive outputs also cannot be used in other modules, so in most cases they can simply be omitted.

      Debugging Complex State Dependencies

      Debugging remote state issues can be tricky. Here are a few tips:

      • Enable detailed logs:

        export TF_LOG=DEBUG
        export TF_LOG_PATH=./terraform.log

      • Include state validation in CI/CD:

        terraform state pull | jq '.outputs.network_config.value | has("vcn_id")'

      • Use terraform console:

        $ terraform console
        > data.terraform_remote_state.network.outputs.subnet_ids

      Provider Aliasing for Complex Scenarios

      When working with multiple accounts (or multiple regions within a single account), provider aliasing greatly simplifies configuration:


      provider "oci" {
        alias  = "global"
        region = "eu-frankfurt-1"
      }
      
      provider "oci" {
        alias  = "customer_a"
        region = "eu-amsterdam-1"
      }
      
      module "customer_a_instance" {
        source     = "./modules/instance"
        provider   = oci.customer_a
        subnet_id  = module.global.outputs.subnet_id
      }

      Conclusion

      Mastering these best practices will help you run your multi-tenant infrastructure in a stable and efficient manner. As you gain experience, you will notice that these patterns not only prevent problems, they also improve collaboration between teams and enhance the overall quality of your infrastructure.