Terraform @ Scale - Teil 5b: API Gateways

Im vorherigen Artikel 5a haben wir gesehen, wie schnell große Terraform‑Rollouts an API‑Limits prallen, etwa wenn DR‑Tests hunderte Ressourcen parallel erstellen und 429‑Fehler lawinenartig Retries auslösen. Diese Fortsetzung schließt jetzt dort an und zeigt, wie Sie mit dem API Gateway von Oracle Cloud Infrastructure und Amazon API Gateway Limits bewusst managen, sauber observieren und per „Policy as Code“ betriebsfest machen.

API Gateway: Die Ultima Ratio?

API‑Gateways helfen, API‑Limits planbar zu machen. Richtig eingesetzt, bündeln sie API‑Calls, setzen Quotas und Throttling durch, liefern konsistentes Observability‑Material und schaffen eine zentrale Stelle für Betrieb und Governance.

Für uns ist vor allem eines relevant: Ein Gateway verschiebt nicht nur das Rate‑Limit‑Problem, sondern es ermöglicht dessen aktive Steuerung pro Team, pro Deployment und pro Route.

Bei Oracle Cloud Infrastructure stellen Sie technische Leitplanken mit Usage Plans und Entitlements her. Diese wirken direkt auf API Gateway Deployments, z. B. eine harte Rate pro Sekunde sowie Quoten pro Minute oder Monat. Für die Durchsetzung und Transparenz stehen dienstspezifische Metriken wie HttpResponses samt Dimensionen deploymentId und httpStatusCode bereit, die sich sauber alarmieren lassen. (Oracle Documentation).

Die Service‑Log‑Kategorien access und execution sind die vorgesehenen Kanäle des Dienstes; sie werden direkt dem API‑Deployment zugeordnet und sind erste Wahl gegenüber Legacy‑Bucket‑Logarchivierung. (Oracle Documentation)

Hier ein Beispiel für OCI (ein Beispiel für AWS folgt weiter unten):

# Terraform >= 1.10, OCI Provider 7.14.0
terraform {
  required_version = ">= 1.10"
  required_providers {
    oci = { source = "oracle/oci", version >= "7.14.0" }
  }
}

provider "oci" {
  region = var.region
}

variable "region" {
  type        = string
  description = "OCI region, e.g., eu-frankfurt-1"
  validation {
    condition     = can(regex("^[a-z]+-[a-z0-9]+-[0-9]+$", var.region))
    error_message = "Region must match a pattern like 'eu-frankfurt-1'."
  }
}

variable "compartment_id" {
  type        = string
  description = "Compartment OCID used for gateway, logs, and alarms"
}

# Optional: Many organizations manage the API deployment separately.
# We intentionally reference it via a variable to keep the example focused.
variable "api_deployment_id" {
  type        = string
  description = "OCID of the API Gateway deployment"
  validation {
    condition     = can(regex("^ocid1\\..+", var.api_deployment_id))
    error_message = "api_deployment_id must be a valid OCID."
  }
}

# Enable service logs for 'access' and 'execution'
resource "oci_logging_log_group" "apigw" {
  compartment_id = var.compartment_id
  display_name   = "apigw-logs"
}

resource "oci_logging_log" "apigw_access" {
  log_group_id = oci_logging_log_group.apigw.id
  display_name = "apigateway-access"
  log_type     = "SERVICE"
  is_enabled   = true

  configuration {
    source {
      category = "access"
      resource = var.api_deployment_id
      service  = "apigateway"
    }
  }
}

resource "oci_logging_log" "apigw_execution" {
  log_group_id = oci_logging_log_group.apigw.id
  display_name = "apigateway-execution"
  log_type     = "SERVICE"
  is_enabled   = true

  configuration {
    source {
      category = "execution"
      resource = var.api_deployment_id
      service  = "apigateway"
    }
  }
}

# Usage plan with rate limit & minute quota
resource "oci_apigateway_usage_plan" "team_plan" {
  compartment_id = var.compartment_id
  display_name   = "team-standard-plan"

  entitlements {
    name        = "default"
    description = "Standard quota for CI runs"

    rate_limit {
      unit  = "SECOND"
      value = 50
    }

    quota {
      unit                 = "MINUTE"
      value                = 2000
      reset_policy         = "CALENDAR"
      operation_on_breach  = "REJECT"
    }

    targets {
      deployment_id = var.api_deployment_id
    }
  }

  lifecycle {
    prevent_destroy = true
  }
}

Bei Amazon API Gateway kombinieren Sie drei Hebel: Stage‑ und Method‑Throttling, Usage Plans mit API Keys und rate‑basierte Regeln in AWS WAF für IP‑Aggregationen. Die CloudWatch‑Metriken 4XXError und 5XXError liefern ein robustes Frühwarnsystem auf Stage‑Ebene.

Wichtig: AWS WAFv2 lässt sich heute nur mit REST‑API‑Stages assoziieren, nicht mit HTTP APIs. (AWS Documentation, Terraform Registry)

# Amazon API Gateway (REST) – stage throttling, usage plan, WAF
terraform {
  required_version = ">= 1.10"
  required_providers {
    aws = { source = "hashicorp/aws", version = ">= 5.0" }
  }
}

provider "aws" {
  region = var.aws_region
}

data "aws_region" "current" {}

resource "aws_api_gateway_rest_api" "tf_api" {
  name = "terraform-at-scale"
}

resource "aws_api_gateway_resource" "status" {
  rest_api_id = aws_api_gateway_rest_api.tf_api.id
  parent_id   = aws_api_gateway_rest_api.tf_api.root_resource_id
  path_part   = "status"
}

resource "aws_api_gateway_method" "get_status" {
  rest_api_id   = aws_api_gateway_rest_api.tf_api.id
  resource_id   = aws_api_gateway_resource.status.id
  http_method   = "GET"
  authorization = "NONE"
}

resource "aws_api_gateway_integration" "get_status_mock" {
  rest_api_id = aws_api_gateway_rest_api.tf_api.id
  resource_id = aws_api_gateway_resource.status.id
  http_method = aws_api_gateway_method.get_status.http_method
  type        = "MOCK"
}

resource "aws_api_gateway_deployment" "this" {
  rest_api_id = aws_api_gateway_rest_api.tf_api.id
  depends_on  = [aws_api_gateway_integration.get_status_mock]
}

resource "aws_api_gateway_stage" "prod" {
  rest_api_id   = aws_api_gateway_rest_api.tf_api.id
  deployment_id = aws_api_gateway_deployment.this.id
  stage_name    = "prod"

  method_settings {
    resource_path           = "/*"
    http_method             = "*"
    metrics_enabled         = true
    logging_level           = "INFO"
    data_trace_enabled      = false
    throttling_burst_limit  = 100
    throttling_rate_limit   = 50
  }
}

resource "aws_api_gateway_usage_plan" "plan" {
  name = "team-standard-plan"

  api_stages {
    api_id = aws_api_gateway_rest_api.tf_api.id
    stage  = aws_api_gateway_stage.prod.stage_name
  }

  throttle_settings {
    burst_limit = 100
    rate_limit  = 50
  }

  quota_settings {
    limit  = 2000
    period = "DAY"
  }
}

resource "aws_api_gateway_api_key" "ci_key" {
  name    = "ci-runs"
  enabled = true
  # If 'value' is omitted, the service generates a secure key automatically.
}

resource "aws_api_gateway_usage_plan_key" "ci_key_bind" {
  key_id        = aws_api_gateway_api_key.ci_key.id
  key_type      = "API_KEY"
  usage_plan_id = aws_api_gateway_usage_plan.plan.id
}

# WAFv2 rate-based rule (REGIONAL) – only for REST API stages, not HTTP APIs
resource "aws_wafv2_web_acl" "apigw_waf" {
  name        = "apigw-waf"
  description = "Rate limit per source IP"
  scope       = "REGIONAL"

  default_action { allow {} }

  rule {
    name     = "rate-limit"
    priority = 1
    action { block {} }

    statement {
      rate_based_statement {
        limit              = 500
        aggregate_key_type = "IP"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "apigw-waf"
      sampled_requests_enabled   = true
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "apigw-waf"
    sampled_requests_enabled   = true
  }
}

resource "aws_wafv2_web_acl_association" "stage_assoc" {
  resource_arn = "arn:aws:apigateway:${data.aws_region.current.name}::/restapis/${aws_api_gateway_rest_api.tf_api.id}/stages/${aws_api_gateway_stage.prod.stage_name}"
  web_acl_arn  = aws_wafv2_web_acl.apigw_waf.arn
}

Die Stage‑weiten Throttles, Usage Plans und die WAF‑Assoziation sind die tragenden Bausteine auf AWS‑Seite. CloudWatch bietet u. a. 4XXError‑Metriken mit den Dimensionen ApiName und Stage, was die Alarmauslösung pro Stage vereinfacht. (AWS Documentation)

Testing und Validierung mit Terraform 1.10+

Für schnelle, reproduzierbare Sicherheitsnetze empfiehlt sich das native Testing‑Framework von Terraform. Mock‑Provider kapseln externe Abhängigkeiten, Assertions prüfen Projekt‑spezifische Regeln wie maximale Batch‑Größen oder das Verhalten bei zu niedrigen Limits.

Pro-Tipp: Nutzen Sie bewusst knappe, aussagekräftige Tests, die Ihre Module gegen Fehlkonfigurationen härten. (HashiCorp Developer)

# tests/api_limits.tftest.hcl

test {
  # optional name and timeouts can be added here
}

variables {
  # Global default variables for all runs in this test file
  max_batch_size = 50
}

# Example: The plan must never try to create more than 50 new resources
run "enforce_small_batches" {
  command = plan

  assert {
    condition = length([for rc in run.plan.resource_changes : rc if contains(rc.change.actions, "create")]) <= var.max_batch_size
    error_message = "Too many new resources in a single run – split the deployment into smaller batches."
  }
}

# Example: We expect a failure of a named precondition
# (Preconditions are defined in your modules/resources)
run "expect_precondition_failure" {
  command = plan
  expect_failures = [
    precondition.api_limits_reasonable
  ]
}

Hinweise aus der Praxis:

Assertions müssen einzeilig sein,
expect_failures bezieht sich auf benannte Preconditions, nicht auf allgemeine Typfehler.
Ephemeral‑Ressourcen sind Stand heute (Terraform 1.12.0) vor allem für kurzfristige Token und Abfragen sinnvoll, aber nicht als universeller Ersatz für Mocks.

Monitoring + Alerting

Beobachtbarkeit ist das operative Rückgrat Ihrer API‑Limit‑Strategie.

Auf OCI arbeiten Sie am zuverlässigsten direkt mit den Service‑Metriken des API Gateway in Kombination mit Alarmen der Monitoring‑Plattform. Die Dimensionen deploymentId und httpStatusCode erlauben eine eindeutige Filterung auf 429‑Antworten. Die Syntax im MQL ist wie folgt, achten Sie auf die korrekten Dimensionsnamen: (Oracle Documentation)

# OCI: Alarm on sustained HTTP 429 responses at deployment level
resource "oci_ons_notification_topic" "ops" {
  compartment_id = var.compartment_id
  name           = "ops-alerts"
}

resource "oci_ons_subscription" "ops_mail" {
  compartment_id = var.compartment_id
  topic_id       = oci_ons_notification_topic.ops.id
  protocol       = "EMAIL"
  endpoint       = var.alert_email
}

resource "oci_monitoring_alarm" "apigw_429" {
  compartment_id        = var.compartment_id
  metric_compartment_id = var.compartment_id
  display_name          = "APIGW 429 bursts"
  is_enabled            = true
  severity              = "CRITICAL"
  destinations          = [oci_ons_notification_topic.ops.id]
  message_format        = "ONS_OPTIMIZED"
  pending_duration      = "PT1M"  # 1 minute
  resolution            = "1m"

  # Correct dimensions according to API Gateway metrics: deploymentId, httpStatusCode
  query = <<-EOT
    HttpResponses[1m]{deploymentId="${var.api_deployment_id}", httpStatusCode="429"}.sum() > 5
  EOT

  body = "Increased rate of HTTP 429 on API Gateway deployment: {{triggerValue}}/min"
}

Auf AWS definieren Sie einfache, belastbare Alarme auf 4XXError und 5XXError, ergänzt um ein Stage‑weites Throttling. In der Praxis melden Alarme auf 4XXError früh und breit, WAF‑Rate‑Limits fangen auftretende Lastspitzen ab. (AWS Documentation)

# AWS: CloudWatch alarm on 4XX errors (stage-wide)
resource "aws_cloudwatch_metric_alarm" "api_4xx_spike" {
  alarm_name          = "apigw-prod-4xx-spike"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  period              = 60
  statistic           = "Sum"
  threshold           = 50
  namespace           = "AWS/ApiGateway"
  metric_name         = "4XXError"

  dimensions = {
    ApiName = aws_api_gateway_rest_api.tf_api.name
    Stage   = aws_api_gateway_stage.prod.stage_name
  }

  alarm_description = "Elevated client errors on 'prod' stage"
}

Best Practices für den Produktivbetrieb

Planung vor Optimierung

API‑Gateways sollten in Ihr Architektur‑ und Betriebsmodell passen, nicht umgekehrt. Die folgenden Praktiken haben sich bewährt und bauen auf Artikel 5a dieser Serie auf:

Gestaffelte Deployments: Trennen Sie Foundation, Plattform und Anwendungs‑Workloads, damit einzelne Runs klein bleiben und Quoten nicht kumuliert reißen.

Circuit‑Breaker für IaC: Implementieren Sie Preconditions und Checks, die Runs abbrechen, sobald Fehlerraten steigen. So verschleißen Sie keine Quotas anderer Teams.

Zeitfenster nutzen: Große Rollouts sollten außerhalb der Hauptlastfenster stattfinden. CI‑Zeitpläne sind Betriebsmittel, keine Kosmetik.

Provider‑Timeouts und Retries: Verlängern Sie Timeouts nur gezielt, statt sie global aufzublasen. Für OCI‑Ressourcen können Sie pro‑Ressource Zeitlimits setzen, z. B. beim Deployment:

resource "oci_apigateway_deployment" "depl" {
  # ... your configuration ...
  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

Parallelität bewusst steuern: In Terraform Enterprise setzen Sie TFE_PARALLELISM pro Workspace, statt überall -parallelism Flags fest in den Kommandozeilen zu verdrahten. Das verhindert unkontrollierte Lastspitzen und ist auditierbar.

Graceful Degradation: Bauen Sie optionale Pfade, die bei Limits auf einfachere Betriebsmodi zurückfallen, anstatt den gesamten Run scheitern zu lassen.

Dokumentierte Quotas: Zentral gepflegte Quotas je Provider und Service sind Pflicht. Nur wer Quotas kennt, kann limitiert deployen.

Policy as Code mit Sentinel

Policies schützen die Plattformqualität. Die folgende Sentinel‑Policy begrenzt die maximale Anzahl neuer Ressourcen pro Run. Sie lässt sich als Must‑Have‑Guardrail in Terraform Enterprise hinterlegen und erzeugt bei hohen Volumina eine aussagekräftige Warnung statt eines Hard Fails.

# sentinel/policies/api_limit_guard.sentinel
import "tfplan/v2" as tfplan

max_resources_per_run = 50

resources_to_create = filter tfplan.resource_changes as _, rc {
  rc.change.actions contains "create"
}

main = rule {
  length(resources_to_create) <= max_resources_per_run
}

warn_high_resource_count = rule when length(resources_to_create) > 30 {
  print("WARNING: High resource volume detected.")
  print("Consider reducing parallelism or splitting the deployment.")
  true
}

Integration mit Terraform Enterprise

Viele der in Artikel 5a diskutierten Maßnahmen entfalten ihre Wirkung erst in der Pipeline.

Terraform Enterprise erlaubt es, Parallelität, Laufzeit‑Einstellungen und Gateway‑Client‑Konfigurationen als Organisationsstandard zu kodifizieren. Für Kunden innerhalb der EU mit Ansprüchen an die Datensouveränität ist TFE das (derzeit einzige) Mittel der Wahl.

terraform {
  required_version = ">= 1.10"
  required_providers {
    tfe = { source = "hashicorp/tfe", version = ">= 0.65.0" }
  }
}

provider "tfe" {
  hostname = var.tfe_hostname   # e.g., tfe.example.eu
  token    = var.tfe_token
}

resource "tfe_workspace" "prod" {
  name              = "production-infra"
  organization      = var.tfe_org
  queue_all_runs    = true    # Consider 'false' if your maturity model requires manual gates
  terraform_version = "1.10.5"
  working_directory = "live/prod"
}

resource "tfe_variable_set" "api_limits" {
  name         = "api-limit-controls"
  description  = "Controls for parallelism and API client defaults"
  organization = var.tfe_org
}

# Control Terraform parallelism via TFE_PARALLELISM
resource "tfe_variable" "parallelism" {
  key             = "TFE_PARALLELISM"
  value           = "5"
  category        = "env"
  description     = "Terraform parallelism for API limit control"
  variable_set_id = tfe_variable_set.api_limits.id
}

# Example of passing a client header for downstream API gateway policies
resource "tfe_variable" "client_header" {
  key             = "TF_VAR_apigw_client_header"
  value           = "X-CI-Run: ${timestamp()}"
  category        = "env"
  description     = "Example header for downstream API gateway policies"
  variable_set_id = tfe_variable_set.api_limits.id
}

Die Steuerung über TFE_PARALLELISM ist dokumentiert und praxiserprobt. Halten Sie die Werte konservativ und messen Sie die Auswirkung auf Plan‑ und Apply‑Dauer.

Achtung: Ein blindes Erhöhen führt oft zu schlechterer Performance durch vermehrte 429/5xx‑Antworten.

Fazit: Respekt vor der API

API‑Limits werden zwar oft als Hindernis wahrgenommen,aber sie sind tatsächlich so etwas wie ein Betriebsvertrag zwischen Ihrem Code und der Plattform. Ein Terraform‑zentrischer Ansatz mit klaren Rate‑Limits, Quoten und Alarmierung auf Gateway‑Ebene bringt Planbarkeit in CI‑Pipelines, schützt Team‑übergreifende Ressourcen und erhöht die Erfolgsquote Ihrer Runs spürbar.

Die in dem Artikel 5a diskutierten Maßnahmen bleiben der erste Hebel. Zusätzliche API-Gateways vertiefen die Kontrolle, harmonisieren Observability und verankern Ihre Regeln zentral.

Merke: Wer Limits respektiert, deployt nachhaltiger und robuster.

Ralf Ramge

Founder, Cloud Architect & IT Consultant

Terraform @ Scale - Teil 7: Best Practices bei der Modulversionierung

Terraform @ Scale - Teil 6c: Modulabhängigkeiten für Fortgeschrittene (und Masochisten)

Terraform @ Scale - Teil 6b: Praktischer Umgang mit verschachtelten Modulen

Terraform @ Scale - Teil 6a: Verstehen und Verwalten von verschachtelten Modulen

Terraform @ Scale - Teil 5b: API Gateways

Terraform @ Scale - Teil 5a: API Limits verstehen

Terraform @ Scale - Teil 4b: Best Practices für skalierende Data Sources

Terraform @ Scale - Teil 4a: Data Sources sind gefährlich!

Terraform @ Scale - Teil 3c: Monitoring und Alerting für Blast-Radius Events

Terraform @ Scale - Teil 5b: API Gateways

API Gateway: Die Ultima Ratio?

Testing und Validierung mit Terraform 1.10+

Monitoring + Alerting

Best Practices für den Produktivbetrieb

Planung vor Optimierung

Policy as Code mit Sentinel

Integration mit Terraform Enterprise

Fazit: Respekt vor der API

Ralf Ramge

ICT.technology

Terraform @ Scale - Teil 7: Best Practices bei der Modulversionierung

Terraform @ Scale - Teil 6c: Modulabhängigkeiten für Fortgeschrittene (und Masochisten)

Terraform @ Scale - Teil 6b: Praktischer Umgang mit verschachtelten Modulen

Terraform @ Scale - Teil 6a: Verstehen und Verwalten von verschachtelten Modulen

Terraform @ Scale - Teil 5b: API Gateways

Terraform @ Scale - Teil 5a: API Limits verstehen

Die Zertifikats‑Bombe tickt: 200‑Tage‑Deadline bedroht Ihr Kerngeschäft!

Terraform @ Scale - Teil 4b: Best Practices für skalierende Data Sources

Terraform @ Scale - Teil 4a: Data Sources sind gefährlich!

Terraform @ Scale - Teil 3c: Monitoring und Alerting für Blast-Radius Events

Terraform @ Scale - Teil 5b: API Gateways

API Gateway: Die Ultima Ratio?

Testing und Validierung mit Terraform 1.10+

Monitoring + Alerting

Best Practices für den Produktivbetrieb

Planung vor Optimierung

Policy as Code mit Sentinel

Integration mit Terraform Enterprise

Fazit: Respekt vor der API

Ralf Ramge

ICT.technology