在上一篇文章 5a 中,我们看到大规模的 Terraform Rollouts 很快会触碰到 API 限制,例如当 DR 测试需要并行创建上百个资源时,429 错误会像雪崩一样触发大量 Retries。本篇续篇正是从这里切入,展示如何通过 Oracle Cloud Infrastructure 的 API Gateway 以及 Amazon API Gateway 来有意识地管理这些限制,实现干净的可观测性,并通过「Policy as Code」将其落实到稳定的运营实践中。
API Gateway: 最后的武器?
API-Gateways 帮助我们使 API 限制变得可控。正确使用时,它们能够汇聚 API 调用、强制实施配额和 Throttling、提供一致的可观测性数据,并在运营和治理上形成一个集中入口。
对我们来说,最重要的是:一个 Gateway 不仅仅是转移 Rate-Limit 问题,而是使其能够在团队、部署和路由层面得到主动控制。
在 Oracle Cloud Infrastructure 中,您可以通过 Usage Plans 和 Entitlements 设置技术护栏。这些规则直接作用于 API Gateway 部署,例如每秒的硬性速率限制,以及每分钟或每月的配额。为保证执行和透明性,服务还提供了如 HttpResponses 这样的专用指标,并带有 deploymentId 和 httpStatusCode 维度,可以干净地接入告警系统。(Oracle Documentation)。
服务日志类别 access 和 execution 是该服务预设的通道;它们直接关联到 API 部署,相较于传统的 Bucket 日志归档,这是首选方式。(Oracle Documentation)
以下是一个 OCI 示例(AWS 示例将在后文展示):
# Terraform >= 1.10, OCI Provider 7.14.0
terraform {
required_version = ">= 1.10"
required_providers {
oci = { source = "oracle/oci", version >= "7.14.0" }
}
}
provider "oci" {
region = var.region
}
variable "region" {
type = string
description = "OCI region, e.g., eu-frankfurt-1"
validation {
condition = can(regex("^[a-z]+-[a-z0-9]+-[0-9]+$", var.region))
error_message = "Region must match a pattern like 'eu-frankfurt-1'."
}
}
variable "compartment_id" {
type = string
description = "Compartment OCID used for gateway, logs, and alarms"
}
# Optional: Many organizations manage the API deployment separately.
# We intentionally reference it via a variable to keep the example focused.
variable "api_deployment_id" {
type = string
description = "OCID of the API Gateway deployment"
validation {
condition = can(regex("^ocid1\\..+", var.api_deployment_id))
error_message = "api_deployment_id must be a valid OCID."
}
}
# Enable service logs for 'access' and 'execution'
resource "oci_logging_log_group" "apigw" {
compartment_id = var.compartment_id
display_name = "apigw-logs"
}
resource "oci_logging_log" "apigw_access" {
log_group_id = oci_logging_log_group.apigw.id
display_name = "apigateway-access"
log_type = "SERVICE"
is_enabled = true
configuration {
source {
category = "access"
resource = var.api_deployment_id
service = "apigateway"
}
}
}
resource "oci_logging_log" "apigw_execution" {
log_group_id = oci_logging_log_group.apigw.id
display_name = "apigateway-execution"
log_type = "SERVICE"
is_enabled = true
configuration {
source {
category = "execution"
resource = var.api_deployment_id
service = "apigateway"
}
}
}
# Usage plan with rate limit & minute quota
resource "oci_apigateway_usage_plan" "team_plan" {
compartment_id = var.compartment_id
display_name = "team-standard-plan"
entitlements {
name = "default"
description = "Standard quota for CI runs"
rate_limit {
unit = "SECOND"
value = 50
}
quota {
unit = "MINUTE"
value = 2000
reset_policy = "CALENDAR"
operation_on_breach = "REJECT"
}
targets {
deployment_id = var.api_deployment_id
}
}
lifecycle {
prevent_destroy = true
}
}
在 Amazon API Gateway 中,您可以结合三种手段:Stage 与 Method Throttling、带有 API Keys 的 Usage Plans,以及基于速率的 AWS WAF 规则来实现 IP 聚合控制。CloudWatch 指标 4XXError 和 5XXError 能够在 Stage 层面提供一个稳健的早期预警系统。
重要提示: AWS WAFv2 目前只能与 REST-API Stages 关联,无法应用于 HTTP APIs。(AWS Documentation, Terraform Registry)
# Amazon API Gateway (REST) – stage throttling, usage plan, WAF
terraform {
required_version = ">= 1.10"
required_providers {
aws = { source = "hashicorp/aws", version = ">= 5.0" }
}
}
provider "aws" {
region = var.aws_region
}
data "aws_region" "current" {}
resource "aws_api_gateway_rest_api" "tf_api" {
name = "terraform-at-scale"
}
resource "aws_api_gateway_resource" "status" {
rest_api_id = aws_api_gateway_rest_api.tf_api.id
parent_id = aws_api_gateway_rest_api.tf_api.root_resource_id
path_part = "status"
}
resource "aws_api_gateway_method" "get_status" {
rest_api_id = aws_api_gateway_rest_api.tf_api.id
resource_id = aws_api_gateway_resource.status.id
http_method = "GET"
authorization = "NONE"
}
resource "aws_api_gateway_integration" "get_status_mock" {
rest_api_id = aws_api_gateway_rest_api.tf_api.id
resource_id = aws_api_gateway_resource.status.id
http_method = aws_api_gateway_method.get_status.http_method
type = "MOCK"
}
resource "aws_api_gateway_deployment" "this" {
rest_api_id = aws_api_gateway_rest_api.tf_api.id
depends_on = [aws_api_gateway_integration.get_status_mock]
}
resource "aws_api_gateway_stage" "prod" {
rest_api_id = aws_api_gateway_rest_api.tf_api.id
deployment_id = aws_api_gateway_deployment.this.id
stage_name = "prod"
method_settings {
resource_path = "/*"
http_method = "*"
metrics_enabled = true
logging_level = "INFO"
data_trace_enabled = false
throttling_burst_limit = 100
throttling_rate_limit = 50
}
}
resource "aws_api_gateway_usage_plan" "plan" {
name = "team-standard-plan"
api_stages {
api_id = aws_api_gateway_rest_api.tf_api.id
stage = aws_api_gateway_stage.prod.stage_name
}
throttle_settings {
burst_limit = 100
rate_limit = 50
}
quota_settings {
limit = 2000
period = "DAY"
}
}
resource "aws_api_gateway_api_key" "ci_key" {
name = "ci-runs"
enabled = true
# If 'value' is omitted, the service generates a secure key automatically.
}
resource "aws_api_gateway_usage_plan_key" "ci_key_bind" {
key_id = aws_api_gateway_api_key.ci_key.id
key_type = "API_KEY"
usage_plan_id = aws_api_gateway_usage_plan.plan.id
}
# WAFv2 rate-based rule (REGIONAL) – only for REST API stages, not HTTP APIs
resource "aws_wafv2_web_acl" "apigw_waf" {
name = "apigw-waf"
description = "Rate limit per source IP"
scope = "REGIONAL"
default_action { allow {} }
rule {
name = "rate-limit"
priority = 1
action { block {} }
statement {
rate_based_statement {
limit = 500
aggregate_key_type = "IP"
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "apigw-waf"
sampled_requests_enabled = true
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "apigw-waf"
sampled_requests_enabled = true
}
}
resource "aws_wafv2_web_acl_association" "stage_assoc" {
resource_arn = "arn:aws:apigateway:${data.aws_region.current.name}::/restapis/${aws_api_gateway_rest_api.tf_api.id}/stages/${aws_api_gateway_stage.prod.stage_name}"
web_acl_arn = aws_wafv2_web_acl.apigw_waf.arn
}
Stage 范围的 Throttling、Usage Plans 以及 WAF 关联是 AWS 端的核心构建模块。CloudWatch 还提供了包括 4XXError 在内的指标,并带有 ApiName 和 Stage 维度,这使得在每个 Stage 层面触发告警变得更加简单。(AWS Documentation)
Testing 与验证 (Terraform 1.10+)
为了实现快速且可重复的安全保障,推荐使用 Terraform 的原生 Testing-Framework。通过 Mock-Provider 封装外部依赖,并使用 Assertions 来检查项目特定规则,例如最大批处理大小或在限制过低时的行为。
专业提示: 请有意识地编写简短且有表现力的测试,用于增强模块对错误配置的防护。(HashiCorp Developer)
# tests/api_limits.tftest.hcl
test {
# optional name and timeouts can be added here
}
variables {
# Global default variables for all runs in this test file
max_batch_size = 50
}
# Example: The plan must never try to create more than 50 new resources
run "enforce_small_batches" {
command = plan
assert {
condition = length([for rc in run.plan.resource_changes : rc if contains(rc.change.actions, "create")]) <= var.max_batch_size
error_message = "Too many new resources in a single run – split the deployment into smaller batches."
}
}
# Example: We expect a failure of a named precondition
# (Preconditions are defined in your modules/resources)
run "expect_precondition_failure" {
command = plan
expect_failures = [
precondition.api_limits_reasonable
]
}
实践中的提示:
- Assertions 必须是单行表达式,
- expect_failures 引用的是已命名的 Preconditions,而不是一般的类型错误。
- Ephemeral 资源截至目前(Terraform 1.12.0)主要适用于临时 Token 和查询,但不能作为 Mocks 的通用替代方案。
Monitoring + Alerting
可观测性是您的 API 限制策略的运营支柱。
在 OCI 上,最可靠的方式是直接使用 API Gateway 的服务指标,并结合监控平台的告警。维度 deploymentId 与 httpStatusCode 可用于唯一过滤 429 响应。MQL 语法如下,请注意维度名称的正确性:(Oracle Documentation)
# OCI: Alarm on sustained HTTP 429 responses at deployment level
resource "oci_ons_notification_topic" "ops" {
compartment_id = var.compartment_id
name = "ops-alerts"
}
resource "oci_ons_subscription" "ops_mail" {
compartment_id = var.compartment_id
topic_id = oci_ons_notification_topic.ops.id
protocol = "EMAIL"
endpoint = var.alert_email
}
resource "oci_monitoring_alarm" "apigw_429" {
compartment_id = var.compartment_id
metric_compartment_id = var.compartment_id
display_name = "APIGW 429 bursts"
is_enabled = true
severity = "CRITICAL"
destinations = [oci_ons_notification_topic.ops.id]
message_format = "ONS_OPTIMIZED"
pending_duration = "PT1M" # 1 minute
resolution = "1m"
# Correct dimensions according to API Gateway metrics: deploymentId, httpStatusCode
query = <<-EOT
HttpResponses[1m]{deploymentId="${var.api_deployment_id}", httpStatusCode="429"}.sum() > 5
EOT
body = "Increased rate of HTTP 429 on API Gateway deployment: {{triggerValue}}/min"
}
在 AWS 上,您可以定义简单且稳健的告警,针对 4XXError 与 5XXError,并辅以 Stage 范围的 Throttling。在实际运行中,基于 4XXError 的告警触发得更早更广,而 WAF 的速率限制则用于拦截突发的流量峰值。(AWS Documentation)
# AWS: CloudWatch alarm on 4XX errors (stage-wide)
resource "aws_cloudwatch_metric_alarm" "api_4xx_spike" {
alarm_name = "apigw-prod-4xx-spike"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
period = 60
statistic = "Sum"
threshold = 50
namespace = "AWS/ApiGateway"
metric_name = "4XXError"
dimensions = {
ApiName = aws_api_gateway_rest_api.tf_api.name
Stage = aws_api_gateway_stage.prod.stage_name
}
alarm_description = "Elevated client errors on 'prod' stage"
}
生产环境最佳实践
规划先于优化
API-Gateways 应当契合您的架构与运营模型,而不是迫使模型去适配 Gateway。以下实践已被验证有效,并且是基于本系列第 5a 篇文章的延伸:
分层部署:将 Foundation、平台与应用工作负载分开,这样单个 Run 保持小规模,避免配额叠加超限。
IaC 的 Circuit-Breaker:实现 Preconditions 与 Checks,一旦错误率上升就中止 Runs,从而避免消耗其他团队的配额。
利用时间窗口:大规模 Rollouts 应该安排在主负载窗口之外。CI 时间表是运营手段,而不是装饰。
Provider-Timeouts 与 Retries:仅在必要时延长 Timeouts,而不是全局放宽。对于 OCI 资源,您可以在资源级别设置时间限制,例如 Deployment:
resource "oci_apigateway_deployment" "depl" {
# ... your configuration ...
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}
有意识地控制并行度:在 Terraform Enterprise 中,请在 Workspace 层面设置 TFE_PARALLELISM,而不是在命令行处到处硬编码 -parallelism Flags。这样能够避免不可控的流量高峰,并且便于审计。
Graceful Degradation:设计可选路径,在触发 Limits 时退化为更简单的运行模式,而不是让整个 Run 失败。
文档化配额:每个 Provider 与 Service 的 Quotas 必须集中管理。只有清楚配额的人,才能有限度地进行部署。
Policy as Code 与 Sentinel
Policies 用于保护平台质量。以下 Sentinel-Policy 限制每个 Run 的最大新建资源数。它可作为 Must-Have Guardrail 集成在 Terraform Enterprise 中,并在高负载时生成有价值的警告,而不是直接报错失败。
# sentinel/policies/api_limit_guard.sentinel
import "tfplan/v2" as tfplan
max_resources_per_run = 50
resources_to_create = filter tfplan.resource_changes as _, rc {
rc.change.actions contains "create"
}
main = rule {
length(resources_to_create) <= max_resources_per_run
}
warn_high_resource_count = rule when length(resources_to_create) > 30 {
print("WARNING: High resource volume detected.")
print("Consider reducing parallelism or splitting the deployment.")
true
}
与 Terraform Enterprise 的集成
在流水线中,文章 5a 中讨论的许多措施才能真正发挥效果。
Terraform Enterprise 允许您将并行度、运行时设置以及 Gateway-Client 配置编码为组织标准。对于位于欧盟、对数据主权有要求的客户而言,TFE 是(目前唯一的)首选方案。
terraform {
required_version = ">= 1.10"
required_providers {
tfe = { source = "hashicorp/tfe", version = ">= 0.65.0" }
}
}
provider "tfe" {
hostname = var.tfe_hostname # e.g., tfe.example.eu
token = var.tfe_token
}
resource "tfe_workspace" "prod" {
name = "production-infra"
organization = var.tfe_org
queue_all_runs = true # Consider 'false' if your maturity model requires manual gates
terraform_version = "1.10.5"
working_directory = "live/prod"
}
resource "tfe_variable_set" "api_limits" {
name = "api-limit-controls"
description = "Controls for parallelism and API client defaults"
organization = var.tfe_org
}
# Control Terraform parallelism via TFE_PARALLELISM
resource "tfe_variable" "parallelism" {
key = "TFE_PARALLELISM"
value = "5"
category = "env"
description = "Terraform parallelism for API limit control"
variable_set_id = tfe_variable_set.api_limits.id
}
# Example of passing a client header for downstream API gateway policies
resource "tfe_variable" "client_header" {
key = "TF_VAR_apigw_client_header"
value = "X-CI-Run: ${timestamp()}"
category = "env"
description = "Example header for downstream API gateway policies"
variable_set_id = tfe_variable_set.api_limits.id
}
通过 TFE_PARALLELISM 进行的控制是有文档支撑且经实践验证的。请保持保守的取值,并衡量其对 Plan- 与 Apply- 时间的影响。
注意:盲目提高并行度往往会因更多的 429/5xx 响应而导致性能下降。
结论:对 API 的尊重
API-Limits 虽然常被视为障碍,但实际上它们是一种在您的代码与平台之间的运营契约。基于 Terraform 的方法,结合清晰的 Rate-Limits、Quotas 和 Gateway 层面的告警机制,可以让 CI-Pipelines 更具可预测性,保护跨团队的资源,并显著提高 Runs 的成功率。
在第 5a 篇文章中讨论的措施依然是首选的抓手。额外引入 API-Gateways 则可以进一步增强控制力,统一 Observability,并集中固化您的规则。
记住:尊重 Limits,才能部署得更可持续、更稳健。




