Terraform @ Scale - Part 3c: Monitoring and Alerting for Blast Radius Events

Details: Read Time: 8 mins; Created: 10 July 2025

Even the most sophisticated infrastructure architecture cannot prevent every error. That is why it is essential to monitor Terraform operations proactively - especially those with potentially destructive impact. The goal is to detect critical changes early and trigger automated alerts before an uncontrolled blast radius occurs.

Sure - your system engineer will undoubtedly point out that Terraform displays the full plan before executing an apply, and that execution must be confirmed by entering "yes".

What your engineer does not mention: they do not actually read the plan before allowing it to proceed.

“It'll be fine.”

Early Warning System: Automated Plan Analysis

Terraform provides a way to evaluate plan information programmatically using the -json flag. This allows detection of planned deletions (destroy) and the automated initiation of appropriate actions, such as a Slack alert or automatic termination of the CI/CD pipeline.

An alternative early indicator is the return value of terraform plan -detailed-exitcode: an exit code 2 signals planned changes, including planned deletions.

Example: Bash Script for Plan Evaluation

This script can be integrated as a hook into the CI/CD pipeline. If planned deletions are detected, immediate notification follows - or optionally, an automatic stop of the rollout.

An example script for reference:

#!/bin/bash
# Automated Plan Analysis Script

set -e  # Exit on any error

# Create the Terraform plan and export it in JSON format
terraform plan -out=tfplan -detailed-exitcode
PLAN_EXIT_CODE=$?

# Check if there are changes (exit code 2)
if [ $PLAN_EXIT_CODE -eq 2 ]; then
    terraform show -json tfplan > plan.json
    
    # Analyze planned deletions with more robust jq query
    DELETIONS=$(jq -r '.resource_changes[]? | select(.change.actions[]? == "delete") | .address' plan.json 2>/dev/null)
    
    if [ -n "$DELETIONS" ]; then
        echo "BLAST RADIUS ALERT: Planned deletions detected:"
        echo "$DELETIONS" | while read -r resource; do
            echo "  - $resource"
        done
        
        # Send alert with proper error handling
        if ! curl -f -X POST "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" \
            -H 'Content-type: application/json' \
            --data "{\"text\":\"ALERT: Terraform Destroy detected in $WORKSPACE:\\n$DELETIONS\"}"; then
            echo "Warning: Failed to send Slack notification"
        fi
        
        exit 2
    fi
fi

Cloud-based Log Monitoring with Alerting

For production environments, centralized, cloud-native monitoring is recommended. This can be implemented, for example, via Splunk running locally in your data center. Or through cloud services such as AWS CloudWatch or Oracle Logging. The goal is to capture suspicious log entries containing destructive keywords like “destroy” and trigger real-time alerts.

Note: The following examples are provided for guidance and include the necessary resource declarations, but are not yet fully operational end-to-end. Missing elements such as versions.tf and variables.tf are left to the sufficiently skilled reader.

Example: AWS CloudWatch Integration

The alerts can be connected directly to an aws_sns_topic, which in turn can send notifications via email, Slack, PagerDuty or other systems. This ensures that no critical terraform destroy goes unnoticed.

provider "aws" {
  region = "eu-central-1"
}

resource "aws_cloudwatch_log_group" "terraform_logs" {
  name              = "/terraform/cicd"
  retention_in_days = 7

  tags = {
    Environment = "production"
    Purpose     = "terraform-monitoring"
  }
}

resource "aws_cloudwatch_metric_filter" "terraform_destroy_filter" {
  name           = "terraform-destroy-keyword"
  log_group_name = aws_cloudwatch_log_group.terraform_logs.name
  pattern = "\"destroy\""

  metric_transformation {
    name      = "DestroyMatches"
    namespace = "Terraform/CI"
    value     = "1"
    unit      = "Count"  
  }
}

resource "aws_sns_topic" "alerts" {
  name = "terraform-blast-radius-alerts"
  
  tags = {
    Environment = "production"
    Purpose     = "terraform-alerts"
  }
}

resource "aws_sns_topic_subscription" "email_alert" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email 
}

resource "aws_cloudwatch_metric_alarm" "blast_radius_alarm" {
  alarm_name          = "Terraform-Destroy-Detected"
  alarm_description   = "Detects destroy operations in Terraform CI output"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  threshold           = 0
  metric_name         = "DestroyMatches"
  namespace           = "Terraform/CI"
  statistic           = "Sum"
  period              = 60
  treat_missing_data  = "notBreaching"
  
  insufficient_data_actions = []
  alarm_actions            = [aws_sns_topic.alerts.arn]
  ok_actions              = [aws_sns_topic.alerts.arn]

  tags = {
    Environment = "production"
    Purpose     = "blast-radius-monitoring"
  }
}

Example: OCI Logging with Alerting

In Oracle Cloud Infrastructure, use the Logging service in combination with a logging query, an alarm and the Notifications service. This allows you to detect destructive actions like terraform destroy based on keywords in the CI/CD pipeline logstream or audit logs.

Configuration steps:

Log Group for your build logs or audit logs
Logging Search with a query such as data.message CONTAINS "destroy"
Define an alarm that triggers on matches
Connect to a notification topic (email, PagerDuty, etc.)

Example alarm using Terraform:

resource "oci_logging_log_group" "terraform_logs" {
  display_name   = "terraform-ci-logs"
  compartment_id = var.compartment_id
  
  freeform_tags = {
    "Environment" = "production"
    "Purpose"     = "terraform-monitoring"
  }
}

resource "oci_logging_log" "cicd_log" {
  display_name = "terraform-cicd-log"
  log_group_id = oci_logging_log_group.terraform_logs.id
  log_type     = "CUSTOM"
  
  configuration {
    source {
      category    = "write"
      resource    = var.compartment_id
      service     = "objectstorage"
      source_type = "OCISERVICE"
    }
    
    compartment_id = var.compartment_id
  }

  is_enabled         = true
  retention_duration = 30
}

resource "oci_ons_notification_topic" "alerts" {
  name           = "terraform-destroy-alerts"
  compartment_id = var.compartment_id
  description    = "Alerts for blast-radius related events"
  
  freeform_tags = {
    "Environment" = "production"
    "Purpose"     = "terraform-alerts"
  }
}

resource "oci_ons_subscription" "email_subscription" {
  compartment_id = var.compartment_id
  topic_id       = oci_ons_notification_topic.alerts.id
  protocol       = "EMAIL"
  endpoint       = var.alert_email
}

resource "oci_monitoring_alarm" "terraform_destroy_alarm" {
  display_name         = "Terraform-Destroy-Detected"
  compartment_id       = var.compartment_id
  metric_compartment_id = var.compartment_id
  
  query = <<-EOQ
    LoggingAnalytics[1m]{
      logGroup = "${oci_logging_log_group.terraform_logs.display_name}",
      log = "${oci_logging_log.cicd_log.display_name}"
    } | where data.message =~ ".*destroy.*" | count()
  EOQ
  
  severity     = "CRITICAL"
  body         = "Terraform destroy operation detected in CI/CD pipeline!"
  is_enabled   = true
  
  pending_duration             = "PT1M"
  repeat_notification_duration = "PT15M"
  resolution                   = "1m"

  suppression {
    description = "Planned maintenance window"
    # time_suppress_from and time_suppress_until can be added if needed
  }

  destinations = [oci_ons_notification_topic.alerts.id]
  
  freeform_tags = {
    "Environment" = "production"
    "Purpose"     = "blast-radius-monitoring"
  }
}

Note: The logging query uses simple text search. For production environments, you may want to use more precise filters - such as regular expressions or structured log fields, assuming your CI tools produce structured logs.
Alternatively, the simpler LoggingSearch query engine may be used if Logging Analytics is not enabled in your tenancy.

Additional benefit: This method in OCI can also be extended to detect apply actions, policy violations or drifts, provided the logs are properly populated (e.g. via Terraform plan output, Sentinel warnings or audit events).

✅Checklist: Blast Radius Readiness

This checklist can help you build your infrastructure to be as resilient as possible.

✅ Preventive Measures

[ ] States segmented by blast radius impact
[ ] Lifecycle rules implemented for critical resources
[ ] Remote state validations in place
[ ] Policy-as-Code for destroy operations
[ ] Automated plan analysis enabled
[ ] Cross-state dependency mapping created

🚨 Preparations for Emergencies

[ ] State backup strategy implemented
[ ] Import scripts for critical resources tested
[ ] Incident response playbooks available
[ ] Team training for state surgery completed
[ ] Monitoring and alerting for blast radius events active

✍️Careful Planning and Mindsets

Successful enterprise-level Terraform implementations also require:

Proactive architecture: design states based on blast radius impact
Defensive programming: implement guardrails and validations
Monitoring and alerting: detect blast radius events early
Recovery preparedness: be ready for critical situations

Conclusion: Controlled Explosions Instead of Chaos

Important: Blast radius management is not a one-time setup, but a continuous process.

The key is to strike the right balance between flexibility and control - just like the Goldilocks principle, which we have discussed in detail in a previous article.

Because the best explosion is the one that never happens.

Ralf Ramge

Founder, Cloud Architect & IT Consultant

Terraform @ Scale - Part 4a: Data Sources are Dangerous!

Terraform @ Scale - Part 3c: Monitoring and Alerting for Blast Radius Events

HashiCorp Vault Deep Dive – Part 2b: Practical Work with the Key/Value Secrets Engine

Terraform @Scale - Part 3b: Blast Radius Recovery Strategies

HashiCorp Vault Deep Dive - Part 2a: Activating the Key/Value Secrets Engine

Terraform @ Scale - Part 3a: Blast-Radius Management

HashiCorp Vault Deep Dive - Part 1: Fundamentals of Secret Engines

Terraform @ Scale - Part 2: The Art of Optimal State Sizing

Terraform @ Scale - Part 1e: Scaling Across Organizational Boundaries

Keeping IT Risks Under Control – Before Your Company Faces a Crisis

Terraform @ Scale - Part 3c: Monitoring and Alerting for Blast Radius Events

Early Warning System: Automated Plan Analysis

Example: Bash Script for Plan Evaluation

Cloud-based Log Monitoring with Alerting

Example: AWS CloudWatch Integration

Example: OCI Logging with Alerting

✅Checklist: Blast Radius Readiness

✅ Preventive Measures

🚨 Preparations for Emergencies

✍️Careful Planning and Mindsets

Conclusion: Controlled Explosions Instead of Chaos

Ralf Ramge

ICT.technology