20 Projects: From Python SDK to Production Engineering

Aligned with Google Cloud Associate Cloud Engineer (ACE) Certification


Cost Notice: Most projects can be completed within the free tier or for a few dollars if you delete resources promptly. Phase 3–4 projects (MIGs, Load Balancers, cross-region DR) can cost more if left running. Each project includes a cost estimate.


Quick Start Guide

Your BackgroundStart HereTime Estimate
New to GCPProjects 1–3 (L1: Single VM basics)1 weekend
Know Python, new to GCEProjects 2–6 (L1–L2: SDK automation)1 weekend
Comfortable with TerraformProjects 8–14 (L2–L3: IaC + Security)1 week
Production experienceProjects 15–20 (L3–L4: SRE + DR)1 week
2 weeks until ACE examProjects 2–5, 8–11, 13, 16Focus path
1 weekend crash courseProjects 1–6Essentials only

Level Definitions

  • L1 — Single VM, simple firewall, manual/scripted operations
  • L2 — MIGs, basic load balancing, Python SDK automation
  • L3 — IaC (Terraform/Pulumi), CI/CD, security hardening
  • L4 — SRE features (autoscaling, SLOs, incident response, DR)

Interaction Philosophy

Primary interactions via Python SDK and IaC tools; Console used mainly for verification and dashboards.


Phase 1: The Python Infrastructure Builder

Focus: Mastering the google-cloud-compute Python client and understanding resource lifecycles.

ACE Focus: Section 2.1 (Launching compute instances, choosing storage) and Section 3.1 (Working with snapshots and images).


Project 1: The One-Time Manual Launch [Phase 1, L1]

Goal: Spin up a VM manually in the Console to observe the default Service Account, scopes, network tags, and firewall rules.

Why This Matters:

  • ACE questions often test your knowledge of default configurations
  • Understanding defaults helps you know what to explicitly override in automation

Requirements:

  • Note the default compute service account email format: PROJECT_NUMBER-compute@developer.gserviceaccount.com
  • Identify which APIs are enabled by default
  • Delete the VM immediately after observation

Common Mistakes:

  • Forgetting to note the default network tags (none) vs. the default firewall rules
  • Not checking the Service Account scopes vs. IAM roles distinction (newer projects default to cloud-platform scope; older patterns used narrower scopes like compute-ro)

Key Concepts: Console navigation, default resource values, VM lifecycle states

ACE Mapping: 1.1 (Setting up projects), 2.1 (Launching compute instance)

Done = Can describe the default SA format, default network tags, and why the VM has no external HTTP access initially.

Stretch: Enable OS Login and note how SSH key management changes.

Cost: Free tier eligible. ~$0 if deleted within minutes.


Project 2: The Python Spawner [Phase 1, L1]

Goal: Write a Python script (create_vm.py) using google-cloud-compute that creates a VM with an injected startup script that installs Nginx and writes a custom index.html.

Why This Matters:

  • The foundation for all programmatic GCE management
  • ACE tests your ability to launch instances with correct configurations
  • Startup scripts are the simplest form of bootstrapping

Requirements:

  • Script must use InstancesClient from google.cloud.compute_v1
  • Startup script injected via metadata key startup-script
  • Script must wait for operation completion using operation.result()
  • Print the external IP when VM is RUNNING

Common Mistakes:

  • Not waiting for the operation to complete before querying for IP
  • Forgetting to create firewall rule for HTTP (port 80)
  • Using wrong metadata key (startup_script vs startup-script)

Key Concepts: Asynchronous API operations, Instance resource structure, metadata bootstrapping

ACE Mapping: 2.1 (Launching compute instance, SSH keys, availability policy)

Done = Running python create_vm.py creates a VM; curl <external-ip> returns custom HTML; script handles errors gracefully.

Verification Commands:

python create_vm.py
# Output: Created instance 'my-vm' with external IP: 35.x.x.x
curl http://35.x.x.x
# Output: <html>Hello from GCE!</html>

Stretch: Add argument parsing for zone, machine type, and custom HTML content.

Cost: e2-micro is free tier eligible. ~$0.01/hour otherwise.


Project 3: The Disk Lifecycle Manager [Phase 1, L1]

Goal: Write a Python script that creates a standalone Persistent Disk, attaches it to your running VM, and formats/mounts it remotely without rebooting.

Why This Matters:

  • ACE frequently tests disk attachment, detachment, and lifecycle
  • Understanding that disks and VMs have independent lifecycles is crucial
  • Day-2 operations (modifying running infrastructure) are production reality

Requirements:

  • Create a 10GB pd-standard disk via DisksClient
  • Attach to VM using AttachedDisk with mode=READ_WRITE
  • Use subprocess with gcloud compute ssh --command to run mkfs.ext4 and mount
  • Alternative: use paramiko for direct SSH

Common Mistakes:

  • Forgetting to unmount before detaching (causes data corruption)
  • Assuming disk auto-detaches when VM stops (it doesn’t by default)
  • Not specifying device_name and then struggling to find it in /dev/
  • Formatting a disk that already has data (always check first)

Key Concepts: Day-2 operations, separate lifecycles for compute vs storage, live disk attachment

ACE Mapping: 2.1 (Choosing storage: zonal PD, regional PD, Hyperdisk), 3.1 (Working with snapshots)

Done = Disk is attached and mounted at /mnt/data; writing a file to /mnt/data persists across VM stop/start.

Verification Commands:

# On the VM after script runs:
lsblk  # Shows new disk
df -h /mnt/data  # Shows mounted filesystem
echo "test" > /mnt/data/testfile
# Stop and start VM, then:
cat /mnt/data/testfile  # Returns "test"

Stretch: Implement disk detachment and reattachment to a different VM.

Cost: 10GB pd-standard ≈ 0.01 for a few hours.


Project 4: The Image Bakery [Phase 1, L2]

Goal: Write a script that stops your VM from Project 2, creates a Custom Image from its boot disk, and deletes the original VM.

Why This Matters:

  • “Golden Images” are foundational to immutable infrastructure
  • ACE tests image creation, families, and deprecation policies
  • Faster instance startup vs. startup scripts every time

Requirements:

  • Stop instance using instances().stop()
  • Create image using ImagesClient.insert()
  • Wait for image to be READY before deleting VM
  • Output the image self_link for future use

Common Mistakes:

  • Trying to create image from running VM (must be stopped or use snapshots)
  • Not using image families for versioning
  • Forgetting that images are global but created from regional disks

Key Concepts: Immutable infrastructure, Golden Images, image families

ACE Mapping: 3.1 (Create, view, delete images), 2.1 (Boot disk image)

Done = Custom image exists; launching new VM from this image serves the same Nginx page without any startup script.

Verification Commands:

gcloud compute images list --filter="name~my-golden"
gcloud compute instances create test-from-image --image=my-golden-image
curl http://<new-vm-ip>  # Returns same HTML without startup script

Stretch: Implement image family versioning with automatic deprecation of old images.

Cost: Image storage ≈ $0.05/GB/month. Minimal for small images.


Project 5: The Snapshot Sentinel [Phase 1, L2]

Goal: Write a Python function that iterates through all disks in a zone and triggers immediate snapshots for disks labeled backup: true.

Why This Matters:

  • ACE tests snapshot scheduling and retention
  • Label-based resource management is a production best practice
  • Understanding snapshot vs. image use cases

Requirements:

  • Use DisksClient.list() to enumerate disks
  • Filter by labels using disk.labels.get('backup') == 'true'
  • Create snapshot with SnapshotsClient
  • Include timestamp in snapshot name (e.g., disk-name-20240115-143022)

Common Mistakes:

  • Snapshot names must be unique; not including timestamps causes collisions
  • Forgetting that snapshots are incremental (only first is full-size)
  • Not setting up snapshot schedule resource policies for automation

Key Concepts: Resource iteration, label-based filtering, backup strategies, incremental snapshots

ACE Mapping: 3.1 (Schedule a snapshot, create/view snapshots), 3.2 (Backing up database instances)

Done = Only labeled disks get snapshots; snapshot names include ISO timestamp; script is idempotent.

Verification Commands:

python snapshot_sentinel.py --zone=us-central1-a
gcloud compute snapshots list --filter="sourceDisk~my-disk"
# Shows snapshot with timestamp in name

Stretch: Add snapshot retention policy that deletes snapshots older than 7 days.

Cost: Snapshots ≈ $0.026/GB/month (incremental after first).


Project 6: The Spot Instance Watchdog [Phase 1, L2]

Goal: Create a Spot VM (preemptible). Write a Python script on a separate standard VM that monitors the Spot VM and automatically recreates it if preempted.

Why This Matters:

  • ACE questions frequently ask when Spot/Preemptible is appropriate
  • Understanding preemption behavior is crucial for cost optimization
  • Real architectures must handle Spot interruptions gracefully

Requirements:

  • Create Spot VM with scheduling.preemptible = True and scheduling.provisioning_model = "SPOT"
  • Monitor script polls instance status every 60 seconds
  • On TERMINATED status with preemption reason, recreate with same configuration
  • Log all state transitions with timestamps

Common Mistakes:

  • Using Spot for stateful workloads without checkpointing
  • Not handling the 30-second shutdown warning
  • Forgetting Spot VMs also terminate after 24 hours (legacy preemptible behavior)

Key Concepts: Spot/Preemptible VMs, failure detection, self-healing patterns, cost optimization

ACE Mapping: 2.1 (Using Spot VM instances), 3.1 (Viewing current running instances)

Done = Spot VM runs; manually triggering preemption (or waiting) results in automatic recreation within 2 minutes.

Verification Commands:

# Simulate preemption:
gcloud compute instances simulate-maintenance-event spot-vm --zone=us-central1-a
# Watch logs from monitor script showing recreation

Stretch: Implement exponential backoff for recreation failures; add Pub/Sub notification on preemption.

Cost: Spot VMs are 60-91% cheaper than on-demand. ~$0.003/hour for e2-micro.


Phase 2: Infrastructure as Code (The Declarative Shift)

Focus: Moving from imperative Python scripts to declarative IaC. Terraform is recommended for ACE exam alignment; Pulumi (Python) is an excellent alternative.

ACE Focus: Section 2.3 (VPC, firewall policies), Section 2.4 (Infrastructure as code tooling), and Sections 4.1–4.2 (IAM and service accounts).


Project 7: The Great Rewrite [Phase 2, L2]

Goal: Destroy everything from Phase 1. Recreate the VM, firewall rule, and reference your custom image using Terraform or Pulumi.

Why This Matters:

  • ACE Section 2.4 explicitly covers IaC tooling (Terraform, Config Connector, Helm)
  • Understanding state management is crucial for production IaC
  • The imperative→declarative shift is a key mental model change

Requirements:

  • Define google_compute_instance with startup script
  • Define google_compute_firewall allowing HTTP from 0.0.0.0/0
  • Reference existing custom image by name or family
  • Use variables for project, region, zone

Common Mistakes:

  • Not initializing Terraform backend before apply
  • Hardcoding project ID instead of using variables
  • Forgetting that terraform destroy is needed to avoid charges

Key Concepts: Imperative vs Declarative paradigms, state management, terraform plan/apply/destroy

ACE Mapping: 2.4 (Planning and implementing resources through infrastructure as code: tooling, state management, updates)

Done = terraform apply creates identical infrastructure; terraform destroy removes all resources; terraform plan shows no drift on re-apply.

Verification Commands:

terraform init
terraform plan  # Shows 2 resources to create
terraform apply -auto-approve
curl http://$(terraform output -raw external_ip)
terraform destroy -auto-approve

Stretch: Add terraform output for external IP; implement remote state backend with GCS.

Cost: Same as Project 2. Infrastructure cost only while running.


Project 8: The Network Architect [Phase 2, L3]

Goal: Define a custom VPC with two subnets in different regions. Create a firewall rule allowing SSH only from your home IP. Deploy a VM in each subnet.

Why This Matters:

  • ACE heavily tests VPC and firewall rule configuration
  • Network isolation is fundamental to cloud security
  • Understanding CIDR ranges and firewall priority is essential

Requirements:

  • Create VPC with auto_create_subnetworks = false
  • Create subnets in us-east1 (10.0.1.0/24) and us-west1 (10.0.2.0/24)
  • Firewall rule with source_ranges = ["<your-ip>/32"]
  • VMs must communicate via internal IPs across regions

Common Mistakes:

  • Overlapping CIDR ranges between subnets
  • Forgetting that VPC is global but subnets are regional
  • Setting firewall priority incorrectly (lower number = higher priority)
  • Using 0.0.0.0/0 for SSH instead of restricting to your IP

Key Concepts: Custom VPCs, CIDR planning, firewall rule priority and direction, network isolation

ACE Mapping: 2.3 (Creating VPC with subnets, Cloud NGFW policies, ingress/egress rules)

Done = VMs can ping each other by internal IP; SSH works only from your IP; SSH from other IPs times out.

Verification Commands:

# From VM in us-east1:
ping 10.0.2.x  # Internal IP of us-west1 VM - should succeed
# From your machine:
ssh user@<vm-external-ip>  # Should succeed
# From a different IP (e.g., Cloud Shell in different project):
ssh user@<vm-external-ip>  # Should timeout

Stretch: Add a deny-all egress rule and allow only specific ports (80, 443).

Cost: VPC and firewall rules are free. VM costs only.


Security & Access Cluster (Projects 9–11, 14)

The next three projects (plus Project 14) form a security-focused cluster. If preparing for ACE with limited time, prioritize these.


Project 9: The Access Pattern Showdown [Phase 2, L3]

Goal: Deploy a private VM with no external IP. Connect via IAP tunneling. Then deploy a traditional Bastion host and configure SSH jumping. Compare both approaches.

Why This Matters:

  • IAP is the GCP-preferred method for secure access (frequently tested on ACE)
  • Understanding when bastion hosts are still appropriate
  • Zero-trust networking concepts

Requirements:

  • Private VM: network_interface without access_config
  • Cloud NAT for the private VM to pull updates (no external IP but can egress)
  • Enable IAP TCP forwarding firewall rule (tcp:22 from 35.235.240.0/20)
  • Connect: gcloud compute ssh --tunnel-through-iap
  • Bastion: public IP, configure ~/.ssh/config ProxyJump

Common Mistakes:

  • Opening SSH from 0.0.0.0/0 “just to test” instead of using IAP from the start
  • Using wrong IAP source range (must be exactly 35.235.240.0/20)
  • Forgetting to grant roles/iap.tunnelResourceAccessor IAM role
  • Not setting up Cloud NAT, then wondering why apt update fails

Key Concepts: IAP TCP forwarding, zero-trust access, SSH jumping, Cloud NAT for egress

ACE Mapping: 3.1 (Remotely connecting to instance), 4.1 (IAM policies), 2.3 (Firewall rules)

Done = Both methods work; can articulate when IAP is preferred (audit logging, no exposed bastion, IAM-controlled) vs bastion (legacy compliance, specific network requirements).

Verification Commands:

# IAP method:
gcloud compute ssh private-vm --tunnel-through-iap --zone=us-central1-a
# Bastion method (after configuring ~/.ssh/config):
ssh private-vm  # Automatically jumps through bastion
# Verify no external IP:
gcloud compute instances describe private-vm --format="get(networkInterfaces[0].accessConfigs)"
# Should return empty

Stretch: Enable OS Login with 2FA requirement.

Cost: Cloud NAT ≈ $0.045/hour + data processing. Medium cost if left running.


Project 10: The Least-Privilege Identity [Phase 2, L3]

Goal: Create a custom Service Account via IaC with only roles/storage.objectViewer. Attach it to a VM. Verify it can read GCS but cannot write or access Compute APIs.

Why This Matters:

  • ACE Section 4.2 is entirely about managing service accounts
  • Scopes vs. IAM roles confusion is a common exam trap
  • Principle of least privilege is fundamental to cloud security

Requirements:

  • Create SA: google_service_account
  • Bind role: google_project_iam_member with roles/storage.objectViewer
  • Attach to VM: service_account { email = ... scopes = ["cloud-platform"] }
  • Use cloud-platform scope (broad) but narrow IAM role (restrictive)

Common Mistakes:

  • Confusing scopes (legacy, ceiling) with IAM roles (actual permissions)
  • Using default compute SA instead of creating custom SA
  • Granting roles/storage.admin when only read is needed
  • Forgetting that SA changes require VM restart to take effect

Key Concepts: Service Account vs user identity, scopes vs IAM roles, principle of least privilege

ACE Mapping: 4.2 (Creating SA, minimum permissions, assigning to resources)

Done = VM can read GCS; VM cannot write to GCS or list compute instances.

Verification Commands:

# SSH to the VM, then:
gsutil cat gs://your-bucket/test-object.txt
# Output: (contents of the file)
 
gsutil cp /tmp/localfile gs://your-bucket/
# Output: AccessDeniedException: 403 ... does not have storage.objects.create access
 
gcloud compute instances list
# Output: ERROR: (gcloud.compute.instances.list) ... does not have compute.instances.list permission

Stretch: Implement service account impersonation from your local machine.

Cost: Service accounts are free. VM costs only.


Project 11: Trusted Images Policy Lockdown [Phase 2, L3]

Goal: Configure an Organization Policy that restricts the project to only use approved images. Verify that creating a VM from an unapproved image fails.

Why This Matters:

  • Supply chain security is increasingly important
  • ACE 1.1 covers applying organizational policies
  • Understanding policy inheritance and enforcement

Note: This project requires organization-level access. If you only have project-level access, read through conceptually and practice with project-level constraints (like restricting instance creation to specific zones) where possible.

Requirements:

  • Enable constraints/compute.trustedImageProjects
  • Allow only your project and projects/debian-cloud
  • Attempt to create VM from projects/ubuntu-os-cloud (should fail)
  • Document the exact error message

Common Mistakes:

  • Forgetting to include your own project in the allowed list (can’t use your custom images)
  • Not understanding policy inheritance from org → folder → project
  • Expecting policy to apply retroactively (only affects new resources)

Key Concepts: Organization Policies, Trusted Images, supply chain security, policy inheritance

ACE Mapping: 1.1 (Applying org policies to resource hierarchy), 4.1 (IAM policies)

Done = VMs from debian-cloud succeed; VMs from ubuntu-os-cloud fail with policy violation error.

Verification Commands:

# Should succeed:
gcloud compute instances create test-debian --image-project=debian-cloud --image-family=debian-11
 
# Should fail:
gcloud compute instances create test-ubuntu --image-project=ubuntu-os-cloud --image-family=ubuntu-2204-lts
# Output: ERROR: ... Constraint constraints/compute.trustedImageProjects violated

Stretch: Add constraint requiring Shielded VM at org level.

Cost: Organization policies are free.


Project 12: The Cost Watchdog [Phase 2, L2]

Goal: Write a Python Cloud Function that lists all VMs in the project. If a VM has label env:dev and current time is after 7 PM local, stop the instance.

Why This Matters:

  • FinOps is increasingly important (ACE 1.2 covers billing)
  • Serverless + scheduled automation is a common pattern
  • Label-based policies are production best practice

Requirements:

  • Trigger: Cloud Scheduler (cron: 0 19 * * *) via Pub/Sub
  • Function uses google-cloud-compute library
  • Filter by labels: instance.labels.get('env') == 'dev'
  • Log all actions to Cloud Logging with instance names and actions taken

Common Mistakes:

  • Hardcoding timezone instead of using proper timezone handling
  • Not granting the function’s SA permission to stop instances
  • Stopping instances that are already stopped (handle gracefully)

Key Concepts: FinOps, serverless compute, scheduled automation, label-based policies

ACE Mapping: 1.2 (Billing budgets and alerts), 3.1 (Managing compute resources)

Done = Function deploys successfully; dev VMs stop after 7 PM; production VMs unaffected; Cloud Logging shows decisions.

Verification Commands:

# Deploy function
gcloud functions deploy cost-watchdog --trigger-topic=cost-watchdog-trigger ...
# Manually trigger for testing:
gcloud pubsub topics publish cost-watchdog-trigger --message="test"
# Check logs:
gcloud functions logs read cost-watchdog --limit=20
# Verify instance stopped:
gcloud compute instances describe dev-vm --format="get(status)"
# Output: TERMINATED

Stretch: Add Slack/email notification when VMs are stopped; implement weekend-only shutdown.

Cost: Cloud Functions: 2M free invocations/month. Effectively free for this use case.


Phase 3: Production Engineering

Focus: High Availability, Load Balancing, Observability, and Security Hardening.

ACE Focus: Section 2.1 (Autoscaled MIGs, instance templates), Section 2.3 (Load balancers), and Section 3.4 (Monitoring alerts, Ops Agent).


Project 13: Scalable Cluster: MIG + LB + Monitoring [Phase 3, L3]

Goal: Deploy a Managed Instance Group behind a Global HTTP(S) Load Balancer. Include Cloud Monitoring Uptime Check and Alert Policy in your IaC.

Why This Matters:

  • MIGs + LB is the canonical GCE high-availability pattern
  • ACE heavily tests this architecture (2.1, 2.3, 3.4)
  • “Monitoring as Code” ensures observability isn’t an afterthought

Requirements:

  • Instance Template with startup script serving unique instance ID in response
  • Regional MIG with target_size = 2
  • Global HTTP(S) LB with backend service pointing to MIG
  • Health check on port 80, path /health
  • Uptime Check pinging LB external IP
  • Alert Policy triggering if uptime check fails for 1 minute

Common Mistakes:

  • Health check path returning non-200 (instances marked unhealthy, never serve traffic)
  • Forgetting named port mapping on instance group
  • Not waiting for LB provisioning (can take 5-10 minutes)
  • Backend service timeout shorter than application response time

Key Concepts: Instance Templates, MIGs, Global LB architecture, health checks, Monitoring as Code

ACE Mapping: 2.1 (Creating autoscaled MIG with template), 2.3 (Deploying load balancers), 3.4 (Creating alerts based on metrics)

Done = curl to LB IP returns responses from different instances (refresh multiple times); Uptime Check shows green in console; manually killing Nginx on one VM triggers alert.

Verification Commands:

# Get LB IP:
LB_IP=$(gcloud compute forwarding-rules describe my-lb --global --format="get(IPAddress)")
# Test load balancing (should see different instance IDs):
for i in {1..10}; do curl -s http://$LB_IP | grep "Instance:"; done
# Check uptime check status:
gcloud monitoring uptime-check-configs list

Stretch: Add Cloud Armor security policy blocking specific countries or IP ranges.

Cost: Global LB ≈ $18/month minimum + data processing. MIG VMs at hourly rate. Medium-high cost if left running.


Project 14: The Hardened Vault [Phase 3, L3]

Goal: Deploy a VM with Shielded VM enabled (Secure Boot, vTPM, Integrity Monitoring) and Confidential Computing. Verify security features via Cloud Logging.

Why This Matters:

  • ACE tests Shielded VM configuration options
  • Compliance requirements increasingly mandate hardware-based security
  • Understanding integrity monitoring for detecting boot-level attacks

Requirements:

  • Enable shielded_instance_config: enable_secure_boot, enable_vtpm, enable_integrity_monitoring
  • Enable confidential_instance_config: enable_confidential_compute (requires N2D machine type)
  • Query integrity monitoring logs in Cloud Logging
  • Document that disabling Secure Boot requires VM stop

Common Mistakes:

  • Using machine type that doesn’t support Confidential Computing (N2D is the primary family; C2D also supported)
  • Shielded VM is the default for all VM families, but Secure Boot may need to be explicitly enabled depending on your image and tooling
  • Not knowing where to find integrity logs in Cloud Logging

Key Concepts: Hardware-based security, vTPM, integrity monitoring, Confidential VMs, SEV encryption

ACE Mapping: 4.1 (Security configurations), 2.1 (Shielded VM, availability policy)

Done = VM boots with all security features enabled; can find earlyBootReportEvent in Cloud Logging; attempting to change Secure Boot while running fails.

Verification Commands:

# Verify Shielded VM config:
gcloud compute instances describe hardened-vm \
  --format="get(shieldedInstanceConfig)"
# Output: enableIntegrityMonitoring: true, enableSecureBoot: true, enableVtpm: true
 
# Find integrity logs in Cloud Logging:
gcloud logging read 'resource.type="gce_instance" AND 
  logName:"cloudaudit.googleapis.com" AND 
  protoPayload.methodName:"earlyBootReportEvent"' --limit=5
 
# Attempt to disable Secure Boot while running (should fail):
gcloud compute instances update hardened-vm --no-shielded-secure-boot
# Output: ERROR: Shielded VM config can only be updated when instance is stopped

Stretch: Simulate integrity violation by modifying boot config; observe alert.

Cost: N2D instances are ~10% more expensive than N2. Confidential VM adds ~20% premium.


Project 15: The Resiliency Auditor [Phase 3, L4]

Goal: Write a Python chaos script that randomly deletes a VM in your MIG. The script must measure and report exact recovery time.

Why This Matters:

  • Chaos engineering validates your self-healing architecture
  • Measuring recovery time gives you concrete SLO data
  • Understanding MIG autohealing behavior is crucial for production

Requirements:

  • Randomly select one instance from MIG using instance_group_managers().list_managed_instances()
  • Delete it via instances().delete() (not via MIG API—simulates unexpected failure)
  • Poll MIG status every second using instance_group_managers().list_managed_instances()
  • Log: deletion time, time MIG detected unhealthy, time replacement reached RUNNING
  • Output: Total Recovery Time: X seconds

Common Mistakes:

  • Deleting via MIG API (which gracefully drains) instead of simulating crash
  • Not accounting for health check interval in recovery time
  • Polling too infrequently to get accurate measurements

Key Concepts: Chaos engineering, self-healing infrastructure, measuring availability, autohealing health checks

ACE Mapping: 3.1 (Managing compute resources), 2.1 (MIG autohealing)

Done = Script outputs recovery timeline; MIG automatically recreates instance; typical recovery time documented (usually 60-120 seconds depending on health check settings).

Verification Commands:

python chaos_auditor.py --mig=my-mig --zone=us-central1-a
# Output:
# [14:32:01] Deleted instance: my-mig-abc123
# [14:32:15] MIG detected unhealthy state
# [14:32:45] Replacement instance created: my-mig-def456
# [14:33:12] Replacement instance RUNNING
# [14:33:18] Health check passed, serving traffic
# Total Recovery Time: 77 seconds

Stretch: Run chaos script 10 times and calculate mean/p95 recovery time.

Cost: Same as Project 13 (MIG infrastructure).


Project 16: The Autoscaling Stress Test [Phase 3, L4]

Goal: Configure CPU-based autoscaling on your MIG (target 60%). Use stress-ng to spike CPU and observe automatic scale-out and scale-in.

Why This Matters:

  • ACE 2.1 explicitly covers creating autoscaled MIGs
  • Understanding cooldown periods prevents thrashing
  • Autoscaler behavior with different metrics (CPU, LB utilization, custom)

Requirements:

  • Configure autoscaler: min_replicas=2, max_replicas=6, cpu_utilization_target=0.6
  • Install stress-ng via startup script
  • Run: stress-ng --cpu 4 --timeout 300s on all instances
  • Monitor instance count via API or Console
  • Observe scale-in after stress stops (respects cooldown)

Common Mistakes:

  • Setting CPU target too low (constant thrashing)
  • Not understanding stabilization/cooldown periods
  • Expecting immediate scale-in (default cooldown is 10 minutes)

Key Concepts: Horizontal scaling, autoscaler policies (CPU, LB, custom metrics), cooldown periods

ACE Mapping: 2.1 (Creating autoscaled MIG), 3.4 (Cloud Monitoring metrics)

Done = MIG scales from 2→5+ during stress; scales back to 2 after cooldown (10+ minutes); scaling events visible in Monitoring.

Verification Commands:

# Start stress on all instances:
for vm in $(gcloud compute instance-groups managed list-instances my-mig --zone=us-central1-a --format="get(instance)"); do
  gcloud compute ssh $vm --command="stress-ng --cpu 4 --timeout 300s &"
done
 
# Watch instance count:
watch -n 10 'gcloud compute instance-groups managed describe my-mig --zone=us-central1-a --format="get(targetSize)"'
# Should increase from 2 to 5-6
 
# After stress stops, wait 10+ minutes for scale-in

Stretch: Configure schedule-based autoscaling for predictable load patterns.

Cost: Autoscaling to 6 instances = 3x base cost during stress test.


Phase 4: SRE & Disaster Recovery

Focus: Production-grade reliability, disaster recovery, and CI/CD pipelines.

ACE Focus: Section 2.2 (Multi-region redundancy), Section 2.4 (IaC versioning and updates), and Section 3.4 (Cloud diagnostics, log routers).


Project 17: Cross-Region Disaster Recovery [Phase 4, L4]

Goal: Deploy a primary application in us-east1. Write a Disaster Recovery script that restores service in us-west1 from the latest snapshot within 10 minutes.

Why This Matters:

  • ACE tests understanding of RTO/RPO concepts
  • Cross-region redundancy is required for true disaster recovery
  • Programmatic DR enables faster recovery than manual runbooks

Requirements:

  • Primary: VM + disk with snapshot schedule replicating to us-west1 (multi-regional storage)
  • Snapshot schedules are configured via resource policies (google_compute_resource_policy in Terraform)
  • DR script: find latest snapshot with snapshots().list() filtered by source disk
  • Create disk from snapshot in us-west1
  • Create VM from disk, configure networking/firewall
  • Verify HTTP response within 10-minute RTO
  • Script must fail (exit 1) if RTO exceeded

Common Mistakes:

  • Snapshot storage location not including DR region
  • Not accounting for disk creation time from snapshot
  • Forgetting to recreate firewall rules in DR region
  • DNS/IP changes not accounted for in recovery

Key Concepts: RTO (Recovery Time Objective), RPO (Recovery Point Objective), cross-region replication, snapshot storage locations

ACE Mapping: 3.1 (Working with snapshots), 2.2 (Multi-region redundancy)

Done = DR script completes in < 10 minutes; new VM serves application; script logs RTO achievement.

Verification Commands:

python disaster_recovery.py --source-region=us-east1 --target-region=us-west1
# Output:
# [14:00:00] Starting DR procedure
# [14:00:05] Found latest snapshot: my-disk-20240115-120000
# [14:01:30] Disk created in us-west1 from snapshot
# [14:03:45] VM created and booting
# [14:05:12] Health check passed - application responding
# [14:05:12] DR SUCCESSFUL - RTO: 5m 12s (target: 10m)

Stretch: Implement automated failover triggered by health check failure + Pub/Sub.

Cost: Cross-region data transfer ≈ $0.08/GB. Snapshot storage in multiple regions doubles storage cost. Plan carefully.


Project 18: The Golden Pipeline [Phase 4, L4]

Goal: Set up Cloud Build to trigger on GitHub push. Pipeline bakes a new image, updates Instance Template, and performs rolling MIG update with zero downtime.

Why This Matters:

  • ACE 2.4 covers IaC deployments including versioning and updates
  • Immutable infrastructure via image baking is a production best practice
  • Rolling updates maintain availability during deployments

Requirements:

  • Cloud Build trigger on push to main branch
  • Build step 1: Build application, create new image with version tag
  • Build step 2: Create new Instance Template referencing new image
  • Build step 3: Start rolling update on MIG with max_surge=1, max_unavailable=0
  • Build step 4: Clean up Instance Templates older than 5 versions

Common Mistakes:

  • Not using max_unavailable=0 (allows brief downtime)
  • Image naming collisions (must include unique identifier like commit SHA)
  • Not cleaning up old templates (clutter + potential cost)

Key Concepts: GitOps, immutable deployments, rolling updates, zero-downtime deployments

ACE Mapping: 2.4 (Planning and implementing resources through IaC: versioning, updates), 3.1 (Managing compute resources)

Done = Git push triggers build; new image created; MIG updates with zero dropped requests; build history shows success.

Verification Commands:

# Push a change:
git commit -am "Update index.html" && git push
 
# Watch Cloud Build:
gcloud builds list --limit=1
 
# Monitor rolling update:
gcloud compute instance-groups managed describe my-mig --zone=us-central1-a \
  --format="get(status.versionTarget)"
 
# Verify zero downtime (run during update):
while true; do curl -s -o /dev/null -w "%{http_code}\n" http://$LB_IP; sleep 0.5; done
# Should show continuous 200s

Stretch: Add canary deployment: route 10% traffic to new version, promote after health check passes.

Cost: Cloud Build: 120 free build-minutes/day. Image storage per version.


Project 19: The Observability Stack [Phase 4, L4]

Goal: Deploy comprehensive observability: Ops Agent for metrics/logs, custom metrics from application, log-based metrics, and a Cloud Monitoring dashboard.

Why This Matters:

  • ACE 3.4 extensively covers monitoring and logging
  • Ops Agent is the modern replacement for legacy agents
  • Custom metrics enable application-specific observability

Requirements:

  • Install Ops Agent via startup script (official installation script)
  • Application writes custom metric custom.googleapis.com/myapp/request_count to Monitoring API
  • Create log-based metric counting severity>=ERROR from application logs
  • Build Monitoring dashboard with 4 widgets: CPU, memory, custom metric, error rate

Common Mistakes:

  • Using legacy monitoring agent instead of Ops Agent
  • Custom metric descriptor not created before writing data
  • Log-based metric filter syntax errors
  • Dashboard JSON/Terraform syntax for complex layouts

Key Concepts: Ops Agent, custom metrics, log-based metrics, dashboards as code, metric descriptors

ACE Mapping: 3.4 (Monitoring and logging: deploying Ops Agent, creating custom metrics, configuring log routers, creating alerts)

Done = Dashboard shows all four metric types updating live; can correlate CPU spike with error rate increase.

Verification Commands:

# Verify Ops Agent running:
gcloud compute ssh my-vm --command="sudo systemctl status google-cloud-ops-agent"
 
# Check custom metric exists:
gcloud monitoring metrics-descriptors list \
  --filter="type=custom.googleapis.com/myapp/request_count"
 
# View dashboard (get URL):
gcloud monitoring dashboards list --format="get(name)"

Stretch: Export logs to BigQuery; create SQL-based analysis of error patterns.

Cost: First 150MB logs/month free. Custom metrics: first 150MB free. Dashboards free.


Project 20: The Incident Response Drill [Phase 4, L4]

Goal: Simulate a production incident: degrade application, observe alerts fire, follow runbook to diagnose, remediate, and write a post-mortem.

Why This Matters:

  • Incident response skills are crucial for production engineering
  • Post-mortem culture drives continuous improvement
  • ACE 3.4 covers using diagnostics to research application issues

Requirements:

  • Inject fault: choose one of
    • Fill disk: dd if=/dev/zero of=/tmp/fill bs=1M count=10000
    • Exhaust memory: stress-ng --vm 2 --vm-bytes 90%
    • Break health check: sudo systemctl stop nginx
  • Observe: alert fires, MIG attempts recreation (if applicable), LB drains traffic
  • Diagnose: use Cloud Logging, serial console, SSH to identify root cause
  • Remediate: fix root cause or rollback
  • Write post-mortem using template: timeline, impact, root cause, action items

Common Mistakes:

  • Not having alerts configured before the drill
  • Remediation that fixes symptom but not root cause
  • Post-mortem that assigns blame instead of identifying systemic issues

Key Concepts: Incident response, post-mortem culture, runbooks, blameless retrospectives

ACE Mapping: 3.4 (Cloud diagnostics, Cloud Trace, viewing logs), 3.1 (Managing compute resources)

Done = Alert fired and acknowledged; incident timeline documented to the minute; post-mortem includes 3+ concrete action items; runbook updated with new failure mode.

Deliverables:

## Post-Mortem: [Date] Disk Full Incident
 
### Timeline
- 14:32 - Disk usage alert fired (>90%)
- 14:35 - On-call acknowledged
- 14:38 - Root cause identified: log rotation misconfigured
- 14:42 - Temporary fix: deleted old logs
- 14:45 - Service fully recovered
 
### Impact
- 10 minutes degraded performance
- 0 data loss
 
### Root Cause
Log rotation cron job was set to weekly instead of daily.
 
### Action Items
1. [ ] Fix log rotation config in base image
2. [ ] Add disk usage alert at 70% (earlier warning)
3. [ ] Document log locations in runbook

Stretch: Implement automated remediation for the specific failure mode (e.g., auto-delete old logs when disk > 80%).

Cost: No additional cost beyond existing infrastructure.


Appendix A: ACE Exam Coverage Matrix

This roadmap covers the following ACE exam objectives. Weights are from the official exam guide; project mappings are approximate.

ACE SectionWeightProjects
1. Setting up cloud environment~23%1, 7, 8, 11, 12
2. Planning/implementing solution~30%2–6, 8, 13, 16, 17, 18
3. Ensuring successful operation~27%3–5, 12, 15, 18–20
4. Configuring access/security~20%9–11, 14

High-Yield ACE Practice Set: If you’re short on time, prioritize Projects 2–5, 8–11, 13, and 16. These cover the core 2.x (compute, networking, IaC) and 3.x (operations, monitoring) objectives that make up ~57% of the exam.


Appendix B: Key GCE Features Covered

FeatureProject(s)
Machine Type Selection1, 2, 14
Persistent Disk (zonal, regional)3, 5, 17
Custom Images & Snapshots4, 5, 17, 18
Spot/Preemptible VMs6
Custom VPC & Firewall Rules8, 9
IAP TCP Forwarding9
Service Accounts & IAM10, 11, 12
Organization Policies11
MIGs & Instance Templates13, 15, 16, 18
Load Balancing (HTTP/S)13
Shielded & Confidential VMs14
Autoscaling Policies16
Ops Agent & Custom Metrics19
Cloud Build CI/CD18

Appendix C: Documentation Requirements

For each project, create:

  1. README.md — Architecture diagram (even ASCII art), key commands, failure modes, recovery steps
  2. Code files — Python scripts, Terraform/Pulumi configs, Cloud Build YAML, all version controlled
  3. Post-mortem (Projects 15, 20) — Timeline, root cause, 3+ action items
  4. Cost log — Actual cost incurred, compared to estimate

Appendix D: Cost Summary

Cost LevelProjectsEstimated Cost
Free/Minimal1–5, 7–8, 10–12, 20< $5 total if deleted promptly
Low6, 14$5–15 if run for a day
Medium9, 15–16, 18–19$15–40 if run for a day
Higher13, 17$40–100 if run for a day (Global LB + cross-region)

Critical: Always run terraform destroy or equivalent cleanup after each project session. Set a calendar reminder!


Last updated: January 2026