20 Projects: From Python SDK to Production Engineering

Aligned with Google Cloud Associate Cloud Engineer (ACE) Certification

Cost Notice: Most projects can be completed within the free tier or for a few dollars if you delete resources promptly. Phase 3–4 projects (MIGs, Load Balancers, cross-region DR) can cost more if left running. Each project includes a cost estimate.

Quick Start Guide

Your Background	Start Here	Time Estimate
New to GCP	Projects 1–3 (L1: Single VM basics)	1 weekend
Know Python, new to GCE	Projects 2–6 (L1–L2: SDK automation)	1 weekend
Comfortable with Terraform	Projects 8–14 (L2–L3: IaC + Security)	1 week
Production experience	Projects 15–20 (L3–L4: SRE + DR)	1 week
2 weeks until ACE exam	Projects 2–5, 8–11, 13, 16	Focus path
1 weekend crash course	Projects 1–6	Essentials only

Level Definitions

L1 — Single VM, simple firewall, manual/scripted operations
L2 — MIGs, basic load balancing, Python SDK automation
L3 — IaC (Terraform/Pulumi), CI/CD, security hardening
L4 — SRE features (autoscaling, SLOs, incident response, DR)

Interaction Philosophy

Primary interactions via Python SDK and IaC tools; Console used mainly for verification and dashboards.

Phase 1: The Python Infrastructure Builder

Focus: Mastering the google-cloud-compute Python client and understanding resource lifecycles.

ACE Focus: Section 2.1 (Launching compute instances, choosing storage) and Section 3.1 (Working with snapshots and images).

Project 1: The One-Time Manual Launch [Phase 1, L1]

Goal: Spin up a VM manually in the Console to observe the default Service Account, scopes, network tags, and firewall rules.

Why This Matters:

ACE questions often test your knowledge of default configurations
Understanding defaults helps you know what to explicitly override in automation

Requirements:

Note the default compute service account email format: PROJECT_NUMBER-compute@developer.gserviceaccount.com
Identify which APIs are enabled by default
Delete the VM immediately after observation

Common Mistakes:

Forgetting to note the default network tags (none) vs. the default firewall rules
Not checking the Service Account scopes vs. IAM roles distinction (newer projects default to cloud-platform scope; older patterns used narrower scopes like compute-ro)

Key Concepts: Console navigation, default resource values, VM lifecycle states

ACE Mapping: 1.1 (Setting up projects), 2.1 (Launching compute instance)

Done = Can describe the default SA format, default network tags, and why the VM has no external HTTP access initially.

Stretch: Enable OS Login and note how SSH key management changes.

Cost: Free tier eligible. ~$0 if deleted within minutes.

Project 2: The Python Spawner [Phase 1, L1]

Goal: Write a Python script (create_vm.py) using google-cloud-compute that creates a VM with an injected startup script that installs Nginx and writes a custom index.html.

Why This Matters:

The foundation for all programmatic GCE management
ACE tests your ability to launch instances with correct configurations
Startup scripts are the simplest form of bootstrapping

Requirements:

Script must use InstancesClient from google.cloud.compute_v1
Startup script injected via metadata key startup-script
Script must wait for operation completion using operation.result()
Print the external IP when VM is RUNNING

Common Mistakes:

Not waiting for the operation to complete before querying for IP
Forgetting to create firewall rule for HTTP (port 80)
Using wrong metadata key (startup_script vs startup-script)

Key Concepts: Asynchronous API operations, Instance resource structure, metadata bootstrapping

ACE Mapping: 2.1 (Launching compute instance, SSH keys, availability policy)

Done = Running python create_vm.py creates a VM; curl <external-ip> returns custom HTML; script handles errors gracefully.

Verification Commands:

python create_vm.py
# Output: Created instance 'my-vm' with external IP: 35.x.x.x
curl http://35.x.x.x
# Output: <html>Hello from GCE!</html>

Stretch: Add argument parsing for zone, machine type, and custom HTML content.

Cost: e2-micro is free tier eligible. ~$0.01/hour otherwise.

Project 3: The Disk Lifecycle Manager [Phase 1, L1]

Goal: Write a Python script that creates a standalone Persistent Disk, attaches it to your running VM, and formats/mounts it remotely without rebooting.

Why This Matters:

ACE frequently tests disk attachment, detachment, and lifecycle
Understanding that disks and VMs have independent lifecycles is crucial
Day-2 operations (modifying running infrastructure) are production reality

Requirements:

Create a 10GB pd-standard disk via DisksClient
Attach to VM using AttachedDisk with mode=READ_WRITE
Use subprocess with gcloud compute ssh --command to run mkfs.ext4 and mount
Alternative: use paramiko for direct SSH

Common Mistakes:

Forgetting to unmount before detaching (causes data corruption)
Assuming disk auto-detaches when VM stops (it doesn’t by default)
Not specifying device_name and then struggling to find it in /dev/
Formatting a disk that already has data (always check first)

Key Concepts: Day-2 operations, separate lifecycles for compute vs storage, live disk attachment

ACE Mapping: 2.1 (Choosing storage: zonal PD, regional PD, Hyperdisk), 3.1 (Working with snapshots)

Done = Disk is attached and mounted at /mnt/data; writing a file to /mnt/data persists across VM stop/start.

Verification Commands:

# On the VM after script runs:
lsblk  # Shows new disk
df -h /mnt/data  # Shows mounted filesystem
echo "test" > /mnt/data/testfile
# Stop and start VM, then:
cat /mnt/data/testfile  # Returns "test"

Stretch: Implement disk detachment and reattachment to a different VM.

Cost: 10GB pd-standard ≈ $0.40/ m o n t h .$ 0.01 for a few hours.

Project 4: The Image Bakery [Phase 1, L2]

Goal: Write a script that stops your VM from Project 2, creates a Custom Image from its boot disk, and deletes the original VM.

Why This Matters:

“Golden Images” are foundational to immutable infrastructure
ACE tests image creation, families, and deprecation policies
Faster instance startup vs. startup scripts every time

Requirements:

Stop instance using instances().stop()
Create image using ImagesClient.insert()
Wait for image to be READY before deleting VM
Output the image self_link for future use

Common Mistakes:

Trying to create image from running VM (must be stopped or use snapshots)
Not using image families for versioning
Forgetting that images are global but created from regional disks

Key Concepts: Immutable infrastructure, Golden Images, image families

ACE Mapping: 3.1 (Create, view, delete images), 2.1 (Boot disk image)

Done = Custom image exists; launching new VM from this image serves the same Nginx page without any startup script.

Verification Commands:

gcloud compute images list --filter="name~my-golden"
gcloud compute instances create test-from-image --image=my-golden-image
curl http://<new-vm-ip>  # Returns same HTML without startup script

Stretch: Implement image family versioning with automatic deprecation of old images.

Cost: Image storage ≈ $0.05/GB/month. Minimal for small images.

Project 5: The Snapshot Sentinel [Phase 1, L2]

Goal: Write a Python function that iterates through all disks in a zone and triggers immediate snapshots for disks labeled backup: true.

Why This Matters:

ACE tests snapshot scheduling and retention
Label-based resource management is a production best practice
Understanding snapshot vs. image use cases

Requirements:

Use DisksClient.list() to enumerate disks
Filter by labels using disk.labels.get('backup') == 'true'
Create snapshot with SnapshotsClient
Include timestamp in snapshot name (e.g., disk-name-20240115-143022)

Common Mistakes:

Snapshot names must be unique; not including timestamps causes collisions
Forgetting that snapshots are incremental (only first is full-size)
Not setting up snapshot schedule resource policies for automation

Key Concepts: Resource iteration, label-based filtering, backup strategies, incremental snapshots

ACE Mapping: 3.1 (Schedule a snapshot, create/view snapshots), 3.2 (Backing up database instances)

Done = Only labeled disks get snapshots; snapshot names include ISO timestamp; script is idempotent.

Verification Commands:

python snapshot_sentinel.py --zone=us-central1-a
gcloud compute snapshots list --filter="sourceDisk~my-disk"
# Shows snapshot with timestamp in name

Stretch: Add snapshot retention policy that deletes snapshots older than 7 days.

Cost: Snapshots ≈ $0.026/GB/month (incremental after first).

Project 6: The Spot Instance Watchdog [Phase 1, L2]

Goal: Create a Spot VM (preemptible). Write a Python script on a separate standard VM that monitors the Spot VM and automatically recreates it if preempted.

Why This Matters:

ACE questions frequently ask when Spot/Preemptible is appropriate
Understanding preemption behavior is crucial for cost optimization
Real architectures must handle Spot interruptions gracefully

Requirements:

Create Spot VM with scheduling.preemptible = True and scheduling.provisioning_model = "SPOT"
Monitor script polls instance status every 60 seconds
On TERMINATED status with preemption reason, recreate with same configuration
Log all state transitions with timestamps

Common Mistakes:

Using Spot for stateful workloads without checkpointing
Not handling the 30-second shutdown warning
Forgetting Spot VMs also terminate after 24 hours (legacy preemptible behavior)

Key Concepts: Spot/Preemptible VMs, failure detection, self-healing patterns, cost optimization

ACE Mapping: 2.1 (Using Spot VM instances), 3.1 (Viewing current running instances)

Done = Spot VM runs; manually triggering preemption (or waiting) results in automatic recreation within 2 minutes.

Verification Commands:

# Simulate preemption:
gcloud compute instances simulate-maintenance-event spot-vm --zone=us-central1-a
# Watch logs from monitor script showing recreation

Stretch: Implement exponential backoff for recreation failures; add Pub/Sub notification on preemption.

Cost: Spot VMs are 60-91% cheaper than on-demand. ~$0.003/hour for e2-micro.

Phase 2: Infrastructure as Code (The Declarative Shift)

Focus: Moving from imperative Python scripts to declarative IaC. Terraform is recommended for ACE exam alignment; Pulumi (Python) is an excellent alternative.

ACE Focus: Section 2.3 (VPC, firewall policies), Section 2.4 (Infrastructure as code tooling), and Sections 4.1–4.2 (IAM and service accounts).

Project 7: The Great Rewrite [Phase 2, L2]

Goal: Destroy everything from Phase 1. Recreate the VM, firewall rule, and reference your custom image using Terraform or Pulumi.

Why This Matters:

ACE Section 2.4 explicitly covers IaC tooling (Terraform, Config Connector, Helm)
Understanding state management is crucial for production IaC
The imperative→declarative shift is a key mental model change

Requirements:

Define google_compute_instance with startup script
Define google_compute_firewall allowing HTTP from 0.0.0.0/0
Reference existing custom image by name or family
Use variables for project, region, zone

Common Mistakes:

Not initializing Terraform backend before apply
Hardcoding project ID instead of using variables
Forgetting that terraform destroy is needed to avoid charges

Key Concepts: Imperative vs Declarative paradigms, state management, terraform plan/apply/destroy

ACE Mapping: 2.4 (Planning and implementing resources through infrastructure as code: tooling, state management, updates)

Done = terraform apply creates identical infrastructure; terraform destroy removes all resources; terraform plan shows no drift on re-apply.

Verification Commands:

terraform init
terraform plan  # Shows 2 resources to create
terraform apply -auto-approve
curl http://$(terraform output -raw external_ip)
terraform destroy -auto-approve

Stretch: Add terraform output for external IP; implement remote state backend with GCS.

Cost: Same as Project 2. Infrastructure cost only while running.

Project 8: The Network Architect [Phase 2, L3]

Goal: Define a custom VPC with two subnets in different regions. Create a firewall rule allowing SSH only from your home IP. Deploy a VM in each subnet.

Why This Matters:

ACE heavily tests VPC and firewall rule configuration
Network isolation is fundamental to cloud security
Understanding CIDR ranges and firewall priority is essential

Requirements:

Create VPC with auto_create_subnetworks = false
Create subnets in us-east1 (10.0.1.0/24) and us-west1 (10.0.2.0/24)
Firewall rule with source_ranges = ["<your-ip>/32"]
VMs must communicate via internal IPs across regions

Common Mistakes:

Overlapping CIDR ranges between subnets
Forgetting that VPC is global but subnets are regional
Setting firewall priority incorrectly (lower number = higher priority)
Using 0.0.0.0/0 for SSH instead of restricting to your IP

Key Concepts: Custom VPCs, CIDR planning, firewall rule priority and direction, network isolation

ACE Mapping: 2.3 (Creating VPC with subnets, Cloud NGFW policies, ingress/egress rules)

Done = VMs can ping each other by internal IP; SSH works only from your IP; SSH from other IPs times out.

Verification Commands:

# From VM in us-east1:
ping 10.0.2.x  # Internal IP of us-west1 VM - should succeed
# From your machine:
ssh user@<vm-external-ip>  # Should succeed
# From a different IP (e.g., Cloud Shell in different project):
ssh user@<vm-external-ip>  # Should timeout

Stretch: Add a deny-all egress rule and allow only specific ports (80, 443).

Cost: VPC and firewall rules are free. VM costs only.

Security & Access Cluster (Projects 9–11, 14)

The next three projects (plus Project 14) form a security-focused cluster. If preparing for ACE with limited time, prioritize these.

Project 9: The Access Pattern Showdown [Phase 2, L3]

Goal: Deploy a private VM with no external IP. Connect via IAP tunneling. Then deploy a traditional Bastion host and configure SSH jumping. Compare both approaches.

Why This Matters:

IAP is the GCP-preferred method for secure access (frequently tested on ACE)
Understanding when bastion hosts are still appropriate
Zero-trust networking concepts

Requirements:

Private VM: network_interface without access_config
Cloud NAT for the private VM to pull updates (no external IP but can egress)
Enable IAP TCP forwarding firewall rule (tcp:22 from 35.235.240.0/20)
Connect: gcloud compute ssh --tunnel-through-iap
Bastion: public IP, configure ~/.ssh/config ProxyJump

Common Mistakes:

Opening SSH from 0.0.0.0/0 “just to test” instead of using IAP from the start
Using wrong IAP source range (must be exactly 35.235.240.0/20)
Forgetting to grant roles/iap.tunnelResourceAccessor IAM role
Not setting up Cloud NAT, then wondering why apt update fails

Key Concepts: IAP TCP forwarding, zero-trust access, SSH jumping, Cloud NAT for egress

ACE Mapping: 3.1 (Remotely connecting to instance), 4.1 (IAM policies), 2.3 (Firewall rules)

Done = Both methods work; can articulate when IAP is preferred (audit logging, no exposed bastion, IAM-controlled) vs bastion (legacy compliance, specific network requirements).

Verification Commands:

# IAP method:
gcloud compute ssh private-vm --tunnel-through-iap --zone=us-central1-a
# Bastion method (after configuring ~/.ssh/config):
ssh private-vm  # Automatically jumps through bastion
# Verify no external IP:
gcloud compute instances describe private-vm --format="get(networkInterfaces[0].accessConfigs)"
# Should return empty

Stretch: Enable OS Login with 2FA requirement.

Cost: Cloud NAT ≈ $0.045/hour + data processing. Medium cost if left running.

Project 10: The Least-Privilege Identity [Phase 2, L3]

Goal: Create a custom Service Account via IaC with only roles/storage.objectViewer. Attach it to a VM. Verify it can read GCS but cannot write or access Compute APIs.

Why This Matters:

ACE Section 4.2 is entirely about managing service accounts
Scopes vs. IAM roles confusion is a common exam trap
Principle of least privilege is fundamental to cloud security

Requirements:

Create SA: google_service_account
Bind role: google_project_iam_member with roles/storage.objectViewer
Attach to VM: service_account { email = ... scopes = ["cloud-platform"] }
Use cloud-platform scope (broad) but narrow IAM role (restrictive)

Common Mistakes:

Confusing scopes (legacy, ceiling) with IAM roles (actual permissions)
Using default compute SA instead of creating custom SA
Granting roles/storage.admin when only read is needed
Forgetting that SA changes require VM restart to take effect

Key Concepts: Service Account vs user identity, scopes vs IAM roles, principle of least privilege

ACE Mapping: 4.2 (Creating SA, minimum permissions, assigning to resources)

Done = VM can read GCS; VM cannot write to GCS or list compute instances.

Verification Commands:

# SSH to the VM, then:
gsutil cat gs://your-bucket/test-object.txt
# Output: (contents of the file)
 
gsutil cp /tmp/localfile gs://your-bucket/
# Output: AccessDeniedException: 403 ... does not have storage.objects.create access
 
gcloud compute instances list
# Output: ERROR: (gcloud.compute.instances.list) ... does not have compute.instances.list permission

Stretch: Implement service account impersonation from your local machine.

Cost: Service accounts are free. VM costs only.

Project 11: Trusted Images Policy Lockdown [Phase 2, L3]

Goal: Configure an Organization Policy that restricts the project to only use approved images. Verify that creating a VM from an unapproved image fails.

Why This Matters:

Supply chain security is increasingly important
ACE 1.1 covers applying organizational policies
Understanding policy inheritance and enforcement

Note: This project requires organization-level access. If you only have project-level access, read through conceptually and practice with project-level constraints (like restricting instance creation to specific zones) where possible.

Requirements:

Enable constraints/compute.trustedImageProjects
Allow only your project and projects/debian-cloud
Attempt to create VM from projects/ubuntu-os-cloud (should fail)
Document the exact error message

Common Mistakes:

Forgetting to include your own project in the allowed list (can’t use your custom images)
Not understanding policy inheritance from org → folder → project
Expecting policy to apply retroactively (only affects new resources)

Key Concepts: Organization Policies, Trusted Images, supply chain security, policy inheritance

ACE Mapping: 1.1 (Applying org policies to resource hierarchy), 4.1 (IAM policies)

Done = VMs from debian-cloud succeed; VMs from ubuntu-os-cloud fail with policy violation error.

Verification Commands:

# Should succeed:
gcloud compute instances create test-debian --image-project=debian-cloud --image-family=debian-11
 
# Should fail:
gcloud compute instances create test-ubuntu --image-project=ubuntu-os-cloud --image-family=ubuntu-2204-lts
# Output: ERROR: ... Constraint constraints/compute.trustedImageProjects violated

Stretch: Add constraint requiring Shielded VM at org level.

Cost: Organization policies are free.

Project 12: The Cost Watchdog [Phase 2, L2]

Goal: Write a Python Cloud Function that lists all VMs in the project. If a VM has label env:dev and current time is after 7 PM local, stop the instance.

Why This Matters:

FinOps is increasingly important (ACE 1.2 covers billing)
Serverless + scheduled automation is a common pattern
Label-based policies are production best practice

Requirements:

Trigger: Cloud Scheduler (cron: 0 19 * * *) via Pub/Sub
Function uses google-cloud-compute library
Filter by labels: instance.labels.get('env') == 'dev'
Log all actions to Cloud Logging with instance names and actions taken

Common Mistakes:

Hardcoding timezone instead of using proper timezone handling
Not granting the function’s SA permission to stop instances
Stopping instances that are already stopped (handle gracefully)

Key Concepts: FinOps, serverless compute, scheduled automation, label-based policies

ACE Mapping: 1.2 (Billing budgets and alerts), 3.1 (Managing compute resources)

Done = Function deploys successfully; dev VMs stop after 7 PM; production VMs unaffected; Cloud Logging shows decisions.

Verification Commands:

# Deploy function
gcloud functions deploy cost-watchdog --trigger-topic=cost-watchdog-trigger ...
# Manually trigger for testing:
gcloud pubsub topics publish cost-watchdog-trigger --message="test"
# Check logs:
gcloud functions logs read cost-watchdog --limit=20
# Verify instance stopped:
gcloud compute instances describe dev-vm --format="get(status)"
# Output: TERMINATED

Stretch: Add Slack/email notification when VMs are stopped; implement weekend-only shutdown.

Cost: Cloud Functions: 2M free invocations/month. Effectively free for this use case.

Phase 3: Production Engineering

Focus: High Availability, Load Balancing, Observability, and Security Hardening.

ACE Focus: Section 2.1 (Autoscaled MIGs, instance templates), Section 2.3 (Load balancers), and Section 3.4 (Monitoring alerts, Ops Agent).

Project 13: Scalable Cluster: MIG + LB + Monitoring [Phase 3, L3]

Goal: Deploy a Managed Instance Group behind a Global HTTP(S) Load Balancer. Include Cloud Monitoring Uptime Check and Alert Policy in your IaC.

Why This Matters:

MIGs + LB is the canonical GCE high-availability pattern
ACE heavily tests this architecture (2.1, 2.3, 3.4)
“Monitoring as Code” ensures observability isn’t an afterthought

Requirements:

Instance Template with startup script serving unique instance ID in response
Regional MIG with target_size = 2
Global HTTP(S) LB with backend service pointing to MIG
Health check on port 80, path /health
Uptime Check pinging LB external IP
Alert Policy triggering if uptime check fails for 1 minute

Common Mistakes:

Health check path returning non-200 (instances marked unhealthy, never serve traffic)
Forgetting named port mapping on instance group
Not waiting for LB provisioning (can take 5-10 minutes)
Backend service timeout shorter than application response time

Key Concepts: Instance Templates, MIGs, Global LB architecture, health checks, Monitoring as Code

ACE Mapping: 2.1 (Creating autoscaled MIG with template), 2.3 (Deploying load balancers), 3.4 (Creating alerts based on metrics)

Done = curl to LB IP returns responses from different instances (refresh multiple times); Uptime Check shows green in console; manually killing Nginx on one VM triggers alert.

Verification Commands:

# Get LB IP:
LB_IP=$(gcloud compute forwarding-rules describe my-lb --global --format="get(IPAddress)")
# Test load balancing (should see different instance IDs):
for i in {1..10}; do curl -s http://$LB_IP | grep "Instance:"; done
# Check uptime check status:
gcloud monitoring uptime-check-configs list

Stretch: Add Cloud Armor security policy blocking specific countries or IP ranges.

Cost: Global LB ≈ $18/month minimum + data processing. MIG VMs at hourly rate. Medium-high cost if left running.

Project 14: The Hardened Vault [Phase 3, L3]

Goal: Deploy a VM with Shielded VM enabled (Secure Boot, vTPM, Integrity Monitoring) and Confidential Computing. Verify security features via Cloud Logging.

Why This Matters:

ACE tests Shielded VM configuration options
Compliance requirements increasingly mandate hardware-based security
Understanding integrity monitoring for detecting boot-level attacks

Requirements:

Enable shielded_instance_config: enable_secure_boot, enable_vtpm, enable_integrity_monitoring
Enable confidential_instance_config: enable_confidential_compute (requires N2D machine type)
Query integrity monitoring logs in Cloud Logging
Document that disabling Secure Boot requires VM stop

Common Mistakes:

Using machine type that doesn’t support Confidential Computing (N2D is the primary family; C2D also supported)
Shielded VM is the default for all VM families, but Secure Boot may need to be explicitly enabled depending on your image and tooling
Not knowing where to find integrity logs in Cloud Logging

Key Concepts: Hardware-based security, vTPM, integrity monitoring, Confidential VMs, SEV encryption

ACE Mapping: 4.1 (Security configurations), 2.1 (Shielded VM, availability policy)

Done = VM boots with all security features enabled; can find earlyBootReportEvent in Cloud Logging; attempting to change Secure Boot while running fails.

Verification Commands:

# Verify Shielded VM config:
gcloud compute instances describe hardened-vm \
  --format="get(shieldedInstanceConfig)"
# Output: enableIntegrityMonitoring: true, enableSecureBoot: true, enableVtpm: true
 
# Find integrity logs in Cloud Logging:
gcloud logging read 'resource.type="gce_instance" AND 
  logName:"cloudaudit.googleapis.com" AND 
  protoPayload.methodName:"earlyBootReportEvent"' --limit=5
 
# Attempt to disable Secure Boot while running (should fail):
gcloud compute instances update hardened-vm --no-shielded-secure-boot
# Output: ERROR: Shielded VM config can only be updated when instance is stopped

Stretch: Simulate integrity violation by modifying boot config; observe alert.

Cost: N2D instances are ~10% more expensive than N2. Confidential VM adds ~20% premium.

Project 15: The Resiliency Auditor [Phase 3, L4]

Goal: Write a Python chaos script that randomly deletes a VM in your MIG. The script must measure and report exact recovery time.

Why This Matters:

Chaos engineering validates your self-healing architecture
Measuring recovery time gives you concrete SLO data
Understanding MIG autohealing behavior is crucial for production

Requirements:

Randomly select one instance from MIG using instance_group_managers().list_managed_instances()
Delete it via instances().delete() (not via MIG API—simulates unexpected failure)
Poll MIG status every second using instance_group_managers().list_managed_instances()
Log: deletion time, time MIG detected unhealthy, time replacement reached RUNNING
Output: Total Recovery Time: X seconds

Common Mistakes:

Deleting via MIG API (which gracefully drains) instead of simulating crash
Not accounting for health check interval in recovery time
Polling too infrequently to get accurate measurements

Key Concepts: Chaos engineering, self-healing infrastructure, measuring availability, autohealing health checks

ACE Mapping: 3.1 (Managing compute resources), 2.1 (MIG autohealing)

Done = Script outputs recovery timeline; MIG automatically recreates instance; typical recovery time documented (usually 60-120 seconds depending on health check settings).

Verification Commands:

python chaos_auditor.py --mig=my-mig --zone=us-central1-a
# Output:
# [14:32:01] Deleted instance: my-mig-abc123
# [14:32:15] MIG detected unhealthy state
# [14:32:45] Replacement instance created: my-mig-def456
# [14:33:12] Replacement instance RUNNING
# [14:33:18] Health check passed, serving traffic
# Total Recovery Time: 77 seconds

Stretch: Run chaos script 10 times and calculate mean/p95 recovery time.

Cost: Same as Project 13 (MIG infrastructure).

Project 16: The Autoscaling Stress Test [Phase 3, L4]

Goal: Configure CPU-based autoscaling on your MIG (target 60%). Use stress-ng to spike CPU and observe automatic scale-out and scale-in.

Why This Matters:

ACE 2.1 explicitly covers creating autoscaled MIGs
Understanding cooldown periods prevents thrashing
Autoscaler behavior with different metrics (CPU, LB utilization, custom)

Requirements:

Configure autoscaler: min_replicas=2, max_replicas=6, cpu_utilization_target=0.6
Install stress-ng via startup script
Run: stress-ng --cpu 4 --timeout 300s on all instances
Monitor instance count via API or Console
Observe scale-in after stress stops (respects cooldown)

Common Mistakes:

Setting CPU target too low (constant thrashing)
Not understanding stabilization/cooldown periods
Expecting immediate scale-in (default cooldown is 10 minutes)

Key Concepts: Horizontal scaling, autoscaler policies (CPU, LB, custom metrics), cooldown periods

ACE Mapping: 2.1 (Creating autoscaled MIG), 3.4 (Cloud Monitoring metrics)

Done = MIG scales from 2→5+ during stress; scales back to 2 after cooldown (10+ minutes); scaling events visible in Monitoring.

Verification Commands:

# Start stress on all instances:
for vm in $(gcloud compute instance-groups managed list-instances my-mig --zone=us-central1-a --format="get(instance)"); do
  gcloud compute ssh $vm --command="stress-ng --cpu 4 --timeout 300s &"
done
 
# Watch instance count:
watch -n 10 'gcloud compute instance-groups managed describe my-mig --zone=us-central1-a --format="get(targetSize)"'
# Should increase from 2 to 5-6
 
# After stress stops, wait 10+ minutes for scale-in

Stretch: Configure schedule-based autoscaling for predictable load patterns.

Cost: Autoscaling to 6 instances = 3x base cost during stress test.

Phase 4: SRE & Disaster Recovery

Focus: Production-grade reliability, disaster recovery, and CI/CD pipelines.

ACE Focus: Section 2.2 (Multi-region redundancy), Section 2.4 (IaC versioning and updates), and Section 3.4 (Cloud diagnostics, log routers).

Project 17: Cross-Region Disaster Recovery [Phase 4, L4]

Goal: Deploy a primary application in us-east1. Write a Disaster Recovery script that restores service in us-west1 from the latest snapshot within 10 minutes.

Why This Matters:

ACE tests understanding of RTO/RPO concepts
Cross-region redundancy is required for true disaster recovery
Programmatic DR enables faster recovery than manual runbooks

Requirements:

Primary: VM + disk with snapshot schedule replicating to us-west1 (multi-regional storage)
Snapshot schedules are configured via resource policies (google_compute_resource_policy in Terraform)
DR script: find latest snapshot with snapshots().list() filtered by source disk
Create disk from snapshot in us-west1
Create VM from disk, configure networking/firewall
Verify HTTP response within 10-minute RTO
Script must fail (exit 1) if RTO exceeded

Common Mistakes:

Snapshot storage location not including DR region
Not accounting for disk creation time from snapshot
Forgetting to recreate firewall rules in DR region
DNS/IP changes not accounted for in recovery

Key Concepts: RTO (Recovery Time Objective), RPO (Recovery Point Objective), cross-region replication, snapshot storage locations

ACE Mapping: 3.1 (Working with snapshots), 2.2 (Multi-region redundancy)

Done = DR script completes in < 10 minutes; new VM serves application; script logs RTO achievement.

Verification Commands:

python disaster_recovery.py --source-region=us-east1 --target-region=us-west1
# Output:
# [14:00:00] Starting DR procedure
# [14:00:05] Found latest snapshot: my-disk-20240115-120000
# [14:01:30] Disk created in us-west1 from snapshot
# [14:03:45] VM created and booting
# [14:05:12] Health check passed - application responding
# [14:05:12] DR SUCCESSFUL - RTO: 5m 12s (target: 10m)

Stretch: Implement automated failover triggered by health check failure + Pub/Sub.

Cost: Cross-region data transfer ≈ $0.08/GB. Snapshot storage in multiple regions doubles storage cost. Plan carefully.

Project 18: The Golden Pipeline [Phase 4, L4]

Goal: Set up Cloud Build to trigger on GitHub push. Pipeline bakes a new image, updates Instance Template, and performs rolling MIG update with zero downtime.

Why This Matters:

ACE 2.4 covers IaC deployments including versioning and updates
Immutable infrastructure via image baking is a production best practice
Rolling updates maintain availability during deployments

Requirements:

Cloud Build trigger on push to main branch
Build step 1: Build application, create new image with version tag
Build step 2: Create new Instance Template referencing new image
Build step 3: Start rolling update on MIG with max_surge=1, max_unavailable=0
Build step 4: Clean up Instance Templates older than 5 versions

Common Mistakes:

Not using max_unavailable=0 (allows brief downtime)
Image naming collisions (must include unique identifier like commit SHA)
Not cleaning up old templates (clutter + potential cost)

Key Concepts: GitOps, immutable deployments, rolling updates, zero-downtime deployments

ACE Mapping: 2.4 (Planning and implementing resources through IaC: versioning, updates), 3.1 (Managing compute resources)

Done = Git push triggers build; new image created; MIG updates with zero dropped requests; build history shows success.

Verification Commands:

# Push a change:
git commit -am "Update index.html" && git push
 
# Watch Cloud Build:
gcloud builds list --limit=1
 
# Monitor rolling update:
gcloud compute instance-groups managed describe my-mig --zone=us-central1-a \
  --format="get(status.versionTarget)"
 
# Verify zero downtime (run during update):
while true; do curl -s -o /dev/null -w "%{http_code}\n" http://$LB_IP; sleep 0.5; done
# Should show continuous 200s

Stretch: Add canary deployment: route 10% traffic to new version, promote after health check passes.

Cost: Cloud Build: 120 free build-minutes/day. Image storage per version.

Project 19: The Observability Stack [Phase 4, L4]

Goal: Deploy comprehensive observability: Ops Agent for metrics/logs, custom metrics from application, log-based metrics, and a Cloud Monitoring dashboard.

Why This Matters:

ACE 3.4 extensively covers monitoring and logging
Ops Agent is the modern replacement for legacy agents
Custom metrics enable application-specific observability

Requirements:

Install Ops Agent via startup script (official installation script)
Application writes custom metric custom.googleapis.com/myapp/request_count to Monitoring API
Create log-based metric counting severity>=ERROR from application logs
Build Monitoring dashboard with 4 widgets: CPU, memory, custom metric, error rate

Common Mistakes:

Using legacy monitoring agent instead of Ops Agent
Custom metric descriptor not created before writing data
Log-based metric filter syntax errors
Dashboard JSON/Terraform syntax for complex layouts

Key Concepts: Ops Agent, custom metrics, log-based metrics, dashboards as code, metric descriptors

ACE Mapping: 3.4 (Monitoring and logging: deploying Ops Agent, creating custom metrics, configuring log routers, creating alerts)

Done = Dashboard shows all four metric types updating live; can correlate CPU spike with error rate increase.

Verification Commands:

# Verify Ops Agent running:
gcloud compute ssh my-vm --command="sudo systemctl status google-cloud-ops-agent"
 
# Check custom metric exists:
gcloud monitoring metrics-descriptors list \
  --filter="type=custom.googleapis.com/myapp/request_count"
 
# View dashboard (get URL):
gcloud monitoring dashboards list --format="get(name)"

Stretch: Export logs to BigQuery; create SQL-based analysis of error patterns.

Cost: First 150MB logs/month free. Custom metrics: first 150MB free. Dashboards free.

Project 20: The Incident Response Drill [Phase 4, L4]

Goal: Simulate a production incident: degrade application, observe alerts fire, follow runbook to diagnose, remediate, and write a post-mortem.

Why This Matters:

Incident response skills are crucial for production engineering
Post-mortem culture drives continuous improvement
ACE 3.4 covers using diagnostics to research application issues

Requirements:

Inject fault: choose one of
- Fill disk: dd if=/dev/zero of=/tmp/fill bs=1M count=10000
- Exhaust memory: stress-ng --vm 2 --vm-bytes 90%
- Break health check: sudo systemctl stop nginx
Observe: alert fires, MIG attempts recreation (if applicable), LB drains traffic
Diagnose: use Cloud Logging, serial console, SSH to identify root cause
Remediate: fix root cause or rollback
Write post-mortem using template: timeline, impact, root cause, action items

Common Mistakes:

Not having alerts configured before the drill
Remediation that fixes symptom but not root cause
Post-mortem that assigns blame instead of identifying systemic issues

Key Concepts: Incident response, post-mortem culture, runbooks, blameless retrospectives

ACE Mapping: 3.4 (Cloud diagnostics, Cloud Trace, viewing logs), 3.1 (Managing compute resources)

Done = Alert fired and acknowledged; incident timeline documented to the minute; post-mortem includes 3+ concrete action items; runbook updated with new failure mode.

Deliverables:

## Post-Mortem: [Date] Disk Full Incident
 
### Timeline
- 14:32 - Disk usage alert fired (>90%)
- 14:35 - On-call acknowledged
- 14:38 - Root cause identified: log rotation misconfigured
- 14:42 - Temporary fix: deleted old logs
- 14:45 - Service fully recovered
 
### Impact
- 10 minutes degraded performance
- 0 data loss
 
### Root Cause
Log rotation cron job was set to weekly instead of daily.
 
### Action Items
1. [ ] Fix log rotation config in base image
2. [ ] Add disk usage alert at 70% (earlier warning)
3. [ ] Document log locations in runbook

Stretch: Implement automated remediation for the specific failure mode (e.g., auto-delete old logs when disk > 80%).

Cost: No additional cost beyond existing infrastructure.

Appendix A: ACE Exam Coverage Matrix

This roadmap covers the following ACE exam objectives. Weights are from the official exam guide; project mappings are approximate.

ACE Section	Weight	Projects
1. Setting up cloud environment	~23%	1, 7, 8, 11, 12
2. Planning/implementing solution	~30%	2–6, 8, 13, 16, 17, 18
3. Ensuring successful operation	~27%	3–5, 12, 15, 18–20
4. Configuring access/security	~20%	9–11, 14

High-Yield ACE Practice Set: If you’re short on time, prioritize Projects 2–5, 8–11, 13, and 16. These cover the core 2.x (compute, networking, IaC) and 3.x (operations, monitoring) objectives that make up ~57% of the exam.

Appendix B: Key GCE Features Covered

Feature	Project(s)
Machine Type Selection	1, 2, 14
Persistent Disk (zonal, regional)	3, 5, 17
Custom Images & Snapshots	4, 5, 17, 18
Spot/Preemptible VMs	6
Custom VPC & Firewall Rules	8, 9
IAP TCP Forwarding	9
Service Accounts & IAM	10, 11, 12
Organization Policies	11
MIGs & Instance Templates	13, 15, 16, 18
Load Balancing (HTTP/S)	13
Shielded & Confidential VMs	14
Autoscaling Policies	16
Ops Agent & Custom Metrics	19
Cloud Build CI/CD	18

Appendix C: Documentation Requirements

For each project, create:

README.md — Architecture diagram (even ASCII art), key commands, failure modes, recovery steps
Code files — Python scripts, Terraform/Pulumi configs, Cloud Build YAML, all version controlled
Post-mortem (Projects 15, 20) — Timeline, root cause, 3+ action items
Cost log — Actual cost incurred, compared to estimate

Appendix D: Cost Summary

Cost Level	Projects	Estimated Cost
Free/Minimal	1–5, 7–8, 10–12, 20	< $5 total if deleted promptly
Low	6, 14	$5–15 if run for a day
Medium	9, 15–16, 18–19	$15–40 if run for a day
Higher	13, 17	$40–100 if run for a day (Global LB + cross-region)

Critical: Always run terraform destroy or equivalent cleanup after each project session. Set a calendar reminder!

Last updated: January 2026

google-compute-engine — GCE concepts and exam prep
ACE Certification Plan — Full study plan
gcp-learning-path — GCP learning roadmap
gcp-resources — Courses and labs
Python for DevOps — Python automation basics
docker-to-cloud-run — Container deployment lab

garden.izzy.sh

Explorer

GCE Mastery Roadmap

Quick Start Guide

Level Definitions

Interaction Philosophy

Phase 1: The Python Infrastructure Builder

Project 1: The One-Time Manual Launch [Phase 1, L1]

Project 2: The Python Spawner [Phase 1, L1]

Project 3: The Disk Lifecycle Manager [Phase 1, L1]

Project 4: The Image Bakery [Phase 1, L2]

Project 5: The Snapshot Sentinel [Phase 1, L2]

Project 6: The Spot Instance Watchdog [Phase 1, L2]

Phase 2: Infrastructure as Code (The Declarative Shift)

Project 7: The Great Rewrite [Phase 2, L2]

Project 8: The Network Architect [Phase 2, L3]

Security & Access Cluster (Projects 9–11, 14)

Project 9: The Access Pattern Showdown [Phase 2, L3]

Project 10: The Least-Privilege Identity [Phase 2, L3]

Project 11: Trusted Images Policy Lockdown [Phase 2, L3]

Project 12: The Cost Watchdog [Phase 2, L2]

Phase 3: Production Engineering

Project 13: Scalable Cluster: MIG + LB + Monitoring [Phase 3, L3]

Project 14: The Hardened Vault [Phase 3, L3]

Project 15: The Resiliency Auditor [Phase 3, L4]

Project 16: The Autoscaling Stress Test [Phase 3, L4]

Phase 4: SRE & Disaster Recovery

Project 17: Cross-Region Disaster Recovery [Phase 4, L4]

Project 18: The Golden Pipeline [Phase 4, L4]

Project 19: The Observability Stack [Phase 4, L4]

Project 20: The Incident Response Drill [Phase 4, L4]

Appendix A: ACE Exam Coverage Matrix

Appendix B: Key GCE Features Covered

Appendix C: Documentation Requirements

Appendix D: Cost Summary

Graph View

Table of Contents

Backlinks

garden.izzy.sh

Explorer

GCE Mastery Roadmap

Quick Start Guide

Level Definitions

Interaction Philosophy

Phase 1: The Python Infrastructure Builder

Project 1: The One-Time Manual Launch [Phase 1, L1]

Project 2: The Python Spawner [Phase 1, L1]

Project 3: The Disk Lifecycle Manager [Phase 1, L1]

Project 4: The Image Bakery [Phase 1, L2]

Project 5: The Snapshot Sentinel [Phase 1, L2]

Project 6: The Spot Instance Watchdog [Phase 1, L2]

Phase 2: Infrastructure as Code (The Declarative Shift)

Project 7: The Great Rewrite [Phase 2, L2]

Project 8: The Network Architect [Phase 2, L3]

Security & Access Cluster (Projects 9–11, 14)

Project 9: The Access Pattern Showdown [Phase 2, L3]

Project 10: The Least-Privilege Identity [Phase 2, L3]

Project 11: Trusted Images Policy Lockdown [Phase 2, L3]

Project 12: The Cost Watchdog [Phase 2, L2]

Phase 3: Production Engineering

Project 13: Scalable Cluster: MIG + LB + Monitoring [Phase 3, L3]

Project 14: The Hardened Vault [Phase 3, L3]

Project 15: The Resiliency Auditor [Phase 3, L4]

Project 16: The Autoscaling Stress Test [Phase 3, L4]

Phase 4: SRE & Disaster Recovery

Project 17: Cross-Region Disaster Recovery [Phase 4, L4]

Project 18: The Golden Pipeline [Phase 4, L4]

Project 19: The Observability Stack [Phase 4, L4]

Project 20: The Incident Response Drill [Phase 4, L4]

Appendix A: ACE Exam Coverage Matrix

Appendix B: Key GCE Features Covered

Appendix C: Documentation Requirements

Appendix D: Cost Summary

Related

Graph View

Table of Contents

Backlinks