20 Projects: From Python SDK to Production Engineering
Aligned with Google Cloud Associate Cloud Engineer (ACE) Certification
Cost Notice: Most projects can be completed within the free tier or for a few dollars if you delete resources promptly. Phase 3–4 projects (MIGs, Load Balancers, cross-region DR) can cost more if left running. Each project includes a cost estimate.
Quick Start Guide
| Your Background | Start Here | Time Estimate |
|---|---|---|
| New to GCP | Projects 1–3 (L1: Single VM basics) | 1 weekend |
| Know Python, new to GCE | Projects 2–6 (L1–L2: SDK automation) | 1 weekend |
| Comfortable with Terraform | Projects 8–14 (L2–L3: IaC + Security) | 1 week |
| Production experience | Projects 15–20 (L3–L4: SRE + DR) | 1 week |
| 2 weeks until ACE exam | Projects 2–5, 8–11, 13, 16 | Focus path |
| 1 weekend crash course | Projects 1–6 | Essentials only |
Level Definitions
- L1 — Single VM, simple firewall, manual/scripted operations
- L2 — MIGs, basic load balancing, Python SDK automation
- L3 — IaC (Terraform/Pulumi), CI/CD, security hardening
- L4 — SRE features (autoscaling, SLOs, incident response, DR)
Interaction Philosophy
Primary interactions via Python SDK and IaC tools; Console used mainly for verification and dashboards.
Phase 1: The Python Infrastructure Builder
Focus: Mastering the google-cloud-compute Python client and understanding resource lifecycles.
ACE Focus: Section 2.1 (Launching compute instances, choosing storage) and Section 3.1 (Working with snapshots and images).
Project 1: The One-Time Manual Launch [Phase 1, L1]
Goal: Spin up a VM manually in the Console to observe the default Service Account, scopes, network tags, and firewall rules.
Why This Matters:
- ACE questions often test your knowledge of default configurations
- Understanding defaults helps you know what to explicitly override in automation
Requirements:
- Note the default compute service account email format:
PROJECT_NUMBER-compute@developer.gserviceaccount.com - Identify which APIs are enabled by default
- Delete the VM immediately after observation
Common Mistakes:
- Forgetting to note the default network tags (none) vs. the default firewall rules
- Not checking the Service Account scopes vs. IAM roles distinction (newer projects default to
cloud-platformscope; older patterns used narrower scopes likecompute-ro)
Key Concepts: Console navigation, default resource values, VM lifecycle states
ACE Mapping: 1.1 (Setting up projects), 2.1 (Launching compute instance)
Done = Can describe the default SA format, default network tags, and why the VM has no external HTTP access initially.
Stretch: Enable OS Login and note how SSH key management changes.
Cost: Free tier eligible. ~$0 if deleted within minutes.
Project 2: The Python Spawner [Phase 1, L1]
Goal: Write a Python script (create_vm.py) using google-cloud-compute that creates a VM with an injected startup script that installs Nginx and writes a custom index.html.
Why This Matters:
- The foundation for all programmatic GCE management
- ACE tests your ability to launch instances with correct configurations
- Startup scripts are the simplest form of bootstrapping
Requirements:
- Script must use
InstancesClientfromgoogle.cloud.compute_v1 - Startup script injected via metadata key
startup-script - Script must wait for operation completion using
operation.result() - Print the external IP when VM is RUNNING
Common Mistakes:
- Not waiting for the operation to complete before querying for IP
- Forgetting to create firewall rule for HTTP (port 80)
- Using wrong metadata key (
startup_scriptvsstartup-script)
Key Concepts: Asynchronous API operations, Instance resource structure, metadata bootstrapping
ACE Mapping: 2.1 (Launching compute instance, SSH keys, availability policy)
Done = Running python create_vm.py creates a VM; curl <external-ip> returns custom HTML; script handles errors gracefully.
Verification Commands:
python create_vm.py
# Output: Created instance 'my-vm' with external IP: 35.x.x.x
curl http://35.x.x.x
# Output: <html>Hello from GCE!</html>Stretch: Add argument parsing for zone, machine type, and custom HTML content.
Cost: e2-micro is free tier eligible. ~$0.01/hour otherwise.
Project 3: The Disk Lifecycle Manager [Phase 1, L1]
Goal: Write a Python script that creates a standalone Persistent Disk, attaches it to your running VM, and formats/mounts it remotely without rebooting.
Why This Matters:
- ACE frequently tests disk attachment, detachment, and lifecycle
- Understanding that disks and VMs have independent lifecycles is crucial
- Day-2 operations (modifying running infrastructure) are production reality
Requirements:
- Create a 10GB
pd-standarddisk viaDisksClient - Attach to VM using
AttachedDiskwithmode=READ_WRITE - Use
subprocesswithgcloud compute ssh --commandto runmkfs.ext4andmount - Alternative: use
paramikofor direct SSH
Common Mistakes:
- Forgetting to unmount before detaching (causes data corruption)
- Assuming disk auto-detaches when VM stops (it doesn’t by default)
- Not specifying
device_nameand then struggling to find it in/dev/ - Formatting a disk that already has data (always check first)
Key Concepts: Day-2 operations, separate lifecycles for compute vs storage, live disk attachment
ACE Mapping: 2.1 (Choosing storage: zonal PD, regional PD, Hyperdisk), 3.1 (Working with snapshots)
Done = Disk is attached and mounted at /mnt/data; writing a file to /mnt/data persists across VM stop/start.
Verification Commands:
# On the VM after script runs:
lsblk # Shows new disk
df -h /mnt/data # Shows mounted filesystem
echo "test" > /mnt/data/testfile
# Stop and start VM, then:
cat /mnt/data/testfile # Returns "test"Stretch: Implement disk detachment and reattachment to a different VM.
Cost: 10GB pd-standard ≈ 0.01 for a few hours.
Project 4: The Image Bakery [Phase 1, L2]
Goal: Write a script that stops your VM from Project 2, creates a Custom Image from its boot disk, and deletes the original VM.
Why This Matters:
- “Golden Images” are foundational to immutable infrastructure
- ACE tests image creation, families, and deprecation policies
- Faster instance startup vs. startup scripts every time
Requirements:
- Stop instance using
instances().stop() - Create image using
ImagesClient.insert() - Wait for image to be READY before deleting VM
- Output the image
self_linkfor future use
Common Mistakes:
- Trying to create image from running VM (must be stopped or use snapshots)
- Not using image families for versioning
- Forgetting that images are global but created from regional disks
Key Concepts: Immutable infrastructure, Golden Images, image families
ACE Mapping: 3.1 (Create, view, delete images), 2.1 (Boot disk image)
Done = Custom image exists; launching new VM from this image serves the same Nginx page without any startup script.
Verification Commands:
gcloud compute images list --filter="name~my-golden"
gcloud compute instances create test-from-image --image=my-golden-image
curl http://<new-vm-ip> # Returns same HTML without startup scriptStretch: Implement image family versioning with automatic deprecation of old images.
Cost: Image storage ≈ $0.05/GB/month. Minimal for small images.
Project 5: The Snapshot Sentinel [Phase 1, L2]
Goal: Write a Python function that iterates through all disks in a zone and triggers immediate snapshots for disks labeled backup: true.
Why This Matters:
- ACE tests snapshot scheduling and retention
- Label-based resource management is a production best practice
- Understanding snapshot vs. image use cases
Requirements:
- Use
DisksClient.list()to enumerate disks - Filter by labels using
disk.labels.get('backup') == 'true' - Create snapshot with
SnapshotsClient - Include timestamp in snapshot name (e.g.,
disk-name-20240115-143022)
Common Mistakes:
- Snapshot names must be unique; not including timestamps causes collisions
- Forgetting that snapshots are incremental (only first is full-size)
- Not setting up snapshot schedule resource policies for automation
Key Concepts: Resource iteration, label-based filtering, backup strategies, incremental snapshots
ACE Mapping: 3.1 (Schedule a snapshot, create/view snapshots), 3.2 (Backing up database instances)
Done = Only labeled disks get snapshots; snapshot names include ISO timestamp; script is idempotent.
Verification Commands:
python snapshot_sentinel.py --zone=us-central1-a
gcloud compute snapshots list --filter="sourceDisk~my-disk"
# Shows snapshot with timestamp in nameStretch: Add snapshot retention policy that deletes snapshots older than 7 days.
Cost: Snapshots ≈ $0.026/GB/month (incremental after first).
Project 6: The Spot Instance Watchdog [Phase 1, L2]
Goal: Create a Spot VM (preemptible). Write a Python script on a separate standard VM that monitors the Spot VM and automatically recreates it if preempted.
Why This Matters:
- ACE questions frequently ask when Spot/Preemptible is appropriate
- Understanding preemption behavior is crucial for cost optimization
- Real architectures must handle Spot interruptions gracefully
Requirements:
- Create Spot VM with
scheduling.preemptible = Trueandscheduling.provisioning_model = "SPOT" - Monitor script polls instance status every 60 seconds
- On
TERMINATEDstatus with preemption reason, recreate with same configuration - Log all state transitions with timestamps
Common Mistakes:
- Using Spot for stateful workloads without checkpointing
- Not handling the 30-second shutdown warning
- Forgetting Spot VMs also terminate after 24 hours (legacy preemptible behavior)
Key Concepts: Spot/Preemptible VMs, failure detection, self-healing patterns, cost optimization
ACE Mapping: 2.1 (Using Spot VM instances), 3.1 (Viewing current running instances)
Done = Spot VM runs; manually triggering preemption (or waiting) results in automatic recreation within 2 minutes.
Verification Commands:
# Simulate preemption:
gcloud compute instances simulate-maintenance-event spot-vm --zone=us-central1-a
# Watch logs from monitor script showing recreationStretch: Implement exponential backoff for recreation failures; add Pub/Sub notification on preemption.
Cost: Spot VMs are 60-91% cheaper than on-demand. ~$0.003/hour for e2-micro.
Phase 2: Infrastructure as Code (The Declarative Shift)
Focus: Moving from imperative Python scripts to declarative IaC. Terraform is recommended for ACE exam alignment; Pulumi (Python) is an excellent alternative.
ACE Focus: Section 2.3 (VPC, firewall policies), Section 2.4 (Infrastructure as code tooling), and Sections 4.1–4.2 (IAM and service accounts).
Project 7: The Great Rewrite [Phase 2, L2]
Goal: Destroy everything from Phase 1. Recreate the VM, firewall rule, and reference your custom image using Terraform or Pulumi.
Why This Matters:
- ACE Section 2.4 explicitly covers IaC tooling (Terraform, Config Connector, Helm)
- Understanding state management is crucial for production IaC
- The imperative→declarative shift is a key mental model change
Requirements:
- Define
google_compute_instancewith startup script - Define
google_compute_firewallallowing HTTP from0.0.0.0/0 - Reference existing custom image by name or family
- Use variables for project, region, zone
Common Mistakes:
- Not initializing Terraform backend before
apply - Hardcoding project ID instead of using variables
- Forgetting that
terraform destroyis needed to avoid charges
Key Concepts: Imperative vs Declarative paradigms, state management, terraform plan/apply/destroy
ACE Mapping: 2.4 (Planning and implementing resources through infrastructure as code: tooling, state management, updates)
Done = terraform apply creates identical infrastructure; terraform destroy removes all resources; terraform plan shows no drift on re-apply.
Verification Commands:
terraform init
terraform plan # Shows 2 resources to create
terraform apply -auto-approve
curl http://$(terraform output -raw external_ip)
terraform destroy -auto-approveStretch: Add terraform output for external IP; implement remote state backend with GCS.
Cost: Same as Project 2. Infrastructure cost only while running.
Project 8: The Network Architect [Phase 2, L3]
Goal: Define a custom VPC with two subnets in different regions. Create a firewall rule allowing SSH only from your home IP. Deploy a VM in each subnet.
Why This Matters:
- ACE heavily tests VPC and firewall rule configuration
- Network isolation is fundamental to cloud security
- Understanding CIDR ranges and firewall priority is essential
Requirements:
- Create VPC with
auto_create_subnetworks = false - Create subnets in
us-east1(10.0.1.0/24) andus-west1(10.0.2.0/24) - Firewall rule with
source_ranges = ["<your-ip>/32"] - VMs must communicate via internal IPs across regions
Common Mistakes:
- Overlapping CIDR ranges between subnets
- Forgetting that VPC is global but subnets are regional
- Setting firewall priority incorrectly (lower number = higher priority)
- Using
0.0.0.0/0for SSH instead of restricting to your IP
Key Concepts: Custom VPCs, CIDR planning, firewall rule priority and direction, network isolation
ACE Mapping: 2.3 (Creating VPC with subnets, Cloud NGFW policies, ingress/egress rules)
Done = VMs can ping each other by internal IP; SSH works only from your IP; SSH from other IPs times out.
Verification Commands:
# From VM in us-east1:
ping 10.0.2.x # Internal IP of us-west1 VM - should succeed
# From your machine:
ssh user@<vm-external-ip> # Should succeed
# From a different IP (e.g., Cloud Shell in different project):
ssh user@<vm-external-ip> # Should timeoutStretch: Add a deny-all egress rule and allow only specific ports (80, 443).
Cost: VPC and firewall rules are free. VM costs only.
Security & Access Cluster (Projects 9–11, 14)
The next three projects (plus Project 14) form a security-focused cluster. If preparing for ACE with limited time, prioritize these.
Project 9: The Access Pattern Showdown [Phase 2, L3]
Goal: Deploy a private VM with no external IP. Connect via IAP tunneling. Then deploy a traditional Bastion host and configure SSH jumping. Compare both approaches.
Why This Matters:
- IAP is the GCP-preferred method for secure access (frequently tested on ACE)
- Understanding when bastion hosts are still appropriate
- Zero-trust networking concepts
Requirements:
- Private VM:
network_interfacewithoutaccess_config - Cloud NAT for the private VM to pull updates (no external IP but can egress)
- Enable IAP TCP forwarding firewall rule (
tcp:22from35.235.240.0/20) - Connect:
gcloud compute ssh --tunnel-through-iap - Bastion: public IP, configure
~/.ssh/configProxyJump
Common Mistakes:
- Opening SSH from
0.0.0.0/0“just to test” instead of using IAP from the start - Using wrong IAP source range (must be exactly
35.235.240.0/20) - Forgetting to grant
roles/iap.tunnelResourceAccessorIAM role - Not setting up Cloud NAT, then wondering why
apt updatefails
Key Concepts: IAP TCP forwarding, zero-trust access, SSH jumping, Cloud NAT for egress
ACE Mapping: 3.1 (Remotely connecting to instance), 4.1 (IAM policies), 2.3 (Firewall rules)
Done = Both methods work; can articulate when IAP is preferred (audit logging, no exposed bastion, IAM-controlled) vs bastion (legacy compliance, specific network requirements).
Verification Commands:
# IAP method:
gcloud compute ssh private-vm --tunnel-through-iap --zone=us-central1-a
# Bastion method (after configuring ~/.ssh/config):
ssh private-vm # Automatically jumps through bastion
# Verify no external IP:
gcloud compute instances describe private-vm --format="get(networkInterfaces[0].accessConfigs)"
# Should return emptyStretch: Enable OS Login with 2FA requirement.
Cost: Cloud NAT ≈ $0.045/hour + data processing. Medium cost if left running.
Project 10: The Least-Privilege Identity [Phase 2, L3]
Goal: Create a custom Service Account via IaC with only roles/storage.objectViewer. Attach it to a VM. Verify it can read GCS but cannot write or access Compute APIs.
Why This Matters:
- ACE Section 4.2 is entirely about managing service accounts
- Scopes vs. IAM roles confusion is a common exam trap
- Principle of least privilege is fundamental to cloud security
Requirements:
- Create SA:
google_service_account - Bind role:
google_project_iam_memberwithroles/storage.objectViewer - Attach to VM:
service_account { email = ... scopes = ["cloud-platform"] } - Use
cloud-platformscope (broad) but narrow IAM role (restrictive)
Common Mistakes:
- Confusing scopes (legacy, ceiling) with IAM roles (actual permissions)
- Using default compute SA instead of creating custom SA
- Granting
roles/storage.adminwhen only read is needed - Forgetting that SA changes require VM restart to take effect
Key Concepts: Service Account vs user identity, scopes vs IAM roles, principle of least privilege
ACE Mapping: 4.2 (Creating SA, minimum permissions, assigning to resources)
Done = VM can read GCS; VM cannot write to GCS or list compute instances.
Verification Commands:
# SSH to the VM, then:
gsutil cat gs://your-bucket/test-object.txt
# Output: (contents of the file)
gsutil cp /tmp/localfile gs://your-bucket/
# Output: AccessDeniedException: 403 ... does not have storage.objects.create access
gcloud compute instances list
# Output: ERROR: (gcloud.compute.instances.list) ... does not have compute.instances.list permissionStretch: Implement service account impersonation from your local machine.
Cost: Service accounts are free. VM costs only.
Project 11: Trusted Images Policy Lockdown [Phase 2, L3]
Goal: Configure an Organization Policy that restricts the project to only use approved images. Verify that creating a VM from an unapproved image fails.
Why This Matters:
- Supply chain security is increasingly important
- ACE 1.1 covers applying organizational policies
- Understanding policy inheritance and enforcement
Note: This project requires organization-level access. If you only have project-level access, read through conceptually and practice with project-level constraints (like restricting instance creation to specific zones) where possible.
Requirements:
- Enable
constraints/compute.trustedImageProjects - Allow only your project and
projects/debian-cloud - Attempt to create VM from
projects/ubuntu-os-cloud(should fail) - Document the exact error message
Common Mistakes:
- Forgetting to include your own project in the allowed list (can’t use your custom images)
- Not understanding policy inheritance from org → folder → project
- Expecting policy to apply retroactively (only affects new resources)
Key Concepts: Organization Policies, Trusted Images, supply chain security, policy inheritance
ACE Mapping: 1.1 (Applying org policies to resource hierarchy), 4.1 (IAM policies)
Done = VMs from debian-cloud succeed; VMs from ubuntu-os-cloud fail with policy violation error.
Verification Commands:
# Should succeed:
gcloud compute instances create test-debian --image-project=debian-cloud --image-family=debian-11
# Should fail:
gcloud compute instances create test-ubuntu --image-project=ubuntu-os-cloud --image-family=ubuntu-2204-lts
# Output: ERROR: ... Constraint constraints/compute.trustedImageProjects violatedStretch: Add constraint requiring Shielded VM at org level.
Cost: Organization policies are free.
Project 12: The Cost Watchdog [Phase 2, L2]
Goal: Write a Python Cloud Function that lists all VMs in the project. If a VM has label env:dev and current time is after 7 PM local, stop the instance.
Why This Matters:
- FinOps is increasingly important (ACE 1.2 covers billing)
- Serverless + scheduled automation is a common pattern
- Label-based policies are production best practice
Requirements:
- Trigger: Cloud Scheduler (cron:
0 19 * * *) via Pub/Sub - Function uses
google-cloud-computelibrary - Filter by labels:
instance.labels.get('env') == 'dev' - Log all actions to Cloud Logging with instance names and actions taken
Common Mistakes:
- Hardcoding timezone instead of using proper timezone handling
- Not granting the function’s SA permission to stop instances
- Stopping instances that are already stopped (handle gracefully)
Key Concepts: FinOps, serverless compute, scheduled automation, label-based policies
ACE Mapping: 1.2 (Billing budgets and alerts), 3.1 (Managing compute resources)
Done = Function deploys successfully; dev VMs stop after 7 PM; production VMs unaffected; Cloud Logging shows decisions.
Verification Commands:
# Deploy function
gcloud functions deploy cost-watchdog --trigger-topic=cost-watchdog-trigger ...
# Manually trigger for testing:
gcloud pubsub topics publish cost-watchdog-trigger --message="test"
# Check logs:
gcloud functions logs read cost-watchdog --limit=20
# Verify instance stopped:
gcloud compute instances describe dev-vm --format="get(status)"
# Output: TERMINATEDStretch: Add Slack/email notification when VMs are stopped; implement weekend-only shutdown.
Cost: Cloud Functions: 2M free invocations/month. Effectively free for this use case.
Phase 3: Production Engineering
Focus: High Availability, Load Balancing, Observability, and Security Hardening.
ACE Focus: Section 2.1 (Autoscaled MIGs, instance templates), Section 2.3 (Load balancers), and Section 3.4 (Monitoring alerts, Ops Agent).
Project 13: Scalable Cluster: MIG + LB + Monitoring [Phase 3, L3]
Goal: Deploy a Managed Instance Group behind a Global HTTP(S) Load Balancer. Include Cloud Monitoring Uptime Check and Alert Policy in your IaC.
Why This Matters:
- MIGs + LB is the canonical GCE high-availability pattern
- ACE heavily tests this architecture (2.1, 2.3, 3.4)
- “Monitoring as Code” ensures observability isn’t an afterthought
Requirements:
- Instance Template with startup script serving unique instance ID in response
- Regional MIG with
target_size = 2 - Global HTTP(S) LB with backend service pointing to MIG
- Health check on port 80, path
/health - Uptime Check pinging LB external IP
- Alert Policy triggering if uptime check fails for 1 minute
Common Mistakes:
- Health check path returning non-200 (instances marked unhealthy, never serve traffic)
- Forgetting named port mapping on instance group
- Not waiting for LB provisioning (can take 5-10 minutes)
- Backend service timeout shorter than application response time
Key Concepts: Instance Templates, MIGs, Global LB architecture, health checks, Monitoring as Code
ACE Mapping: 2.1 (Creating autoscaled MIG with template), 2.3 (Deploying load balancers), 3.4 (Creating alerts based on metrics)
Done = curl to LB IP returns responses from different instances (refresh multiple times); Uptime Check shows green in console; manually killing Nginx on one VM triggers alert.
Verification Commands:
# Get LB IP:
LB_IP=$(gcloud compute forwarding-rules describe my-lb --global --format="get(IPAddress)")
# Test load balancing (should see different instance IDs):
for i in {1..10}; do curl -s http://$LB_IP | grep "Instance:"; done
# Check uptime check status:
gcloud monitoring uptime-check-configs listStretch: Add Cloud Armor security policy blocking specific countries or IP ranges.
Cost: Global LB ≈ $18/month minimum + data processing. MIG VMs at hourly rate. Medium-high cost if left running.
Project 14: The Hardened Vault [Phase 3, L3]
Goal: Deploy a VM with Shielded VM enabled (Secure Boot, vTPM, Integrity Monitoring) and Confidential Computing. Verify security features via Cloud Logging.
Why This Matters:
- ACE tests Shielded VM configuration options
- Compliance requirements increasingly mandate hardware-based security
- Understanding integrity monitoring for detecting boot-level attacks
Requirements:
- Enable
shielded_instance_config:enable_secure_boot,enable_vtpm,enable_integrity_monitoring - Enable
confidential_instance_config:enable_confidential_compute(requires N2D machine type) - Query integrity monitoring logs in Cloud Logging
- Document that disabling Secure Boot requires VM stop
Common Mistakes:
- Using machine type that doesn’t support Confidential Computing (N2D is the primary family; C2D also supported)
- Shielded VM is the default for all VM families, but Secure Boot may need to be explicitly enabled depending on your image and tooling
- Not knowing where to find integrity logs in Cloud Logging
Key Concepts: Hardware-based security, vTPM, integrity monitoring, Confidential VMs, SEV encryption
ACE Mapping: 4.1 (Security configurations), 2.1 (Shielded VM, availability policy)
Done = VM boots with all security features enabled; can find earlyBootReportEvent in Cloud Logging; attempting to change Secure Boot while running fails.
Verification Commands:
# Verify Shielded VM config:
gcloud compute instances describe hardened-vm \
--format="get(shieldedInstanceConfig)"
# Output: enableIntegrityMonitoring: true, enableSecureBoot: true, enableVtpm: true
# Find integrity logs in Cloud Logging:
gcloud logging read 'resource.type="gce_instance" AND
logName:"cloudaudit.googleapis.com" AND
protoPayload.methodName:"earlyBootReportEvent"' --limit=5
# Attempt to disable Secure Boot while running (should fail):
gcloud compute instances update hardened-vm --no-shielded-secure-boot
# Output: ERROR: Shielded VM config can only be updated when instance is stoppedStretch: Simulate integrity violation by modifying boot config; observe alert.
Cost: N2D instances are ~10% more expensive than N2. Confidential VM adds ~20% premium.
Project 15: The Resiliency Auditor [Phase 3, L4]
Goal: Write a Python chaos script that randomly deletes a VM in your MIG. The script must measure and report exact recovery time.
Why This Matters:
- Chaos engineering validates your self-healing architecture
- Measuring recovery time gives you concrete SLO data
- Understanding MIG autohealing behavior is crucial for production
Requirements:
- Randomly select one instance from MIG using
instance_group_managers().list_managed_instances() - Delete it via
instances().delete()(not via MIG API—simulates unexpected failure) - Poll MIG status every second using
instance_group_managers().list_managed_instances() - Log: deletion time, time MIG detected unhealthy, time replacement reached RUNNING
- Output:
Total Recovery Time: X seconds
Common Mistakes:
- Deleting via MIG API (which gracefully drains) instead of simulating crash
- Not accounting for health check interval in recovery time
- Polling too infrequently to get accurate measurements
Key Concepts: Chaos engineering, self-healing infrastructure, measuring availability, autohealing health checks
ACE Mapping: 3.1 (Managing compute resources), 2.1 (MIG autohealing)
Done = Script outputs recovery timeline; MIG automatically recreates instance; typical recovery time documented (usually 60-120 seconds depending on health check settings).
Verification Commands:
python chaos_auditor.py --mig=my-mig --zone=us-central1-a
# Output:
# [14:32:01] Deleted instance: my-mig-abc123
# [14:32:15] MIG detected unhealthy state
# [14:32:45] Replacement instance created: my-mig-def456
# [14:33:12] Replacement instance RUNNING
# [14:33:18] Health check passed, serving traffic
# Total Recovery Time: 77 secondsStretch: Run chaos script 10 times and calculate mean/p95 recovery time.
Cost: Same as Project 13 (MIG infrastructure).
Project 16: The Autoscaling Stress Test [Phase 3, L4]
Goal: Configure CPU-based autoscaling on your MIG (target 60%). Use stress-ng to spike CPU and observe automatic scale-out and scale-in.
Why This Matters:
- ACE 2.1 explicitly covers creating autoscaled MIGs
- Understanding cooldown periods prevents thrashing
- Autoscaler behavior with different metrics (CPU, LB utilization, custom)
Requirements:
- Configure autoscaler:
min_replicas=2,max_replicas=6,cpu_utilization_target=0.6 - Install
stress-ngvia startup script - Run:
stress-ng --cpu 4 --timeout 300son all instances - Monitor instance count via API or Console
- Observe scale-in after stress stops (respects cooldown)
Common Mistakes:
- Setting CPU target too low (constant thrashing)
- Not understanding stabilization/cooldown periods
- Expecting immediate scale-in (default cooldown is 10 minutes)
Key Concepts: Horizontal scaling, autoscaler policies (CPU, LB, custom metrics), cooldown periods
ACE Mapping: 2.1 (Creating autoscaled MIG), 3.4 (Cloud Monitoring metrics)
Done = MIG scales from 2→5+ during stress; scales back to 2 after cooldown (10+ minutes); scaling events visible in Monitoring.
Verification Commands:
# Start stress on all instances:
for vm in $(gcloud compute instance-groups managed list-instances my-mig --zone=us-central1-a --format="get(instance)"); do
gcloud compute ssh $vm --command="stress-ng --cpu 4 --timeout 300s &"
done
# Watch instance count:
watch -n 10 'gcloud compute instance-groups managed describe my-mig --zone=us-central1-a --format="get(targetSize)"'
# Should increase from 2 to 5-6
# After stress stops, wait 10+ minutes for scale-inStretch: Configure schedule-based autoscaling for predictable load patterns.
Cost: Autoscaling to 6 instances = 3x base cost during stress test.
Phase 4: SRE & Disaster Recovery
Focus: Production-grade reliability, disaster recovery, and CI/CD pipelines.
ACE Focus: Section 2.2 (Multi-region redundancy), Section 2.4 (IaC versioning and updates), and Section 3.4 (Cloud diagnostics, log routers).
Project 17: Cross-Region Disaster Recovery [Phase 4, L4]
Goal: Deploy a primary application in us-east1. Write a Disaster Recovery script that restores service in us-west1 from the latest snapshot within 10 minutes.
Why This Matters:
- ACE tests understanding of RTO/RPO concepts
- Cross-region redundancy is required for true disaster recovery
- Programmatic DR enables faster recovery than manual runbooks
Requirements:
- Primary: VM + disk with snapshot schedule replicating to
us-west1(multi-regional storage) - Snapshot schedules are configured via resource policies (
google_compute_resource_policyin Terraform) - DR script: find latest snapshot with
snapshots().list()filtered by source disk - Create disk from snapshot in
us-west1 - Create VM from disk, configure networking/firewall
- Verify HTTP response within 10-minute RTO
- Script must fail (exit 1) if RTO exceeded
Common Mistakes:
- Snapshot storage location not including DR region
- Not accounting for disk creation time from snapshot
- Forgetting to recreate firewall rules in DR region
- DNS/IP changes not accounted for in recovery
Key Concepts: RTO (Recovery Time Objective), RPO (Recovery Point Objective), cross-region replication, snapshot storage locations
ACE Mapping: 3.1 (Working with snapshots), 2.2 (Multi-region redundancy)
Done = DR script completes in < 10 minutes; new VM serves application; script logs RTO achievement.
Verification Commands:
python disaster_recovery.py --source-region=us-east1 --target-region=us-west1
# Output:
# [14:00:00] Starting DR procedure
# [14:00:05] Found latest snapshot: my-disk-20240115-120000
# [14:01:30] Disk created in us-west1 from snapshot
# [14:03:45] VM created and booting
# [14:05:12] Health check passed - application responding
# [14:05:12] DR SUCCESSFUL - RTO: 5m 12s (target: 10m)Stretch: Implement automated failover triggered by health check failure + Pub/Sub.
Cost: Cross-region data transfer ≈ $0.08/GB. Snapshot storage in multiple regions doubles storage cost. Plan carefully.
Project 18: The Golden Pipeline [Phase 4, L4]
Goal: Set up Cloud Build to trigger on GitHub push. Pipeline bakes a new image, updates Instance Template, and performs rolling MIG update with zero downtime.
Why This Matters:
- ACE 2.4 covers IaC deployments including versioning and updates
- Immutable infrastructure via image baking is a production best practice
- Rolling updates maintain availability during deployments
Requirements:
- Cloud Build trigger on push to
mainbranch - Build step 1: Build application, create new image with version tag
- Build step 2: Create new Instance Template referencing new image
- Build step 3: Start rolling update on MIG with
max_surge=1,max_unavailable=0 - Build step 4: Clean up Instance Templates older than 5 versions
Common Mistakes:
- Not using
max_unavailable=0(allows brief downtime) - Image naming collisions (must include unique identifier like commit SHA)
- Not cleaning up old templates (clutter + potential cost)
Key Concepts: GitOps, immutable deployments, rolling updates, zero-downtime deployments
ACE Mapping: 2.4 (Planning and implementing resources through IaC: versioning, updates), 3.1 (Managing compute resources)
Done = Git push triggers build; new image created; MIG updates with zero dropped requests; build history shows success.
Verification Commands:
# Push a change:
git commit -am "Update index.html" && git push
# Watch Cloud Build:
gcloud builds list --limit=1
# Monitor rolling update:
gcloud compute instance-groups managed describe my-mig --zone=us-central1-a \
--format="get(status.versionTarget)"
# Verify zero downtime (run during update):
while true; do curl -s -o /dev/null -w "%{http_code}\n" http://$LB_IP; sleep 0.5; done
# Should show continuous 200sStretch: Add canary deployment: route 10% traffic to new version, promote after health check passes.
Cost: Cloud Build: 120 free build-minutes/day. Image storage per version.
Project 19: The Observability Stack [Phase 4, L4]
Goal: Deploy comprehensive observability: Ops Agent for metrics/logs, custom metrics from application, log-based metrics, and a Cloud Monitoring dashboard.
Why This Matters:
- ACE 3.4 extensively covers monitoring and logging
- Ops Agent is the modern replacement for legacy agents
- Custom metrics enable application-specific observability
Requirements:
- Install Ops Agent via startup script (official installation script)
- Application writes custom metric
custom.googleapis.com/myapp/request_countto Monitoring API - Create log-based metric counting
severity>=ERRORfrom application logs - Build Monitoring dashboard with 4 widgets: CPU, memory, custom metric, error rate
Common Mistakes:
- Using legacy monitoring agent instead of Ops Agent
- Custom metric descriptor not created before writing data
- Log-based metric filter syntax errors
- Dashboard JSON/Terraform syntax for complex layouts
Key Concepts: Ops Agent, custom metrics, log-based metrics, dashboards as code, metric descriptors
ACE Mapping: 3.4 (Monitoring and logging: deploying Ops Agent, creating custom metrics, configuring log routers, creating alerts)
Done = Dashboard shows all four metric types updating live; can correlate CPU spike with error rate increase.
Verification Commands:
# Verify Ops Agent running:
gcloud compute ssh my-vm --command="sudo systemctl status google-cloud-ops-agent"
# Check custom metric exists:
gcloud monitoring metrics-descriptors list \
--filter="type=custom.googleapis.com/myapp/request_count"
# View dashboard (get URL):
gcloud monitoring dashboards list --format="get(name)"Stretch: Export logs to BigQuery; create SQL-based analysis of error patterns.
Cost: First 150MB logs/month free. Custom metrics: first 150MB free. Dashboards free.
Project 20: The Incident Response Drill [Phase 4, L4]
Goal: Simulate a production incident: degrade application, observe alerts fire, follow runbook to diagnose, remediate, and write a post-mortem.
Why This Matters:
- Incident response skills are crucial for production engineering
- Post-mortem culture drives continuous improvement
- ACE 3.4 covers using diagnostics to research application issues
Requirements:
- Inject fault: choose one of
- Fill disk:
dd if=/dev/zero of=/tmp/fill bs=1M count=10000 - Exhaust memory:
stress-ng --vm 2 --vm-bytes 90% - Break health check:
sudo systemctl stop nginx
- Fill disk:
- Observe: alert fires, MIG attempts recreation (if applicable), LB drains traffic
- Diagnose: use Cloud Logging, serial console, SSH to identify root cause
- Remediate: fix root cause or rollback
- Write post-mortem using template: timeline, impact, root cause, action items
Common Mistakes:
- Not having alerts configured before the drill
- Remediation that fixes symptom but not root cause
- Post-mortem that assigns blame instead of identifying systemic issues
Key Concepts: Incident response, post-mortem culture, runbooks, blameless retrospectives
ACE Mapping: 3.4 (Cloud diagnostics, Cloud Trace, viewing logs), 3.1 (Managing compute resources)
Done = Alert fired and acknowledged; incident timeline documented to the minute; post-mortem includes 3+ concrete action items; runbook updated with new failure mode.
Deliverables:
## Post-Mortem: [Date] Disk Full Incident
### Timeline
- 14:32 - Disk usage alert fired (>90%)
- 14:35 - On-call acknowledged
- 14:38 - Root cause identified: log rotation misconfigured
- 14:42 - Temporary fix: deleted old logs
- 14:45 - Service fully recovered
### Impact
- 10 minutes degraded performance
- 0 data loss
### Root Cause
Log rotation cron job was set to weekly instead of daily.
### Action Items
1. [ ] Fix log rotation config in base image
2. [ ] Add disk usage alert at 70% (earlier warning)
3. [ ] Document log locations in runbookStretch: Implement automated remediation for the specific failure mode (e.g., auto-delete old logs when disk > 80%).
Cost: No additional cost beyond existing infrastructure.
Appendix A: ACE Exam Coverage Matrix
This roadmap covers the following ACE exam objectives. Weights are from the official exam guide; project mappings are approximate.
| ACE Section | Weight | Projects |
|---|---|---|
| 1. Setting up cloud environment | ~23% | 1, 7, 8, 11, 12 |
| 2. Planning/implementing solution | ~30% | 2–6, 8, 13, 16, 17, 18 |
| 3. Ensuring successful operation | ~27% | 3–5, 12, 15, 18–20 |
| 4. Configuring access/security | ~20% | 9–11, 14 |
High-Yield ACE Practice Set: If you’re short on time, prioritize Projects 2–5, 8–11, 13, and 16. These cover the core 2.x (compute, networking, IaC) and 3.x (operations, monitoring) objectives that make up ~57% of the exam.
Appendix B: Key GCE Features Covered
| Feature | Project(s) |
|---|---|
| Machine Type Selection | 1, 2, 14 |
| Persistent Disk (zonal, regional) | 3, 5, 17 |
| Custom Images & Snapshots | 4, 5, 17, 18 |
| Spot/Preemptible VMs | 6 |
| Custom VPC & Firewall Rules | 8, 9 |
| IAP TCP Forwarding | 9 |
| Service Accounts & IAM | 10, 11, 12 |
| Organization Policies | 11 |
| MIGs & Instance Templates | 13, 15, 16, 18 |
| Load Balancing (HTTP/S) | 13 |
| Shielded & Confidential VMs | 14 |
| Autoscaling Policies | 16 |
| Ops Agent & Custom Metrics | 19 |
| Cloud Build CI/CD | 18 |
Appendix C: Documentation Requirements
For each project, create:
- README.md — Architecture diagram (even ASCII art), key commands, failure modes, recovery steps
- Code files — Python scripts, Terraform/Pulumi configs, Cloud Build YAML, all version controlled
- Post-mortem (Projects 15, 20) — Timeline, root cause, 3+ action items
- Cost log — Actual cost incurred, compared to estimate
Appendix D: Cost Summary
| Cost Level | Projects | Estimated Cost |
|---|---|---|
| Free/Minimal | 1–5, 7–8, 10–12, 20 | < $5 total if deleted promptly |
| Low | 6, 14 | $5–15 if run for a day |
| Medium | 9, 15–16, 18–19 | $15–40 if run for a day |
| Higher | 13, 17 | $40–100 if run for a day (Global LB + cross-region) |
Critical: Always run
terraform destroyor equivalent cleanup after each project session. Set a calendar reminder!
Last updated: January 2026
Related
- google-compute-engine — GCE concepts and exam prep
- ACE Certification Plan — Full study plan
- gcp-learning-path — GCP learning roadmap
- gcp-resources — Courses and labs
- Python for DevOps — Python automation basics
- docker-to-cloud-run — Container deployment lab