Prerequisites This note assumes familiarity with core GCP concepts. WikiLinks like
[[Cloud IAM]]and[[VPC Networks]]reference other notes in this digital garden—if those pages don't exist yet, treat them as topics to explore in the official documentation.
At a Glance
| Topic | What You’ll Learn | ACE Exam Weight |
|---|---|---|
| Machine Families | Choosing N4/E2/C4/M3 for scenarios | High (Section 2.1) |
| Storage | PD vs Hyperdisk decision matrix | High (Section 2.1) |
| MIGs & Autoscaling | Designing for 99.99% availability | High (Sections 2.1, 3.1) |
| Security | Shielded/Confidential VMs, OS Login | Medium (Section 4.1–4.2) |
| Cost Optimization | Spot vs CUD vs On-Demand | Medium (Section 1.2) |
🚀 If You Only Have 30 Minutes Before Practice Exam
Must read (15 min):
- The Five Machine Families table + its Exam Strategy callout
- Managed Instance Groups (MIGs) table + Autoscaling Reactive vs Predictive
- The Cost Spectrum diagram
Quick skim (10 min): 4. Hyperdisk Variants table + Storage Exam Strategy callout 5. Security Threat Model comparison table
If time permits (5 min): 6. Exam Traps Cheat Sheet at the bottom 7. Practice Scenarios — try answering before checking
Why GCE Still Matters in a Serverless World
Here’s a dirty little secret that the “serverless everything” crowd doesn’t want you to think about: every GKE node is just a Compute Engine VM under the hood. Every Cloud Run container ultimately executes on Google’s virtualized infrastructure. When you peel back the abstraction layers of any “serverless” platform, you find VMs—and understanding those VMs gives you superpowers when things go sideways.
For those of us coming from a malware analysis background, GCE is particularly fascinating. It’s where you can spin up isolated sandbox environments, create honeypots with precise network controls, and build forensic workstations with exactly the hardware profile you need. But more broadly, GCE is the Infrastructure as a Service (IaaS) foundation that everything else in GCP sits on.
Think of it this way: if Google Kubernetes Engine is a luxury apartment complex with a concierge (Google) handling maintenance, GCE is buying raw land and building exactly the house you want. More work? Yes. More control? Absolutely.
Where GCE Fits in the Cloud Spectrum
GCE is “hiring a contractor to build a custom house.” You tell them the specs (vCPUs, memory, storage), they provide the land and foundation (hardware, virtualization), and you’re responsible for everything from the OS up—runtime, scaling, application code, and data.
The Hardware Deep Dive: Understanding Machine Families
Goal: Pick the right machine family in under 30 seconds for any exam scenario.
Maps to ACE 2.1: Planning and implementing compute resources — “Selecting appropriate compute choices for a given workload.”
The biggest conceptual shift in modern GCE (2024-2026) is understanding that not all vCPUs are created equal. Google now offers highly specialized machine families optimized for radically different workloads. Picking the wrong family is like bringing a scalpel to a demolition job—or a sledgehammer to surgery.
The Five Machine Families
graph TD subgraph "Machine Type Families" GP[🖥️ General Purpose<br/>N4, E2, Tau T2A] CO[⚡ Compute Optimized<br/>C4, C3] MO[🧠 Memory Optimized<br/>X4, M3] AO[🎮 Accelerator Optimized<br/>A3, A2 - GPUs] SO[💾 Storage Optimized<br/>Z3] end GP --> |"Best for"| GPuse[Web servers, Dev/Test<br/>Microservices, Databases] CO --> |"Best for"| COuse[HPC, Gaming servers<br/>Single-threaded apps, CI/CD] MO --> |"Best for"| MOuse[SAP HANA, In-memory DBs<br/>Large analytics workloads] AO --> |"Best for"| AOuse[ML Training, Rendering<br/>Scientific simulation] SO --> |"Best for"| SOuse[High-IOPS databases<br/>Real-time analytics] style GP fill:#4CAF50,color:white style CO fill:#2196F3,color:white style MO fill:#9C27B0,color:white style AO fill:#FF9800,color:white style SO fill:#607D8B,color:white
General Purpose (N4, E2, Tau T2A/T2D)
The workhorses. These are your go-to for 80% of workloads.
| Series | Architecture | Sweet Spot | Notes |
|---|---|---|---|
| N4 | 4th-gen Intel Xeon + Titanium offloads | Production web apps, balanced workloads | Newest generation, best price/performance |
| E2 | Shared-core or burst | Dev/test, cost-sensitive workloads | Up to 32 vCPUs, typically the lowest-cost GP option |
| Tau T2A | Arm-based (Ampere Altra) | Scale-out microservices, containerized apps | ~20% better price/performance for Arm-compatible workloads |
| Tau T2D | AMD EPYC (x86) | Scale-out web serving, containerized apps | x86 option for Tau family |
Previous-Generation GP Options You'll still encounter N2, N2D, and N1 in production environments and exam questions. These remain fully supported but are considered previous-generation for new deployments. N4 offers better performance per dollar for most new workloads.
Exam Strategy: General Purpose The ACE exam loves to test whether you understand when to use E2 vs. N4. Key differentiator: E2 uses shared cores for smaller instances (cost savings through CPU bursting), while N4 provides dedicated cores. For consistent, production workloads, N4. For bursty dev/test, E2.
Compute Optimized (C4, C3)
When raw CPU clock speed and single-threaded performance matter more than anything else.
- C4: Built on Titanium with Intel Emerald Rapids. Highest per-core performance available.
- C3: Previous generation, still excellent for HPC and gaming servers.
Use cases: Game servers (think Minecraft or Valheim hosting), High-Performance Computing (HPC) batch jobs, Electronic Design Automation (EDA), single-threaded legacy applications that can’t be parallelized.
Memory Optimized (X4, M3)
These are the beasts. We’re talking machines with multiple terabytes of RAM (up to tens of TB) and hundreds of vCPUs.
- X4: Latest generation, Intel Sapphire Rapids, massive memory capacity.
- M3: Excellent for SAP HANA certified workloads.
If you’re running in-memory databases like Redis clusters at scale, SAP HANA, or doing genomics analysis where the entire dataset needs to fit in RAM, this is your family.
Storage Optimized (Z3)
The newest family (2024+), designed for workloads that need extreme local SSD IOPS—think distributed databases like Cassandra, CockroachDB, or time-series databases.
Accelerator Optimized (A3, A2)
GPU-attached instances for ML/AI training, inference, rendering, and scientific simulation. The A3 series with NVIDIA H100 GPUs is the current flagship for large language model training.
The Titanium Evolution
Here’s where it gets interesting for systems engineers. Google’s Titanium is a custom silicon offload chip that handles I/O operations—network virtualization, storage I/O, security functions—off the host CPU.
Why Titanium matters:
- More usable CPU cycles — Your vCPUs aren’t wasting time on packet processing
- Consistent performance — I/O-heavy workloads don’t starve compute
- Enhanced security — Hardware-isolated operations reduce attack surface
When Titanium matters for your choice:
- ✅ Choose N4/C4 (Titanium-enabled) when: Network-intensive apps, storage-heavy databases, security-sensitive workloads
- ⚠️ Titanium doesn’t matter when: Simple dev/test VMs, cost is primary concern (E2 is fine)
Think of it like having a dedicated mail room staff (Titanium) so your employees (vCPUs) don’t have to stop working every time a package arrives.
Exam Traps: Machine Families
- Trap 1: Picking C4 (compute-optimized) when E2 would suffice. The exam often emphasizes cost efficiency over maximum performance.
- Trap 2: Forgetting that Tau T2A is Arm-based. If the scenario mentions “existing x86 binaries,” T2A won’t work without recompilation.
- Trap 3: Choosing memory-optimized (M3/X4) for a “large database” without checking if it’s actually memory-bound. Most databases are I/O-bound first.
Legacy Machine Types on the Exam You'll still see N1, N2, and N2D in exam scenarios. These aren't deprecated, but for new deployments, the N4/C4/M3/X4 families offer better performance per dollar. The exam may present scenarios where you need to identify the appropriate generation for a migration or cost optimization question.
The Storage Layer: Why Hyperdisk is the 2026 Standard
Goal: Instantly distinguish when to use PD-Balanced, Regional PD, or Hyperdisk based on scenario keywords.
Maps to ACE 2.1: “Choosing the appropriate storage for Compute Engine (e.g., zonal Persistent Disk, regional Persistent Disk, Google Cloud Hyperdisk).”
The Critical Decision: Regional PD vs Hyperdisk
Before diving into details, memorize this decision rule:
| Scenario Keyword | Best Choice | Why |
|---|---|---|
| ”Cross-zone redundancy” / “survive zone failure” | Regional PD | Synchronously replicated across 2 zones |
| ”Tune IOPS independently” / “SAN-like flexibility” | Hyperdisk | Decoupled capacity and performance |
| ”Cost-effective general workload” | PD-Balanced | Good default, scales with size |
| ”Maximum database IOPS” | Hyperdisk Extreme | Up to ~350k IOPS, independently tunable |
The Old World: Persistent Disk
Traditional Persistent Disk options still exist and you’ll encounter them:
| Type | IOPS | Throughput | Use Case |
|---|---|---|---|
| PD-Standard | Low (0.75 per GB) | Low | Boot disks, cold storage |
| PD-Balanced | Medium (6 per GB) | Medium | General workloads |
| PD-SSD | High (30 per GB) | High | Databases, high-performance apps |
The critical limitation: performance scales linearly with capacity. Want more IOPS? You need a bigger disk, even if you don’t need the space.
The New World: Hyperdisk
Hyperdisk is Google’s answer to enterprise SAN (Storage Area Network) flexibility. The revolutionary concept: decoupled performance and capacity.
graph LR subgraph "Traditional PD" PD[100GB PD-SSD] --> IOPS1[3,000 IOPS<br/>Fixed to size] end subgraph "Hyperdisk" HD[100GB Hyperdisk] --> IOPS2[3,000 to 350,000 IOPS<br/>You choose] HD --> TP[Throughput: You choose] HD --> CAP[Capacity: You choose] end style PD fill:#FFA726 style HD fill:#66BB6A
Hyperdisk Variants
| Type | Max IOPS | Max Throughput | Sweet Spot |
|---|---|---|---|
| Hyperdisk Balanced | Up to ~160,000 | Up to ~2.4 GB/s | General workloads needing flexibility |
| Hyperdisk Extreme | Up to ~350,000 | Up to ~5 GB/s | Databases, SAP HANA |
| Hyperdisk Throughput | Scales with throughput | Up to ~2.4 GB/s | Analytics, large sequential reads/writes |
| Hyperdisk ML | Optimized for parallel reads | Massive parallelism | ML training data, multi-node access |
Hyperdisk Throughput I/O Profile Hyperdisk Throughput is tuned for large sequential workloads rather than random I/O. IOPS scales with provisioned throughput (roughly 4 IOPS per MiB/s), making it cost-effective for analytics pipelines and data lakes.
Exam Strategy: Hyperdisk Variants For the ACE exam, 90% of storage questions boil down to three choices: PD-Standard (cheap/cold data), PD-Balanced or Hyperdisk Balanced (default for web apps), and PD-SSD or Hyperdisk Extreme (databases needing high IOPS). Hyperdisk ML and Throughput are real products but rarely appear on the associate-level exam—save that knowledge for the Professional Cloud Architect track.
Storage Pools: The Cost Optimization Secret
Here’s where it gets financially interesting. Storage Pools allow you to provision a pool of Hyperdisk capacity and performance, then carve out individual disks from that pool.
Why does this matter? Thin provisioning.
Imagine you have 10 VMs that each might need 100GB and 10,000 IOPS during peak load, but typically use 20GB and 2,000 IOPS. Without pools:
- 10 × 100GB × 10,000 IOPS = 1TB provisioned, 100,000 IOPS provisioned
- Cost: $$
With Storage Pools:
- Pool: 400GB capacity, 30,000 IOPS (sized for realistic aggregate demand)
- Individual disks: 100GB each, but drawing from shared pool
- Cost: $ (you pay for the pool, not theoretical maximums)
Exam Traps: Storage
- Trap 1: Ignoring Regional PD when “cross-zone durability” or “zone failure” is mentioned. Hyperdisk is flexible but doesn’t inherently replicate across zones.
- Trap 2: Choosing Hyperdisk for a simple web server boot disk. PD-Balanced is usually the right default—Hyperdisk shines when you need to tune IOPS/throughput independently.
- Trap 3: Forgetting that PD performance scales with size. If someone says “need more IOPS on PD-SSD,” the answer might be “increase disk size,” not “switch to Hyperdisk.”
Exam Strategy: Storage The exam loves to trick you with scenarios asking about Balanced PD vs. Hyperdisk Balanced. Key insight: if the question mentions needing to scale IOPS independently of capacity, or talks about "enterprise SAN-like flexibility," the answer is Hyperdisk. If it's a simple "what's the default for a web server?" question, PD-Balanced is usually the cost-effective answer.
Life of a VM: From Boot to Production
Goal: Understand what happens at each boot stage and how Shielded VMs protect that process.
Maps to ACE 2.1: “Launching a compute instance (e.g., availability policy, SSH keys)” and ACE 3.1: “Working with snapshots and images.”
Understanding the VM lifecycle is crucial for both the exam and real-world troubleshooting.
The Boot Process
sequenceDiagram participant User participant GCE API participant Scheduler participant Host participant VM User->>GCE API: gcloud compute instances create GCE API->>Scheduler: Find suitable host Scheduler->>Host: Allocate resources Host->>VM: Create VM from image VM->>VM: UEFI/Shielded VM verification VM->>VM: Boot OS VM->>VM: Execute startup script VM->>VM: Guest Agent registers VM-->>User: RUNNING status
- Image Selection: You specify a boot disk image (public like
debian-12or custom). - Host Placement: Google’s scheduler finds a host with available resources in your specified zone.
- Shielded VM Verification (if enabled, which is default): The boot process is verified against known-good measurements.
- OS Boot: The operating system loads.
- Startup Script Execution: Any metadata-specified startup scripts run.
- Guest Agent: The GCE guest agent registers, enabling features like OS Login and metadata queries.
Shielded VM: Security by Default
Since ~2020, all VM families are Shielded VMs by default. This is huge for security-conscious deployments.
What Shielded VMs provide:
| Feature | What It Does | Why It Matters |
|---|---|---|
| Secure Boot | Verifies bootloader and kernel signatures | Prevents boot-level rootkits |
| vTPM | Virtual Trusted Platform Module | Enables measured boot, secure key storage |
| Integrity Monitoring | Compares boot measurements to baseline | Detects tampering in real-time |
For Malware Analysts The vTPM and integrity monitoring features are goldmines for incident response. If you suspect a VM has been compromised at the boot level, you can compare its current boot measurements against the baseline. A mismatch is a strong indicator of boot-level persistence mechanisms.
Startup Scripts: Automation at Boot
Startup scripts run every time a VM boots (not just first boot). They’re specified via instance metadata:
gcloud compute instances create my-vm \
--metadata startup-script='#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl enable nginx
systemctl start nginx'Or reference a script in Cloud Storage:
gcloud compute instances create my-vm \
--metadata startup-script-url=gs://my-bucket/startup.shStartup Script Pitfalls Startup scripts run as root and block the boot process until completion. Long-running scripts delay your VM reaching "RUNNING" status. For complex provisioning, consider using startup scripts only to bootstrap a configuration management tool like Ansible or Puppet.
Resiliency: Designing for 99.99% SLA
Goal: Design a multi-zone architecture that survives zone failures using MIGs, health checks, and autoscaling.
Maps to ACE 2.1: “Creating an autoscaled managed instance group by using an instance template” and ACE 3.1: “Working with GKE node pools… autoscaling node pool.”
The uploaded materials emphasize Managed Instance Groups (MIGs) as the core of high availability, and they’re absolutely right.
The Hierarchy of Resilience
graph TD subgraph "GCP Project" subgraph "Region: us-central1" subgraph "Zone A" MIG1[Regional MIG<br/>Instance 1] HD1[Hyperdisk 1] MIG1 --> HD1 end subgraph "Zone B" MIG2[Regional MIG<br/>Instance 2] HD2[Hyperdisk 2] MIG2 --> HD2 end subgraph "Zone C" MIG3[Regional MIG<br/>Instance 3] HD3[Hyperdisk 3] MIG3 --> HD3 end end LB[Global Load Balancer] --> MIG1 LB --> MIG2 LB --> MIG3 end style LB fill:#4285F4,color:white style MIG1 fill:#34A853,color:white style MIG2 fill:#34A853,color:white style MIG3 fill:#34A853,color:white
Managed Instance Groups (MIGs)
A MIG is a collection of identical VMs created from the same Instance Template. Google manages them as a unit.
Key features:
| Feature | Zonal MIG | Regional MIG |
|---|---|---|
| Distribution | Single zone | Across 3 zones |
| Failure domain | Zone failure = total outage | Zone failure = ~66% capacity remains |
| Use case | Stateful apps, cost-sensitive | Production, HA required |
| Availability | Lower (single-zone dependency) | Higher (can reach 99.99% with proper LB and health checks) |
Autohealing: Self-Repairing Infrastructure
MIGs can automatically detect and replace unhealthy instances using health checks:
gcloud compute health-checks create http my-health-check \
--port 80 \
--request-path /health \
--check-interval 10s \
--timeout 5s \
--unhealthy-threshold 3When an instance fails 3 consecutive health checks, the MIG automatically:
- Deletes the unhealthy instance
- Creates a new instance from the template
- Waits for it to pass health checks
- Adds it to the load balancer
Autoscaling: Reactive vs. Predictive
Reactive autoscaling (traditional) responds to current metrics:
- CPU utilization exceeds 70%? Add instances.
- CPU drops below 40%? Remove instances.
Predictive autoscaling (the 2024+ game-changer) uses ML to forecast load based on historical patterns:
Predictive Autoscaling
If your workload has predictable patterns (daily traffic spikes at 9 AM, weekly batch jobs on Sundays), predictive autoscaling learns these patterns and pre-provisions instances before the load arrives. This eliminates the “cold start” lag where users experience slowness while new instances spin up.
gcloud compute instance-groups managed update my-mig \
--enable-predictive-autoscaling \
--predictive-autoscaling-mode=optimize-availabilityLive Migration: Google’s Magic Trick
Here’s something that still amazes systems engineers: Google can move your running VM to a different physical host without any downtime.
When live migration happens:
- Hardware maintenance scheduled
- Software updates on the host
- Resource rebalancing
What you experience: Maybe a brief network blip (sub-second). Your VM keeps running, processes stay alive, network connections persist.
Exam Strategy: Availability The exam tests whether you understand the availability policy options:
MIGRATE(default): Live migration during maintenanceTERMINATE: VM is stopped, then restarted on new hostSpot VMs (preemptible) only support
TERMINATEbecause they can be reclaimed at any time.
Security Deep Dive
Goal: Distinguish which security feature addresses which threat: boot tampering, memory snooping, or supply-chain risks.
Maps to ACE 4.1: “Managing Identity and Access Management (IAM)” and ACE 4.2: “Managing service accounts.”
For those of us from security backgrounds, this section is critical.
Security Threat Model
Before diving into features, understand what each one protects against:
| Feature | Threat Addressed | Attack Example |
|---|---|---|
| Shielded VMs | Boot-level tampering | Bootkit/rootkit persistence in MBR/UEFI |
| Confidential VMs | Memory snooping | Hypervisor compromise, cold boot attacks, physical access |
| Trusted Images Policy | Supply-chain / image hygiene | Developer spins up VM from untrusted ISO with backdoor |
| OS Login | Key management sprawl | Leaked SSH keys, orphaned access after employee departure |
The Security Layer Cake
graph TB subgraph "Defense in Depth" L1[Trusted Images Policy<br/>What can be deployed?] L2[Shielded VMs<br/>Is the boot process clean?] L3[Confidential VMs<br/>Is memory encrypted?] L4[IAM & Firewall<br/>Who can access?] L5[OS Login<br/>How do users authenticate?] end L1 --> L2 --> L3 --> L4 --> L5 style L1 fill:#FFCDD2 style L2 fill:#F8BBD9 style L3 fill:#E1BEE7 style L4 fill:#C5CAE9 style L5 fill:#BBDEFB
Confidential VMs: Memory Encryption in Use
Standard encryption protects data at rest (disk) and in transit (network). Confidential VMs add encryption for data in use—the data in RAM while your application processes it.
Technologies:
- AMD SEV-SNP: Secure Encrypted Virtualization with Secure Nested Paging
- Intel TDX: Trust Domain Extensions (newer option)
Why this matters: Even if an attacker compromises the hypervisor or has physical access to the host, they cannot read your VM’s memory contents.
gcloud compute instances create my-confidential-vm \
--confidential-compute \
--machine-type n2d-standard-4 \
--zone us-central1-aConfidential VM Limitations
Not all machine types support Confidential VMs. As of 2026, you’re primarily looking at N2D (AMD) and C3 (Intel TDX) families. The exam may ask which machine types support confidential computing—remember the AMD/Intel split.
Trusted Images Policy: Organizational Control
Prevent developers from spinning up VMs with arbitrary images (like that sketchy ISO someone found on the internet):
# Organization Policy
constraint: compute.trustedImageProjects
listPolicy:
allowedValues:
- projects/debian-cloud
- projects/ubuntu-os-cloud
- projects/my-golden-imagesThis is enforced at the Organization or Folder level, ensuring that even project owners can only use approved images.
OS Login: Centralized Access Control
Instead of managing SSH keys manually on each VM, OS Login ties VM access to Cloud Identity / Google Workspace:
- User authenticates with their Google account
- Their permissions are checked against IAM
- A POSIX account is automatically created/managed on the VM
- SSH session is established
gcloud compute instances create my-vm \
--metadata enable-oslogin=TRUEExam Strategy: Security The exam frequently tests the difference between:
- Project-wide SSH keys: Stored in project metadata, apply to all VMs
- Instance-specific SSH keys: Stored in instance metadata
- OS Login: IAM-based, no key management required
For enterprise/secure environments, OS Login is almost always the right answer.
Cost Optimization: A Decision Framework
Goal: Instantly classify a workload into Spot / CUD / On-Demand based on two questions: “Can it be interrupted?” and “Is usage predictable?”
Maps to ACE 1.2: “Managing billing configuration” and touches on ACE 2.1 (Spot VM instances).
The Cost Spectrum
Quick decision rules:
- ✅ Use Spot VMs when: Batch jobs, CI/CD, fault-tolerant distributed systems, anything with checkpointing
- ❌ Avoid Spot when: User-facing apps, stateful workloads without external state, SLA requirements
- ✅ Use CUDs when: Always-on production, predictable baseline (even if traffic varies), 1+ year commitment is acceptable
- ❌ Avoid CUDs when: Uncertain capacity needs, short-term projects, rapidly evolving architecture
- ✅ Use On-Demand when: Unpredictable workloads, short experiments, need maximum flexibility
- ❌ Avoid On-Demand when: Sustained usage over 25% of month (you’re leaving money on the table)
graph TD Start[New Workload] --> Q1{Can it tolerate<br/>interruption?} Q1 -->|Yes| Q2{Batch or<br/>continuous?} Q1 -->|No| Q3{Predictable<br/>usage?} Q2 -->|Batch| SPOT[🎰 Spot VMs<br/>Up to 80-91% off] Q2 -->|Continuous| Q3 Q3 -->|Yes, 1+ year| CUD[📝 Committed Use<br/>Up to ~57% off] Q3 -->|Variable| SUD[🔄 Sustained Use<br/>Automatic up to ~30% off] Q3 -->|Unpredictable| OD[💰 On-Demand<br/>Full price, max flexibility] style SPOT fill:#4CAF50,color:white style CUD fill:#2196F3,color:white style SUD fill:#9C27B0,color:white style OD fill:#FF5722,color:white
Discount Mechanisms Explained
Sustained Use Discounts (Automatic)
If you run an instance for more than 25% of the month, Google automatically applies discounts:
| Usage | Discount |
|---|---|
| 25-50% of month | ~10% off |
| 50-75% of month | ~20% off |
| 75-100% of month | ~30% off |
No commitment required. This is pure “thank you for being a good customer.”
Committed Use Discounts (CUDs)
Commit to 1 or 3 years of specific resource usage:
| Term | Discount |
|---|---|
| 1 year | Up to ~37% off |
| 3 years | Up to ~57% off |
Key insight: CUDs are for resource types (vCPUs, memory), not specific VMs. You can change instance types, zones, even machine families, as long as you’re using the committed resources.
Spot VMs (Formerly Preemptible)
Up to 80–91% off (depending on machine type and region), but Google can reclaim them with 30 seconds notice.
Ideal for:
- Batch processing (render farms, ETL jobs)
- CI/CD pipelines
- Fault-tolerant distributed systems
- Anything with checkpointing
Not suitable for:
- User-facing applications
- Stateful workloads without external state
- Anything that can’t handle sudden termination
gcloud compute instances create my-spot-vm \
--provisioning-model=SPOT \
--instance-termination-action=DELETEExam Strategy: Cost
The exam loves scenarios like: “A company runs nightly data processing jobs that take 4 hours and can restart if interrupted. What’s the most cost-effective option?”
Answer: Spot VMs (fault-tolerant batch = Spot).
Contrast with: “A company runs a critical e-commerce site 24/7 with predictable traffic.”
Answer: 3-year CUD (always-on, predictable = committed use).
Right-Sizing: Don’t Overprovision
Google’s Active Assist analyzes your VM utilization and recommends right-sizing:
gcloud recommender recommendations list \
--recommender=google.compute.instance.MachineTypeRecommender \
--project=my-project \
--location=us-central1-aCommon findings:
- “This n2-standard-8 averages 12% CPU. Consider n2-standard-2.”
- “This VM has 32GB RAM but uses 4GB. Consider custom machine type.”
Quick Reference: gcloud Commands
Instance Management
# Create a basic instance
gcloud compute instances create my-vm \
--zone=us-central1-a \
--machine-type=n4-standard-4 \
--image-family=debian-12 \
--image-project=debian-cloud
# Create with Hyperdisk
gcloud compute instances create my-vm \
--zone=us-central1-a \
--machine-type=n4-standard-8 \
--create-disk=name=my-data,size=200GB,type=hyperdisk-balanced,provisioned-iops=10000
# SSH into instance
gcloud compute ssh my-vm --zone=us-central1-a
# List running instances
gcloud compute instances list
# Stop/Start
gcloud compute instances stop my-vm --zone=us-central1-a
gcloud compute instances start my-vm --zone=us-central1-a
# Delete
gcloud compute instances delete my-vm --zone=us-central1-a🧪 Micro-Lab: Your First GCE VM Try this hands-on sequence (15 minutes):
- Create an E2-micro VM with PD-Balanced boot disk
- SSH in and run
lsblkto see disk layout- In Console: Compute Engine → Disks → Resize the disk to +10GB
- Back in SSH: run
sudo growpart /dev/sda 1andsudo resize2fs /dev/sda1- Note: IOPS didn’t change (PD scales with size, but you have to check the metrics)
- Delete the VM when done to avoid charges
Snapshots and Images
# Create snapshot
gcloud compute disks snapshot my-disk \
--zone=us-central1-a \
--snapshot-names=my-snapshot
# Create image from snapshot
gcloud compute images create my-image \
--source-snapshot=my-snapshot
# Create image from disk
gcloud compute images create my-image \
--source-disk=my-disk \
--source-disk-zone=us-central1-aManaged Instance Groups
# Create instance template
gcloud compute instance-templates create my-template \
--machine-type=n4-standard-2 \
--image-family=debian-12 \
--image-project=debian-cloud
# Create regional MIG
gcloud compute instance-groups managed create my-mig \
--template=my-template \
--size=3 \
--region=us-central1
# Configure autoscaling
gcloud compute instance-groups managed set-autoscaling my-mig \
--region=us-central1 \
--min-num-replicas=2 \
--max-num-replicas=10 \
--target-cpu-utilization=0.7🧪 Micro-Lab: Watch Autohealing in Action Try this to see MIG self-healing (20 minutes):
- Create a regional MIG with 2 instances using the commands above
- Add an HTTP health check:
gcloud compute health-checks create http my-hc --port 80- Attach it:
gcloud compute instance-groups managed update my-mig --region=us-central1 --health-check=my-hc- SSH into one instance and run
sudo systemctl stop nginx(or kill whatever the health check expects)- Watch in Console: the instance goes UNHEALTHY → gets deleted → new one created
- Clean up: delete the MIG, template, and health check
🧪 Micro-Lab: See Right-Sizing Recommendations Quick cost optimization exercise (5 minutes):
- Go to Console → Billing → Reports to see spend by SKU
- Go to Console → Recommender → VM Right-sizing
- Look for any VMs with <20% average CPU—these are candidates for downsizing
- Note: recommendations take ~24 hours of usage data to appear
Related Topics
- Google Kubernetes Engine - Container orchestration built on GCE
- Cloud Run - Serverless containers (abstracted GCE)
- Persistent Disk - Legacy storage deep dive
- VPC Networks - Networking for Compute Engine
- Cloud IAM - Access control for GCE resources
- Cloud Monitoring - Observability for your VMs
- Instance Templates - Reproducible VM configurations
- Cloud Load Balancing - Distributing traffic to MIGs
Summary: The ACE Exam Mental Model
When approaching GCE questions on the Associate Cloud Engineer exam, think in layers:
- What family? Match workload characteristics to machine family
- What storage? Default to PD-Balanced unless Hyperdisk flexibility is needed
- What availability? Single VM < Zonal MIG < Regional MIG (cost vs. SLA tradeoff)
- What cost model? Spot (interruptible) → CUD (predictable) → On-Demand (flexible)
- What security? Shielded (default) → Confidential (memory encryption) → Trusted Images (org control)
Master these decision points and the exam scenarios become straightforward pattern matching.
Exam Traps Cheat Sheet
Quick reference of common ACE exam pitfalls for GCE:
| Topic | Trap | Correct Thinking |
|---|---|---|
| Machine families | Picking C4 for “needs good performance” | Check if cost matters—E2/N4 often sufficient |
| Machine families | Choosing M3 for “large database” | Most DBs are I/O-bound, not memory-bound—check the bottleneck |
| Machine families | Forgetting Tau T2A is Arm | x86 binaries won’t run without recompilation |
| Storage | Hyperdisk for simple boot disk | PD-Balanced is the right default |
| Storage | Hyperdisk when “zone failure” mentioned | Regional PD provides cross-zone replication |
| Storage | Ignoring disk size → IOPS relationship | On PD, bigger disk = more IOPS |
| MIGs | Zonal MIG for “high availability” | Regional MIG survives zone failures |
| MIGs | Forgetting health checks for autohealing | No health check = no autohealing |
| Autoscaling | Predictive for spiky/unpredictable loads | Predictive needs historical patterns; use reactive for chaos |
| Cost | CUD for uncertain future capacity | CUD is a commitment—use on-demand if unsure |
| Cost | On-demand for 24/7 production | You’re leaving 30-57% savings on the table |
| Security | Confidential VM for “secure boot” | Shielded VM handles boot integrity; Confidential is for memory |
| Security | SSH keys for enterprise access | OS Login integrates with IAM—prefer it |
Practice Scenarios
Test your mental model with these one-liner scenarios. Try to answer before checking.
-
Scenario: A startup runs nightly ML training jobs that process 500GB of data. Jobs can restart from checkpoints. They want to minimize costs.
- Your answer: ?
- Check answer
-
Scenario: A bank needs VMs for a trading application where even the hypervisor shouldn’t be able to read memory contents.
- Your answer: ?
- Check answer
-
Scenario: An e-commerce company has predictable traffic with 3x spikes during sales events. They need to survive a zone failure.
- Your answer: ?
- Check answer
-
Scenario: A team needs to run legacy Windows x86 applications with consistent CPU performance for EDA workloads.
- Your answer: ?
- Check answer
-
Scenario: A company wants to prevent developers from launching VMs using random public images from the internet.
- Your answer: ?
- Check answer
Practice Answers
-
ML training with checkpoints, cost-sensitive: Spot VMs (fault-tolerant batch) + possibly Storage Optimized Z3 or Hyperdisk Throughput for the 500GB data reads.
-
Memory contents hidden from hypervisor: Confidential VMs (AMD SEV-SNP or Intel TDX for memory encryption in use).
-
Predictable traffic, survives zone failure: Regional MIG with predictive autoscaling + load balancer. CUD for the baseline capacity.
-
Legacy Windows x86, consistent CPU, EDA: Compute Optimized C4 (not Tau T2A—that’s Arm). Consider sole-tenant nodes if licensing requires it.
-
Prevent random public images: Trusted Images Policy via Organization Policy constraint
compute.trustedImageProjects.
Related
- GCE Mastery Roadmap — 20 hands-on projects
- ACE Certification Plan — Full study plan
- gcp-overview — Why choose GCP
- gcp-learning-path — GCP learning roadmap
- gcp-resources — Courses and labs
- GCP Hierarchy Explorer — Python project
- docker-to-cloud-run — Container deployment lab
Last updated: 2026-01-09