Prerequisites This note assumes familiarity with core GCP concepts. WikiLinks like [[Cloud IAM]] and [[VPC Networks]] reference other notes in this digital garden—if those pages don't exist yet, treat them as topics to explore in the official documentation.


At a Glance

TopicWhat You’ll LearnACE Exam Weight
Machine FamiliesChoosing N4/E2/C4/M3 for scenariosHigh (Section 2.1)
StoragePD vs Hyperdisk decision matrixHigh (Section 2.1)
MIGs & AutoscalingDesigning for 99.99% availabilityHigh (Sections 2.1, 3.1)
SecurityShielded/Confidential VMs, OS LoginMedium (Section 4.1–4.2)
Cost OptimizationSpot vs CUD vs On-DemandMedium (Section 1.2)

🚀 If You Only Have 30 Minutes Before Practice Exam

Must read (15 min):

  1. The Five Machine Families table + its Exam Strategy callout
  2. Managed Instance Groups (MIGs) table + Autoscaling Reactive vs Predictive
  3. The Cost Spectrum diagram

Quick skim (10 min): 4. Hyperdisk Variants table + Storage Exam Strategy callout 5. Security Threat Model comparison table

If time permits (5 min): 6. Exam Traps Cheat Sheet at the bottom 7. Practice Scenarios — try answering before checking


Why GCE Still Matters in a Serverless World

Here’s a dirty little secret that the “serverless everything” crowd doesn’t want you to think about: every GKE node is just a Compute Engine VM under the hood. Every Cloud Run container ultimately executes on Google’s virtualized infrastructure. When you peel back the abstraction layers of any “serverless” platform, you find VMs—and understanding those VMs gives you superpowers when things go sideways.

For those of us coming from a malware analysis background, GCE is particularly fascinating. It’s where you can spin up isolated sandbox environments, create honeypots with precise network controls, and build forensic workstations with exactly the hardware profile you need. But more broadly, GCE is the Infrastructure as a Service (IaaS) foundation that everything else in GCP sits on.

Think of it this way: if Google Kubernetes Engine is a luxury apartment complex with a concierge (Google) handling maintenance, GCE is buying raw land and building exactly the house you want. More work? Yes. More control? Absolutely.

Where GCE Fits in the Cloud Spectrum

GCE is “hiring a contractor to build a custom house.” You tell them the specs (vCPUs, memory, storage), they provide the land and foundation (hardware, virtualization), and you’re responsible for everything from the OS up—runtime, scaling, application code, and data.


The Hardware Deep Dive: Understanding Machine Families

Goal: Pick the right machine family in under 30 seconds for any exam scenario.

Maps to ACE 2.1: Planning and implementing compute resources — “Selecting appropriate compute choices for a given workload.”

The biggest conceptual shift in modern GCE (2024-2026) is understanding that not all vCPUs are created equal. Google now offers highly specialized machine families optimized for radically different workloads. Picking the wrong family is like bringing a scalpel to a demolition job—or a sledgehammer to surgery.

The Five Machine Families

graph TD
    subgraph "Machine Type Families"
        GP[🖥️ General Purpose<br/>N4, E2, Tau T2A]
        CO[⚡ Compute Optimized<br/>C4, C3]
        MO[🧠 Memory Optimized<br/>X4, M3]
        AO[🎮 Accelerator Optimized<br/>A3, A2 - GPUs]
        SO[💾 Storage Optimized<br/>Z3]
    end
    
    GP --> |"Best for"| GPuse[Web servers, Dev/Test<br/>Microservices, Databases]
    CO --> |"Best for"| COuse[HPC, Gaming servers<br/>Single-threaded apps, CI/CD]
    MO --> |"Best for"| MOuse[SAP HANA, In-memory DBs<br/>Large analytics workloads]
    AO --> |"Best for"| AOuse[ML Training, Rendering<br/>Scientific simulation]
    SO --> |"Best for"| SOuse[High-IOPS databases<br/>Real-time analytics]
    
    style GP fill:#4CAF50,color:white
    style CO fill:#2196F3,color:white
    style MO fill:#9C27B0,color:white
    style AO fill:#FF9800,color:white
    style SO fill:#607D8B,color:white

General Purpose (N4, E2, Tau T2A/T2D)

The workhorses. These are your go-to for 80% of workloads.

SeriesArchitectureSweet SpotNotes
N44th-gen Intel Xeon + Titanium offloadsProduction web apps, balanced workloadsNewest generation, best price/performance
E2Shared-core or burstDev/test, cost-sensitive workloadsUp to 32 vCPUs, typically the lowest-cost GP option
Tau T2AArm-based (Ampere Altra)Scale-out microservices, containerized apps~20% better price/performance for Arm-compatible workloads
Tau T2DAMD EPYC (x86)Scale-out web serving, containerized appsx86 option for Tau family

Previous-Generation GP Options You'll still encounter N2, N2D, and N1 in production environments and exam questions. These remain fully supported but are considered previous-generation for new deployments. N4 offers better performance per dollar for most new workloads.

Exam Strategy: General Purpose The ACE exam loves to test whether you understand when to use E2 vs. N4. Key differentiator: E2 uses shared cores for smaller instances (cost savings through CPU bursting), while N4 provides dedicated cores. For consistent, production workloads, N4. For bursty dev/test, E2.

Compute Optimized (C4, C3)

When raw CPU clock speed and single-threaded performance matter more than anything else.

  • C4: Built on Titanium with Intel Emerald Rapids. Highest per-core performance available.
  • C3: Previous generation, still excellent for HPC and gaming servers.

Use cases: Game servers (think Minecraft or Valheim hosting), High-Performance Computing (HPC) batch jobs, Electronic Design Automation (EDA), single-threaded legacy applications that can’t be parallelized.

Memory Optimized (X4, M3)

These are the beasts. We’re talking machines with multiple terabytes of RAM (up to tens of TB) and hundreds of vCPUs.

  • X4: Latest generation, Intel Sapphire Rapids, massive memory capacity.
  • M3: Excellent for SAP HANA certified workloads.

If you’re running in-memory databases like Redis clusters at scale, SAP HANA, or doing genomics analysis where the entire dataset needs to fit in RAM, this is your family.

Storage Optimized (Z3)

The newest family (2024+), designed for workloads that need extreme local SSD IOPS—think distributed databases like Cassandra, CockroachDB, or time-series databases.

Accelerator Optimized (A3, A2)

GPU-attached instances for ML/AI training, inference, rendering, and scientific simulation. The A3 series with NVIDIA H100 GPUs is the current flagship for large language model training.

The Titanium Evolution

Here’s where it gets interesting for systems engineers. Google’s Titanium is a custom silicon offload chip that handles I/O operations—network virtualization, storage I/O, security functions—off the host CPU.

Why Titanium matters:

  • More usable CPU cycles — Your vCPUs aren’t wasting time on packet processing
  • Consistent performance — I/O-heavy workloads don’t starve compute
  • Enhanced security — Hardware-isolated operations reduce attack surface

When Titanium matters for your choice:

  • ✅ Choose N4/C4 (Titanium-enabled) when: Network-intensive apps, storage-heavy databases, security-sensitive workloads
  • ⚠️ Titanium doesn’t matter when: Simple dev/test VMs, cost is primary concern (E2 is fine)

Think of it like having a dedicated mail room staff (Titanium) so your employees (vCPUs) don’t have to stop working every time a package arrives.

Exam Traps: Machine Families

  • Trap 1: Picking C4 (compute-optimized) when E2 would suffice. The exam often emphasizes cost efficiency over maximum performance.
  • Trap 2: Forgetting that Tau T2A is Arm-based. If the scenario mentions “existing x86 binaries,” T2A won’t work without recompilation.
  • Trap 3: Choosing memory-optimized (M3/X4) for a “large database” without checking if it’s actually memory-bound. Most databases are I/O-bound first.

Legacy Machine Types on the Exam You'll still see N1, N2, and N2D in exam scenarios. These aren't deprecated, but for new deployments, the N4/C4/M3/X4 families offer better performance per dollar. The exam may present scenarios where you need to identify the appropriate generation for a migration or cost optimization question.


The Storage Layer: Why Hyperdisk is the 2026 Standard

Goal: Instantly distinguish when to use PD-Balanced, Regional PD, or Hyperdisk based on scenario keywords.

Maps to ACE 2.1: “Choosing the appropriate storage for Compute Engine (e.g., zonal Persistent Disk, regional Persistent Disk, Google Cloud Hyperdisk).”

The Critical Decision: Regional PD vs Hyperdisk

Before diving into details, memorize this decision rule:

Scenario KeywordBest ChoiceWhy
”Cross-zone redundancy” / “survive zone failure”Regional PDSynchronously replicated across 2 zones
”Tune IOPS independently” / “SAN-like flexibility”HyperdiskDecoupled capacity and performance
”Cost-effective general workload”PD-BalancedGood default, scales with size
”Maximum database IOPS”Hyperdisk ExtremeUp to ~350k IOPS, independently tunable

The Old World: Persistent Disk

Traditional Persistent Disk options still exist and you’ll encounter them:

TypeIOPSThroughputUse Case
PD-StandardLow (0.75 per GB)LowBoot disks, cold storage
PD-BalancedMedium (6 per GB)MediumGeneral workloads
PD-SSDHigh (30 per GB)HighDatabases, high-performance apps

The critical limitation: performance scales linearly with capacity. Want more IOPS? You need a bigger disk, even if you don’t need the space.

The New World: Hyperdisk

Hyperdisk is Google’s answer to enterprise SAN (Storage Area Network) flexibility. The revolutionary concept: decoupled performance and capacity.

graph LR
    subgraph "Traditional PD"
        PD[100GB PD-SSD] --> IOPS1[3,000 IOPS<br/>Fixed to size]
    end
    
    subgraph "Hyperdisk"
        HD[100GB Hyperdisk] --> IOPS2[3,000 to 350,000 IOPS<br/>You choose]
        HD --> TP[Throughput: You choose]
        HD --> CAP[Capacity: You choose]
    end
    
    style PD fill:#FFA726
    style HD fill:#66BB6A

Hyperdisk Variants

TypeMax IOPSMax ThroughputSweet Spot
Hyperdisk BalancedUp to ~160,000Up to ~2.4 GB/sGeneral workloads needing flexibility
Hyperdisk ExtremeUp to ~350,000Up to ~5 GB/sDatabases, SAP HANA
Hyperdisk ThroughputScales with throughputUp to ~2.4 GB/sAnalytics, large sequential reads/writes
Hyperdisk MLOptimized for parallel readsMassive parallelismML training data, multi-node access

Hyperdisk Throughput I/O Profile Hyperdisk Throughput is tuned for large sequential workloads rather than random I/O. IOPS scales with provisioned throughput (roughly 4 IOPS per MiB/s), making it cost-effective for analytics pipelines and data lakes.

Exam Strategy: Hyperdisk Variants For the ACE exam, 90% of storage questions boil down to three choices: PD-Standard (cheap/cold data), PD-Balanced or Hyperdisk Balanced (default for web apps), and PD-SSD or Hyperdisk Extreme (databases needing high IOPS). Hyperdisk ML and Throughput are real products but rarely appear on the associate-level exam—save that knowledge for the Professional Cloud Architect track.

Storage Pools: The Cost Optimization Secret

Here’s where it gets financially interesting. Storage Pools allow you to provision a pool of Hyperdisk capacity and performance, then carve out individual disks from that pool.

Why does this matter? Thin provisioning.

Imagine you have 10 VMs that each might need 100GB and 10,000 IOPS during peak load, but typically use 20GB and 2,000 IOPS. Without pools:

  • 10 × 100GB × 10,000 IOPS = 1TB provisioned, 100,000 IOPS provisioned
  • Cost: $$

With Storage Pools:

  • Pool: 400GB capacity, 30,000 IOPS (sized for realistic aggregate demand)
  • Individual disks: 100GB each, but drawing from shared pool
  • Cost: $ (you pay for the pool, not theoretical maximums)

Exam Traps: Storage

  • Trap 1: Ignoring Regional PD when “cross-zone durability” or “zone failure” is mentioned. Hyperdisk is flexible but doesn’t inherently replicate across zones.
  • Trap 2: Choosing Hyperdisk for a simple web server boot disk. PD-Balanced is usually the right default—Hyperdisk shines when you need to tune IOPS/throughput independently.
  • Trap 3: Forgetting that PD performance scales with size. If someone says “need more IOPS on PD-SSD,” the answer might be “increase disk size,” not “switch to Hyperdisk.”

Exam Strategy: Storage The exam loves to trick you with scenarios asking about Balanced PD vs. Hyperdisk Balanced. Key insight: if the question mentions needing to scale IOPS independently of capacity, or talks about "enterprise SAN-like flexibility," the answer is Hyperdisk. If it's a simple "what's the default for a web server?" question, PD-Balanced is usually the cost-effective answer.


Life of a VM: From Boot to Production

Goal: Understand what happens at each boot stage and how Shielded VMs protect that process.

Maps to ACE 2.1: “Launching a compute instance (e.g., availability policy, SSH keys)” and ACE 3.1: “Working with snapshots and images.”

Understanding the VM lifecycle is crucial for both the exam and real-world troubleshooting.

The Boot Process

sequenceDiagram
    participant User
    participant GCE API
    participant Scheduler
    participant Host
    participant VM
    
    User->>GCE API: gcloud compute instances create
    GCE API->>Scheduler: Find suitable host
    Scheduler->>Host: Allocate resources
    Host->>VM: Create VM from image
    VM->>VM: UEFI/Shielded VM verification
    VM->>VM: Boot OS
    VM->>VM: Execute startup script
    VM->>VM: Guest Agent registers
    VM-->>User: RUNNING status
  1. Image Selection: You specify a boot disk image (public like debian-12 or custom).
  2. Host Placement: Google’s scheduler finds a host with available resources in your specified zone.
  3. Shielded VM Verification (if enabled, which is default): The boot process is verified against known-good measurements.
  4. OS Boot: The operating system loads.
  5. Startup Script Execution: Any metadata-specified startup scripts run.
  6. Guest Agent: The GCE guest agent registers, enabling features like OS Login and metadata queries.

Shielded VM: Security by Default

Since ~2020, all VM families are Shielded VMs by default. This is huge for security-conscious deployments.

What Shielded VMs provide:

FeatureWhat It DoesWhy It Matters
Secure BootVerifies bootloader and kernel signaturesPrevents boot-level rootkits
vTPMVirtual Trusted Platform ModuleEnables measured boot, secure key storage
Integrity MonitoringCompares boot measurements to baselineDetects tampering in real-time

For Malware Analysts The vTPM and integrity monitoring features are goldmines for incident response. If you suspect a VM has been compromised at the boot level, you can compare its current boot measurements against the baseline. A mismatch is a strong indicator of boot-level persistence mechanisms.

Startup Scripts: Automation at Boot

Startup scripts run every time a VM boots (not just first boot). They’re specified via instance metadata:

gcloud compute instances create my-vm \
  --metadata startup-script='#!/bin/bash
apt-get update
apt-get install -y nginx
systemctl enable nginx
systemctl start nginx'

Or reference a script in Cloud Storage:

gcloud compute instances create my-vm \
  --metadata startup-script-url=gs://my-bucket/startup.sh

Startup Script Pitfalls Startup scripts run as root and block the boot process until completion. Long-running scripts delay your VM reaching "RUNNING" status. For complex provisioning, consider using startup scripts only to bootstrap a configuration management tool like Ansible or Puppet.


Resiliency: Designing for 99.99% SLA

Goal: Design a multi-zone architecture that survives zone failures using MIGs, health checks, and autoscaling.

Maps to ACE 2.1: “Creating an autoscaled managed instance group by using an instance template” and ACE 3.1: “Working with GKE node pools… autoscaling node pool.”

The uploaded materials emphasize Managed Instance Groups (MIGs) as the core of high availability, and they’re absolutely right.

The Hierarchy of Resilience

graph TD
    subgraph "GCP Project"
        subgraph "Region: us-central1"
            subgraph "Zone A"
                MIG1[Regional MIG<br/>Instance 1]
                HD1[Hyperdisk 1]
                MIG1 --> HD1
            end
            subgraph "Zone B"
                MIG2[Regional MIG<br/>Instance 2]
                HD2[Hyperdisk 2]
                MIG2 --> HD2
            end
            subgraph "Zone C"
                MIG3[Regional MIG<br/>Instance 3]
                HD3[Hyperdisk 3]
                MIG3 --> HD3
            end
        end
        LB[Global Load Balancer] --> MIG1
        LB --> MIG2
        LB --> MIG3
    end
    
    style LB fill:#4285F4,color:white
    style MIG1 fill:#34A853,color:white
    style MIG2 fill:#34A853,color:white
    style MIG3 fill:#34A853,color:white

Managed Instance Groups (MIGs)

A MIG is a collection of identical VMs created from the same Instance Template. Google manages them as a unit.

Key features:

FeatureZonal MIGRegional MIG
DistributionSingle zoneAcross 3 zones
Failure domainZone failure = total outageZone failure = ~66% capacity remains
Use caseStateful apps, cost-sensitiveProduction, HA required
AvailabilityLower (single-zone dependency)Higher (can reach 99.99% with proper LB and health checks)

Autohealing: Self-Repairing Infrastructure

MIGs can automatically detect and replace unhealthy instances using health checks:

gcloud compute health-checks create http my-health-check \
  --port 80 \
  --request-path /health \
  --check-interval 10s \
  --timeout 5s \
  --unhealthy-threshold 3

When an instance fails 3 consecutive health checks, the MIG automatically:

  1. Deletes the unhealthy instance
  2. Creates a new instance from the template
  3. Waits for it to pass health checks
  4. Adds it to the load balancer

Autoscaling: Reactive vs. Predictive

Reactive autoscaling (traditional) responds to current metrics:

  • CPU utilization exceeds 70%? Add instances.
  • CPU drops below 40%? Remove instances.

Predictive autoscaling (the 2024+ game-changer) uses ML to forecast load based on historical patterns:

Predictive Autoscaling

If your workload has predictable patterns (daily traffic spikes at 9 AM, weekly batch jobs on Sundays), predictive autoscaling learns these patterns and pre-provisions instances before the load arrives. This eliminates the “cold start” lag where users experience slowness while new instances spin up.

gcloud compute instance-groups managed update my-mig \
  --enable-predictive-autoscaling \
  --predictive-autoscaling-mode=optimize-availability

Live Migration: Google’s Magic Trick

Here’s something that still amazes systems engineers: Google can move your running VM to a different physical host without any downtime.

When live migration happens:

  • Hardware maintenance scheduled
  • Software updates on the host
  • Resource rebalancing

What you experience: Maybe a brief network blip (sub-second). Your VM keeps running, processes stay alive, network connections persist.

Exam Strategy: Availability The exam tests whether you understand the availability policy options:

  • MIGRATE (default): Live migration during maintenance
  • TERMINATE: VM is stopped, then restarted on new host

Spot VMs (preemptible) only support TERMINATE because they can be reclaimed at any time.


Security Deep Dive

Goal: Distinguish which security feature addresses which threat: boot tampering, memory snooping, or supply-chain risks.

Maps to ACE 4.1: “Managing Identity and Access Management (IAM)” and ACE 4.2: “Managing service accounts.”

For those of us from security backgrounds, this section is critical.

Security Threat Model

Before diving into features, understand what each one protects against:

FeatureThreat AddressedAttack Example
Shielded VMsBoot-level tamperingBootkit/rootkit persistence in MBR/UEFI
Confidential VMsMemory snoopingHypervisor compromise, cold boot attacks, physical access
Trusted Images PolicySupply-chain / image hygieneDeveloper spins up VM from untrusted ISO with backdoor
OS LoginKey management sprawlLeaked SSH keys, orphaned access after employee departure

The Security Layer Cake

graph TB
    subgraph "Defense in Depth"
        L1[Trusted Images Policy<br/>What can be deployed?]
        L2[Shielded VMs<br/>Is the boot process clean?]
        L3[Confidential VMs<br/>Is memory encrypted?]
        L4[IAM & Firewall<br/>Who can access?]
        L5[OS Login<br/>How do users authenticate?]
    end
    
    L1 --> L2 --> L3 --> L4 --> L5
    
    style L1 fill:#FFCDD2
    style L2 fill:#F8BBD9
    style L3 fill:#E1BEE7
    style L4 fill:#C5CAE9
    style L5 fill:#BBDEFB

Confidential VMs: Memory Encryption in Use

Standard encryption protects data at rest (disk) and in transit (network). Confidential VMs add encryption for data in use—the data in RAM while your application processes it.

Technologies:

  • AMD SEV-SNP: Secure Encrypted Virtualization with Secure Nested Paging
  • Intel TDX: Trust Domain Extensions (newer option)

Why this matters: Even if an attacker compromises the hypervisor or has physical access to the host, they cannot read your VM’s memory contents.

gcloud compute instances create my-confidential-vm \
  --confidential-compute \
  --machine-type n2d-standard-4 \
  --zone us-central1-a

Confidential VM Limitations

Not all machine types support Confidential VMs. As of 2026, you’re primarily looking at N2D (AMD) and C3 (Intel TDX) families. The exam may ask which machine types support confidential computing—remember the AMD/Intel split.

Trusted Images Policy: Organizational Control

Prevent developers from spinning up VMs with arbitrary images (like that sketchy ISO someone found on the internet):

# Organization Policy
constraint: compute.trustedImageProjects
listPolicy:
  allowedValues:
    - projects/debian-cloud
    - projects/ubuntu-os-cloud
    - projects/my-golden-images

This is enforced at the Organization or Folder level, ensuring that even project owners can only use approved images.

OS Login: Centralized Access Control

Instead of managing SSH keys manually on each VM, OS Login ties VM access to Cloud Identity / Google Workspace:

  1. User authenticates with their Google account
  2. Their permissions are checked against IAM
  3. A POSIX account is automatically created/managed on the VM
  4. SSH session is established
gcloud compute instances create my-vm \
  --metadata enable-oslogin=TRUE

Exam Strategy: Security The exam frequently tests the difference between:

  • Project-wide SSH keys: Stored in project metadata, apply to all VMs
  • Instance-specific SSH keys: Stored in instance metadata
  • OS Login: IAM-based, no key management required

For enterprise/secure environments, OS Login is almost always the right answer.


Cost Optimization: A Decision Framework

Goal: Instantly classify a workload into Spot / CUD / On-Demand based on two questions: “Can it be interrupted?” and “Is usage predictable?”

Maps to ACE 1.2: “Managing billing configuration” and touches on ACE 2.1 (Spot VM instances).

The Cost Spectrum

Quick decision rules:

  • Use Spot VMs when: Batch jobs, CI/CD, fault-tolerant distributed systems, anything with checkpointing
  • Avoid Spot when: User-facing apps, stateful workloads without external state, SLA requirements
  • Use CUDs when: Always-on production, predictable baseline (even if traffic varies), 1+ year commitment is acceptable
  • Avoid CUDs when: Uncertain capacity needs, short-term projects, rapidly evolving architecture
  • Use On-Demand when: Unpredictable workloads, short experiments, need maximum flexibility
  • Avoid On-Demand when: Sustained usage over 25% of month (you’re leaving money on the table)
graph TD
    Start[New Workload] --> Q1{Can it tolerate<br/>interruption?}
    
    Q1 -->|Yes| Q2{Batch or<br/>continuous?}
    Q1 -->|No| Q3{Predictable<br/>usage?}
    
    Q2 -->|Batch| SPOT[🎰 Spot VMs<br/>Up to 80-91% off]
    Q2 -->|Continuous| Q3
    
    Q3 -->|Yes, 1+ year| CUD[📝 Committed Use<br/>Up to ~57% off]
    Q3 -->|Variable| SUD[🔄 Sustained Use<br/>Automatic up to ~30% off]
    Q3 -->|Unpredictable| OD[💰 On-Demand<br/>Full price, max flexibility]
    
    style SPOT fill:#4CAF50,color:white
    style CUD fill:#2196F3,color:white
    style SUD fill:#9C27B0,color:white
    style OD fill:#FF5722,color:white

Discount Mechanisms Explained

Sustained Use Discounts (Automatic)

If you run an instance for more than 25% of the month, Google automatically applies discounts:

UsageDiscount
25-50% of month~10% off
50-75% of month~20% off
75-100% of month~30% off

No commitment required. This is pure “thank you for being a good customer.”

Committed Use Discounts (CUDs)

Commit to 1 or 3 years of specific resource usage:

TermDiscount
1 yearUp to ~37% off
3 yearsUp to ~57% off

Key insight: CUDs are for resource types (vCPUs, memory), not specific VMs. You can change instance types, zones, even machine families, as long as you’re using the committed resources.

Spot VMs (Formerly Preemptible)

Up to 80–91% off (depending on machine type and region), but Google can reclaim them with 30 seconds notice.

Ideal for:

  • Batch processing (render farms, ETL jobs)
  • CI/CD pipelines
  • Fault-tolerant distributed systems
  • Anything with checkpointing

Not suitable for:

  • User-facing applications
  • Stateful workloads without external state
  • Anything that can’t handle sudden termination
gcloud compute instances create my-spot-vm \
  --provisioning-model=SPOT \
  --instance-termination-action=DELETE

Exam Strategy: Cost

The exam loves scenarios like: “A company runs nightly data processing jobs that take 4 hours and can restart if interrupted. What’s the most cost-effective option?”

Answer: Spot VMs (fault-tolerant batch = Spot).

Contrast with: “A company runs a critical e-commerce site 24/7 with predictable traffic.”

Answer: 3-year CUD (always-on, predictable = committed use).

Right-Sizing: Don’t Overprovision

Google’s Active Assist analyzes your VM utilization and recommends right-sizing:

gcloud recommender recommendations list \
  --recommender=google.compute.instance.MachineTypeRecommender \
  --project=my-project \
  --location=us-central1-a

Common findings:

  • “This n2-standard-8 averages 12% CPU. Consider n2-standard-2.”
  • “This VM has 32GB RAM but uses 4GB. Consider custom machine type.”

Quick Reference: gcloud Commands

Instance Management

# Create a basic instance
gcloud compute instances create my-vm \
  --zone=us-central1-a \
  --machine-type=n4-standard-4 \
  --image-family=debian-12 \
  --image-project=debian-cloud
 
# Create with Hyperdisk
gcloud compute instances create my-vm \
  --zone=us-central1-a \
  --machine-type=n4-standard-8 \
  --create-disk=name=my-data,size=200GB,type=hyperdisk-balanced,provisioned-iops=10000
 
# SSH into instance
gcloud compute ssh my-vm --zone=us-central1-a
 
# List running instances
gcloud compute instances list
 
# Stop/Start
gcloud compute instances stop my-vm --zone=us-central1-a
gcloud compute instances start my-vm --zone=us-central1-a
 
# Delete
gcloud compute instances delete my-vm --zone=us-central1-a

🧪 Micro-Lab: Your First GCE VM Try this hands-on sequence (15 minutes):

  • Create an E2-micro VM with PD-Balanced boot disk
  • SSH in and run lsblk to see disk layout
  • In Console: Compute Engine → Disks → Resize the disk to +10GB
  • Back in SSH: run sudo growpart /dev/sda 1 and sudo resize2fs /dev/sda1
  • Note: IOPS didn’t change (PD scales with size, but you have to check the metrics)
  • Delete the VM when done to avoid charges

Snapshots and Images

# Create snapshot
gcloud compute disks snapshot my-disk \
  --zone=us-central1-a \
  --snapshot-names=my-snapshot
 
# Create image from snapshot
gcloud compute images create my-image \
  --source-snapshot=my-snapshot
 
# Create image from disk
gcloud compute images create my-image \
  --source-disk=my-disk \
  --source-disk-zone=us-central1-a

Managed Instance Groups

# Create instance template
gcloud compute instance-templates create my-template \
  --machine-type=n4-standard-2 \
  --image-family=debian-12 \
  --image-project=debian-cloud
 
# Create regional MIG
gcloud compute instance-groups managed create my-mig \
  --template=my-template \
  --size=3 \
  --region=us-central1
 
# Configure autoscaling
gcloud compute instance-groups managed set-autoscaling my-mig \
  --region=us-central1 \
  --min-num-replicas=2 \
  --max-num-replicas=10 \
  --target-cpu-utilization=0.7

🧪 Micro-Lab: Watch Autohealing in Action Try this to see MIG self-healing (20 minutes):

  • Create a regional MIG with 2 instances using the commands above
  • Add an HTTP health check: gcloud compute health-checks create http my-hc --port 80
  • Attach it: gcloud compute instance-groups managed update my-mig --region=us-central1 --health-check=my-hc
  • SSH into one instance and run sudo systemctl stop nginx (or kill whatever the health check expects)
  • Watch in Console: the instance goes UNHEALTHY → gets deleted → new one created
  • Clean up: delete the MIG, template, and health check

🧪 Micro-Lab: See Right-Sizing Recommendations Quick cost optimization exercise (5 minutes):

  • Go to Console → Billing → Reports to see spend by SKU
  • Go to Console → Recommender → VM Right-sizing
  • Look for any VMs with <20% average CPU—these are candidates for downsizing
  • Note: recommendations take ~24 hours of usage data to appear


Summary: The ACE Exam Mental Model

When approaching GCE questions on the Associate Cloud Engineer exam, think in layers:

  1. What family? Match workload characteristics to machine family
  2. What storage? Default to PD-Balanced unless Hyperdisk flexibility is needed
  3. What availability? Single VM < Zonal MIG < Regional MIG (cost vs. SLA tradeoff)
  4. What cost model? Spot (interruptible) → CUD (predictable) → On-Demand (flexible)
  5. What security? Shielded (default) → Confidential (memory encryption) → Trusted Images (org control)

Master these decision points and the exam scenarios become straightforward pattern matching.


Exam Traps Cheat Sheet

Quick reference of common ACE exam pitfalls for GCE:

TopicTrapCorrect Thinking
Machine familiesPicking C4 for “needs good performance”Check if cost matters—E2/N4 often sufficient
Machine familiesChoosing M3 for “large database”Most DBs are I/O-bound, not memory-bound—check the bottleneck
Machine familiesForgetting Tau T2A is Armx86 binaries won’t run without recompilation
StorageHyperdisk for simple boot diskPD-Balanced is the right default
StorageHyperdisk when “zone failure” mentionedRegional PD provides cross-zone replication
StorageIgnoring disk size → IOPS relationshipOn PD, bigger disk = more IOPS
MIGsZonal MIG for “high availability”Regional MIG survives zone failures
MIGsForgetting health checks for autohealingNo health check = no autohealing
AutoscalingPredictive for spiky/unpredictable loadsPredictive needs historical patterns; use reactive for chaos
CostCUD for uncertain future capacityCUD is a commitment—use on-demand if unsure
CostOn-demand for 24/7 productionYou’re leaving 30-57% savings on the table
SecurityConfidential VM for “secure boot”Shielded VM handles boot integrity; Confidential is for memory
SecuritySSH keys for enterprise accessOS Login integrates with IAM—prefer it

Practice Scenarios

Test your mental model with these one-liner scenarios. Try to answer before checking.

  1. Scenario: A startup runs nightly ML training jobs that process 500GB of data. Jobs can restart from checkpoints. They want to minimize costs.

  2. Scenario: A bank needs VMs for a trading application where even the hypervisor shouldn’t be able to read memory contents.

  3. Scenario: An e-commerce company has predictable traffic with 3x spikes during sales events. They need to survive a zone failure.

  4. Scenario: A team needs to run legacy Windows x86 applications with consistent CPU performance for EDA workloads.

  5. Scenario: A company wants to prevent developers from launching VMs using random public images from the internet.


Practice Answers

  1. ML training with checkpoints, cost-sensitive: Spot VMs (fault-tolerant batch) + possibly Storage Optimized Z3 or Hyperdisk Throughput for the 500GB data reads.

  2. Memory contents hidden from hypervisor: Confidential VMs (AMD SEV-SNP or Intel TDX for memory encryption in use).

  3. Predictable traffic, survives zone failure: Regional MIG with predictive autoscaling + load balancer. CUD for the baseline capacity.

  4. Legacy Windows x86, consistent CPU, EDA: Compute Optimized C4 (not Tau T2A—that’s Arm). Consider sole-tenant nodes if licensing requires it.

  5. Prevent random public images: Trusted Images Policy via Organization Policy constraint compute.trustedImageProjects.



Last updated: 2026-01-09