Chapter 16 of 27
Managing Compute Engine: Updates, Availability, and Troubleshooting
Once VMs are running, they must be patched, scaled, and debugged; build the operational skills to keep Compute Engine workloads healthy.
Compute Engine Instance Lifecycle: The Big Picture
Why Lifecycle Matters
Operating Compute Engine means understanding what happens when you start, stop, reset, suspend, or delete a VM, and how that affects data, IPs, and billing.
Main Lifecycle States
Key states: PROVISIONING, STAGING, RUNNING, STOPPING, TERMINATED, SUSPENDING, SUSPENDED. Billing and behavior differ in each state.
Billing Basics
In RUNNING you pay for vCPU, RAM, and disks. In TERMINATED you stop paying for vCPU/RAM but still pay for persistent disks and static IPs.
Common Operations
Start, Stop, Reset, Suspend/Resume, and Delete control transitions between states and determine what resources and data are preserved.
Exam Angle
Expect questions about what survives each operation (disks, IPs, metadata, service accounts) and what is billable in each state.
Hands-On: Start, Stop, Reset, Suspend, Delete
Stopping a VM
Stop a dev VM when idle: in the console click Stop, or use `gcloud compute instances stop`. You stop paying for vCPU/RAM but disks and static IPs still cost money.
Reset vs Stop
Reset reboots a RUNNING VM immediately, like a power cycle. It keeps the same instance ID and attached disks, unlike delete/recreate.
Suspend and Resume
Suspend writes RAM to disk and powers off. Resume reloads RAM for a faster startup. Not all machine types/zones support this.
Delete and Keep Disk
When deleting, you can keep the boot disk for later reuse. In CLI use `--keep-disks=boot` to avoid losing the OS and data.
Common Exam Trap
Stopping a VM does not stop disk or static IP charges. Deleting a VM may delete the boot disk unless you explicitly keep it.
Managed Instance Groups: Rolling Updates and Patch Strategy
What is a MIG?
A managed instance group (MIG) runs many identical VMs from one instance template, ideal for stateless, horizontally scaled workloads.
Instance Template Role
The template defines machine type, image, disks, metadata, startup script, and service account used for all VMs in the group.
Patch Flow Overview
To patch: create a new template, point the MIG to it, then perform a rolling update to gradually replace old VMs with new ones.
Rolling Update Controls
`maxSurge`, `maxUnavailable`, and `minReadySec` control how many VMs update at once and how much capacity stays available.
Exam Reminder
You do not hand-edit MIG instances for config changes. Update the template and roll. Know zonal vs regional MIG behavior.
Example: Safe Rolling Update with Autoscaling
Scenario Setup
You have a regional MIG `web-mig` behind an HTTP(S) load balancer. You want to deploy a new app version safely.
Step 1: New Template
Create `web-template-v3` with updated image or startup script using `gcloud compute instance-templates create`.
Step 2: Point MIG to Template
Use `set-instance-template` so future and updated instances use `web-template-v3` across the group.
Step 3: Rolling Update
Start a rolling update with `--max-surge=1 --max-unavailable=0` to keep all existing capacity while adding one new VM at a time.
Autoscaling Interaction
Autoscaling may change group size based on CPU or load, but rolling update limits still control how many VMs update at once.
Availability, Maintenance, Live Migration, and Preemptible VMs
Maintenance Policy
On host maintenance, you can choose MIGRATE for live migration or TERMINATE to stop the VM during maintenance.
Automatic Restart
Automatic restart recreates VMs after unexpected host failures. It is typically enabled for production workloads.
Live Migration
Live migration moves a running VM to another host with no reboot, handling planned maintenance with minimal disruption.
Preemptible / Spot VMs
These are low-cost, interruptible VMs that Google can reclaim at any time. Use them only for fault-tolerant, non-critical jobs.
Exam Pointers
Know flags like `--maintenance-policy` and `--restart-on-failure`, and that preemptible/Spot VMs offer no SLA and may not restart.
Troubleshooting Basics: Serial Console, Logs, and Connectivity
Serial Console
Serial console shows low-level boot and kernel logs, even when SSH fails. Enable it via metadata and access it from the VM details page.
Cloud Logging
Use Cloud Logging (Logs Explorer) with `resource.type="gce_instance"` to view system and app logs from Compute Engine VMs.
Connectivity Checks
Check firewall rules, VPC routes, and use Connectivity Tests to simulate traffic paths and find where packets are blocked.
SSH Troubleshooting Flow
If SSH fails: confirm VM is RUNNING, verify firewall on port 22, check external IP, then inspect serial console for boot issues.
Disk Rescue Approach
For severe OS misconfigurations, detach the boot disk, attach to a healthy VM, fix configs, then reattach and reboot.
Thought Exercise: Choose the Right Availability and Update Strategy
Work through these scenarios and pick an approach. Think in terms of MIGs, maintenance policies, and preemptible vs standard VMs.
- 24/7 API service with strict uptime
- Requirements: minimal downtime during maintenance, automatic recovery from host failures, traffic served across zones.
- Your choices:
- A. Zonal MIG with preemptible VMs and maintenance policy TERMINATE
- B. Regional MIG with standard VMs, maintenance policy MIGRATE, automatic restart enabled
- C. Single VM with maintenance policy TERMINATE and no automatic restart
Which is best and why?
- Batch data processing job that runs nightly
- Requirements: cost-optimized, can tolerate restarts or interruptions, no strict uptime.
- Options:
- A. Regional MIG with Spot or preemptible VMs and an instance template that makes the job idempotent
- B. Single large VM with automatic restart and maintenance policy MIGRATE
- C. Zonal MIG with only standard VMs and `maxUnavailable=0` during updates
- Patching a fleet of web servers
You operate a MIG of 20 web servers behind a load balancer. You need to roll out a security patch with no more than 10% capacity loss at any time.
- How would you set `maxSurge` and `maxUnavailable`?
- Explain briefly why your choice meets the requirement.
Pause and answer these in your own words. Then compare:
- Scenario 1: B is correct. Regional MIG + MIGRATE + auto restart gives multi-zone redundancy and seamless maintenance.
- Scenario 2: A is correct. Preemptible/Spot VMs in a MIG minimize cost for fault-tolerant batch jobs.
- Scenario 3: With 20 instances, 10% is 2. Set `maxUnavailable=2` (or less) and `maxSurge>=0`. A common safe choice is `maxSurge=2`, `maxUnavailable=2`.
Quiz: Lifecycle, MIG Updates, and Maintenance
Test your understanding of lifecycle operations, MIG updates, and maintenance behavior.
You manage a zonal managed instance group of 5 VMs serving a web app. You created a new instance template with a patched image and want to roll it out with zero downtime while keeping the group size at 5. Which `gcloud` rolling update configuration is MOST appropriate?
- Use `rolling-action start-update` with `--max-surge=0 --max-unavailable=5`
- Use `rolling-action start-update` with `--max-surge=1 --max-unavailable=0`
- Use `rolling-action start-update` with `--max-surge=5 --max-unavailable=5`
- Edit each VM manually to match the new template, then update the MIG later
Show Answer
Answer: B) Use `rolling-action start-update` with `--max-surge=1 --max-unavailable=0`
To maintain full capacity and avoid downtime, you must not have any unavailable instances. `--max-unavailable=0` enforces that. `--max-surge=1` allows one extra instance above the target size, so the group can create a new VM, wait for it to be healthy, then remove an old one. The other options either allow all instances to be unavailable, cause large disruptive surges, or ignore the correct MIG workflow.
Quiz: Troubleshooting and Quotas
Check your understanding of troubleshooting tools and quota-related errors.
You attempt to create a new e2-standard-4 instance in `us-central1-a` and receive an error: "Quota 'CPUS' exceeded in region us-central1." What is the BEST immediate action?
- Retry the command with a different machine type in the same zone
- Request a CPU quota increase for region `us-central1` and consider temporarily creating the VM in a region with available quota
- Disable automatic restart on existing instances to free CPU quota
- Increase the disk size of existing instances to reduce CPU usage
Show Answer
Answer: B) Request a CPU quota increase for region `us-central1` and consider temporarily creating the VM in a region with available quota
A CPU quota error means you have reached your regional vCPU limit. Changing machine type in the same region usually will not help if you are already at or near the limit. The correct response is to request a quota increase for the region and, if needed, deploy temporarily in another region where you have available quota. Automatic restart and disk size do not affect quota usage.
Quotas, Limits, and Common Error Messages
Compute Engine Quotas
Key quotas: vCPUs, persistent disk GB, static external IPs, and number of instances per project per region.
Recognizing Errors
Look for messages like `Quota 'CPUS' exceeded` or `Quota 'INUSEADDRESSES' exceeded` in console or CLI output.
Short-Term Fixes
Delete unused VMs, disks, and static IPs, or use smaller machine types to fit within current quotas.
Request Increases
For ongoing needs, request a quota increase for the specific metric and region instead of repeatedly hitting limits.
Exam Trap: Quota vs IAM
Quota errors differ from IAM errors. Quotas are about capacity; IAM errors are permission-related and fixed via IAM roles.
Key Term Flashcards: Compute Engine Operations
Flip through these cards to reinforce core terms and behaviors for Compute Engine management.
- Instance lifecycle: RUNNING vs TERMINATED billing
- RUNNING: you pay for vCPU, RAM, and attached disks. TERMINATED: you stop paying for vCPU/RAM, but persistent disks and static external IPs still incur charges until deleted or released.
- Managed instance group (MIG)
- A group of identical VM instances created from an instance template. It supports autoscaling, autohealing, and rolling updates, and is used for stateless, horizontally scaled workloads.
- Instance template
- A reusable configuration for VM instances, including machine type, image, disks, metadata, startup scripts, and service account. MIGs use it to create and update instances.
- Rolling update: maxSurge
- The maximum number of additional instances (above the target size) that can be created during an update. Higher values speed up updates but can increase cost and capacity spikes.
- Rolling update: maxUnavailable
- The maximum number of instances that can be unavailable during an update. Lower values protect availability but slow the rollout.
- On host maintenance: MIGRATE vs TERMINATE
- MIGRATE uses live migration to move VMs to another host during maintenance without rebooting. TERMINATE stops the VM; it may restart later if automatic restart is enabled.
- Preemptible / Spot VM characteristics
- Low-cost, interruptible VMs that Compute Engine can reclaim at any time. They have no SLA, limited lifetime, and should only run fault-tolerant or batch workloads.
- Serial console usage
- A low-level console that shows boot and kernel logs, useful when SSH fails. It must be enabled via metadata and is accessed from the VM details page.
- Common CPU quota error message
- Quota 'CPUS' exceeded in region <region>. It means you have reached the regional vCPU limit; you may need to free resources or request a quota increase.
- Firewall rule basics for SSH
- To allow SSH, create an ingress rule that permits TCP port 22 from the desired source ranges to the VM’s network tag or service account on the correct VPC/subnet.
Key Terms
- Quota
- A configurable limit on resource usage in a Google Cloud project or region, such as vCPUs, disk space, or IP addresses, used to prevent accidental overconsumption.
- Live migration
- A Compute Engine capability that moves a running VM from one physical host to another during maintenance without requiring a reboot of the guest OS.
- Rolling update
- A deployment strategy where instances in a managed instance group are gradually replaced with new instances based on an updated template, controlling how many are updated at once.
- Serial console
- A low-level console interface for a VM instance that exposes boot and kernel logs and enables troubleshooting when network access like SSH is unavailable.
- Autoscaling (MIG)
- A feature of managed instance groups that automatically adjusts the number of VM instances based on metrics such as CPU utilization or HTTP load balancing capacity.
- Instance template
- A resource that defines the configuration of VM instances (machine type, image, disks, metadata, service account, etc.) and is used by MIGs to create instances.
- Connectivity Tests
- A tool in Network Intelligence Center that simulates network traffic between two endpoints and analyzes firewall rules, routes, and other factors to diagnose connectivity issues.
- Preemptible VM / Spot VM
- A low-cost, interruptible Compute Engine instance type that can be reclaimed by Google Cloud at any time and is suitable only for fault-tolerant or batch workloads.
- On host maintenance policy
- A VM setting that determines whether a VM is live-migrated (MIGRATE) or stopped (TERMINATE) when its underlying host requires maintenance.
- Managed instance group (MIG)
- A Compute Engine feature that manages a group of identical VM instances created from an instance template, supporting autoscaling, autohealing, and rolling updates.