SkarpSkarp

Chapter 16 of 27

Managing Compute Engine: Updates, Availability, and Troubleshooting

Once VMs are running, they must be patched, scaled, and debugged; build the operational skills to keep Compute Engine workloads healthy.

27 min readen

Compute Engine Instance Lifecycle: The Big Picture

Why Lifecycle Matters

Operating Compute Engine means understanding what happens when you start, stop, reset, suspend, or delete a VM, and how that affects data, IPs, and billing.

Main Lifecycle States

Key states: PROVISIONING, STAGING, RUNNING, STOPPING, TERMINATED, SUSPENDING, SUSPENDED. Billing and behavior differ in each state.

Billing Basics

In RUNNING you pay for vCPU, RAM, and disks. In TERMINATED you stop paying for vCPU/RAM but still pay for persistent disks and static IPs.

Common Operations

Start, Stop, Reset, Suspend/Resume, and Delete control transitions between states and determine what resources and data are preserved.

Exam Angle

Expect questions about what survives each operation (disks, IPs, metadata, service accounts) and what is billable in each state.

Hands-On: Start, Stop, Reset, Suspend, Delete

Stopping a VM

Stop a dev VM when idle: in the console click Stop, or use `gcloud compute instances stop`. You stop paying for vCPU/RAM but disks and static IPs still cost money.

Reset vs Stop

Reset reboots a RUNNING VM immediately, like a power cycle. It keeps the same instance ID and attached disks, unlike delete/recreate.

Suspend and Resume

Suspend writes RAM to disk and powers off. Resume reloads RAM for a faster startup. Not all machine types/zones support this.

Delete and Keep Disk

When deleting, you can keep the boot disk for later reuse. In CLI use `--keep-disks=boot` to avoid losing the OS and data.

Common Exam Trap

Stopping a VM does not stop disk or static IP charges. Deleting a VM may delete the boot disk unless you explicitly keep it.

Managed Instance Groups: Rolling Updates and Patch Strategy

What is a MIG?

A managed instance group (MIG) runs many identical VMs from one instance template, ideal for stateless, horizontally scaled workloads.

Instance Template Role

The template defines machine type, image, disks, metadata, startup script, and service account used for all VMs in the group.

Patch Flow Overview

To patch: create a new template, point the MIG to it, then perform a rolling update to gradually replace old VMs with new ones.

Rolling Update Controls

`maxSurge`, `maxUnavailable`, and `minReadySec` control how many VMs update at once and how much capacity stays available.

Exam Reminder

You do not hand-edit MIG instances for config changes. Update the template and roll. Know zonal vs regional MIG behavior.

Example: Safe Rolling Update with Autoscaling

Scenario Setup

You have a regional MIG `web-mig` behind an HTTP(S) load balancer. You want to deploy a new app version safely.

Step 1: New Template

Create `web-template-v3` with updated image or startup script using `gcloud compute instance-templates create`.

Step 2: Point MIG to Template

Use `set-instance-template` so future and updated instances use `web-template-v3` across the group.

Step 3: Rolling Update

Start a rolling update with `--max-surge=1 --max-unavailable=0` to keep all existing capacity while adding one new VM at a time.

Autoscaling Interaction

Autoscaling may change group size based on CPU or load, but rolling update limits still control how many VMs update at once.

Availability, Maintenance, Live Migration, and Preemptible VMs

Maintenance Policy

On host maintenance, you can choose MIGRATE for live migration or TERMINATE to stop the VM during maintenance.

Automatic Restart

Automatic restart recreates VMs after unexpected host failures. It is typically enabled for production workloads.

Live Migration

Live migration moves a running VM to another host with no reboot, handling planned maintenance with minimal disruption.

Preemptible / Spot VMs

These are low-cost, interruptible VMs that Google can reclaim at any time. Use them only for fault-tolerant, non-critical jobs.

Exam Pointers

Know flags like `--maintenance-policy` and `--restart-on-failure`, and that preemptible/Spot VMs offer no SLA and may not restart.

Troubleshooting Basics: Serial Console, Logs, and Connectivity

Serial Console

Serial console shows low-level boot and kernel logs, even when SSH fails. Enable it via metadata and access it from the VM details page.

Cloud Logging

Use Cloud Logging (Logs Explorer) with `resource.type="gce_instance"` to view system and app logs from Compute Engine VMs.

Connectivity Checks

Check firewall rules, VPC routes, and use Connectivity Tests to simulate traffic paths and find where packets are blocked.

SSH Troubleshooting Flow

If SSH fails: confirm VM is RUNNING, verify firewall on port 22, check external IP, then inspect serial console for boot issues.

Disk Rescue Approach

For severe OS misconfigurations, detach the boot disk, attach to a healthy VM, fix configs, then reattach and reboot.

Thought Exercise: Choose the Right Availability and Update Strategy

Work through these scenarios and pick an approach. Think in terms of MIGs, maintenance policies, and preemptible vs standard VMs.

  1. 24/7 API service with strict uptime
  • Requirements: minimal downtime during maintenance, automatic recovery from host failures, traffic served across zones.
  • Your choices:
  • A. Zonal MIG with preemptible VMs and maintenance policy TERMINATE
  • B. Regional MIG with standard VMs, maintenance policy MIGRATE, automatic restart enabled
  • C. Single VM with maintenance policy TERMINATE and no automatic restart

Which is best and why?

  1. Batch data processing job that runs nightly
  • Requirements: cost-optimized, can tolerate restarts or interruptions, no strict uptime.
  • Options:
  • A. Regional MIG with Spot or preemptible VMs and an instance template that makes the job idempotent
  • B. Single large VM with automatic restart and maintenance policy MIGRATE
  • C. Zonal MIG with only standard VMs and `maxUnavailable=0` during updates
  1. Patching a fleet of web servers

You operate a MIG of 20 web servers behind a load balancer. You need to roll out a security patch with no more than 10% capacity loss at any time.

  • How would you set `maxSurge` and `maxUnavailable`?
  • Explain briefly why your choice meets the requirement.

Pause and answer these in your own words. Then compare:

  • Scenario 1: B is correct. Regional MIG + MIGRATE + auto restart gives multi-zone redundancy and seamless maintenance.
  • Scenario 2: A is correct. Preemptible/Spot VMs in a MIG minimize cost for fault-tolerant batch jobs.
  • Scenario 3: With 20 instances, 10% is 2. Set `maxUnavailable=2` (or less) and `maxSurge>=0`. A common safe choice is `maxSurge=2`, `maxUnavailable=2`.

Quiz: Lifecycle, MIG Updates, and Maintenance

Test your understanding of lifecycle operations, MIG updates, and maintenance behavior.

You manage a zonal managed instance group of 5 VMs serving a web app. You created a new instance template with a patched image and want to roll it out with zero downtime while keeping the group size at 5. Which `gcloud` rolling update configuration is MOST appropriate?

  1. Use `rolling-action start-update` with `--max-surge=0 --max-unavailable=5`
  2. Use `rolling-action start-update` with `--max-surge=1 --max-unavailable=0`
  3. Use `rolling-action start-update` with `--max-surge=5 --max-unavailable=5`
  4. Edit each VM manually to match the new template, then update the MIG later
Show Answer

Answer: B) Use `rolling-action start-update` with `--max-surge=1 --max-unavailable=0`

To maintain full capacity and avoid downtime, you must not have any unavailable instances. `--max-unavailable=0` enforces that. `--max-surge=1` allows one extra instance above the target size, so the group can create a new VM, wait for it to be healthy, then remove an old one. The other options either allow all instances to be unavailable, cause large disruptive surges, or ignore the correct MIG workflow.

Quiz: Troubleshooting and Quotas

Check your understanding of troubleshooting tools and quota-related errors.

You attempt to create a new e2-standard-4 instance in `us-central1-a` and receive an error: "Quota 'CPUS' exceeded in region us-central1." What is the BEST immediate action?

  1. Retry the command with a different machine type in the same zone
  2. Request a CPU quota increase for region `us-central1` and consider temporarily creating the VM in a region with available quota
  3. Disable automatic restart on existing instances to free CPU quota
  4. Increase the disk size of existing instances to reduce CPU usage
Show Answer

Answer: B) Request a CPU quota increase for region `us-central1` and consider temporarily creating the VM in a region with available quota

A CPU quota error means you have reached your regional vCPU limit. Changing machine type in the same region usually will not help if you are already at or near the limit. The correct response is to request a quota increase for the region and, if needed, deploy temporarily in another region where you have available quota. Automatic restart and disk size do not affect quota usage.

Quotas, Limits, and Common Error Messages

Compute Engine Quotas

Key quotas: vCPUs, persistent disk GB, static external IPs, and number of instances per project per region.

Recognizing Errors

Look for messages like `Quota 'CPUS' exceeded` or `Quota 'INUSEADDRESSES' exceeded` in console or CLI output.

Short-Term Fixes

Delete unused VMs, disks, and static IPs, or use smaller machine types to fit within current quotas.

Request Increases

For ongoing needs, request a quota increase for the specific metric and region instead of repeatedly hitting limits.

Exam Trap: Quota vs IAM

Quota errors differ from IAM errors. Quotas are about capacity; IAM errors are permission-related and fixed via IAM roles.

Key Term Flashcards: Compute Engine Operations

Flip through these cards to reinforce core terms and behaviors for Compute Engine management.

Instance lifecycle: RUNNING vs TERMINATED billing
RUNNING: you pay for vCPU, RAM, and attached disks. TERMINATED: you stop paying for vCPU/RAM, but persistent disks and static external IPs still incur charges until deleted or released.
Managed instance group (MIG)
A group of identical VM instances created from an instance template. It supports autoscaling, autohealing, and rolling updates, and is used for stateless, horizontally scaled workloads.
Instance template
A reusable configuration for VM instances, including machine type, image, disks, metadata, startup scripts, and service account. MIGs use it to create and update instances.
Rolling update: maxSurge
The maximum number of additional instances (above the target size) that can be created during an update. Higher values speed up updates but can increase cost and capacity spikes.
Rolling update: maxUnavailable
The maximum number of instances that can be unavailable during an update. Lower values protect availability but slow the rollout.
On host maintenance: MIGRATE vs TERMINATE
MIGRATE uses live migration to move VMs to another host during maintenance without rebooting. TERMINATE stops the VM; it may restart later if automatic restart is enabled.
Preemptible / Spot VM characteristics
Low-cost, interruptible VMs that Compute Engine can reclaim at any time. They have no SLA, limited lifetime, and should only run fault-tolerant or batch workloads.
Serial console usage
A low-level console that shows boot and kernel logs, useful when SSH fails. It must be enabled via metadata and is accessed from the VM details page.
Common CPU quota error message
Quota 'CPUS' exceeded in region <region>. It means you have reached the regional vCPU limit; you may need to free resources or request a quota increase.
Firewall rule basics for SSH
To allow SSH, create an ingress rule that permits TCP port 22 from the desired source ranges to the VM’s network tag or service account on the correct VPC/subnet.

Key Terms

Quota
A configurable limit on resource usage in a Google Cloud project or region, such as vCPUs, disk space, or IP addresses, used to prevent accidental overconsumption.
Live migration
A Compute Engine capability that moves a running VM from one physical host to another during maintenance without requiring a reboot of the guest OS.
Rolling update
A deployment strategy where instances in a managed instance group are gradually replaced with new instances based on an updated template, controlling how many are updated at once.
Serial console
A low-level console interface for a VM instance that exposes boot and kernel logs and enables troubleshooting when network access like SSH is unavailable.
Autoscaling (MIG)
A feature of managed instance groups that automatically adjusts the number of VM instances based on metrics such as CPU utilization or HTTP load balancing capacity.
Instance template
A resource that defines the configuration of VM instances (machine type, image, disks, metadata, service account, etc.) and is used by MIGs to create instances.
Connectivity Tests
A tool in Network Intelligence Center that simulates network traffic between two endpoints and analyzes firewall rules, routes, and other factors to diagnose connectivity issues.
Preemptible VM / Spot VM
A low-cost, interruptible Compute Engine instance type that can be reclaimed by Google Cloud at any time and is suitable only for fault-tolerant or batch workloads.
On host maintenance policy
A VM setting that determines whether a VM is live-migrated (MIGRATE) or stopped (TERMINATE) when its underlying host requires maintenance.
Managed instance group (MIG)
A Compute Engine feature that manages a group of identical VM instances created from an instance template, supporting autoscaling, autohealing, and rolling updates.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself