SkarpSkarp

Chapter 17 of 26

Operating Compute Engine: Lifecycle Management, Patching, and Troubleshooting

Keep your VMs healthy by mastering instance lifecycle operations, OS patching, and common troubleshooting patterns tested on the exam.

27 min readen

Compute Engine Lifecycle Basics

Lifecycle States Overview

A Compute Engine VM moves through states: PROVISIONING, STAGING, RUNNING, STOPPING, TERMINATED, and sometimes REPAIRING (for managed instances). You see these in the Console or via `gcloud compute instances list`.

Start, Stop, Suspend

Start boots a stopped VM and resumes vCPU/RAM billing. Stop gracefully shuts down the OS and stops vCPU/RAM billing while disks still accrue cost. Suspend saves RAM to disk and pauses the VM; cheaper than running, but the suspend disk is billed.

Reset and Delete

Reset is a hard reboot with no OS shutdown; unsaved in-memory data is lost. Delete permanently removes the VM. Disks may be deleted or kept depending on the "Delete disk" flag; metadata and labels are lost on delete.

Billing and IP Nuances

Compute charges apply in RUNNING/REPAIRING. Disks are always billed while they exist. Ephemeral external IPs are released when you stop or delete a VM; static external IPs stay reserved and billable even without a running VM.

Changing Machine Types, Disks, and Live Maintenance

Changing Machine Types

To change a VM’s machine type, you usually stop it, then edit the machine type in the Console or run `gcloud compute instances set-machine-type`. Start the VM again to apply the new vCPU/RAM configuration.

Resizing Disks

You can only increase persistent disk size. Grow the disk in the Console or via `gcloud`, then expand the filesystem inside the guest OS (for example, `resize2fs` on Linux or Disk Management on Windows).

Disk Type Changes

Changing disk type (standard to SSD) requires a new disk. Common pattern: create a snapshot, create a new disk of the desired type from that snapshot, attach it, and optionally make it the new boot disk.

Live Migration

Google Cloud performs host maintenance. With the default Migrate policy, VMs are live-migrated to another host with minimal downtime. If the maintenance policy is Terminate, VMs stop and must be restarted.

Disks, Snapshots, and Images for Backup and Migration

Persistent Disk vs Local SSD

Persistent disks are durable, network-attached, and can be zonal or regional. Local SSDs are host-attached, extremely fast, but ephemeral; their data is lost when the VM stops or is deleted.

What Snapshots Do

A snapshot is an incremental, point-in-time backup of a persistent disk. You can create new disks from snapshots in any compatible zone or region, making them ideal for backup and recovery.

Using Snapshots

Before risky changes, create a snapshot of the boot disk. If the change fails, create a new disk from the snapshot and attach it to a new VM, effectively rolling back to the previous state.

What Images Do

An image is a bootable disk template containing an OS and optional software. Use images to create standardized VMs across projects or regions, or to clone a configured VM many times.

Hands-on: Backup and Migration with Snapshots and Images

Scenario 1: Pre-Change Snapshot

Before a risky deployment on `web-prod-1`, create a snapshot of its boot disk named `web-prod-1-predeploy`. If the update breaks the VM, you can later create a new disk from this snapshot.

Scenario 1: Rollback Steps

If the VM becomes unstable, stop it, create a new disk from `web-prod-1-predeploy`, and either attach that disk to a new VM or swap it in as the boot disk, effectively rolling back the OS and app state.

Scenario 2: Create an Image

To clone a tuned VM to another region, stop `app-template`, then create an image from its boot disk (for example, `app-template-image`). This image becomes a reusable boot template.

Scenario 2: Deploy in New Region

In `europe-west1`, create a new VM and select `app-template-image` as the boot disk image. You now have a copy of the original configuration in a different region, ideal for multi-region deployments.

OS Patching, Guest Environment, and Patch Management

What Is the Guest Environment?

The guest environment is a set of scripts and agents inside the VM (for example, `google-guest-agent`) that integrates the OS with Google Cloud for SSH keys, OS Login, metadata, and patch/inventory reporting.

Manual vs Managed Patching

You can patch manually using package managers or Windows Update, but this does not scale. VM Manager’s OS patch management lets you centrally schedule and monitor patch jobs across many VMs.

Patch Job Workflow

Enable the OS Config agent, then in the Console create a patch deployment. Target VMs by labels or zones, set the schedule, and define reboot behavior. VM Manager runs the patch job and reports results.

Labels and Maintenance Windows

Apply labels like `env=prod` or `role=db` so patch jobs can select the right VMs. For production, schedule patches during maintenance windows and test on staging VMs first to reduce risk.

Troubleshooting SSH and Serial Console Access

Start with Basics

When SSH fails, first confirm the VM is RUNNING, has the correct IP, and you are connecting to port 22. Many issues are simple IP or state mismatches.

Firewall Rules for SSH

Ensure an ingress firewall rule allows TCP 22 to the VM. It must target the VM’s tags or service account, have appropriate source ranges, and a priority that is not overridden by a deny rule.

OS Login and IAM

If OS Login is enabled, SSH is controlled via IAM roles like `roles/compute.osLogin` or `roles/compute.osAdminLogin`. Without the right role, users cannot log in even if a firewall rule exists.

Serial Console Recovery

If SSH is broken, use the serial console to access the VM. From there, repair `sshd` configuration, user accounts, or guest firewalls. This is a common exam answer for "SSH misconfiguration" scenarios.

Thought Exercise: SSH Troubleshooting Flow

Work through this mental playbook. Imagine you are on call and receive an alert: "Cannot SSH into `prod-api-1`." Take 2–3 minutes to think through each question before revealing the suggested reasoning.

Question 1: What do you check first?

Try to list the first 3 checks you would do.

Suggested reasoning:

  1. In the Console, confirm `prod-api-1` is in RUNNING state.
  2. Confirm its IP address (internal/external) and that you are using the correct one.
  3. Try Cloud Console SSH (browser-based) versus your local SSH client. If Console SSH works but local does not, suspect local network or firewall.

Question 2: Firewall or IAM?

Suppose browser-based SSH also fails. How do you distinguish a firewall issue from an IAM/OS Login issue?

Suggested reasoning:

  • Check VPC firewall rules for an ingress rule allowing TCP 22 to the VM. If no rule exists or a deny rule has higher priority, fix the firewall.
  • If rules look correct and OS Login is enabled, check that your user has `roles/compute.osLogin` or `roles/compute.osAdminLogin`.

Question 3: SSH daemon misconfiguration

Assume firewall and IAM are correct, but SSH still fails after an `sshd_config` change.

Suggested reasoning:

  • Use serial console to access the VM.
  • Inspect `/var/log/auth.log` or `/var/log/secure` (Linux) for SSH errors.
  • Revert or fix `sshd_config`, restart the SSH service, and test again.

Question 4: Exam mapping

Which of your actions map directly to exam-style answers?

Key mappings:

  • "Verify firewall rules and tags".
  • "Check IAM roles for OS Login".
  • "Use serial console to repair SSH configuration".
  • "Ensure the VM is in RUNNING state and has a valid IP".

Monitoring Performance and Cost: Cloud Monitoring and Logging

Cloud Monitoring Basics

Cloud Monitoring tracks VM metrics like CPU, memory (with agent), disk, and network. Use Metrics Explorer with resource type GCE VM Instance to visualize utilization and identify bottlenecks.

Alerting on VM Health

Create alerting policies that trigger when metrics cross thresholds, such as CPU > 80% for 5 minutes. Send notifications to email or incident tools to catch issues early.

Cloud Logging and Serial Output

Cloud Logging ingests VM logs and serial port output. Use Logs Explorer to search for errors, and inspect serial logs when a VM fails to boot or shows kernel panics.

Performance vs Cost

Monitoring data reveals over- or under-provisioning. Use the Google Cloud pricing calculator to estimate cost changes when right-sizing machine types or resizing disks.

Quiz: Lifecycle and Storage Operations

Check your understanding of lifecycle and storage concepts.

You have a VM running a production workload. You want to stop paying for vCPU/RAM but keep the boot disk and be able to restart the VM later with the same data. What should you do?

  1. Delete the VM and its boot disk, then recreate it later from a public image
  2. Stop the VM and ensure the boot disk is not set to auto-delete
  3. Suspend the VM and then detach the boot disk
  4. Reset the VM and then remove its external IP
Show Answer

Answer: B) Stop the VM and ensure the boot disk is not set to auto-delete

Stopping the VM releases vCPU/RAM billing while preserving the disk. Ensuring the boot disk is not set to auto-delete guarantees the data remains. Deleting the VM and disk loses data; suspend still bills for suspend storage; reset does not change billing or disk state.

Quiz: SSH Troubleshooting and Monitoring

Test yourself on SSH troubleshooting and monitoring.

A user cannot SSH into a Linux VM. The VM is RUNNING, and a firewall rule allows TCP 22 from the user’s IP. OS Login is enabled on the project. What is the most likely next step to restore access?

  1. Assign the user the roles/compute.osLogin IAM role on the project
  2. Disable OS Login on the VM and rely on project-wide SSH keys
  3. Create a snapshot of the boot disk and restore it
  4. Change the VM’s machine type to a larger size
Show Answer

Answer: A) Assign the user the roles/compute.osLogin IAM role on the project

With OS Login enabled, SSH access is controlled by IAM. Without a role like roles/compute.osLogin or roles/compute.osAdminLogin, the user cannot log in. Disabling OS Login is not usually recommended; snapshots and machine type changes do not address SSH auth.

Key Term Review

Flip through these cards to reinforce core terms and concepts before moving on.

Compute Engine VM lifecycle: Start vs Stop vs Reset vs Delete
**Start** boots a stopped VM and resumes compute billing. **Stop** gracefully shuts down the OS and stops vCPU/RAM billing while disks remain. **Reset** is a hard reboot with no OS shutdown. **Delete** permanently removes the VM; disks may or may not be deleted depending on settings.
Persistent disk snapshot
An incremental, point-in-time backup of a persistent disk stored in Cloud Storage. Used for backup and recovery, and to create new disks in the same or different zones/regions.
Image (Compute Engine)
A bootable disk template containing an OS and optional software. Used to create new VMs and boot disks, ideal for cloning standardized configurations across regions or projects.
Guest environment / guest agent
Scripts and agents inside the VM (for example, google-guest-agent) that integrate the OS with Google Cloud, handling metadata, SSH keys, OS Login, and feeding data to features like OS patch management.
Serial console usage
Out-of-band access to a VM’s serial port. Used to troubleshoot boot issues, kernel panics, and broken SSH by logging in via the console even when network access fails.
Cloud Monitoring role for VMs
Collects metrics such as CPU, memory (with agent), disk, and network for GCE instances. Supports dashboards, metrics explorer, and alerting policies to track performance and availability.
Cloud Logging role for VMs
Centralizes logs from VM agents and serial output. Used to search for errors, debug incidents, and inspect boot-time issues when VMs fail to start correctly.
OS patch management (VM Manager)
A feature that inventories OS versions and schedules patch jobs across VMs. You target instances by labels or other criteria and define when and how patches are applied.
Live migration vs terminate maintenance policy
With **Migrate**, Google Cloud live-migrates VMs to another host during maintenance with minimal downtime. With **Terminate**, VMs are stopped and must be manually restarted, causing downtime.
Network Service Tiers
Network Service Tiers is a Google Cloud networking feature that lets you optimize network performance and cost by choosing between different network quality tiers for outbound traffic.

Key Terms

Image
A bootable disk template containing an OS and optional software, used to create new VMs and boot disks.
OS Login
A feature that manages Linux SSH access to VMs using IAM roles and user identities instead of project-wide SSH keys.
Snapshot
An incremental, point-in-time backup of a persistent disk, used for backup, recovery, and creating new disks.
Local SSD
High-performance, physically attached SSD storage with very low latency but ephemeral data that is lost when the VM is stopped or deleted.
VM Manager
A suite of tools for managing VMs at scale, including OS inventory, OS patch management, and configuration management.
Guest agent
Software running inside a VM that is part of the guest environment, enabling integration with Google Cloud features such as metadata, OS Login, and patch management.
Cloud Logging
Google Cloud’s centralized logging service that stores and lets you query logs from VMs and other resources, including serial port output.
Compute Engine
Google Cloud’s Infrastructure-as-a-Service (IaaS) offering that provides virtual machine instances running on Google’s infrastructure.
Serial console
An interface to a VM’s serial port that provides out-of-band access for troubleshooting boot and SSH issues.
Persistent disk
Durable, network-attached block storage for Compute Engine VMs. It persists independently of VM lifecycle and can be zonal or regional.
Cloud Monitoring
Google Cloud’s monitoring service that collects metrics, supports dashboards, and enables alerting for resources including Compute Engine VMs.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself