SkarpSkarp

Chapter 18 of 26

Operating GKE and Serverless: Scaling, Updates, and Reliability

Manage running GKE clusters, Cloud Run services, and Cloud Functions so they scale smoothly and recover gracefully from failures.

27 min readen

Big Picture: Operating GKE and Serverless in Production

From Deploying to Operating

Here you move from "I can deploy things" to "I can keep them healthy". You must operate GKE, Cloud Run, and Cloud Functions so they scale, update safely, and recover from failures.

Three Themes

We focus on: 1) Scaling: HPA, node pools, Cloud Run autoscaling, Cloud Functions throughput. 2) Updates: GKE upgrades, rollouts, revisions. 3) Reliability: SLO-aware ops, error handling, retries.

Three Layers

Visualize: Infrastructure (GKE clusters/node pools), Platform (Kubernetes workloads, Cloud Run, Cloud Functions), and Operations (autoscaling, rollouts, monitoring, error handling).

Exam Orientation

Expect scenario questions that mix layers, like choosing which autoscaling or rollout setting to adjust when a new version causes errors or scaling delays.

GKE Cluster and Node Pool Upgrades

GKE Version Basics

GKE has a control plane version and node pool versions. Release channels (Rapid, Regular, Stable) define how quickly you receive new Kubernetes versions.

Safe Upgrade Flow

Common pattern: use a release channel, test in staging, upgrade control plane, then upgrade node pools. GKE drains nodes and respects PodDisruptionBudgets during node upgrades.

Surge and Availability

Surge upgrades add temporary nodes so pods can move before old nodes are drained. You can tune `--max-surge-upgrade` and `--max-unavailable-upgrade` for each node pool.

Rollbacks in Practice

Control plane rollbacks are handled by Google. For node issues, create a new node pool with the previous version and migrate workloads. Remember: cluster upgrade does not auto-upgrade all node pools.

Example: Safely Upgrading a GKE Node Pool and Recovering

Step 1–2: Plan the Upgrade

List clusters and node pools, identify critical vs non-critical pools, and ensure PodDisruptionBudgets exist so upgrades do not evict too many pods at once.

Step 3: Upgrade Command

Run: `gcloud container clusters upgrade my-cluster --node-pool=web-pool --cluster-version=1.27 --max-surge-upgrade=2 --max-unavailable-upgrade=0` to upgrade safely.

Step 4: Monitor the Upgrade

Use `kubectl get nodes -w` and `kubectl get pods -o wide` plus Monitoring dashboards to confirm pods reschedule and app error rate stays healthy.

Step 5: Recover via New Pool

If issues appear, create a new node pool with the previous version, cordon/drain the bad nodes, and migrate workloads. This is the typical exam-correct rollback pattern.

GKE Workload Autoscaling: HPA, Requests, and Limits

Requests, Limits, and HPA

Requests are guaranteed CPU/memory; limits are maximums. HPA usually targets CPU utilization relative to requests, so missing or wrong requests break scaling.

Basic HPA Setup

Pattern: set realistic requests/limits in your Deployment, then run `kubectl autoscale deployment web --cpu-percent=80 --min=3 --max=15` for CPU-based scaling.

Cluster Autoscaler

When new pods cannot be scheduled due to lack of capacity (requests do not fit), cluster autoscaler can add nodes in auto-scaled node pools.

Exam Traps

HPA not scaling? Check requests. Pods Pending? Likely need node autoscaling. OOMKilled pods? Increase memory requests/limits and adjust HPA targets.

Quiz: GKE Autoscaling and Upgrades

Test your understanding of GKE autoscaling and upgrade behavior.

Your GKE Deployment has HPA configured, but under heavy load the number of replicas never increases. CPU usage in logs looks high. Which configuration change is most likely to fix the issue?

  1. Enable cluster autoscaler on the node pool.
  2. Add CPU requests to the container spec in the Deployment.
  3. Increase the HPA maxReplicas value.
  4. Upgrade the cluster control plane to a newer version.
Show Answer

Answer: B) Add CPU requests to the container spec in the Deployment.

HPA for CPU uses utilization relative to CPU requests. If requests are missing, utilization may read as 0%, so HPA will not scale. Adding realistic CPU requests lets HPA see high utilization and increase replicas. Cluster autoscaler only helps once HPA has already tried to add pods, and maxReplicas only matters if HPA is already scaling.

Cloud Run Autoscaling and Concurrency Tuning

Cloud Run Scaling Basics

Cloud Run scales container instances based on HTTP traffic. It can scale to zero when idle and up to many instances when load increases.

Concurrency

Concurrency is how many requests each instance handles at once. Low concurrency improves isolation and latency; high concurrency reduces instance count and cost.

Min and Max Instances

Min instances keep warm capacity to avoid cold starts. Max instances cap total instances to protect backends and control cost during traffic spikes.

Typical Exam Scenarios

High latency? Lower concurrency or raise min instances. Database overload? Lower max instances or concurrency. High cost with low CPU? Increase concurrency.

Example: Tuning a Cloud Run Service Under Load

The Problem

Cloud Run service with concurrency 80 and no limits floods Cloud SQL during spikes, causing 100% CPU and slow uploads for users.

Step 1–2: Protect Cloud SQL

Lower concurrency to 10 and cap max instances (for example 50) so Cloud Run cannot create thousands of concurrent DB connections.

Step 3: Reduce Cold Start Latency

Set min instances (for example 5) so some instances are always warm, reducing latency at the beginning of traffic spikes.

Exam Pattern

Database overload? Lower concurrency or max instances. Cold-start latency? Raise min instances. High cost with idle CPU? Increase concurrency.

Cloud Functions: Retries, Idempotency, and Error Handling

Cloud Functions and Retries

Event-driven functions (Pub/Sub, Storage) retry on failure with backoff. HTTP functions do not auto-retry; clients decide. You can enable or disable retries for some triggers.

Idempotency Defined

An operation is idempotent if running it multiple times has the same effect as once. With retries, your function must handle duplicate events safely.

Designing Idempotent Handlers

Use unique event IDs, store processed IDs, favor upserts, and design writes so repeating them does not corrupt data or double-charge users.

Error Handling Patterns

Use dead-letter topics for repeatedly failing messages, log structured errors, and do not swallow all exceptions or you might block useful retries.

Quiz: Cloud Run vs Cloud Functions Behavior

Check your understanding of Cloud Run and Cloud Functions operational behavior.

You have a Pub/Sub-triggered Cloud Function that charges a customer when an event arrives. Occasionally, the function times out and Pub/Sub retries the message, causing duplicate charges. What is the best fix?

  1. Disable retries on the Cloud Function trigger.
  2. Increase the function timeout so it never fails.
  3. Make the charge operation idempotent by tracking processed event IDs in a database.
  4. Switch from Cloud Functions to Cloud Run.
Show Answer

Answer: C) Make the charge operation idempotent by tracking processed event IDs in a database.

The correct operational pattern is to design idempotent handlers. Track event IDs in a database and skip processing if an ID was already handled. Disabling retries risks data loss, increasing timeout does not remove the possibility of failure, and switching to Cloud Run does not solve the core issue.

Rollouts, Rollbacks, and SLO-Aware Operations

Safe Rollouts in GKE

Deployments use rolling updates. `maxUnavailable` and `maxSurge` control speed vs availability. Use `kubectl rollout status` and `kubectl rollout undo` to monitor and roll back.

Cloud Run Revisions

Each deploy creates a revision. You can split traffic (for example 90/10) to canary test. To roll back, route 100% of traffic to a stable revision.

SLO-Aware Decisions

Define SLOs like latency and error-rate targets. During rollouts, monitor metrics. If SLOs are breached, pause or roll back instead of only scaling up.

Exam Angle

New version causing errors? The preferred answer is often rollback or traffic shift, not just increasing resources. Canary-style rollouts are favored for critical apps.

Thought Exercise: Choosing the Right Scaling and Rollout Strategy

Work through these scenarios mentally and pick an approach. You do not need to write code; focus on reasoning like an Associate Cloud Engineer.

Scenario 1: GKE API backend

  • You run a GKE Deployment that serves mobile app traffic.
  • Symptoms: During daily traffic spikes, p95 latency increases sharply and some requests time out. CPU on nodes is around 40%; pods show 95% CPU utilization relative to requests.

Questions to consider:

  1. Would you adjust HPA targets, resource requests, or node pool size first? Why?
  2. How might you change `maxUnavailable` during deployments to protect latency?

Scenario 2: Cloud Run public API

  • A Cloud Run service backs a public REST API.
  • Symptoms: Users complain of slow responses right after marketing campaigns start, but performance is fine once traffic stabilizes.

Questions:

  1. Which two Cloud Run settings would you tweak to reduce cold start impact?
  2. How would you balance those against cost?

Scenario 3: Pub/Sub Cloud Function processing orders

  • A Pub/Sub-triggered Cloud Function writes orders to Firestore.
  • A bug in a new deployment causes some orders to be written twice.

Questions:

  1. What is your immediate action to stop damage?
  2. What long-term change do you make to the function design?

Pause and answer these in your own words. Then compare:

  • Scenario 1: Likely increase HPA max replicas or lower CPU target (more pods), ensure realistic requests, and maybe enable node autoscaling. During rollouts, keep `maxUnavailable` low.
  • Scenario 2: Increase min instances and possibly lower concurrency. Accept higher baseline cost in exchange for better launch-time latency.
  • Scenario 3: Immediately roll back to the previous version. Long term, implement idempotency based on event IDs and consider a dead-letter topic for problematic messages.

Key Term Review: GKE, Cloud Run, Cloud Functions

Use these flashcards to reinforce core terms and behaviors.

Horizontal Pod Autoscaler (HPA)
A Kubernetes feature that automatically adjusts the number of pod replicas in a replication controller, deployment, or replica set based on observed metrics such as CPU utilization or custom metrics.
Cluster autoscaler (GKE)
A GKE feature that automatically adjusts the size of a node pool based on the scheduling needs of pods. It adds nodes when pods are unschedulable due to lack of resources and removes nodes when they are underutilized.
GKE node pool
A group of nodes within a GKE cluster that all have the same configuration, including machine type and node image. You can upgrade, scale, and configure autoscaling per node pool.
Cloud Run concurrency
The maximum number of concurrent requests that a single Cloud Run instance can handle. Lower values improve isolation and latency; higher values reduce instance count and cost.
Cloud Run revision
An immutable snapshot of a Cloud Run service configuration and container image created with each deployment. Traffic can be routed between revisions for canary or rollback.
Cloud Functions idempotency
The property of a Cloud Function's handler that ensures processing the same event multiple times has the same effect as processing it once, which is crucial because event deliveries may be retried.
Dead-letter topic (Pub/Sub)
A Pub/Sub topic used to store messages that could not be successfully processed after a configured number of delivery attempts, allowing later inspection and manual handling.
Rolling update (Kubernetes Deployment)
The default update strategy for Deployments where pods are gradually replaced with new ones, controlled by maxUnavailable and maxSurge to balance availability and rollout speed.
Min instances (Cloud Run)
A configuration that keeps a minimum number of Cloud Run instances warm at all times to reduce cold start latency, at the cost of some baseline resource usage.
Max instances (Cloud Run)
A configuration that caps the maximum number of Cloud Run instances that can be created, protecting downstream systems and controlling costs during traffic spikes.

Key Terms

Node pool
A group of nodes within a GKE cluster that all have the same configuration, including machine type and node image. You can upgrade, scale, and configure autoscaling per node pool.
Idempotency
A property of an operation where performing it multiple times has the same effect as performing it once, which is critical for safely handling retries in event-driven systems.
Rolling update
The default update strategy for Kubernetes Deployments where pods are gradually replaced with new ones, controlled by maxUnavailable and maxSurge to balance availability and rollout speed.
Dead-letter topic
A Pub/Sub topic used to store messages that could not be successfully processed after a configured number of delivery attempts, allowing later inspection and manual handling.
Cloud Run revision
An immutable snapshot of a Cloud Run service configuration and container image created with each deployment. Traffic can be routed between revisions for canary or rollback.
Cluster autoscaler
A GKE feature that automatically adjusts the size of a node pool based on the scheduling needs of pods. It adds nodes when pods are unschedulable due to lack of resources and removes nodes when they are underutilized.
Cloud Run concurrency
The maximum number of concurrent requests that a single Cloud Run instance can handle. Lower values improve isolation and latency; higher values reduce instance count and cost.
Release channel (GKE)
A configuration that determines the Kubernetes version track and upgrade cadence for a GKE cluster, such as Rapid, Regular, or Stable.
Service level objective (SLO)
A target level of reliability or performance for a service, such as a specific error rate or latency percentile, used to guide operational decisions like rollouts and rollbacks.
Horizontal Pod Autoscaler (HPA)
A Kubernetes feature that automatically adjusts the number of pod replicas in a replication controller, deployment, or replica set based on observed metrics such as CPU utilization or custom metrics.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself