Chapter 20 of 26
Monitoring with Cloud Monitoring: Metrics, Dashboards, and Alerting
Detect issues before your users do by building dashboards and alerts with Cloud Monitoring for compute, storage, and application workloads.
Cloud Monitoring Overview and Workspaces
What is Cloud Monitoring?
Cloud Monitoring is part of the Google Cloud Operations Suite. It collects metrics, logs, and events so you can see how your infrastructure and apps are behaving in real time.
Workspaces: The Big Picture
A Cloud Monitoring workspace is a container for dashboards, alerting policies, uptime checks, and SLOs. It has a host project and can monitor multiple Google Cloud projects.
Workspace Patterns
Typical patterns: small setups use one project per workspace; larger orgs use a central monitoring project whose workspace monitors several prod, staging, and dev projects.
Access Control and the Exam
IAM on the host project controls who can view or edit monitoring configs. On the exam, centralize visibility by creating a workspace in a central project and adding other projects as monitored.
Metrics Fundamentals: Types, Labels, and Resources
What is a Metric?
A metric is a numeric measurement over time, like CPU utilization or request latency. Cloud Monitoring stores metrics as time series: timestamp–value pairs plus metadata.
Metric Types and Resources
Each metric has a metric type, such as `compute.googleapis.com/instance/cpu/utilization`, and a monitored resource type like `gceinstance` or `cloudrun_revision`.
Labels Add Detail
Labels are key-value pairs like `instanceid`, `zone`, or `responsecode` that let you filter and group metrics, for example by zone or HTTP status code.
Exam Traps with Metrics
CPU utilization is usually a 0–1 fraction, not a percentage. Many service metrics are cumulative counters; Cloud Monitoring can derive per-second rates or deltas from them.
Exploring Metrics for Compute Engine, GKE, and Cloud Run
Compute Engine CPU Example
In Metrics Explorer, pick resource `VM instance (gce_instance)` and metric `CPU utilization`, then group by `zone` to see if some zones run hotter than others.
GKE Pod Metrics Example
Switch to resource `Kubernetes Container (k8scontainer)` and metrics like CPU or memory usage, filter by `clustername`, and group by `pod_name` to spot hot pods.
Cloud Run Latency Example
Use resource `cloudrunrevision` and `Request latency` metric. Look at p95 latency, filtered by `service_name`, to see if users experience slow responses.
Why This Matters for the Exam
Scenarios often ask which resource or metric to use. Map Compute Engine to `gceinstance`, GKE to `k8s*`, and Cloud Run to `cloudrunrevision` to choose correctly.
Dashboards and Charts: Building a Single Pane of Glass
What is a Dashboard?
A dashboard is a collection of charts and widgets that visualize metrics for a specific purpose, like VM health, GKE clusters, or Cloud Run services.
Steps to Create a Dashboard
Go to Monitoring → Dashboards → Create dashboard, name it, then add charts by selecting metrics, filters, aggregations, and a visualization type.
Designing Effective Dashboards
Highlight key KPIs: CPU, memory, traffic, errors, and latency. Use scorecards for top-level SLO indicators and group related charts on the same row.
Exam Angle: Access vs Control
Dashboards are read-only views. To give a team visibility without control, assign Monitoring Viewer roles and share dashboards, not editor roles on workloads.
Thought Exercise: Design a Dashboard for an E-commerce App
Imagine you operate a small e-commerce platform running on Google Cloud:
- Frontend: Cloud Run service `store-frontend` in region `us-central1`.
- Backend: GKE cluster `prod-gke` running a `cart-service` and `orders-service`.
- Database: Cloud SQL for PostgreSQL.
You want a single dashboard for on-call engineers to quickly see if users can browse, add to cart, and place orders.
Your task: On a sheet of paper or in a note, sketch what widgets you would add. Use the prompts below.
- Cloud Run `store-frontend`
- Which 3 metrics would you chart to detect user-facing issues?
- How would you group or filter them (by service, revision, region)?
- GKE `cart-service` and `orders-service`
- Which metrics would reveal if these services are overloaded or failing?
- Would you group by `namespace`, `podname`, or `responsecode`?
- Cloud SQL
- Which basic metrics would tell you if the database is the bottleneck (e.g., CPU, connections, latency)?
- Summary row
- What 2–3 scorecards would you add at the top as “red/green” indicators for management?
After you sketch, compare against this checklist:
- Do you cover traffic (request count/QPS)?
- Do you cover latency (p95 or p99)?
- Do you cover errors (4xx/5xx rates)?
- Do you cover resource saturation (CPU, memory, DB connections)?
Use this exercise to practice thinking like an on-call engineer. The exam often describes these kinds of scenarios and asks what to monitor or where to look first.
Uptime Checks and SLOs: From Basic Health to Reliability Targets
What is an Uptime Check?
An uptime check is an external probe that regularly hits your HTTP(S) or TCP endpoint to verify it is reachable and returning expected responses.
Configuring Uptime Checks
In Monitoring, define the protocol, URL or IP/port, frequency, timeout, and optional content match string. Later, attach alerting policies to this check.
What is an SLO?
A Service Level Objective is a reliability target like 99.9% success or p95 latency under 400 ms over a given window, built from metrics called SLIs.
Exam Focus: Uptime vs SLOs
Uptime checks answer “is it up right now?”; SLOs answer “has it been reliable enough over time?” Remember this distinction in scenario questions.
Alerting Policies and Notification Channels
Alerting Policy Basics
An alerting policy combines conditions on metrics or uptime checks with notification channels and documentation to inform responders when something is wrong.
Example: High CPU Alert
Create a policy on `compute.googleapis.com/instance/cpu/utilization`, trigger when mean CPU > 0.8 for 5 minutes, and send email or Pub/Sub notifications.
Notification Channels
Configure email, SMS, mobile app, webhooks, or Pub/Sub under Monitoring → Alerting → Notification channels, then attach them to policies.
Reducing Alert Noise
Align and aggregate metrics, require a minimum duration above threshold, and alert on user-impacting metrics like error rate or latency, not just CPU spikes.
Quiz: Uptime Checks vs Metric Alerts
Test your understanding of uptime checks and alerting policies.
You run a public API on Cloud Run. You want to be alerted when the API becomes unreachable from the internet, regardless of CPU or memory usage. What is the MOST appropriate configuration in Cloud Monitoring?
- Create a metric-based alert on Cloud Run CPU utilization above 90% for 5 minutes.
- Create an uptime check targeting the API URL and an alerting policy that triggers when the uptime check fails.
- Create an SLO on request latency and alert when p95 latency exceeds 500 ms for 10 minutes.
- Create a dashboard chart for request count and manually watch it during business hours.
Show Answer
Answer: B) Create an uptime check targeting the API URL and an alerting policy that triggers when the uptime check fails.
An uptime check is an external probe that verifies reachability. Pairing an uptime check with an alerting policy directly addresses internet reachability. CPU-based alerts or latency SLOs may not fire if the service is down at the network level, and manual dashboards are not reliable for timely detection.
Custom Metrics and Integration with GKE and Cloud Run
What are Custom Metrics?
Custom metrics are user-defined metrics you send via the Cloud Monitoring API, such as `custom.googleapis.com/orders/created_count` for business KPIs.
Custom Metrics with GKE
GKE workloads can expose Prometheus metrics. The GKE–Monitoring integration ingests them as custom metrics tied to `k8s*` resource types with labels like `namespacename`.
Custom Metrics with Cloud Run
Cloud Run services can use client libraries to write custom metrics, often using the `cloudrunrevision` resource so you can filter by service or revision.
Exam Scenarios
If you must alert on a business KPI (like orders per minute), create a custom metric and an alert. Be wary of high-cardinality labels that create too many time series.
Code Example: Writing a Custom Metric (Python)
This simplified Python example shows how a Cloud Run service or GKE pod might write a custom counter metric to Cloud Monitoring. In practice, you would handle authentication using a service account attached to the workload.
A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.
```python
from google.cloud import monitoring_v3
from google.api import metricpb2 as gametric
from google.api import monitoredresourcepb2
import time
projectid = "YOURPROJECT_ID"
client = monitoring_v3.MetricServiceClient()
projectname = f"projects/{projectid}"
Define the custom metric type
metrictype = "custom.googleapis.com/orders/createdcount"
series = monitoring_v3.TimeSeries()
series.metric.type = metric_type
series.resource.type = "global" # or "cloudrunrevision", "k8s_container", etc.
Optional labels
series.metric.labels["env"] = "prod"
Set the data point value (e.g., 1 order created)
point = monitoring_v3.Point()
now = time.time()
interval = monitoring_v3.TimeInterval()
interval.end_time.seconds = int(now)
interval.end_time.nanos = int((now - int(now)) 10*9)
point.interval.CopyFrom(interval)
For a counter, use INT64
point.value.int64_value = 1
series.points.append(point)
Write the time series data
client.createtimeseries(name=projectname, timeseries=[series])
print("Wrote 1 data point to", metric_type)
```
On the exam, you will not write code, but you should recognize that:
- Custom metrics are pushed via the Monitoring API.
- Workloads authenticate using a service account.
- Once written, custom metrics appear in Metrics Explorer under the `custom.googleapis.com` namespace.
Quiz: Choosing the Right Monitoring Approach
Check your understanding of metrics, dashboards, and alerts in a realistic scenario.
Your GKE-based payments service sometimes returns HTTP 500 errors during traffic spikes, but CPU and memory look normal. You want to detect this quickly and alert the on-call engineer. What is the BEST approach using Cloud Monitoring?
- Create a dashboard showing node CPU and memory usage and ask on-call engineers to watch it during peak hours.
- Create an uptime check against the service endpoint and alert if it fails from any region.
- Create a metric-based alert on the HTTP 5xx error rate metric for the service, with a threshold over a short window.
- Create a custom metric that counts successful payments and alert when the count is zero over 24 hours.
Show Answer
Answer: C) Create a metric-based alert on the HTTP 5xx error rate metric for the service, with a threshold over a short window.
The issue is HTTP 500 errors despite normal CPU and memory. The correct signal is the 5xx error rate metric for the service. An alert on error rate catches user-visible failures quickly. Uptime checks only detect total unreachability, and watching dashboards manually is unreliable. A 24-hour zero-success alert is far too slow.
Troubleshooting Scenario: Slow Cloud Run Service
Work through this mental troubleshooting exercise to practice using Cloud Monitoring data.
Scenario
A Cloud Run service `image-resizer` is used by a mobile app. Users report that image uploads sometimes take 10–15 seconds. Autoscaling seems to work, and there are no obvious errors.
Using only Cloud Monitoring (no code changes), think through these questions:
- Which metrics do you check first?
- Hint: Consider request latency distributions (p95, p99), request count, and concurrency metrics for `cloudrunrevision`.
- How do you confirm if the issue is global or regional?
- Hint: Check labels like `location` or use filters by region in Metrics Explorer.
- How do you distinguish between app slowness and downstream dependency slowness (like Cloud Storage or a database)?
- Hint: Look at metrics for those dependencies (for example, Cloud Storage latency, Cloud SQL CPU/connection usage) on their own dashboards.
- What alerting policy could you create to catch this faster next time?
- Hint: Consider an alert on p95 latency above a threshold (for example, 2 seconds) for a few minutes, filtered to this service.
Write down your answers, then compare to this reference checklist:
- Checked Cloud Run request latency (p95/p99) and request count.
- Compared latency across regions or revisions using labels.
- Checked metrics for downstream services to see if they spike at the same time.
- Designed an alert on latency, not just CPU or error rate.
Being able to reason through these steps is exactly what the exam tests in scenario questions.
Key Term Review: Cloud Monitoring Essentials
Use these flashcards to reinforce the core concepts before moving on.
- Cloud Monitoring workspace
- A logical container in a host project that stores dashboards, alerting policies, uptime checks, SLOs, and metrics from one or more monitored projects.
- Metric type
- The identifier for what is being measured, such as `compute.googleapis.com/instance/cpu/utilization` or `run.googleapis.com/request_count`.
- Monitored resource type
- The kind of resource a metric is attached to, like `gce_instance`, `k8s_container`, or `cloud_run_revision`.
- Uptime check
- An external probe that regularly tests whether an endpoint (HTTP(S) or TCP) is reachable and responding correctly, often used to drive availability alerts.
- Service Level Objective (SLO)
- A target level of reliability or performance over time (for example, 99.9% success rate over 30 days), built from service-level indicators like error rate or latency.
- Alerting policy
- A configuration in Cloud Monitoring that defines conditions on metrics, uptime checks, or SLOs and sends notifications through channels like email or Pub/Sub when triggered.
- Notification channel
- A destination for alerts, such as email, SMS, mobile app notifications, webhooks, or Pub/Sub topics, configured in the Monitoring alerting settings.
- Custom metric
- A user-defined metric written via the Cloud Monitoring API, typically under the `custom.googleapis.com` namespace, used for app-specific or business KPIs.
- Service account
- A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.
Key Terms
- Label
- A key-value pair attached to metrics or resources, used to filter and group data (for example, zone, instance_id, response_code).
- Metric
- A numeric measurement over time, stored as a time series in Cloud Monitoring, such as CPU utilization, request count, or latency.
- Scorecard
- A dashboard widget in Cloud Monitoring that displays a single aggregated metric value, often used for high-level KPIs or SLO indicators.
- Workspace
- A Cloud Monitoring construct in a host project that organizes dashboards, alerting policies, uptime checks, and SLOs, and can monitor multiple Google Cloud projects.
- Time series
- A sequence of metric data points, each with a timestamp, value, and associated labels and monitored resource metadata.
- Error budget
- The amount of unreliability a service is allowed within an SLO period, computed as 1 minus the SLO target (for example, 0.1% for a 99.9% SLO).
- Uptime check
- An external Cloud Monitoring probe that periodically tests whether an endpoint is reachable and responding as expected.
- Custom metric
- A user-defined metric created and written via the Cloud Monitoring API, often used for application-specific or business-level measurements.
- Alerting policy
- A Cloud Monitoring configuration that evaluates metric, uptime, or SLO conditions and sends notifications when thresholds are breached.
- Service account
- A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.
- Cloud Monitoring
- A Google Cloud Operations Suite service that collects metrics, logs, and events from Google Cloud, AWS, and applications, and provides dashboards, alerting, uptime checks, and SLOs.
- Metrics Explorer
- A Cloud Monitoring tool in the console that lets you browse, filter, and visualize metrics, and experiment with aggregations and groupings.
- Notification channel
- A configured destination for alerts, such as email, SMS, mobile notifications, webhooks, or Pub/Sub topics.
- Service Level Objective (SLO)
- A defined target for service reliability or performance, such as availability or latency, over a specified time window.