An external probe that regularly tests whether an endpoint (HTTP(S) or TCP) is reachable and responding correctly, often used to drive availability alerts.

Service Level Objective (SLO)

A target level of reliability or performance over time (for example, 99.9% success rate over 30 days), built from service-level indicators like error rate or latency.

A key-value pair attached to metrics or resources, used to filter and group data (for example, zone, instance_id, response_code).

A numeric measurement over time, stored as a time series in Cloud Monitoring, such as CPU utilization, request count, or latency.

A dashboard widget in Cloud Monitoring that displays a single aggregated metric value, often used for high-level KPIs or SLO indicators.

Monitoring with Cloud Monitoring: Metrics, Dashboards, and Alerting — Google Cloud Associate Cloud Engineer: Complete Exam-Ready Masterclass

Cloud Monitoring Overview and Workspaces

What is Cloud Monitoring?

Cloud Monitoring is part of the Google Cloud Operations Suite. It collects metrics, logs, and events so you can see how your infrastructure and apps are behaving in real time.

Workspaces: The Big Picture

A Cloud Monitoring workspace is a container for dashboards, alerting policies, uptime checks, and SLOs. It has a host project and can monitor multiple Google Cloud projects.

Workspace Patterns

Typical patterns: small setups use one project per workspace; larger orgs use a central monitoring project whose workspace monitors several prod, staging, and dev projects.

Access Control and the Exam

IAM on the host project controls who can view or edit monitoring configs. On the exam, centralize visibility by creating a workspace in a central project and adding other projects as monitored.

Metrics Fundamentals: Types, Labels, and Resources

What is a Metric?

A metric is a numeric measurement over time, like CPU utilization or request latency. Cloud Monitoring stores metrics as time series: timestamp–value pairs plus metadata.

Metric Types and Resources

Each metric has a metric type, such as `compute.googleapis.com/instance/cpu/utilization`, and a monitored resource type like `gceinstance` or `cloudrun_revision`.

Labels Add Detail

Labels are key-value pairs like `instanceid`, `zone`, or `responsecode` that let you filter and group metrics, for example by zone or HTTP status code.

Exam Traps with Metrics

CPU utilization is usually a 0–1 fraction, not a percentage. Many service metrics are cumulative counters; Cloud Monitoring can derive per-second rates or deltas from them.

Exploring Metrics for Compute Engine, GKE, and Cloud Run

Compute Engine CPU Example

In Metrics Explorer, pick resource `VM instance (gce_instance)` and metric `CPU utilization`, then group by `zone` to see if some zones run hotter than others.

GKE Pod Metrics Example

Switch to resource `Kubernetes Container (k8scontainer)` and metrics like CPU or memory usage, filter by `clustername`, and group by `pod_name` to spot hot pods.

Cloud Run Latency Example

Use resource `cloudrunrevision` and `Request latency` metric. Look at p95 latency, filtered by `service_name`, to see if users experience slow responses.

Why This Matters for the Exam

Scenarios often ask which resource or metric to use. Map Compute Engine to `gceinstance`, GKE to `k8s*`, and Cloud Run to `cloudrunrevision` to choose correctly.

Dashboards and Charts: Building a Single Pane of Glass

What is a Dashboard?

A dashboard is a collection of charts and widgets that visualize metrics for a specific purpose, like VM health, GKE clusters, or Cloud Run services.

Steps to Create a Dashboard

Go to Monitoring → Dashboards → Create dashboard, name it, then add charts by selecting metrics, filters, aggregations, and a visualization type.

Designing Effective Dashboards

Highlight key KPIs: CPU, memory, traffic, errors, and latency. Use scorecards for top-level SLO indicators and group related charts on the same row.

Exam Angle: Access vs Control

Dashboards are read-only views. To give a team visibility without control, assign Monitoring Viewer roles and share dashboards, not editor roles on workloads.

Thought Exercise: Design a Dashboard for an E-commerce App

Imagine you operate a small e-commerce platform running on Google Cloud:

Frontend: Cloud Run service `store-frontend` in region `us-central1`.
Backend: GKE cluster `prod-gke` running a `cart-service` and `orders-service`.
Database: Cloud SQL for PostgreSQL.

You want a single dashboard for on-call engineers to quickly see if users can browse, add to cart, and place orders.

Your task: On a sheet of paper or in a note, sketch what widgets you would add. Use the prompts below.

Cloud Run `store-frontend`

Which 3 metrics would you chart to detect user-facing issues?
How would you group or filter them (by service, revision, region)?

GKE `cart-service` and `orders-service`

Which metrics would reveal if these services are overloaded or failing?
Would you group by `namespace`, `podname`, or `responsecode`?

Cloud SQL

Which basic metrics would tell you if the database is the bottleneck (e.g., CPU, connections, latency)?

Summary row

What 2–3 scorecards would you add at the top as “red/green” indicators for management?

After you sketch, compare against this checklist:

Do you cover traffic (request count/QPS)?
Do you cover latency (p95 or p99)?
Do you cover errors (4xx/5xx rates)?
Do you cover resource saturation (CPU, memory, DB connections)?

Use this exercise to practice thinking like an on-call engineer. The exam often describes these kinds of scenarios and asks what to monitor or where to look first.

Uptime Checks and SLOs: From Basic Health to Reliability Targets

What is an Uptime Check?

An uptime check is an external probe that regularly hits your HTTP(S) or TCP endpoint to verify it is reachable and returning expected responses.

Configuring Uptime Checks

In Monitoring, define the protocol, URL or IP/port, frequency, timeout, and optional content match string. Later, attach alerting policies to this check.

What is an SLO?

A Service Level Objective is a reliability target like 99.9% success or p95 latency under 400 ms over a given window, built from metrics called SLIs.

Exam Focus: Uptime vs SLOs

Uptime checks answer “is it up right now?”; SLOs answer “has it been reliable enough over time?” Remember this distinction in scenario questions.

Alerting Policies and Notification Channels

Alerting Policy Basics

An alerting policy combines conditions on metrics or uptime checks with notification channels and documentation to inform responders when something is wrong.

Example: High CPU Alert

Create a policy on `compute.googleapis.com/instance/cpu/utilization`, trigger when mean CPU > 0.8 for 5 minutes, and send email or Pub/Sub notifications.

Notification Channels

Configure email, SMS, mobile app, webhooks, or Pub/Sub under Monitoring → Alerting → Notification channels, then attach them to policies.

Reducing Alert Noise

Align and aggregate metrics, require a minimum duration above threshold, and alert on user-impacting metrics like error rate or latency, not just CPU spikes.

Quiz: Uptime Checks vs Metric Alerts

Test your understanding of uptime checks and alerting policies.

You run a public API on Cloud Run. You want to be alerted when the API becomes unreachable from the internet, regardless of CPU or memory usage. What is the MOST appropriate configuration in Cloud Monitoring?

Create a metric-based alert on Cloud Run CPU utilization above 90% for 5 minutes.
Create an uptime check targeting the API URL and an alerting policy that triggers when the uptime check fails.
Create an SLO on request latency and alert when p95 latency exceeds 500 ms for 10 minutes.
Create a dashboard chart for request count and manually watch it during business hours.

Show Answer

Answer: B) Create an uptime check targeting the API URL and an alerting policy that triggers when the uptime check fails.

An uptime check is an external probe that verifies reachability. Pairing an uptime check with an alerting policy directly addresses internet reachability. CPU-based alerts or latency SLOs may not fire if the service is down at the network level, and manual dashboards are not reliable for timely detection.

Custom Metrics and Integration with GKE and Cloud Run

What are Custom Metrics?

Custom metrics are user-defined metrics you send via the Cloud Monitoring API, such as `custom.googleapis.com/orders/created_count` for business KPIs.

Custom Metrics with GKE

GKE workloads can expose Prometheus metrics. The GKE–Monitoring integration ingests them as custom metrics tied to `k8s*` resource types with labels like `namespacename`.

Custom Metrics with Cloud Run

Cloud Run services can use client libraries to write custom metrics, often using the `cloudrunrevision` resource so you can filter by service or revision.

Exam Scenarios

If you must alert on a business KPI (like orders per minute), create a custom metric and an alert. Be wary of high-cardinality labels that create too many time series.

Code Example: Writing a Custom Metric (Python)

This simplified Python example shows how a Cloud Run service or GKE pod might write a custom counter metric to Cloud Monitoring. In practice, you would handle authentication using a service account attached to the workload.

A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.

```python

from google.cloud import monitoring_v3

from google.api import metricpb2 as gametric

from google.api import monitoredresourcepb2

import time

projectid = "YOURPROJECT_ID"

client = monitoring_v3.MetricServiceClient()

projectname = f"projects/{projectid}"

Define the custom metric type

metrictype = "custom.googleapis.com/orders/createdcount"

series = monitoring_v3.TimeSeries()

series.metric.type = metric_type

series.resource.type = "global" # or "cloudrunrevision", "k8s_container", etc.

Optional labels

series.metric.labels["env"] = "prod"

Set the data point value (e.g., 1 order created)

point = monitoring_v3.Point()

now = time.time()

interval = monitoring_v3.TimeInterval()

interval.end_time.seconds = int(now)

interval.end_time.nanos = int((now - int(now)) 10*9)

point.interval.CopyFrom(interval)

For a counter, use INT64

point.value.int64_value = 1

series.points.append(point)

Write the time series data

client.createtimeseries(name=projectname, timeseries=[series])

print("Wrote 1 data point to", metric_type)

```

On the exam, you will not write code, but you should recognize that:

Custom metrics are pushed via the Monitoring API.
Workloads authenticate using a service account.
Once written, custom metrics appear in Metrics Explorer under the `custom.googleapis.com` namespace.

Quiz: Choosing the Right Monitoring Approach

Check your understanding of metrics, dashboards, and alerts in a realistic scenario.

Your GKE-based payments service sometimes returns HTTP 500 errors during traffic spikes, but CPU and memory look normal. You want to detect this quickly and alert the on-call engineer. What is the BEST approach using Cloud Monitoring?

Create a dashboard showing node CPU and memory usage and ask on-call engineers to watch it during peak hours.
Create an uptime check against the service endpoint and alert if it fails from any region.
Create a metric-based alert on the HTTP 5xx error rate metric for the service, with a threshold over a short window.
Create a custom metric that counts successful payments and alert when the count is zero over 24 hours.

Show Answer

Answer: C) Create a metric-based alert on the HTTP 5xx error rate metric for the service, with a threshold over a short window.

The issue is HTTP 500 errors despite normal CPU and memory. The correct signal is the 5xx error rate metric for the service. An alert on error rate catches user-visible failures quickly. Uptime checks only detect total unreachability, and watching dashboards manually is unreliable. A 24-hour zero-success alert is far too slow.

Troubleshooting Scenario: Slow Cloud Run Service

Work through this mental troubleshooting exercise to practice using Cloud Monitoring data.

Scenario

A Cloud Run service `image-resizer` is used by a mobile app. Users report that image uploads sometimes take 10–15 seconds. Autoscaling seems to work, and there are no obvious errors.

Using only Cloud Monitoring (no code changes), think through these questions:

Which metrics do you check first?

Hint: Consider request latency distributions (p95, p99), request count, and concurrency metrics for `cloudrunrevision`.

How do you confirm if the issue is global or regional?

Hint: Check labels like `location` or use filters by region in Metrics Explorer.

How do you distinguish between app slowness and downstream dependency slowness (like Cloud Storage or a database)?

Hint: Look at metrics for those dependencies (for example, Cloud Storage latency, Cloud SQL CPU/connection usage) on their own dashboards.

What alerting policy could you create to catch this faster next time?

Hint: Consider an alert on p95 latency above a threshold (for example, 2 seconds) for a few minutes, filtered to this service.

Write down your answers, then compare to this reference checklist:

Checked Cloud Run request latency (p95/p99) and request count.
Compared latency across regions or revisions using labels.
Checked metrics for downstream services to see if they spike at the same time.
Designed an alert on latency, not just CPU or error rate.

Being able to reason through these steps is exactly what the exam tests in scenario questions.

Key Term Review: Cloud Monitoring Essentials

Use these flashcards to reinforce the core concepts before moving on.

Cloud Monitoring workspace: A logical container in a host project that stores dashboards, alerting policies, uptime checks, SLOs, and metrics from one or more monitored projects.
Metric type: The identifier for what is being measured, such as `compute.googleapis.com/instance/cpu/utilization` or `run.googleapis.com/request_count`.
Monitored resource type: The kind of resource a metric is attached to, like `gce_instance`, `k8s_container`, or `cloud_run_revision`.
Uptime check: An external probe that regularly tests whether an endpoint (HTTP(S) or TCP) is reachable and responding correctly, often used to drive availability alerts.
Service Level Objective (SLO): A target level of reliability or performance over time (for example, 99.9% success rate over 30 days), built from service-level indicators like error rate or latency.
Alerting policy: A configuration in Cloud Monitoring that defines conditions on metrics, uptime checks, or SLOs and sends notifications through channels like email or Pub/Sub when triggered.
Notification channel: A destination for alerts, such as email, SMS, mobile app notifications, webhooks, or Pub/Sub topics, configured in the Monitoring alerting settings.
Custom metric: A user-defined metric written via the Cloud Monitoring API, typically under the `custom.googleapis.com` namespace, used for app-specific or business KPIs.
Service account: A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.

Key Terms

Label: A key-value pair attached to metrics or resources, used to filter and group data (for example, zone, instance_id, response_code).
Metric: A numeric measurement over time, stored as a time series in Cloud Monitoring, such as CPU utilization, request count, or latency.
Scorecard: A dashboard widget in Cloud Monitoring that displays a single aggregated metric value, often used for high-level KPIs or SLO indicators.
Workspace: A Cloud Monitoring construct in a host project that organizes dashboards, alerting policies, uptime checks, and SLOs, and can monitor multiple Google Cloud projects.
Time series: A sequence of metric data points, each with a timestamp, value, and associated labels and monitored resource metadata.
Error budget: The amount of unreliability a service is allowed within an SLO period, computed as 1 minus the SLO target (for example, 0.1% for a 99.9% SLO).
Uptime check: An external Cloud Monitoring probe that periodically tests whether an endpoint is reachable and responding as expected.
Custom metric: A user-defined metric created and written via the Cloud Monitoring API, often used for application-specific or business-level measurements.
Alerting policy: A Cloud Monitoring configuration that evaluates metric, uptime, or SLO conditions and sends notifications when thresholds are breached.
Service account: A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.
Cloud Monitoring: A Google Cloud Operations Suite service that collects metrics, logs, and events from Google Cloud, AWS, and applications, and provides dashboards, alerting, uptime checks, and SLOs.
Metrics Explorer: A Cloud Monitoring tool in the console that lets you browse, filter, and visualize metrics, and experiment with aggregations and groupings.
Notification channel: A configured destination for alerts, such as email, SMS, mobile notifications, webhooks, or Pub/Sub topics.
Service Level Objective (SLO): A defined target for service reliability or performance, such as availability or latency, over a specified time window.

Cloud Monitoring Overview and Workspaces

What is Cloud Monitoring?

Workspaces: The Big Picture

Workspace Patterns

Access Control and the Exam

Metrics Fundamentals: Types, Labels, and Resources

What is a Metric?

Metric Types and Resources

Labels Add Detail

Exam Traps with Metrics

Exploring Metrics for Compute Engine, GKE, and Cloud Run

Compute Engine CPU Example

GKE Pod Metrics Example

Cloud Run Latency Example

Why This Matters for the Exam

Dashboards and Charts: Building a Single Pane of Glass

What is a Dashboard?

Steps to Create a Dashboard

Designing Effective Dashboards

Exam Angle: Access vs Control

Thought Exercise: Design a Dashboard for an E-commerce App

Uptime Checks and SLOs: From Basic Health to Reliability Targets

What is an Uptime Check?

Configuring Uptime Checks

What is an SLO?

Exam Focus: Uptime vs SLOs

Alerting Policies and Notification Channels

Alerting Policy Basics

Example: High CPU Alert

Notification Channels

Reducing Alert Noise

Quiz: Uptime Checks vs Metric Alerts

Custom Metrics and Integration with GKE and Cloud Run

What are Custom Metrics?

Custom Metrics with GKE

Custom Metrics with Cloud Run

Exam Scenarios

Code Example: Writing a Custom Metric (Python)

Define the custom metric type

Optional labels

Set the data point value (e.g., 1 order created)

For a counter, use INT64

Write the time series data

Quiz: Choosing the Right Monitoring Approach

Troubleshooting Scenario: Slow Cloud Run Service

Key Term Review: Cloud Monitoring Essentials

Key Terms

Finished reading?