SkarpSkarp

Chapter 20 of 27

Cloud Logging and Cloud Monitoring: Observability Foundations

Visibility is essential for reliable operations; learn how to capture logs and metrics, build dashboards, and configure alerts across your Google Cloud estate.

27 min readen

Observability Foundations on Google Cloud

Why Observability Matters

Observability means having enough visibility into your systems to know what is happening, why it is happening, and how to fix or improve it in production.

Core Google Cloud Tools

Google Cloud’s observability stack centers on two services: Cloud Logging for logs, and Cloud Monitoring for metrics, dashboards, uptime checks, and alerting.

Logs vs Metrics

Logs are detailed, timestamped records of events. Metrics are numeric measurements over time, such as CPU utilization, request latency, or error rate.

Link to Previous Modules

App Engine, Cloud Functions, Cloud Storage, Cloud SQL, and BigQuery all emit logs and metrics that feed into Cloud Logging and Cloud Monitoring.

What You Will Be Able To Do

You will learn to navigate logs, identify metrics, create log-based metrics, build dashboards and uptime checks, and configure alerts for operators.

Cloud Logging: Concepts, Structure, and Navigation

What Is Cloud Logging?

Cloud Logging is Google Cloud’s central logging service. Managed services write logs to it automatically when APIs and default agents or integrations are enabled.

Anatomy of a Log Entry

Each log entry has a timestamp, logName, resource, severity, and a payload (textPayload or jsonPayload) that contains the actual message or structured data.

Buckets, Views, Retention

Logs are stored in log buckets, which have retention settings. Log views restrict which logs in a bucket are visible to different teams or tools.

Using Logs Explorer

In the console, go to Logging > Logs Explorer, choose project and time range, then filter by resource type and use the query builder to refine results.

Typical Filters

Common filters include resource.type for the service, logName for the log, and severity>=ERROR to focus on failures and critical events.

Hands-on: Finding Errors for a Cloud Function

Scenario: Failing Cloud Function

You deployed a Cloud Function called image-processor. Users say some uploads fail, and you need to find recent errors in Cloud Logging.

Navigating to Logs Explorer

Open Logging > Logs Explorer, select the correct project, and set the time range to Last 1 hour to keep the search focused on recent issues.

Filtering by Resource and Function

Choose resource type Cloud Function and function name image-processor, then switch to the Query tab to refine the filter further.

Adding Severity Filter

Use a query like: resource.type="cloudfunction" resource.labels.functionname="image-processor" severity>=ERROR to focus on failures.

Inspecting and Grouping Logs

Run the query, expand error entries to see details and stack traces, and use Group by to see which part of the function fails most often.

Cloud Monitoring: Metrics, Workspaces, and Metric Types

What Is Cloud Monitoring?

Cloud Monitoring is the core monitoring service for Google Cloud. It collects metrics and metadata, and lets you build dashboards, uptime checks, and alerts.

Metrics and Monitored Resources

A metric is time-series data like CPU utilization. Each metric is tied to a monitored resource, such as a gceinstance or cloudfunction.

Metric Types: Gauge

Gauge metrics represent the current value at a point in time, such as CPU utilization or memory usage on a VM or container.

Metric Types: Cumulative and Delta

Cumulative metrics increase over time and reset occasionally, like total bytes sent. Delta metrics capture the change since the last measurement.

Monitoring Workspaces

Monitoring uses workspaces tied to a scoping project. A workspace can include metrics from multiple projects for cross-project visibility.

Log-based Metrics: Turning Events into Numbers

Why Log-based Metrics?

Use log-based metrics when you want to alert or chart based on specific log patterns, such as HTTP 500 errors or custom application error messages.

What Is a Log-based Metric?

A log-based metric is a metric derived from log entries that match a filter. Logging evaluates logs against the filter and updates the metric over time.

Metric Types: Counter and Distribution

Counter metrics count matching log entries. Distribution metrics capture numeric values from logs, such as latency extracted from a JSON field.

Steps to Create One

In Logs Explorer, build a query, click Create metric, choose counter or distribution, name it, optionally add labels, and save the metric.

Where It Appears

The new metric shows up in Cloud Monitoring as a user-defined metric under logging.googleapis.com/user/, usable in dashboards and alerts.

Example: Log-based Metric and Alert for App Engine Errors

Goal: Alert on App Engine 500s

You run an App Engine API and want an alert when it returns many HTTP 500 errors in a short period, indicating a serious outage.

Building the Log Filter

In Logs Explorer, use a query like resource.type="gae_app" severity>=ERROR jsonPayload.status=500 and verify it returns recent failing requests.

Creating the Counter Metric

Click Create metric, choose Counter, name it gae500count, and optionally add a label for the App Engine service from resource.labels.module_id.

Alerting on the Metric

In Monitoring > Alerting, create a policy with a metric condition on logging/user/gae500count, such as more than 10 errors per minute.

Notifications in Action

When errors spike, Logging updates the metric, Monitoring evaluates the condition, and operators receive notifications via configured channels.

Dashboards and Uptime Checks in Cloud Monitoring

What Are Dashboards?

Dashboards in Cloud Monitoring are collections of widgets like charts and scorecards that visualize metrics and status for your resources.

Building a Custom Dashboard

In Monitoring > Dashboards, create a dashboard, add a line chart, select a metric like instance CPU utilization, filter as needed, and save it.

What Are Uptime Checks?

Uptime checks are synthetic probes that periodically test if an endpoint, such as an HTTP URL, is reachable and responding correctly.

Creating an HTTP Uptime Check

In Monitoring > Uptime checks, create a check, choose HTTP or HTTPS, enter the URL, select frequency and regions, and optionally link an alert.

Why They Matter for the Exam

Expect scenarios where you must monitor application health. Dashboards show trends, while uptime checks detect and alert on outages.

Alerting Policies: Thresholds, Conditions, and Notifications

What Is an Alerting Policy?

An alerting policy links metrics or uptime checks to notifications. It defines when to alert and who should be notified about issues.

Conditions and Thresholds

Conditions specify rules like CPU utilization > 80% for 5 minutes or an uptime check failing for multiple consecutive runs.

Notification Channels

Notification channels are destinations such as email, SMS, Slack, mobile app, webhooks, or Pub/Sub, configured in Monitoring.

Creating a Basic Alert

In Monitoring > Alerting, create a policy, add a metric condition, set thresholds and windows, choose notification channels, and save it.

Common Exam Traps

Do not confuse uptime checks with alerts: checks probe endpoints, but alerting policies send notifications based on check results or metrics.

Thought Exercise: Designing an Observability Setup

Imagine you are the Associate Cloud Engineer for a small startup running this architecture on Google Cloud:

  • Frontend: React app served from Cloud Storage behind a Cloud CDN-enabled HTTPS load balancer.
  • Backend API: Cloud Run service.
  • Background processing: Cloud Functions triggered by Pub/Sub messages.
  • Database: Cloud SQL for PostgreSQL.

Your task: sketch an observability setup using Cloud Logging and Cloud Monitoring.

Reflect on these prompts and, if possible, write your answers down:

  1. Logs to focus on
  • Which log types would you review regularly for each component? (Think: load balancer logs, Cloud Run logs, Cloud Functions logs, Cloud SQL audit logs.)
  • Which ones are most critical for security or compliance vs. performance troubleshooting?
  1. Key metrics to chart on dashboards
  • For Cloud Run and Cloud Functions: what metrics show performance and reliability? (Hint: request count, latency, error rate, instance count.)
  • For Cloud SQL: what metrics indicate stress? (Hint: CPU utilization, connections, disk usage.)
  1. Log-based metrics you might create
  • Identify at least one log pattern per service that you would turn into a log-based counter metric (for example, specific error codes or “payment failed” messages).
  1. Uptime checks and alerts
  • Which public endpoints would you probe with uptime checks?
  • For each metric or log-based metric, what is a reasonable threshold for an alert (e.g., 5xx error rate > 1% for 5 minutes)?

When you finish, compare your design to the concepts in this module and adjust. This kind of design thinking is close to scenario questions you will see on the certification exam.

Quick Check: Logging and Log-based Metrics

Test your understanding of Cloud Logging and log-based metrics.

You need to be notified when a Cloud Function writes log entries containing a specific error string more than 20 times in 5 minutes. Which combination of services and features should you use?

  1. Create a filter in Logs Explorer and periodically check it manually.
  2. Create a log-based counter metric in Cloud Logging and an alerting policy in Cloud Monitoring based on that metric.
  3. Create an uptime check against the Cloud Function URL and alert when it fails.
  4. Enable VPC Flow Logs and create a firewall rule to block the traffic causing errors.
Show Answer

Answer: B) Create a log-based counter metric in Cloud Logging and an alerting policy in Cloud Monitoring based on that metric.

You want automated notifications based on how often a specific log pattern occurs. The correct approach is to create a log-based counter metric that counts matching log entries, then use a Cloud Monitoring alerting policy on that metric. Uptime checks only test endpoint reachability, not specific log contents, and VPC Flow Logs are unrelated to application error strings.

Quick Check: Metrics, Dashboards, and Uptime Checks

Test your understanding of Cloud Monitoring concepts.

Which statement best describes the role of an uptime check in Cloud Monitoring?

  1. It continuously scans your logs for specific error messages.
  2. It tests whether an endpoint is reachable and responding correctly from selected regions.
  3. It automatically creates dashboards for all Google Cloud services in your project.
  4. It is required before you can view any metrics in Metrics Explorer.
Show Answer

Answer: B) It tests whether an endpoint is reachable and responding correctly from selected regions.

An uptime check is a synthetic probe that periodically tests whether an endpoint (HTTP(S), TCP, or gRPC) is reachable and responding as expected from chosen regions. It does not scan logs, create dashboards, or gate access to metrics.

Key Terms Review: Cloud Logging and Monitoring

Use these flashcards to reinforce the core terminology.

Cloud Logging
Google Cloud’s central logging service used to collect, store, view, and route logs from Google Cloud services and applications.
Cloud Monitoring
Google Cloud’s core monitoring service used to collect metrics, build dashboards, configure uptime checks, and create alerting policies.
Log-based metric
A metric derived from log entries that match a specified filter, typically used as a counter or distribution in Cloud Monitoring.
Gauge metric
A metric type that represents the current value at a point in time, such as CPU utilization or memory usage.
Cumulative metric
A metric type that represents a value that increases over time and may reset, such as total bytes sent.
Logs Explorer
The Cloud Logging interface where you can query, filter, and inspect log entries using a UI builder or logging query language.
Monitoring workspace
A logical grouping in Cloud Monitoring that aggregates metrics, dashboards, and alerts from one or more Google Cloud projects.
Uptime check
A synthetic probe in Cloud Monitoring that periodically tests whether an endpoint (HTTP(S), TCP, gRPC) is reachable and responding correctly.
Alerting policy
A Cloud Monitoring configuration that defines conditions on metrics or uptime checks and sends notifications via configured channels when conditions are met.
Severity (in logs)
A field in log entries that indicates importance or urgency, such as DEBUG, INFO, WARNING, ERROR, CRITICAL, ALERT, or EMERGENCY.

Key Terms

Log view
A filtered view of a log bucket that controls which logs are visible to different users or tools.
Log bucket
A storage container in Cloud Logging that holds log entries with configurable retention and access controls.
Delta metric
A metric type that represents the change in a value since the last measurement.
Gauge metric
A metric type that represents the current value at a point in time, such as CPU utilization or memory usage.
Uptime check
A synthetic probe in Cloud Monitoring that periodically tests whether an endpoint (HTTP(S), TCP, gRPC) is reachable and responding correctly.
Cloud Logging
Google Cloud’s central logging service used to collect, store, view, and route logs from Google Cloud services and applications.
Logs Explorer
The Cloud Logging interface where you can query, filter, and inspect log entries using a UI builder or logging query language.
Alerting policy
A Cloud Monitoring configuration that defines conditions on metrics or uptime checks and sends notifications via configured channels when conditions are met.
Cloud Monitoring
Google Cloud’s core monitoring service used to collect metrics, build dashboards, configure uptime checks, and create alerting policies.
Log-based metric
A metric derived from log entries that match a specified filter, typically used as a counter or distribution in Cloud Monitoring.
Cumulative metric
A metric type that represents a value that increases over time and may reset, such as total bytes sent.
Monitored resource
The entity that a metric describes, such as a gce_instance, cloud_function, gae_instance, cloudsql_database, or k8s_container.
Monitoring workspace
A logical grouping in Cloud Monitoring that aggregates metrics, dashboards, and alerts from one or more Google Cloud projects.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself