Chapter 20 of 27
Cloud Logging and Cloud Monitoring: Observability Foundations
Visibility is essential for reliable operations; learn how to capture logs and metrics, build dashboards, and configure alerts across your Google Cloud estate.
Observability Foundations on Google Cloud
Why Observability Matters
Observability means having enough visibility into your systems to know what is happening, why it is happening, and how to fix or improve it in production.
Core Google Cloud Tools
Google Cloud’s observability stack centers on two services: Cloud Logging for logs, and Cloud Monitoring for metrics, dashboards, uptime checks, and alerting.
Logs vs Metrics
Logs are detailed, timestamped records of events. Metrics are numeric measurements over time, such as CPU utilization, request latency, or error rate.
Link to Previous Modules
App Engine, Cloud Functions, Cloud Storage, Cloud SQL, and BigQuery all emit logs and metrics that feed into Cloud Logging and Cloud Monitoring.
What You Will Be Able To Do
You will learn to navigate logs, identify metrics, create log-based metrics, build dashboards and uptime checks, and configure alerts for operators.
Cloud Logging: Concepts, Structure, and Navigation
What Is Cloud Logging?
Cloud Logging is Google Cloud’s central logging service. Managed services write logs to it automatically when APIs and default agents or integrations are enabled.
Anatomy of a Log Entry
Each log entry has a timestamp, logName, resource, severity, and a payload (textPayload or jsonPayload) that contains the actual message or structured data.
Buckets, Views, Retention
Logs are stored in log buckets, which have retention settings. Log views restrict which logs in a bucket are visible to different teams or tools.
Using Logs Explorer
In the console, go to Logging > Logs Explorer, choose project and time range, then filter by resource type and use the query builder to refine results.
Typical Filters
Common filters include resource.type for the service, logName for the log, and severity>=ERROR to focus on failures and critical events.
Hands-on: Finding Errors for a Cloud Function
Scenario: Failing Cloud Function
You deployed a Cloud Function called image-processor. Users say some uploads fail, and you need to find recent errors in Cloud Logging.
Navigating to Logs Explorer
Open Logging > Logs Explorer, select the correct project, and set the time range to Last 1 hour to keep the search focused on recent issues.
Filtering by Resource and Function
Choose resource type Cloud Function and function name image-processor, then switch to the Query tab to refine the filter further.
Adding Severity Filter
Use a query like: resource.type="cloudfunction" resource.labels.functionname="image-processor" severity>=ERROR to focus on failures.
Inspecting and Grouping Logs
Run the query, expand error entries to see details and stack traces, and use Group by to see which part of the function fails most often.
Cloud Monitoring: Metrics, Workspaces, and Metric Types
What Is Cloud Monitoring?
Cloud Monitoring is the core monitoring service for Google Cloud. It collects metrics and metadata, and lets you build dashboards, uptime checks, and alerts.
Metrics and Monitored Resources
A metric is time-series data like CPU utilization. Each metric is tied to a monitored resource, such as a gceinstance or cloudfunction.
Metric Types: Gauge
Gauge metrics represent the current value at a point in time, such as CPU utilization or memory usage on a VM or container.
Metric Types: Cumulative and Delta
Cumulative metrics increase over time and reset occasionally, like total bytes sent. Delta metrics capture the change since the last measurement.
Monitoring Workspaces
Monitoring uses workspaces tied to a scoping project. A workspace can include metrics from multiple projects for cross-project visibility.
Log-based Metrics: Turning Events into Numbers
Why Log-based Metrics?
Use log-based metrics when you want to alert or chart based on specific log patterns, such as HTTP 500 errors or custom application error messages.
What Is a Log-based Metric?
A log-based metric is a metric derived from log entries that match a filter. Logging evaluates logs against the filter and updates the metric over time.
Metric Types: Counter and Distribution
Counter metrics count matching log entries. Distribution metrics capture numeric values from logs, such as latency extracted from a JSON field.
Steps to Create One
In Logs Explorer, build a query, click Create metric, choose counter or distribution, name it, optionally add labels, and save the metric.
Where It Appears
The new metric shows up in Cloud Monitoring as a user-defined metric under logging.googleapis.com/user/, usable in dashboards and alerts.
Example: Log-based Metric and Alert for App Engine Errors
Goal: Alert on App Engine 500s
You run an App Engine API and want an alert when it returns many HTTP 500 errors in a short period, indicating a serious outage.
Building the Log Filter
In Logs Explorer, use a query like resource.type="gae_app" severity>=ERROR jsonPayload.status=500 and verify it returns recent failing requests.
Creating the Counter Metric
Click Create metric, choose Counter, name it gae500count, and optionally add a label for the App Engine service from resource.labels.module_id.
Alerting on the Metric
In Monitoring > Alerting, create a policy with a metric condition on logging/user/gae500count, such as more than 10 errors per minute.
Notifications in Action
When errors spike, Logging updates the metric, Monitoring evaluates the condition, and operators receive notifications via configured channels.
Dashboards and Uptime Checks in Cloud Monitoring
What Are Dashboards?
Dashboards in Cloud Monitoring are collections of widgets like charts and scorecards that visualize metrics and status for your resources.
Building a Custom Dashboard
In Monitoring > Dashboards, create a dashboard, add a line chart, select a metric like instance CPU utilization, filter as needed, and save it.
What Are Uptime Checks?
Uptime checks are synthetic probes that periodically test if an endpoint, such as an HTTP URL, is reachable and responding correctly.
Creating an HTTP Uptime Check
In Monitoring > Uptime checks, create a check, choose HTTP or HTTPS, enter the URL, select frequency and regions, and optionally link an alert.
Why They Matter for the Exam
Expect scenarios where you must monitor application health. Dashboards show trends, while uptime checks detect and alert on outages.
Alerting Policies: Thresholds, Conditions, and Notifications
What Is an Alerting Policy?
An alerting policy links metrics or uptime checks to notifications. It defines when to alert and who should be notified about issues.
Conditions and Thresholds
Conditions specify rules like CPU utilization > 80% for 5 minutes or an uptime check failing for multiple consecutive runs.
Notification Channels
Notification channels are destinations such as email, SMS, Slack, mobile app, webhooks, or Pub/Sub, configured in Monitoring.
Creating a Basic Alert
In Monitoring > Alerting, create a policy, add a metric condition, set thresholds and windows, choose notification channels, and save it.
Common Exam Traps
Do not confuse uptime checks with alerts: checks probe endpoints, but alerting policies send notifications based on check results or metrics.
Thought Exercise: Designing an Observability Setup
Imagine you are the Associate Cloud Engineer for a small startup running this architecture on Google Cloud:
- Frontend: React app served from Cloud Storage behind a Cloud CDN-enabled HTTPS load balancer.
- Backend API: Cloud Run service.
- Background processing: Cloud Functions triggered by Pub/Sub messages.
- Database: Cloud SQL for PostgreSQL.
Your task: sketch an observability setup using Cloud Logging and Cloud Monitoring.
Reflect on these prompts and, if possible, write your answers down:
- Logs to focus on
- Which log types would you review regularly for each component? (Think: load balancer logs, Cloud Run logs, Cloud Functions logs, Cloud SQL audit logs.)
- Which ones are most critical for security or compliance vs. performance troubleshooting?
- Key metrics to chart on dashboards
- For Cloud Run and Cloud Functions: what metrics show performance and reliability? (Hint: request count, latency, error rate, instance count.)
- For Cloud SQL: what metrics indicate stress? (Hint: CPU utilization, connections, disk usage.)
- Log-based metrics you might create
- Identify at least one log pattern per service that you would turn into a log-based counter metric (for example, specific error codes or “payment failed” messages).
- Uptime checks and alerts
- Which public endpoints would you probe with uptime checks?
- For each metric or log-based metric, what is a reasonable threshold for an alert (e.g., 5xx error rate > 1% for 5 minutes)?
When you finish, compare your design to the concepts in this module and adjust. This kind of design thinking is close to scenario questions you will see on the certification exam.
Quick Check: Logging and Log-based Metrics
Test your understanding of Cloud Logging and log-based metrics.
You need to be notified when a Cloud Function writes log entries containing a specific error string more than 20 times in 5 minutes. Which combination of services and features should you use?
- Create a filter in Logs Explorer and periodically check it manually.
- Create a log-based counter metric in Cloud Logging and an alerting policy in Cloud Monitoring based on that metric.
- Create an uptime check against the Cloud Function URL and alert when it fails.
- Enable VPC Flow Logs and create a firewall rule to block the traffic causing errors.
Show Answer
Answer: B) Create a log-based counter metric in Cloud Logging and an alerting policy in Cloud Monitoring based on that metric.
You want automated notifications based on how often a specific log pattern occurs. The correct approach is to create a log-based counter metric that counts matching log entries, then use a Cloud Monitoring alerting policy on that metric. Uptime checks only test endpoint reachability, not specific log contents, and VPC Flow Logs are unrelated to application error strings.
Quick Check: Metrics, Dashboards, and Uptime Checks
Test your understanding of Cloud Monitoring concepts.
Which statement best describes the role of an uptime check in Cloud Monitoring?
- It continuously scans your logs for specific error messages.
- It tests whether an endpoint is reachable and responding correctly from selected regions.
- It automatically creates dashboards for all Google Cloud services in your project.
- It is required before you can view any metrics in Metrics Explorer.
Show Answer
Answer: B) It tests whether an endpoint is reachable and responding correctly from selected regions.
An uptime check is a synthetic probe that periodically tests whether an endpoint (HTTP(S), TCP, or gRPC) is reachable and responding as expected from chosen regions. It does not scan logs, create dashboards, or gate access to metrics.
Key Terms Review: Cloud Logging and Monitoring
Use these flashcards to reinforce the core terminology.
- Cloud Logging
- Google Cloud’s central logging service used to collect, store, view, and route logs from Google Cloud services and applications.
- Cloud Monitoring
- Google Cloud’s core monitoring service used to collect metrics, build dashboards, configure uptime checks, and create alerting policies.
- Log-based metric
- A metric derived from log entries that match a specified filter, typically used as a counter or distribution in Cloud Monitoring.
- Gauge metric
- A metric type that represents the current value at a point in time, such as CPU utilization or memory usage.
- Cumulative metric
- A metric type that represents a value that increases over time and may reset, such as total bytes sent.
- Logs Explorer
- The Cloud Logging interface where you can query, filter, and inspect log entries using a UI builder or logging query language.
- Monitoring workspace
- A logical grouping in Cloud Monitoring that aggregates metrics, dashboards, and alerts from one or more Google Cloud projects.
- Uptime check
- A synthetic probe in Cloud Monitoring that periodically tests whether an endpoint (HTTP(S), TCP, gRPC) is reachable and responding correctly.
- Alerting policy
- A Cloud Monitoring configuration that defines conditions on metrics or uptime checks and sends notifications via configured channels when conditions are met.
- Severity (in logs)
- A field in log entries that indicates importance or urgency, such as DEBUG, INFO, WARNING, ERROR, CRITICAL, ALERT, or EMERGENCY.
Key Terms
- Log view
- A filtered view of a log bucket that controls which logs are visible to different users or tools.
- Log bucket
- A storage container in Cloud Logging that holds log entries with configurable retention and access controls.
- Delta metric
- A metric type that represents the change in a value since the last measurement.
- Gauge metric
- A metric type that represents the current value at a point in time, such as CPU utilization or memory usage.
- Uptime check
- A synthetic probe in Cloud Monitoring that periodically tests whether an endpoint (HTTP(S), TCP, gRPC) is reachable and responding correctly.
- Cloud Logging
- Google Cloud’s central logging service used to collect, store, view, and route logs from Google Cloud services and applications.
- Logs Explorer
- The Cloud Logging interface where you can query, filter, and inspect log entries using a UI builder or logging query language.
- Alerting policy
- A Cloud Monitoring configuration that defines conditions on metrics or uptime checks and sends notifications via configured channels when conditions are met.
- Cloud Monitoring
- Google Cloud’s core monitoring service used to collect metrics, build dashboards, configure uptime checks, and create alerting policies.
- Log-based metric
- A metric derived from log entries that match a specified filter, typically used as a counter or distribution in Cloud Monitoring.
- Cumulative metric
- A metric type that represents a value that increases over time and may reset, such as total bytes sent.
- Monitored resource
- The entity that a metric describes, such as a gce_instance, cloud_function, gae_instance, cloudsql_database, or k8s_container.
- Monitoring workspace
- A logical grouping in Cloud Monitoring that aggregates metrics, dashboards, and alerts from one or more Google Cloud projects.