Chapter 21 of 27
Advanced Logging, Metrics, and Troubleshooting Across Services
Complex issues often span multiple services; practice using advanced logging and monitoring features to trace, diagnose, and resolve cross-cutting problems.
Big Picture: Why Advanced Logging and Metrics Matter
Why This Matters
Complex Google Cloud issues often span multiple services. You need Cloud Logging and Cloud Monitoring together to trace problems end‑to‑end across compute, storage, and networking.
Exam Context
These skills live mainly in "Ensuring successful operation of a cloud solution" and "Deploying and implementing a cloud solution" for the Associate Cloud Engineer exam.
What You Will Do
You will configure log buckets and routing, export logs with sinks, create custom metrics, hook them to alerts, and practice full troubleshooting workflows using logs, metrics, and traces.
Cloud Logging Architecture: Buckets, Routers, and Sinks
Core Pieces
Cloud Logging has log entries, log buckets, the log router, and log sinks. The router receives all logs and sinks define how selected logs are routed or exported.
Default vs Custom Buckets
Projects have a Default log bucket and sometimes Required buckets. You can add user‑defined buckets with custom region, retention, and access control.
Common Exam Trap
Do not confuse Cloud Logging log buckets with Cloud Storage buckets. They are different services. Log sinks can export from log buckets into Cloud Storage buckets.
Hands‑On: Creating a Regional Log Bucket with Custom Retention
Scenario
You must keep production audit logs for 1 year in the EU, while other logs use default retention. You create a dedicated regional log bucket for these logs.
Create the Bucket
In Logging → Log buckets, create `prod-audit-eu`, choose an EU region like europe-west1, and set retention to 365 days to meet compliance.
Route Logs with a Sink
In Logging → Log router, create a sink `route-prod-audit-to-eu`, choose destination = Cloud Logging bucket `prod-audit-eu`, and filter audit logs from your prod project.
Log Sinks and Exports: Cloud Storage, BigQuery, Pub/Sub
What a Sink Does
A log sink defines a destination plus filters. It exports selected logs from Cloud Logging to Cloud Storage, BigQuery, Pub/Sub, or another log bucket.
Choosing Destinations
Cloud Storage is for cheap long‑term archives, BigQuery for SQL analytics and dashboards, and Pub/Sub for real‑time streaming or event‑driven processing.
IAM and Service Accounts
Each sink uses a service account that needs roles like storage.objectCreator, bigquery.dataEditor, or pubsub.publisher on the destination resource.
Quick Check: Log Buckets and Sinks
Test your understanding of log buckets and sinks.
You need to keep all production HTTP request logs for 7 years, but you rarely query them. Which configuration best balances cost and requirements?
- Increase retention on the _Default log bucket to 7 years.
- Create a log sink that exports matching logs to a Cloud Storage bucket with lifecycle rules for 7‑year retention.
- Export logs to BigQuery and keep them for 7 years.
- Create a log sink to a Pub/Sub topic and store logs in a custom application database.
Show Answer
Answer: B) Create a log sink that exports matching logs to a Cloud Storage bucket with lifecycle rules for 7‑year retention.
Cloud Storage is the low‑cost choice for long‑term retention when you do not need frequent querying. You create a sink from Cloud Logging to a Cloud Storage bucket and manage 7‑year retention with lifecycle rules. Extending _Default retention is more expensive; BigQuery is for analytics; Pub/Sub alone does not provide long‑term storage.
Custom Metrics in Cloud Monitoring
Why Custom Metrics
System metrics are not enough. You need app‑specific signals like failed logins or 500 errors. Log‑based metrics turn patterns in logs into metrics.
Metric Types
Log‑based metrics can be Counters (count events) or Distributions (track numeric values like latency). They are defined by a log filter in Logs Explorer.
Creation Flow
Filter logs in Logs Explorer, click Create metric, choose type, name it, and optionally map a numeric field. The metric then appears in Cloud Monitoring.
From Logs to Alerts: Building a 5xx Error Alert for Cloud Run
Step 1: Metric from Logs
Filter Cloud Run logs for 5xx status codes in Logs Explorer, then create a Counter log‑based metric named `cloudrun5xx_errors` from that filter.
Step 2: View the Metric
In Metrics explorer, search for `cloudrun5xxerrors` and plot it by servicename to see error counts per Cloud Run service over time.
Step 3: Alerting Policy
Create an alerting policy that triggers when `cloudrun5xx_errors` exceeds a threshold, such as more than 5 errors in 5 minutes, and attach notification channels.
Thought Exercise: Picking the Right Observability Tool
Scenario A: Latency Spikes
Cloud Run requests sometimes take 10 seconds, but CPU and memory look fine. Which tool best shows where request time is spent across services?
Scenario B: Pipeline Failures
A data pipeline sometimes fails to write to BigQuery because of permissions. Where do you look to see the error details and which identity failed?
Scenario C: SSH Security Alert
Security wants near real‑time alerts for failed SSH logins on any VM. Which combination of logs, metrics, and alerts would you use?
End‑to‑End Troubleshooting Workflow Across Services
Step 1: Start from Symptoms
Check Cloud Monitoring dashboards, alerting policies, and uptime checks to see which services show increased errors or latency during the incident.
Step 2–4: Logs, Chain, Quotas
Use Logs Explorer to find 5xx errors, follow dependencies (Cloud SQL, Pub/Sub), and look for PERMISSIONDENIED or QUOTAEXCEEDED to catch IAM or quota issues.
Step 5–6: Trace and Verify
Use Cloud Trace to pinpoint slow operations, apply fixes like scaling or IAM changes, then monitor metrics and logs to confirm recovery and stability.
Common Error Patterns in Logs and How to Interpret Them
IAM and Quota Errors
PERMISSIONDENIED and insufficient scopes point to IAM or service account issues. RESOURCEEXHAUSTED or 429 errors indicate quota or rate limit problems.
Network and Resource Limits
Connection timed out or refused suggests networking or firewall issues. OOMKilled or exceeded memory limit means you must adjust resource sizing or fix leaks.
Config Mistakes
Messages like Bucket not found or Access denied usually mean wrong project, wrong resource name, or missing IAM permissions for that resource.
Quiz: Interpreting Logs and Choosing Actions
Apply what you learned about error patterns and troubleshooting.
Your Cloud Run service logs show: `PERMISSION_DENIED: Caller does not have permission storage.objects.get on bucket my-prod-bucket` and `principalEmail: my-service@my-project.iam.gserviceaccount.com`. What is the most appropriate next step?
- Increase the memory allocated to the Cloud Run service.
- Grant the service account the Storage Object Viewer role on my-prod-bucket.
- Create a new log-based metric counting PERMISSION_DENIED errors.
- Move the bucket to a different region.
Show Answer
Answer: B) Grant the service account the Storage Object Viewer role on my-prod-bucket.
The error clearly indicates a permission issue on Cloud Storage. You should grant the Cloud Run service account `my-service@my-project.iam.gserviceaccount.com` a role that allows reading objects, such as Storage Object Viewer, on `my-prod-bucket`. Changing memory, creating a metric, or moving the bucket will not fix the access denial.
Key Terms Review
Use these flashcards to reinforce core concepts from this module.
- Log bucket (Cloud Logging)
- A storage container in Cloud Logging that holds log entries with its own location, retention, and access control, separate from Cloud Storage buckets.
- Log sink
- A routing rule attached to the Cloud Logging router that selects log entries with filters and sends them to a destination such as a log bucket, Cloud Storage, BigQuery, or Pub/Sub.
- Log-based metric
- A Cloud Monitoring metric derived from log entries that match a filter in Cloud Logging, used to count events or measure distributions like latency.
- Counter vs Distribution log-based metrics
- Counter metrics count how many log entries match a filter. Distribution metrics extract a numeric value from each matching entry and track its distribution over time.
- Common IAM error pattern in logs
- Messages containing PERMISSION_DENIED or insufficient authentication scopes, often with principalEmail, indicating missing or incorrect IAM roles or service accounts.
- Common quota error pattern in logs
- Messages containing RESOURCE_EXHAUSTED, Quota exceeded, or HTTP 429, indicating that a service-specific quota or rate limit has been reached.
- Service account (canonical definition)
- A service account is a special kind of account used by an application or compute workload, not a person, to make authorized API calls and access Google Cloud resources.
- End-to-end troubleshooting workflow
- A structured process: start from symptoms and high-level metrics, drill into logs, follow dependencies across services, check IAM and quotas, use tracing for latency, then apply fixes and verify with metrics and logs.
Key Terms
- Log sink
- A Cloud Logging resource that defines how selected log entries are routed or exported, including a destination and inclusion/exclusion filters.
- Log router
- The Cloud Logging component that receives all log entries and, based on configured sinks, routes them to log buckets or export destinations.
- Cloud Trace
- A Google Cloud service that collects and displays latency data for requests, helping you analyze and optimize performance across distributed services.
- Counter metric
- A log-based metric type that counts the number of log entries matching a filter over time.
- Log-based metric
- A Cloud Monitoring metric derived from log entries that match a specified filter in Cloud Logging.
- PERMISSION_DENIED
- A common error code in logs indicating the caller lacks the required Identity and Access Management (IAM) permissions for an operation.
- RESOURCE_EXHAUSTED
- An error code that typically indicates quota or rate limits have been reached for a particular Google Cloud resource or API.
- Distribution metric
- A log-based metric type that records the distribution of a numeric field extracted from matching log entries, such as latency.
- Cloud Logging log bucket
- A storage container in Cloud Logging that holds log entries with its own location, retention period, and access control configuration.
- Cloud Monitoring custom metric
- A metric defined by the user, such as a log-based metric, that captures application-specific signals not provided by default system metrics.