Chapter 19 of 26
Operating Storage and Databases: Capacity, Performance, and Data Protection
Run storage systems in production by tuning performance, managing capacity, and protecting data with backups and lifecycle policies.
Storage and Databases in Operations: What This Module Covers
Why This Module Matters
Here you connect your compute skills to the data layer: Cloud Storage, Cloud SQL, and BigQuery. In production, you must keep them fast, control cost, and protect data.
Five Operational Themes
We focus on: Cloud Storage operations, Cloud SQL performance and maintenance, Cloud SQL backups and HA, BigQuery performance and cost, and exam-style decision making.
Your Mental Model
Picture a web app using Cloud SQL, files in Cloud Storage, and analytics in BigQuery. When things slow down or break, which knobs do you turn first?
Learning Outcomes
You will be able to run Cloud Storage with lifecycle and signed URLs, tune and protect Cloud SQL, and operate BigQuery with partitioning and clustering.
Operating Cloud Storage: Classes, Lifecycle Rules, and Versioning
Cloud Storage Levers
In Cloud Storage operations, your main tools are storage class, lifecycle rules, and object versioning. All classes share the same API but differ in cost and usage pattern.
Choosing Storage Classes
Standard is for hot data, Nearline for monthly access, Coldline for a few reads per year, Archive for very rare reads and long-term retention at lowest cost.
Lifecycle Rules Basics
Lifecycle rules run at bucket level and can change storage class or delete objects based on age, creation date, live status, or prefix/suffix filters.
Versioning for Protection
With versioning enabled, deletes and overwrites create noncurrent versions. You can restore older versions and use lifecycle to clean up old ones.
Common Exam Trap
Object versioning is not a cross-region backup; it just keeps historical versions in the same bucket location.
Cloud Storage Lifecycle and Versioning: Worked Scenarios
Scenario 1: Log Lifecycle
A bucket `app-logs-prod` stores logs. You keep them in Standard for 30 days, then move to Coldline, and delete after 400 days using two lifecycle actions.
Designing the Rules
Rule 1: Set storage class to COLDLINE when Age > 30 and prefix is logs/. Rule 2: Delete when Age > 400 and prefix is logs/ to enforce retention.
Scenario 2: Overwrite Protection
Bucket `customer-docs` holds PDFs. Overwrites happen by mistake. Enable object versioning so old copies become noncurrent instead of being lost.
Cleaning Up Old Versions
Add a lifecycle rule that deletes noncurrent versions when Age > 90 days. This keeps recovery possible while controlling storage costs.
Retention vs Versioning
If requirements mention legal hold or preventing deletion before N days, think Bucket Lock (retention policy), not just versioning or lifecycle.
Signed URLs and Access Patterns for Cloud Storage
Two Access Patterns
Cloud Storage access is usually via IAM for internal users and services, or via signed URLs to give temporary access to specific objects.
IAM for Storage
IAM roles like storage.objectViewer can be granted at project, bucket, or object level to identities that authenticate with Google Cloud.
What Are Signed URLs?
A signed URL is a time-limited URL that lets anyone who has it read or write a specific object, without needing a Google account or IAM role.
Operational Flow
Your backend authenticates the user, checks permissions, then generates a short-lived signed URL so the client can directly download or upload the file.
Common Traps
For truly public files, make them public instead of using signed URLs. To revoke a signed URL early, you must rotate keys or change the object name.
Cloud SQL Performance: Sizing, Connections, and Query Basics
Cloud SQL Tuning Areas
You tune Cloud SQL by adjusting instance sizing and storage, managing connections, and ensuring queries and indexes are reasonable.
Instance Sizing
More vCPUs help with CPU-bound workloads and concurrency; more memory helps caching and large joins; SSD is typical for OLTP workloads.
Connections Matter
Each instance has a max connections limit. Use connection pooling so apps reuse connections instead of opening and closing many short-lived ones.
Exam Hint on Connections
If you see 'too many connections' errors, think connection pooling and tuning max connections, not just making the instance bigger.
Query and Index Basics
Slow queries and missing indexes can cause CPU spikes. Use EXPLAIN and add indexes on common filter columns, while avoiding over-indexing.
Cloud SQL Maintenance, Backups, PITR, and Failover
Three Resilience Levers
For Cloud SQL protection and availability you use automated and on-demand backups, point-in-time recovery, and high availability plus replicas.
Automated Backups
Enable daily automated backups with a retention window. Take on-demand backups before risky changes, like major schema migrations.
Point-in-Time Recovery
With binary logging and PITR enabled, you can restore a new instance to a specific timestamp, such as just before an accidental DROP TABLE.
High Availability
HA uses a primary and synchronous standby in different zones of the same region. If the primary zone fails, Cloud SQL promotes the standby.
Read Replicas
Read replicas offload read traffic and can be cross-region for DR and global reads. They can be promoted to standalone instances if needed.
Exam Mapping
HA is for low RPO/RTO in-region, PITR is for recovering from logical errors, cross-region replicas are for global reads and regional disaster recovery.
BigQuery Operations: Partitioning, Clustering, and Cost Control
BigQuery Ops Focus
In BigQuery you mainly tune performance and cost. Partitioning and clustering are your main design tools to control how much data queries scan.
Partitioning Tables
Partition tables by ingestion time or by a date-like column such as event_date. Queries that filter on that column only scan relevant partitions.
Using Partition Filters
Always include filters on the partition column in WHERE clauses. If queries filter on another column, consider repartitioning on that column.
Clustering Tables
Clustering organizes data within partitions based on up to four columns. It is best for columns frequently used in filters and joins.
Cost Control Tactics
Avoid SELECT *. Set table expiration on staging tables, use materialized views or scheduled queries, and monitor scanned bytes and costs.
Common Misconceptions
Partitioning by user ID is usually wrong. Clustering improves query cost, not storage cost, by letting BigQuery skip unnecessary data blocks.
Thought Exercise: Choosing the Right Tool
Work through these scenarios and decide which feature or service you would use. Think it through before checking the hints.
- Scenario A: Customer invoices downloads
- You store invoices as PDFs in Cloud Storage.
- Customers access them via your web app after logging in.
- Requirements: files must not be public, URLs should expire after 10 minutes, and downloads should not overload your backend.
- Question: How do you design access to these PDFs?
- Hint: Think about offloading bandwidth and granting temporary access per object.
- Scenario B: Recover from accidental DELETE in Cloud SQL
- An engineer accidentally ran `DELETE FROM orders;` at 09:17 this morning.
- You need to restore the database to 09:16 with minimal data loss.
- Question: Which Cloud SQL feature do you rely on, and what high-level steps do you take?
- Hint: Consider transaction logs and time-based restore.
- Scenario C: BigQuery monthly reports cost spike
- A 5 TB table `events` is time-partitioned on `event_date`.
- Analysts run monthly reports but use `WHERE userregion = 'EU'` without filtering on `eventdate`.
- Costs are high and queries are slow.
- Question: What two changes would you suggest to improve performance and lower cost?
- Hint: Think about filters and table design.
After you answer, compare with these model answers:
- A: Use signed URLs generated by a backend service after user authentication, so clients download directly from Cloud Storage.
- B: Use point-in-time recovery to restore a new instance to 09:16, then switch traffic to that instance.
- C: Update queries to filter on `eventdate` (e.g., last month) and consider **clustering** on `userregion` if it is a common filter.
Quick Check: Cloud Storage and Cloud SQL
Answer this question to test your understanding of lifecycle, versioning, and Cloud SQL resilience.
Your company stores user-uploaded images in a Cloud Storage bucket. Requirements: - Images are frequently accessed for the first 7 days, then rarely. - Must keep the ability to restore accidentally overwritten images for 60 days. - Storage cost should be minimized after the first week. What is the BEST configuration?
- Use Standard storage, disable versioning, and create a lifecycle rule to delete objects after 60 days.
- Use Standard storage with object versioning, and a lifecycle rule to delete noncurrent versions older than 60 days.
- Use Standard storage for 7 days, then lifecycle to move objects to Nearline; enable object versioning and delete noncurrent versions older than 60 days.
- Use Archive storage from day 0 with object versioning and no lifecycle rules.
Show Answer
Answer: C) Use Standard storage for 7 days, then lifecycle to move objects to Nearline; enable object versioning and delete noncurrent versions older than 60 days.
You want hot access for 7 days (Standard), then cheaper storage (Nearline) after that, plus versioning to recover overwrites for 60 days. A lifecycle rule moves objects to Nearline after 7 days, and another rule deletes noncurrent versions older than 60 days. Option 2 ignores the cheaper storage class; option 1 has no overwrite protection; option 4 makes initial access too slow and costly in retrieval.
Quick Check: BigQuery and Cloud SQL Scenarios
Test your ability to pick the right operational feature for a scenario.
A reporting team complains that their BigQuery queries on a 20 TB `transactions` table are slow and expensive. The table is currently unpartitioned. Most queries filter on a `transaction_date` column and `country_code`. What is the MOST effective first step?
- Enable clustering on `country_code` without changing the table schema.
- Partition the table by `transaction_date` and update reports to filter on that column.
- Increase the BigQuery slot commitments for the project to speed up queries.
- Rewrite all queries to use SELECT * to simplify optimization.
Show Answer
Answer: B) Partition the table by `transaction_date` and update reports to filter on that column.
Partitioning by `transaction_date` matches the main filter and allows queries to scan only relevant partitions, dramatically reducing scanned bytes and cost. Updating reports to filter on the partition column is essential. Clustering alone on `country_code` (option 1) helps less than partitioning by date. Increasing slots (option 3) may speed queries but does not address cost from scanning too much data. SELECT * (option 4) usually increases cost.
Key Concepts Review
Use these flashcards to reinforce core terms and behaviors you need for the exam.
- Cloud Storage lifecycle rule
- A bucket-level configuration that automatically performs actions (such as changing storage class or deleting objects) when specified conditions like object age, creation date, live status, or prefix/suffix filters are met.
- Object versioning in Cloud Storage
- A feature that keeps multiple versions of an object in the same bucket. When enabled, overwrites and deletions create noncurrent versions instead of permanently deleting data, allowing recovery of older copies.
- Signed URL (Cloud Storage)
- A time-limited URL that grants access to a specific Cloud Storage object without requiring a Google identity or IAM role. Commonly used to let clients download or upload objects directly after a backend authorizes them.
- Cloud SQL automated backup
- A scheduled backup of a Cloud SQL instance taken automatically within a defined window and retained for a configured period, used for disaster recovery and point-in-time recovery.
- Point-in-time recovery (PITR) in Cloud SQL
- A capability that uses transaction logs (such as binary logs) to restore a database to an exact time within a retention window, typically by creating a new instance representing the state at that time.
- Cloud SQL high availability (HA)
- A regional configuration where a Cloud SQL primary instance has a synchronous standby in another zone of the same region, enabling automatic failover if the primary zone or instance becomes unavailable.
- BigQuery partitioned table
- A table that is divided into segments (partitions) based on ingestion time or a specific column, typically a DATE or TIMESTAMP. Queries that filter on the partition key scan only relevant partitions, reducing cost and improving performance.
- BigQuery clustering
- An optimization that organizes data within a table (or partition) based on one or more clustering columns, allowing BigQuery to skip reading large portions of data when queries filter or aggregate on those columns.
- Table expiration (BigQuery)
- A setting that automatically deletes a table after a specified period, useful for temporary or staging data to control storage costs without manual cleanup.
- Connection pooling (Cloud SQL)
- An application or proxy pattern where a limited set of database connections is reused for many requests, reducing overhead and preventing exhaustion of the Cloud SQL instance's maximum connections.
Key Terms
- BigQuery
- Google Cloud's serverless, highly scalable data warehouse for analytics, with SQL interface and built-in features for partitioning, clustering, and cost-based optimization.
- Cloud SQL
- Google Cloud's managed relational database service for MySQL, PostgreSQL, and SQL Server, providing automated backups, maintenance, HA, and integration with other Google Cloud services.
- Signed URL
- A time-bound URL that grants temporary access to a specific Cloud Storage object without requiring the caller to authenticate with Google Cloud.
- Read replica
- A read-only copy of a primary Cloud SQL instance that receives replicated data and is used to offload read traffic or provide disaster recovery.
- Cloud Storage
- Google Cloud's object storage service for unstructured data such as images, backups, and logs. It offers multiple storage classes, lifecycle rules, and features like object versioning and signed URLs.
- Lifecycle rule
- A configuration on a Cloud Storage bucket that automatically transitions or deletes objects based on conditions like age, creation date, live status, or name prefix/suffix.
- Clustered table
- A BigQuery table where data is organized by clustering columns, improving performance and reducing scanned data for queries that filter or aggregate on those columns.
- Automated backup
- A scheduled, system-managed backup of a database instance, used for recovery from failures or data loss.
- On-demand backup
- A backup of a database instance triggered manually, often taken before risky changes such as schema migrations.
- Table expiration
- A BigQuery setting that defines when a table will be automatically deleted, helping control storage costs for temporary data.
- Object versioning
- A Cloud Storage feature that keeps multiple versions of an object when it is overwritten or deleted, allowing recovery of previous versions.
- Partitioned table
- A BigQuery table divided into partitions based on ingestion time or a specific column, enabling queries to scan only relevant partitions.
- Connection pooling
- A technique where a limited set of database connections is maintained and reused by an application, improving performance and preventing connection exhaustion.
- Point-in-time recovery (PITR)
- A recovery method that restores a database to a specific time using transaction logs, typically by creating a new instance representing the state at that time.
- High availability (HA) (Cloud SQL)
- A deployment mode where a Cloud SQL primary instance has a synchronous standby in another zone of the same region, enabling automatic failover.