SkarpSkarp

Chapter 18 of 20

Service Health and Resiliency: Azure Service Health and Status

Differentiate between problems in your own deployment and platform-wide issues by using Azure Service Health and status information.

27 min readen

Big Picture: Why Service Health Matters

Why This Matters

You need to tell if an Azure problem is caused by your deployment or a platform-wide issue. This is a common AZ-900 scenario.

Monitor vs Service Health

  • Azure Monitor: health of your resources.
  • Azure Service Health: health of Azure services/regions.
  • Azure status page: broad public incidents.

What You Will Be Able To Do

You will describe Azure Service Health, compare it with the public status page, and explain how alerts and history help troubleshooting and resiliency.

Azure Status Page vs Azure Service Health

Azure Status Page

Public site, no sign-in. Shows high-level health of Azure services by region and only broad, ongoing incidents. It does not know about your specific resources.

Azure Service Health

Portal-based, subscription-aware. Shows service issues, planned maintenance, health advisories, and security advisories that affect the services/regions you use.

Common Exam Trap

Questions about "publicly visible" or "overall Azure health" point to the status page. "Personalized impact" or "alerts" point to Azure Service Health.

Inside Azure Service Health: Main Components

Four Key Areas

Azure Service Health shows: 1) Service issues, 2) Planned maintenance, 3) Health advisories, and 4) Security advisories, all scoped to your subscriptions.

Service Issues & Maintenance

Service issues are active incidents; planned maintenance is scheduled work that may affect you, with windows and impact details.

Advisories & Filters

Health and security advisories warn about changes and risks. You can filter by subscription, region, service, and time to see what matters to you.

Scenario Walkthrough: Is It My App or Azure?

The Problem

Your web app in West Europe is slow, but Azure Monitor metrics and logs look normal. You suspect Azure itself might be having issues.

Two Checks

1) Look at the public Azure status page for a big incident in West Europe. 2) Open Azure Service Health in the portal and filter by your subscription and region.

Drawing a Conclusion

If Service Health shows an App Service issue in West Europe, it is platform-level. If not, it is probably your app, configuration, or network.

Health Alerts and Notifications: Conceptual Setup

Why Configure Alerts?

Health alerts mean you are notified automatically when Azure has issues or maintenance that affects you, instead of manually watching the portal.

Scope and Event Types

You pick subscriptions, regions, and services, then choose event types: service issues, planned maintenance, health advisories, and security advisories.

Action Groups

Alerts use action groups with email, SMS, push, or webhooks so the right people and tools get notified as soon as something happens.

Thought Exercise: Designing Notifications for Stakeholders

Imagine your company runs a customer-facing application in two regions: East US and West Europe. You have three stakeholder groups:

  • Operations team (24/7): needs detailed, immediate alerts.
  • Product managers: need to know about customer-visible outages and planned maintenance that might require communication.
  • Executives: only want critical, high-level notifications.

Using Azure Service Health concepts, think through these questions:

  1. Which event types for each group?
  • Operations: Would you include service issues, planned maintenance, health advisories, and security advisories? Why?
  • Product managers: Would you include service issues and planned maintenance? What about health advisories that change behavior?
  • Executives: Would you limit this to major service issues in production subscriptions only?
  1. How would you use action groups?
  • Would you create separate action groups per stakeholder group (Ops, PM, Exec)?
  • Which channels would you use (email, SMS, mobile push, integration with incident management tools)?
  1. Region and service scope
  • Would you scope alerts to only production subscriptions and the two production regions?
  • How would this differ from alerts in a test or dev subscription?

Pause for a moment and sketch out, on paper or in your notes, three action groups and three health alert rules you would create. This exercise mirrors the kind of reasoning you need in AZ-900 scenario questions about communication and resiliency.

Using Health History for Root Cause and Resiliency Planning

Health History Overview

Service Health keeps a record of past service issues, maintenance, and advisories that affected your subscriptions, filterable by date and exportable.

Root Cause and Planning

You can confirm whether an outage was platform-related and use post-incident details and patterns to improve resiliency, such as adding geo-redundancy.

Exam Clue

If a question says "review past incidents" that affected your app, think Azure Service Health history, not just Azure Monitor metrics or logs.

Quick Check: Status Page vs Service Health

Test your understanding of the difference between the public Azure status page and Azure Service Health.

Your team reports problems with an app in North Europe. You want to see a personalized list of Azure platform issues that specifically affect your subscriptions and configure alerts for future incidents. Which option should you use?

  1. The public Azure status page
  2. Azure Service Health in the Azure portal
  3. Azure Monitor metrics for the app
  4. Azure Advisor recommendations
Show Answer

Answer: B) Azure Service Health in the Azure portal

Azure Service Health in the Azure portal provides a personalized view of service issues, planned maintenance, and advisories that affect your subscriptions and lets you configure health alerts. The public status page is high-level and not subscription-aware. Azure Monitor focuses on resource-level metrics/logs, and Azure Advisor provides optimization recommendations, not platform incident tracking.

Quick Check: Resource-Level vs Platform-Level Issues

Decide whether the described situation points to a resource-level issue or a platform-level issue.

You deployed a VM in East US. It is not responding to RDP, but Azure Service Health shows no service issues for Virtual Machines in East US, and the public status page is green for that region. What should you suspect first?

  1. A platform-wide outage in the East US region
  2. A configuration or networking problem with your specific VM
  3. A global Azure outage affecting all regions
  4. A planned maintenance event that always stops all VMs
Show Answer

Answer: B) A configuration or networking problem with your specific VM

If both Azure Service Health and the public status page show no issues, the problem is likely resource-level: for example, NSG rules, firewall settings, or OS configuration on your specific VM. Platform-wide or global outages would appear in Service Health and on the status page. Planned maintenance does not always stop all VMs and would also be visible in Service Health.

Key Term Review: Service Health and Status

Flip through these cards to reinforce the main concepts from this module.

Azure status page (public)
A public website that shows high-level health of major Azure services by region, focusing on broad, ongoing incidents. It is not personalized to your subscriptions or resources.
Azure Service Health
An Azure portal experience that provides a personalized view of service issues, planned maintenance, health advisories, and security advisories that affect your Azure subscriptions, services, and regions.
Service issues (in Service Health)
Ongoing Azure platform problems, such as outages or degraded performance, that affect one or more services or regions and may impact your resources.
Planned maintenance (in Service Health)
Scheduled work performed by Microsoft on Azure services or infrastructure that may affect availability or performance during defined maintenance windows.
Health advisories (in Service Health)
Notifications about important but not necessarily outage-related events, such as behavior changes, required configuration updates, or upcoming feature deprecations.
Security advisories (in Service Health)
Notifications about security-related issues or required actions that may be critical to protecting your Azure resources.
Health alerts (Azure Service Health)
Alert rules that notify you when new service issues, planned maintenance events, or advisories affect your selected subscriptions, services, and regions.
Action group
A reusable Azure resource that defines a set of notification and action preferences (such as email, SMS, push, or webhooks) used by alerts from Azure Monitor and Azure Service Health.
Resource-level issue vs platform-level issue
A resource-level issue affects a specific resource (such as a misconfigured VM or app). A platform-level issue is an Azure service or region problem visible in Azure Service Health and often on the public status page.
Health history (Azure Service Health)
A record of past service issues, planned maintenance, and advisories that impacted your subscriptions, useful for root cause analysis, audits, and resiliency planning.

Putting It Together: Exam-Style Mini Case

The Scenario

A Central US app was down for 30 minutes yesterday. You must confirm if Azure caused it, set up notifications, and plan to reduce future impact.

Confirming the Cause

Use Azure Service Health history, filtered by subscription and Central US, to see if a service issue matches the outage window and review details.

Alerts and Resiliency

Create health alerts with action groups for on-call and product owners, then consider multi-region deployment and failover to improve resiliency.

Key Terms

Action group
A reusable Azure resource that defines a set of notification and action preferences (such as email, SMS, push, or webhooks) used by alerts from Azure Monitor and Azure Service Health.
Health alerts
Alert rules in Azure Service Health that notify you when new service issues, planned maintenance events, or advisories affect your selected subscriptions, services, and regions.
Health history
The record of past service issues, planned maintenance, and advisories in Azure Service Health that impacted your subscriptions.
Service issues
Ongoing Azure platform problems, such as outages or degraded performance, that affect one or more services or regions.
Azure status page
A public website that shows high-level health of major Azure services by region, focusing on broad, ongoing incidents, and not personalized to individual subscriptions.
Health advisories
Notifications about important but not necessarily outage-related events, such as behavior changes, required configuration updates, or upcoming feature deprecations.
Planned maintenance
Scheduled work performed by Microsoft on Azure services or infrastructure that may affect availability or performance during defined maintenance windows.
Security advisories
Notifications about security-related issues or required actions that may be critical to protecting Azure resources.
Azure Service Health
An Azure portal experience that gives a personalized view of service issues, planned maintenance, health advisories, and security advisories that affect your Azure subscriptions, services, and regions.
Platform-level issue
A problem with an Azure service or region that affects multiple customers and is visible in Azure Service Health and often on the public Azure status page.
Resource-level issue
A problem that affects a specific Azure resource, such as misconfiguration, code bugs, or local networking issues, typically investigated with Azure Monitor.

Finished reading?

Test your understanding with a custom practice exam on this chapter.

Test yourself