The Problem — Single Points of Failure
Imagine you have a web application running on a single Azure VM. Two things can take it down:
- Unplanned downtime — The physical host server fails (power supply, hardware fault, network issue)
- Planned maintenance — Azure needs to update the host server's firmware or hypervisor and reboots it
In both cases, your application is unavailable until the VM restarts. For production workloads, this is unacceptable. The solution is to run multiple VMs and ensure they don't all fail at the same time — which is what Availability Sets and Zones enable.
Availability Sets
An Availability Set is a logical grouping that tells Azure to spread VMs across multiple physical fault domains and update domains within a single data centre. When you add VMs to an Availability Set, Azure ensures no two VMs in the same set are on the same physical hardware.
How to Create an Availability Set
An Availability Set must be created BEFORE the VMs that go into it. You cannot add an existing VM to an Availability Set after creation — you must create the VM with the Availability Set selected.
# Create the Availability Set first
az vm availability-set create \
--resource-group myResourceGroup \
--name myAvailabilitySet \
--location centralindia \
--platform-fault-domain-count 2 \
--platform-update-domain-count 5
# Create VMs inside the Availability Set
az vm create \
--resource-group myResourceGroup \
--name myVM1 \
--availability-set myAvailabilitySet \
--image Ubuntu2204 \
--size Standard_B2s \
--admin-username azureuser \
--generate-ssh-keys
Fault Domains
A fault domain is a group of VMs that share a common power source and network switch. If that power supply or switch fails, all VMs in that fault domain go down together.
Azure Availability Sets spread VMs across 2 or 3 fault domains (depending on the region). This means if one power supply fails, only some of your VMs are affected — others remain running.
Update Domains
An update domain is a group of VMs that Azure updates (reboots for host maintenance) at the same time. Azure spreads VMs across 5 update domains by default (up to 20).
When Azure needs to perform planned maintenance, it updates one update domain at a time, waiting 30 minutes between each. This ensures your application stays available while Azure maintains the underlying hardware.
| Fault Domain | Update Domain | |
|---|---|---|
| Protects against | Hardware failures (power, network) | Planned maintenance reboots |
| Number in Availability Set | 2–3 | Up to 20 (default 5) |
| When it matters | Unexpected hardware failure | Azure maintenance window |
Availability Zones for VMs
While Availability Sets protect within a single data centre, Availability Zones spread VMs across physically separate data centres (zones) within a region.
Each zone has its own independent power, cooling, and networking. A failure in one zone (even a complete data centre fire or flood) does not affect the other zones.
Deploying VMs Across Zones
# VM in Zone 1
az vm create \
--resource-group myResourceGroup \
--name myVM-zone1 \
--zone 1 \
--image Ubuntu2204 \
--size Standard_B2s \
--admin-username azureuser \
--generate-ssh-keys
# VM in Zone 2
az vm create \
--resource-group myResourceGroup \
--name myVM-zone2 \
--zone 2 \
--image Ubuntu2204 \
--size Standard_B2s \
--admin-username azureuser \
--generate-ssh-keys
# VM in Zone 3
az vm create \
--resource-group myResourceGroup \
--name myVM-zone3 \
--zone 3 \
--image Ubuntu2204 \
--size Standard_B2s \
--admin-username azureuser \
--generate-ssh-keys
Availability Sets vs Availability Zones
| Feature | Availability Set | Availability Zone |
|---|---|---|
| Protection scope | Within one data centre | Across separate data centres |
| Protects against | Hardware failures, planned maintenance | Data centre failures |
| SLA | 99.95% | 99.99% |
| Cost impact | Free to create (pay for VMs) | Pay for VMs in each zone |
| Network latency between VMs | Very low (same DC) | Low (<2ms between zones) |
| Available everywhere? | Yes — all regions | Only in AZ-supported regions |
| Recommended for new workloads? | Legacy — use zones instead | Yes — modern approach |
SLA Comparison
| Deployment | SLA | Max Annual Downtime |
|---|---|---|
| Single VM (Standard HDD) | No SLA | No guarantee |
| Single VM (Premium SSD) | 99.9% | ~8.7 hours |
| VMs in Availability Set | 99.95% | ~4.4 hours |
| VMs across 2+ Availability Zones | 99.99% | ~52 minutes |
Designing a Highly Available Application
Here's a practical high availability design for a web application using Availability Zones:
| Layer | Zone 1 | Zone 2 | Zone 3 | Front-end |
|---|---|---|---|---|
| Web tier | VM-web-1 | VM-web-2 | VM-web-3 | Zone-redundant Load Balancer |
| App tier | VM-app-1 | VM-app-2 | VM-app-3 | Internal Load Balancer |
| Database | SQL Primary | SQL Secondary | SQL Secondary | SQL Always On |
With this design, if Zone 2 completely fails, Zone 1 and Zone 3 continue serving all traffic. The Load Balancer detects the zone failure and routes around it automatically. Users see no downtime.