Azure Service Healing

I often get asked what happens if an Azure service or resource crashes.
I’m also sometimes asked how Azure keep Virtual Machines running 100%.

Well lets start with the second question. They Don’t! Azure is an extremely reliable platform but is still based on industry standard physical servers, power, networking… And sometimes a failure may occur that can cause a VM to reboot or go offline. Having said that uptime is of course extremely high, some services being higher than others. You can find official SLA listings here.

Now regarding what happens if a service does fail. Well Azure has a an Auto-Recovery feature called service healing. Auto-Recovery is available across all Virtual Machine sizes in all regions.
Azure has multiple ways to preform health checks on your resources. Every VM deployed in the form if Web and Worker role has an agent injected in to it that run a health check every 15 seconds, a web farm behind a load balancer will also have health checks performed by the load balancer itself. If a predefined number of consecutive health check fail or a signal from the load balancer causes a role to become unhealthy then a recovery action will be initiated which is to restart the role instance.

Another test preformed is the health of the virtual machine itself within which the role instance is running. The virtual machine is hosted on a physical server running inside an Azure datacenter. The physical server runs another agent called the Host Agent. The Host Agent monitors the health of the virtual machine by pinging the guest agent every 15 seconds. It is quite plausible that a virtual machine is under stress from its workload, which could be its CPU is at 100% utilization, because a machine may be under heavy load Azure will wait 10 minutes before preforming a recovery action. The recovery action in this case is to recycle the virtual machine with a clean OS disk in the case of a Web & Worker Role and in the case of Azure Virtual Machine we perform a reboot preserving the disk state intact.

Apart from this Azure take as many measures as possible to predict failure in advance this includes extensive monitoring of all hardware in the Datacenter including CPU, Disk IO etc.