Performing Failure Testing on Nutanix
The Nutanix Enterprise Cloud platform is very resilient as its clusters or storage containers can be configured using Resiliency Factor (RF) 2 or RF 3. This ensures that copies of data are stored on two or three other nodes in the same cluster. That means that a single cluster is able to tolerate a single node failure (RF 2) or two node failures (RF 3) without the loss cluster operations or production data.
Besides that builtin data resiliency within a single Nutanix cluster there is the option to configure a Protection Domain to a Remote Site (another Nutanix cluster) using a Metro Availability (MA) setup. MA ensures that for every VM data write action within the Primary Site another two copies of data (RF 2) is written to the Secondary Site. This translates to Recovery Point Objective (RPO) equalling to 0 in Disaster Recovery Plan terminology.
Nutanix makes it possible to setup MA in three types of Failure Handlings:
- VM writes do not resume until a Nutanix Administrator disables MA manually or the problem is resolved
- Automatic Resume
- VM writes resume automatically after 10 seconds by default
- A Nutanix Witness VM can automatically distinguish a Site Failure from a network interruption between the MA sites and decide whether to failover in case of a Site Failure or network interruption
Option #3 is the recommended method to avoid a split-brain scenario but also to ensure fast recovery on the Secondary Site.
Metro Availability works in conjunction with a VMware High Availability (HA) cluster that stretches across both Sites ensuring that, on the compute level, all VM’s can be powered on in the event of a complete Site Failure.
Of course, all of the above is very good to have ensuring that your Business Critical Applications and Data are kept safe. However, it is good to ‘witness’ how these functionalities work in real time. Therefore, a customer asked to perform a certain Unplanned Failure Tests including a complete Site Failover. These test, as shown in the attached video, were performed on a Nutanix Enterprise Cloud environment that was not yet released for Production usage. This made it the ideal candidate for such test activities.
Guess what? All Failover Tests were successful (as expected) making this yet another Nutanix success story! 🙂