Data Resiliency Not Possible & Adding New Disks to Nutanix Nodes

One of the great benefits of the Nutanix Hyperconverged Infrastructure (‘HCI’) platform is that you can easily expand your Nutanix cluster with new nodes when you are in need of more CPU, RAM and storage resources. But what if you find yourself quickly running out of storage space on your Nutanix cluster without a need for more CPU or RAM resources? Even worse, what if you’ve let this storage problem linger on too long? Now, Nutanix Prism Elements (cluster GUI) is showing you are Critical warning message stating that your Nutanix cluster is no longer Data Resilient and it cannot tolerate a node failure without data loss…! Well, this happened for one of my customers, which we remedied by adding more disks instead of nodes. Read on to learn how we got this Nutanix cluster back to being Data Resilient.

No Rebuild Capacity Available

You are confronted with the Critical alert “Data Resiliency Not Possible” & “Cluster can not tolerate 1 node failure(s) and guarantee available rebuild capacity” when you have used up too much storage space on your Nutanix cluster. Do note that your cluster is still fully operational without any Production workload issues but it warns you that it cannot tolerate a single node failure and still remain data resilient after that event. After such a node failure event, you will end up with a situation in which you are left with only 1 copy of a data block available instead of 2 with RF2 and 2 copies available with RF3. There just is not enough space left anymore within the Nutanix cluster to store sufficient data block copies.

It goes without saying that this situation requires immediate attention.

Maximum Storage Usage on a Nutanix Cluster

The Nutanix Prism Elements dashboard shows the total storage space (logical/physical) and how much is currently in use versus free space. In my experience, Nutanix customers are often under the impression that they can use up almost all of the storage space available in the cluster. Well, this is definitely NOT the case. There is a limit to how much storage you can use, which is determined based on the specifics of each Nutanix cluster. That limit is way below the maximum number as annotated by that widget on the Prism Elements dashboard.

The formula for calculating the maximum recommended usage for clusters is one of the following.

    • Recommended maximum utilization of a cluster with containers using replication factor (RF)=2
      • M = 0.9 x (T – N1)
    • Recommended maximum utilization of a cluster with containers using replication factor (RF)=3
      • M = 0.9 x (T – [N1 + N2])

 

M Recommended maximum usage of the cluster
T Total available physical storage capacity in the cluster
N1 Storage capacity of the node with the largest amount of storage
N2 Storage capacity of the node with the next largest amount of storage

Please note that the formula takes the following into consideration:

    1. Space is reserved in the cluster to tolerate one full node failure.
    2. 10% space is reserved as a buffer for various actions such as adding or removing nodes or VMs as needed.
    3. The physical storage capacity of a node can be found by under Hardware Diagram. Click a specific node in Diagram and see the value of Storage Capacity in the Prism web console.

Apart from the above formula, which will trigger that Critical Data Resiliency alert, there is another limit that you do not want to reach. It is the 90% physical storage space usage on your Nutanix cluster. In case you do cross this line then you will start seeing that one or more Nutanix (CVM) cluster services will start to shutdown due to a lack of required storage space. This will be the point where your Production workloads will suffer from performance issues or even outages.

Adding New Disks

Back to my customer case mentioned above. Luckily, each node in this 4-node Nutanix cluster still had 4 unused disk slots allowing the storage on the Nutanix cluster to be expanded by simply adding more disks (SSD’s in this case as it were all flash nodes).

In case you do not have any unused disk slots in your Nutanix nodes then you are left with the following options with the “low hanging fruit” options first:

    1. Delete unused or not required virtual machines or vdisks
    2. Offload the Nutanix cluster by transferring data to another storage platform
    3. Order and add new Nutanix nodes to your Nutanix cluster

Checking the Nutanix Hardware Compatibility List (‘HCL’)

When ordering new SSD’s, we always check the Nutanix Hardware Compatibility List (‘HCL’)  available via https://portal.nutanix.com/page/documents/list?type=compatibilityList. You need to ensure that the new SSD’s are of a make and model, which is supported by the Nutanix platform. If not, you will end up in a situation in which the new SSD’s are not recognized by Nutanix and thus cannot be used.

In the case for my customer, we used HPE P05986-B21 SSD’s which are listed on the HCL and already present in their current setup of HPE Proliant DL380 Gen10 nodes. Either way, I always check the Nutanix HCL when dealing with hardware replacements or additions.

Ensure up-to-date Nutanix Foundation

According to the Nutanix HCL, we needed to ensure that Nutanix Foundation was up-to-date. Nutanix Foundation plays a key part in such hardware level type changes within a Nutanix cluster. Foundation also plays that key role when you are upgrading BIOS or BMC on your Nutanix nodes. Always keep Nutanix Foundation up-to-date alongside Nutanix NCC. These are critical services that you can always update (using Upgrade Software within Nutanix Prism) without disruptions to your Production workloads. The Nutanix CVM’s do not even need a reboot when updating these services, which take only minutes to complete. 🙂

Adding the Disks

The actual SSD’s were hot-added to the nodes thus no need for turning off a node beforehand. This was perfect in my customer case because we “could not tolerate a node failure”. Adding the disks was quite easy as these already included the disk enclosures.

When each disk was physically added to a node, the Nutanix node automatically detects the SSD and starts the “repartition and add” process. This process creates the required partitions on the SSD with partitions reserved for Nutanix services and for the cluster storage pool. You can track this activity by logging in on Nutanix Prism Element and looking at the Hardware > Diagram tab. Each new disk will first be red highlighted and, when the process is completed, it will be white with a disk number assigned. When selecting that new disk, you can view the details of that disk: ID, serial number, type, size, etc.

So, this process is really easy as it only involves physically adding the disks in the nodes; Nutanix does the rest. 🙂

Data Resilience Back to OK!

After having added all 16 disks, the Nutanix cluster for this customer went back into the healthy Data Resilience status! However, I did advise to prepare for having to add more nodes to their cluster based on their storage usage. If that does not change, and they remain unable to relocate or delete data, there is not other choice.

Storage Runway Chart in Nutanix Prism Central

Always keep a close eye on your Nutanix cluster storage usage and take appropriate actions to avoid ever seeing that Critical “Data Resiliency Not Possible” alert on your Nutanix Prism Element dashboard. To help you in this area, there is a feature in Nutanix Prism Central called “Storage Runway” available in the “Planning” section. This runway chart shows your past and current cluster storage usage and, more importantly, a predictive graph line towards the future when you could possibly run out of storage. There is also a “CPU Runway” and “Memory Runway” chart available. Cool stuff to make your system administration life a lot easier. 😉

Thank you

Thank you for reading this blog post and feel free to post your comments below.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.