Nutanix Medusa Error: Cassandra Gossip Fails
I want to share an interesting issue that I came across when expanding an existing Nutanix cluster with a new node. The existing cluster consisted out of several HPE Proliant DL380 servers whereas the new node was a HPE Proliant DX380 server. Why a DX model was chosen for the new node was not disclosed with me. Regardless, a Nutanix cluster consisting out of DL and DX servers is possible. DX models are fully integrated with Nutanix and are the result of a recent partnership between HPE and Nutanix. More information on this partnership and DX models can be found here.
Medusa Error: Cassandra Gossip Fails
Most of the installation process went like a breeze:
- Performing the Nutanix Foundation on the new HPE DX380 server
- Adding the node (actually VMware ESXi host) to VMware vCenter Server as a standalone host
- Adding the ESXi host to the respective VMware Distributed Switches
- Performing the Expand Cluster from within Nutanix Prism Elements
- Moving the ESXi host into the required VMware vSphere cluster
The problem occurred when Nutanix Prism started showing this node as an IP address instead of its’ FQDN in the Hardware page. There were no readings available for CPU, RAM, etc. like when a CVM is still in maintenance mode.
When checking via SSH and running “ncli host ls”, the respective CVM was showing up as “Under Maintenance Mode : null” . When running the command to set the Maintenance Mode to False “ncli host edit id=<Id> enable-maintenance-mode=”false””, the node appeared to become operational again and services started to become active. Well, it was short lived as Medusa gave an error stating “Cassandra gossip failed“:
nutanix@NTNX-xxxxxxx-A-CVM:10.10.234.151:~$ cs | grep -v UP 2020-11-17 14:22:07 INFO zookeeper_session.py:135 cluster is attempting to connect to Zookeeper 2020-11-17 14:22:07 INFO cluster:2642 Executing action status on SVMs 10.10.234.151,10.10.234.152,10.10.234.153,10.10.234.154,10.10.234.155,10.10.234.156 The state of the cluster: start Lockdown mode: Disabled CVM: 10.10.234.151 Up CVM: 10.10.234.152 Up CVM: 10.10.234.153 Up CVM: 10.10.234.154 Up, ZeusLeader CVM: 10.10.234.155 Up 2020-11-17 14:22:08 INFO cluster:2755 Success! CVM: 10.10.234.156 Up Medusa ERROR [6008, 23586, 23641, 23642] Cassandra gossip failed DynamicRingChanger DOWN  Pithos DOWN  Mantle DOWN  Hera DOWN  Stargate DOWN  InsightsDB DOWN  InsightsDataTransfer DOWN  Ergon DOWN  Cerebro DOWN  Chronos DOWN  Curator DOWN  Athena DOWN  Prism DOWN  CIM DOWN  AlertManager DOWN  Arithmos DOWN  Catalog DOWN  Acropolis DOWN  Uhura DOWN  Snmp DOWN  SysStatCollector DOWN  Tunnel DOWN  Janus DOWN  NutanixGuestTools DOWN  MinervaCVM DOWN  ClusterConfig DOWN  Mercury DOWN  APLOSEngine DOWN  APLOS DOWN  Lazan DOWN  Delphi DOWN  XTrim DOWN  ClusterHealth DOWN 
What are Medusa and Cassandra?
Perhaps, this is a good moment to explain something about what Medusa and Cassandra do within a Nutanix cluster. 🙂
Key role: Access interface for Cassandra
Distributed systems that store data for other systems (for example, a hypervisor that hosts virtual machines) must have a way to keep track of where that data is. In the case of a Nutanix cluster, it is also important to track where the replicas of that data are stored.
Medusa is a Nutanix abstraction layer that sits in front of the database that holds this metadata. The database is distributed across all nodes in the cluster, using a modified form of Apache Cassandra.
Key Role: Distributed metadata store
Cassandra stores and manages all of the cluster metadata in a distributed ring-like manner based upon a heavily modified Apache Cassandra. The Paxos algorithm is utilized to enforce strict consistency.
This service runs on every node in the cluster. The Cassandra is accessed via an interface called Medusa.
Cassandra depends on Zeus to gather information about the cluster configuration.
Now, back to our issue. Of course, the next step was to call in the help from Nutanix Support as this is definitely not something that can be fixed by me.
Via a Zoom session, the assigned SRE (Service Reliability Engineer) investigated the situation and ran the following command on the new CVM:
nutanix@NTNX-xxxxxxx-A-CVM:10.10.234.156:~$ cat /sys/block/sd*/device/path_info [2:0:0:0] Direct-Access PORT: 1I BOX: NA BAY: 1 Active [2:0:1:0] Direct-Access PORT: 1I BOX: NA BAY: 2 Active [2:0:2:0] Direct-Access PORT: 1I BOX: NA BAY: 3 Active [2:0:3:0] Direct-Access PORT: 1I BOX: NA BAY: 4 Active [2:0:4:0] Direct-Access PORT: 2I BOX: NA BAY: 1 Active [2:0:5:0] Direct-Access PORT: 2I BOX: NA BAY: 2 Active [2:0:6:0] Direct-Access PORT: 2I BOX: NA BAY: 3 Active [2:0:7:0] Direct-Access PORT: 2I BOX: NA BAY: 4 Active [3:0:0:0] Direct-Access PORT: 1I BOX: NA BAY: 1 Active [3:0:1:0] Direct-Access PORT: 1I BOX: NA BAY: 2 Active [3:0:2:0] Direct-Access PORT: 1I BOX: NA BAY: 3 Active [3:0:3:0] Direct-Access PORT: 1I BOX: NA BAY: 4 Active
Based on the above results, the SRE concluded that the issue is caused by incorrect BOX (Disk Array) internal SAS cabling. Nutanix will not mount any disks that are not in a specific order in terms of Controller / BOX / Disk location. All HPE DX units are supposed to have the BOX in the same order. Anything out of order is a defect even though all disks are visible on OS layer (CVM). To clarify, the expected output above should be 1-2-3 for the PORTs.
To be continued…
At this point in time, the advice of having the SAS cables recabled has still to be done after which I can confirm that the Cassandra Gossip is resolved. Thus to be continued…
Update February 10th, 2021
I have an important update to share on this issue namely that the root cause has been properly identified with great assistance from the Nutanix SRE team.
It turned out to be quite simple and obvious being that this new HPE Proliant DX380 node was imaged, using Nutanix Foundation, using an unsupported version of Nutanix AOS! The node was imaged with Nutanix AOS 5.10.3 to match the version the Nutanix cluster was on where this new node was to be added to using the “Expand Cluster” functionality. However, always check the Nutanix Compatibility Matrix ensuring that you are imaging a Nutanix node with using a supported combination. It turned out that HPE Proliant DX380 nodes are supported from Nutanix AOS 5.10.5 and upwards. Whoops…! 😛
The following was the resolution response from the assigned Nutanix SRE:
Initial issue was Prism throwing vlan ID error with cluster expand of the new node. Added the new node in Nutanix cluster after adding the new node in vcenter with vDS. After which the cassandra service was fataling with Gossip failed error – found that the CVM was not able to access the disks properly – failing list_disks command and the BOX numbers not available with HP DX380 hardware. Checked the SAS cabling with the DX380 and confirmed as per the KB. Cluster running in AOS is 5.10.3 – and in compatibility matrix, AOS 5.10.5 is the supported for DX380. Removed the node with the help of Engineering team as the node was in new node status. So to upgrade the AOS to version supported for DX380 and add the new node.
So, the course of action we took was to:
- Have the Nutanix SRE to remove the new HPE Proliant DX380 node from the Nutanix cluster as we were unable to do so because the node was in a “new node” status, which does not allow the node to be removed using the Prism GUI (Prism > Hardware > Select Host > Remove Host) or ncli command (ncli host rm-start id=<node_id> skip-space-check=true)
- Then upgrade the Nutanix cluster to a newer version of Nutanix AOS 5.10.5
- Using Nutanix Foundation re-image the new HPE Proliant DX380 node with Nutanix AOS 5.10.5
So, lesson learned here is to always check that Nutanix Compatibility Matrix no matter what! 😉