Nutanix Medusa Error: Cassandra Gossip Fails
I want to share an interesting issue that I came across when expanding an existing Nutanix cluster with a new node. The existing cluster consisted out of several HPE Proliant DL380 servers whereas the new node was a HPE Proliant DX380 server. Why a DX model was chosen for the new node was not disclosed with me. Regardless, a Nutanix cluster consisting out of DL and DX servers is possible. DX models are fully integrated with Nutanix and are the result of a recent partnership between HPE and Nutanix. More information on this partnership and DX models can be found here.
Medusa Error: Cassandra Gossip Fails
Most of the installation process went like a breeze:
- Performing the Nutanix Foundation on the new HPE DX380 server
- Adding the node (actually VMware ESXi host) to VMware vCenter Server as a standalone host
- Adding the ESXi host to the respective VMware Distributed Switches
- Performing the Expand Cluster from within Nutanix Prism Elements
- Moving the ESXi host into the required VMware vSphere cluster
The problem occurred when Nutanix Prism started showing this node as an IP address instead of its’ FQDN in the Hardware page. There were no readings available for CPU, RAM, etc. like when a CVM is still in maintenance mode.
When checking via SSH and running “ncli host ls”, the respective CVM was showing up as “Under Maintenance Mode : null” . When running the command to set the Maintenance Mode to False “ncli host edit id=<Id> enable-maintenance-mode=”false””, the node appeared to become operational again and services started to become active. Well, it was short lived as Medusa gave an error stating “Cassandra gossip failed“:
nutanix@NTNX-xxxxxxx-A-CVM:10.10.234.151:~$ cs | grep -v UP 2020-11-17 14:22:07 INFO zookeeper_session.py:135 cluster is attempting to connect to Zookeeper 2020-11-17 14:22:07 INFO cluster:2642 Executing action status on SVMs 10.10.234.151,10.10.234.152,10.10.234.153,10.10.234.154,10.10.234.155,10.10.234.156 The state of the cluster: start Lockdown mode: Disabled CVM: 10.10.234.151 Up CVM: 10.10.234.152 Up CVM: 10.10.234.153 Up CVM: 10.10.234.154 Up, ZeusLeader CVM: 10.10.234.155 Up 2020-11-17 14:22:08 INFO cluster:2755 Success! CVM: 10.10.234.156 Up Medusa ERROR [6008, 23586, 23641, 23642] Cassandra gossip failed DynamicRingChanger DOWN  Pithos DOWN  Mantle DOWN  Hera DOWN  Stargate DOWN  InsightsDB DOWN  InsightsDataTransfer DOWN  Ergon DOWN  Cerebro DOWN  Chronos DOWN  Curator DOWN  Athena DOWN  Prism DOWN  CIM DOWN  AlertManager DOWN  Arithmos DOWN  Catalog DOWN  Acropolis DOWN  Uhura DOWN  Snmp DOWN  SysStatCollector DOWN  Tunnel DOWN  Janus DOWN  NutanixGuestTools DOWN  MinervaCVM DOWN  ClusterConfig DOWN  Mercury DOWN  APLOSEngine DOWN  APLOS DOWN  Lazan DOWN  Delphi DOWN  XTrim DOWN  ClusterHealth DOWN 
What are Medusa and Cassandra?
Perhaps, this is a good moment to explain something about what Medusa and Cassandra do within a Nutanix cluster. 🙂
Key role: Access interface for Cassandra
Distributed systems that store data for other systems (for example, a hypervisor that hosts virtual machines) must have a way to keep track of where that data is. In the case of a Nutanix cluster, it is also important to track where the replicas of that data are stored.
Medusa is a Nutanix abstraction layer that sits in front of the database that holds this metadata. The database is distributed across all nodes in the cluster, using a modified form of Apache Cassandra.
Key Role: Distributed metadata store
Cassandra stores and manages all of the cluster metadata in a distributed ring-like manner based upon a heavily modified Apache Cassandra. The Paxos algorithm is utilized to enforce strict consistency.
This service runs on every node in the cluster. The Cassandra is accessed via an interface called Medusa.
Cassandra depends on Zeus to gather information about the cluster configuration.
Now, back to our issue. Of course, the next step was to call in the help from Nutanix Support as this is definitely not something that can be fixed by me.
Via a Zoom session, the assigned SRE (Service Reliability Engineer) investigated the situation and ran the following command on the new CVM:
nutanix@NTNX-xxxxxxx-A-CVM:10.10.234.156:~$ cat /sys/block/sd*/device/path_info [2:0:0:0] Direct-Access PORT: 1I BOX: NA BAY: 1 Active [2:0:1:0] Direct-Access PORT: 1I BOX: NA BAY: 2 Active [2:0:2:0] Direct-Access PORT: 1I BOX: NA BAY: 3 Active [2:0:3:0] Direct-Access PORT: 1I BOX: NA BAY: 4 Active [2:0:4:0] Direct-Access PORT: 2I BOX: NA BAY: 1 Active [2:0:5:0] Direct-Access PORT: 2I BOX: NA BAY: 2 Active [2:0:6:0] Direct-Access PORT: 2I BOX: NA BAY: 3 Active [2:0:7:0] Direct-Access PORT: 2I BOX: NA BAY: 4 Active [3:0:0:0] Direct-Access PORT: 1I BOX: NA BAY: 1 Active [3:0:1:0] Direct-Access PORT: 1I BOX: NA BAY: 2 Active [3:0:2:0] Direct-Access PORT: 1I BOX: NA BAY: 3 Active [3:0:3:0] Direct-Access PORT: 1I BOX: NA BAY: 4 Active
Based on the above results, the SRE concluded that the issue is caused by incorrect BOX (Disk Array) internal SAS cabling. Nutanix will not mount any disks that are not in a specific order in terms of Controller / BOX / Disk location. All HPE DX units are supposed to have the BOX in the same order. Anything out of order is a defect even though all disks are visible on OS layer (CVM). To clarify, the expected output above should be 1-2-3 for the PORTs.
To be continued…
At this point in time, the advice of having the SAS cables recabled has still to be done after which I can confirm that the Cassandra Gossip is resolved. Thus to be continued…