Nutanix Medusa Error: Cassandra Gossip Fails

I want to share an interesting issue that I came across when expanding an existing Nutanix cluster with a new node. The existing cluster consisted out of several HPE Proliant DL380 servers whereas the new node was a HPE Proliant DX380 server. Why a DX model was chosen for the new node was not disclosed with me. Regardless, a Nutanix cluster consisting out of DL and DX servers is possible. DX models are fully integrated with Nutanix and are the result of a recent partnership between HPE and Nutanix. More information on this partnership and DX models can be found here.

Medusa Error: Cassandra Gossip Fails

Most of the installation process went like a breeze:

    1. Performing the Nutanix Foundation on the new HPE DX380 server
    2. Adding the node (actually VMware ESXi host) to VMware vCenter Server as a standalone host
    3. Adding the ESXi host to the respective VMware Distributed Switches
    4. Performing the Expand Cluster from within Nutanix Prism Elements
    5. Moving the ESXi host into the required VMware vSphere cluster

The problem occurred when Nutanix Prism started showing this node as an IP address instead of its’ FQDN in the Hardware page. There were no readings available for CPU, RAM, etc. like when a CVM is still in maintenance mode.

When checking via SSH and running “ncli host ls”, the respective CVM was showing up as “Under Maintenance Mode : null” . When running the command to set the Maintenance Mode to False “ncli host edit id=<Id> enable-maintenance-mode=”false””, the node appeared to become operational again and services started to become active. Well, it was short lived as Medusa gave an error stating Cassandra gossip failed“:

nutanix@NTNX-xxxxxxx-A-CVM:10.10.234.151:~$ cs | grep -v UP
2020-11-17 14:22:07 INFO zookeeper_session.py:135 cluster is attempting to connect to Zookeeper
2020-11-17 14:22:07 INFO cluster:2642 Executing action status on SVMs 10.10.234.151,10.10.234.152,10.10.234.153,10.10.234.154,10.10.234.155,10.10.234.156
The state of the cluster: start
Lockdown mode: Disabled
 
        CVM: 10.10.234.151 Up
 
        CVM: 10.10.234.152 Up
 
        CVM: 10.10.234.153 Up
 
        CVM: 10.10.234.154 Up, ZeusLeader
 
        CVM: 10.10.234.155 Up
2020-11-17 14:22:08 INFO cluster:2755 Success!
 
        CVM: 10.10.234.156 Up
                              Medusa ERROR      [6008, 23586, 23641, 23642]   Cassandra gossip failed
                  DynamicRingChanger DOWN       []
                             Pithos DOWN        []
                              Mantle DOWN       []
                                Hera DOWN       []
                            Stargate DOWN       []
                          InsightsDB DOWN       []
                InsightsDataTransfer DOWN       []
                               Ergon DOWN       []
                             Cerebro DOWN       []
                             Chronos DOWN       []
                             Curator DOWN       []
                              Athena DOWN       []
                               Prism DOWN       []
                                 CIM DOWN       []
                        AlertManager DOWN       []
                            Arithmos DOWN       []
                             Catalog DOWN       []
                           Acropolis DOWN       []
                               Uhura DOWN       []
                                Snmp DOWN       []
                    SysStatCollector DOWN       []
                              Tunnel DOWN       []
                               Janus DOWN       []
                   NutanixGuestTools DOWN       []
                          MinervaCVM DOWN       []
                       ClusterConfig DOWN       []
                             Mercury DOWN       []
                         APLOSEngine DOWN       []
                               APLOS DOWN       []
                               Lazan DOWN       []
                              Delphi DOWN       []
                               XTrim DOWN       []
                       ClusterHealth DOWN       []

What are Medusa and Cassandra?

Perhaps, this is a good moment to explain something about what Medusa and Cassandra do within a Nutanix cluster. 🙂

Medusa

Key role: Access interface for Cassandra

Distributed systems that store data for other systems (for example, a hypervisor that hosts virtual machines) must have a way to keep track of where that data is. In the case of a Nutanix cluster, it is also important to track where the replicas of that data are stored.

Medusa is a Nutanix abstraction layer that sits in front of the database that holds this metadata. The database is distributed across all nodes in the cluster, using a modified form of Apache Cassandra.

Cassandra

Key Role: Distributed metadata store

Cassandra stores and manages all of the cluster metadata in a distributed ring-like manner based upon a heavily modified Apache Cassandra. The Paxos algorithm is utilized to enforce strict consistency.

This service runs on every node in the cluster.  The Cassandra is accessed via an interface called Medusa.

Cassandra depends on Zeus to gather information about the cluster configuration.

Nutanix Support

Now, back to our issue. Of course, the next step was to call in the help from Nutanix Support as this is definitely not something that can be fixed by me.

Via a Zoom session, the assigned SRE (Service Reliability Engineer) investigated the situation and ran the following command on the new CVM:

nutanix@NTNX-xxxxxxx-A-CVM:10.10.234.156:~$ cat /sys/block/sd*/device/path_info
[2:0:0:0]    Direct-Access     PORT: 1I BOX: NA BAY: 1 Active
[2:0:1:0]    Direct-Access     PORT: 1I BOX: NA BAY: 2 Active
[2:0:2:0]    Direct-Access     PORT: 1I BOX: NA BAY: 3 Active
[2:0:3:0]    Direct-Access     PORT: 1I BOX: NA BAY: 4 Active
[2:0:4:0]    Direct-Access     PORT: 2I BOX: NA BAY: 1 Active
[2:0:5:0]    Direct-Access     PORT: 2I BOX: NA BAY: 2 Active
[2:0:6:0]    Direct-Access     PORT: 2I BOX: NA BAY: 3 Active
[2:0:7:0]    Direct-Access     PORT: 2I BOX: NA BAY: 4 Active
[3:0:0:0]    Direct-Access     PORT: 1I BOX: NA BAY: 1 Active
[3:0:1:0]    Direct-Access     PORT: 1I BOX: NA BAY: 2 Active
[3:0:2:0]    Direct-Access     PORT: 1I BOX: NA BAY: 3 Active
[3:0:3:0]    Direct-Access     PORT: 1I BOX: NA BAY: 4 Active

Based on the above results, the SRE concluded that the issue is caused by incorrect BOX (Disk Array) internal SAS cabling. Nutanix will not mount any disks that are not in a specific order in terms of Controller / BOX / Disk location. All HPE DX units are supposed to have the BOX in the same order. Anything out of order is a defect even though all disks are visible on OS layer (CVM). To clarify, the expected output above should be 1-2-3 for the PORTs.

To be continued…

At this point in time, the advice of having the SAS cables recabled has still to be done after which I can confirm that the Cassandra Gossip is resolved. Thus to be continued…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.