WhiskeyTech

Home Lab Update #1

So I have been using my Hyperconverged RHV environment and it’s been working nicely for me. I upgraded the switch that hosts the Ceph Cluster interfaces to a 10g backplane which definitely helped with performance, but that’s isn’t what I wanted to post about. I wanted to post about Christmas! You see, my “Home Lab” is hosted in my spare bedroom in my house and this Christmas I was hosting the whole family at my house. What this meant was, my family members who were sleeping in that room must like to sweat alot or I needed to shut down my home lab while the room was occupied. I don’t hate my family, so I opted with option #2.

I couldn’t really find much on the “best” way to shutdown a Ceph/RHV Hosted Engine Hyper Converged platform, so I went with this:

  • Stop all VM’s other than the Hosted Engine
  • Put the cluster into Global Maintenance Mode
  • Shutdown the Hosted Engine
  • Wait 10-15 minutes to ensure the dust had settled
  • Shutdown all 3 Ceph nodes simultaneously

Additionally, I pulled the power plugs on all three servers as well as the new 10g switch, which was louder than the 1g switch I was running. The room was perfect for the family and Christmas was a success!

It’s now January 2020 and everyone finally left my house, so it’s time to bring the home lab back online. I plug in the power to the 3 servers and they start to boot. I plug in the 10g switch and it’s lights flicker as it comes up… And I check Ceph and it seems like the nodes can’t communicate on the cluster network (which is supposed to be running on the 10g switch).

I try to connect to the 10g switch and it’s not pinging but I just got a notification from my network that a NETGEAR device just performed a DHCP request. That’s not supposed to happen?!?

I am not really sure why, but the 10g switch factory reset. Maybe because it was without power for a couple weeks? I am not really sure. I will still need to investigate this. I quickly re-configured the switch and I now had a healthy Ceph cluster.

Check the Hosted Engine and it kept reporting that the ovirt-ha-agent wasn’t running. The error message in the agent.log was slightly deceiving showing:

MainThread::ERROR::2020-01-02 12:08:57,996::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 131, in _run_agent
    return action(he)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 55, in action_proper
    return he.start_monitoring()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 413, in start_monitoring
    self._initialize_broker()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 537, in _initialize_broker
    m.get('options', {}))
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 86, in start_monitor
    ).format(t=type, o=options, e=e)
RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, [monitor: 'ping', options: {'addr': '172.16.0.1'}]

The IP addressed listed is my gateway. I was able to ping that fine from the RHV nodes. I took a look at the current mount points and noticed that the localhost mounts to the various storage domains were not there. This made me look at the state of the Ganesha cluster:

[root@server50 temp]# ./usr/bin/ganesha-rados-grace -p cephfs_metadata -n ganesha
cur=24 rec=23
======================================================
server50.homelab.net
server60.homelab.net    NE
server70.homelab.net    NE
[root@server50 temp]#

So it shows that both server60 and server70 both (N)eed a grace period and are (E)nforcing a grace period. So I restarted the agent on both of those nodes which did absolutely nothing and then I restarted the agent on server50. This put all three in the NE state which then after a minute quickly resolved itself.

I went back and took a look at the nodes and they now had the hosted-deploy storage domain mounted on loopback.

Next I took the RHV cluster out of global maintenance:

hosted-engine --set-maintenance --mode=none

After some time, the Hosted Engine started back up:

[root@server50 ~]# hosted-engine --vm-status


--== Host server50.homelab.net (id: 1) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : server50.homelab.net
Host ID                            : 1
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 178def03
local_conf_timestamp               : 21997
Host timestamp                     : 21997
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=21997 (Thu Jan  2 15:52:14 2020)
	host-id=1
	score=3400
	vm_conf_refresh_time=21997 (Thu Jan  2 15:52:14 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineUp
	stopped=False


--== Host server60.homelab.net (id: 2) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : server60.homelab.net
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : b4fec494
local_conf_timestamp               : 21450
Host timestamp                     : 21450
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=21450 (Thu Jan  2 15:52:06 2020)
	host-id=2
	score=3400
	vm_conf_refresh_time=21450 (Thu Jan  2 15:52:06 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineDown
	stopped=False


--== Host server70.homelab.net (id: 3) status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : server70.homelab.net
Host ID                            : 3
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 6518d6ed
local_conf_timestamp               : 13817
Host timestamp                     : 13817
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=13817 (Thu Jan  2 15:52:14 2020)
	host-id=3
	score=3400
	vm_conf_refresh_time=13817 (Thu Jan  2 15:52:14 2020)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineDown
	stopped=False
[root@server50 ~]# 

At this point I was able to login to RHV-M and everything looked functional. I did some Googling as to why it said “failed liveliness check” and it looked like the issue was regarding DNS. Ah! That make sense as my home DNS is a VM on my infrastructure. Logging into RHV-M and starting my DNS VM and everything is looking good:

[root@server50 ~]# hosted-engine --vm-status | egrep '(^Hostname|^Engine status)'
Hostname                           : server50.homelab.net
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Hostname                           : server60.homelab.net
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Hostname                           : server70.homelab.net
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
[root@server50 ~]#

So far the cluster is meeting my needs at home and I haven’t really struggled too much with it. I hope that this type of deployment is somewhere in the future of our product.

Anyway, back to work!

Back to Top