Fixing CEPH OSD “Down” and “Out” Glitch After Power Outage

I have been running my home server on a mini setup of a CEPH system, which is composed of three monitors and six OSDs. The monitors are running off SBCs (single board computer) and the OSDs are running off Intel NUC mini PCs. I created a small acrylic rack to house all the devices. I have been so impressed by the SDS (software defined storage) solution provided by the CEPH. It kept my websites (including the one you’re visiting right now) online 24/7 for years until recently when a half-day power outage drained my UPS. After the power was restored, I found 2 out of the 6 OSDs were down.

osd: 6 osds: 4 up (since 5m), 4 in (since 5m)

Again, I would like to deliver my compliments to the CEPH platform with no hesitation because my websites went online robustly even with 1/3 of the OSDs being offline. Here are the steps I used to look into the issue and finally got the 2 outliers back online.

Step 1. Check the hardware

I used the following command to figure out which OSDs are down.

ceph osd tree

It turned out those 2 down OSDs were on the same computer (note: I have 3 Intel NUCs running as OSD hosts, each of which carries two SSDs. Each SSD is configured as an OSD. So total 6 OSDs running on 3 computers). My first impression was hardware issue. So I logged to the computer but found nothing was wrong: computer was running OK; network connected OK; storage was readable and writable and not full.

Step 2. Restart the OSD service

Since nothing seemed broken, why not restarting the OSD service? I ran the following command to bring up the OSDs.

sudo systemctl start ceph-osd@4
sudo systemctl start ceph-osd@5

The first couple times I tried executing the commands, the system didn’t return any error. And I did notice the CEPH status changed to:

osd: 6 osds: 4 up (since 5s), 6 in (since 5s)

Looks promising that the system was trying to bring up the OSDs, but it actually didn’t take long to drop the ball and returned to “4 in” again. Then I retried to start service, a couple times later it started to return immediately with error, so I run the following diagnosis command to check:

sudo systemctl status ceph-osd@4.service

This printed out the message during the service bring-up and I did find some wired information:

received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
osd.4 1746 *** Got signal Interrupt ***
osd.4 1746 *** Immediate shutdown (osd_fast_shutdown=true) ***

So it looked like something was not right however the OSD host was trying, and then the kernel sent a message to kill the service.

Step 3. Remove the OSDs and Re-add

My next suspicion was that the computer that hosted the problematic OSDs might have some sorts of configuration issues, so it was rejected by the kernel. I ran to the check the time and all computers were well sync-ed with the current time. Then I followed this article to completely remove the OSDs and re-introduce them to the system. Unfortunately, that was still a no-go.

Step “finally”. Found the clue in log file and solve the issue caused by ip address setting

While going to through the previous step, I realized there is the log information I could refer to for more details. Each OSD has their dedicated log file located at

/var/log/ceph/ceph-osd.4.log

Simply change the ods.4 to any digital to match your id. Then I noticed another wired information:

-1 osd.4 1717 heartbeat_check: no reply from 100.0.0.100:6804 osd.0 ever on either front or back, first ping sent 2025-05-13 17:52:27.107270 (oldest deadline 2025-05-13 17:52:47.107270)

And I found the exact same error on the osd.5 log. This was telling me that the network between the OSD hosts were not going through. So, I tried the ping command and finally nailed down the cause: I could ping from OSD4/5 computer to OSD0/1 computer, but not the other way around. It turned out I enabled both static IP and DHCP assignment on OSD0/1 computer a few years ago when I was setting up the system (don’t ask me why …). And somehow the OSD0/1 computer had been carrying dual IP addresses until this power outage, when a new IP address was assigned through DHCP after the power restore. But I still don’t understand how it could cause ping failure given that the original static IP address remained unchanged. Nevertheless, after I removed the DHCP, the missing 2 OSDs were successfully brough up and stayed on “in” and “up”.

P.S. the location of the configuration file for static IP / DHCP assignment on linux system:

/etc/netplan

Edit the .yaml files accordingly and apply the change:

sudo netplan apply

Leave a Reply

Your email address will not be published. Required fields are marked *