Something goes wrong…
Our house lost power the other day while the family and I were on the way out the door to eat at a restaurant we hadn’t been to before. Power outages are generally not a problem for my basement home lab since most of the systems I consider important are on a UPS and will shut themselves down gently if power suddenly goes missing.
For reasons, my primary router running pfSense wasn’t considered important enough to warrant a backup battery. “It’ll just come up on its own…” I thought. It generally does and I haven’t had an issue with it in the past.
While sitting at the restaurant enjoying the company of my extended family (Yay living close again) and the warm ambiance of the food and setting, my neighbor texted and said the power was back online. However, Nagios (I monitor the home lab from a cloud instance, nerd alert!) reported that everything was still down.
After arriving home, I checked on the equipment and everything seemed to be in working order except the router which was powered off. Yikes, after replacing the single board computer with a different one, apparently I had forgotten to flip on the BIOS setting that powers up the system back up after a power loss. No big deal, I enabled the BIOS setting and booted the pfSense router. Now, if that was all there was to the story, you wouldn’t be reading it.
In order to test it the change, I pulled the power from the router after giving it a few minutes to boot. I plugged the power back in and the router powered on without an issue. “Huzzah!” I thought, “the problem is fixed.”
Except that, while the router did indeed power on correctly, it was not showing up on the network. Ugh, now I would have to spend time troubleshooting or have a internet-less family becoming crankier and crankier as the evening wore on. In watching pfSense boot on the console, I immediately saw that it was panicking with error
ffs_valloc: dup alloc and rebooting. It was essentially caught in a reboot loop.
I’m a fairly seasoned system administrator (or site reliability engineer? who knows) with a lot of UNIXy experience so this didn’t phase me much. The version of pfSense I am using for my router sits atop FreeBSD 12.4 which I’m very comfortable with. A very brief internet search revealed the issue:
Kirk McKusick explains it succintly in this bug report:
This is a known problem with journaled soft updates. It only fixes things that are in its log. Most disks are run with write cache enabled which means that they lie about completing I/O operations. Specifically they report that an I/O has been made to stable store when in fact it is only in the RAM-cache on the disk. If power to the disk is lost before the cache is flushed, the write is lost, but journaled soft updates believes it to have been done so does not check for the error. Thus it marks the disk clean when in fact it is not clean. To resolve this problem, you must bring the machine up single user and run a full fsck on it using ``fsck -f -y /filesystem_in_question’’. This will find the hidden problem and correct it.
Am I using FFS with soft updates on an el-cheapo drive in my router with write cache enabled that’s likely lying to the OS about completing I/O operations? You betcha!
fsck -f -y in single user mode just as Kirk indicates, the root filesystem was repaired and the router did indeed boot back up just fine albeit with a corrupted
config.xml configuration file. Thankfully, I had a backup.
When I was first bouncing around the pfSense console attempting to figure out what was failing and why, I was struck with the notion that the little bit of friction wrapping my brain around the pfSense version of FreeBSD was more than a little bit annoying. I felt like I had fallen into an uncanny valley of the FreeBSD we all know and love.
The clean separation of the base system vs everything else that the BSD’s are famous for doesn’t seem to carry over to pfSense. Which, duh, it’s an appliance, it’s going to be a bit weird. My long background with FreeBSD served as somewhat of a hindrence when trying to diagnose the router. Things that looked and felt similar didn’t match 100% to my expectations on how they ought to function.
So I created a list of what I use pfSense for and it boiled down to:
- DHCP server
- Local DNS resolver and cache
- Firewall & internet gateway
All of these services are generally easy to install and run if you have even a little background in UNIX administration. The software I chose to use for each service is:
- DHCP - isc-dhcp44-server (ports)
- Local DNS - unbound (ports)
- Firewall & internet gateway - pf (base)
- VPN - wireguard (ports)
I decided to run the router out of a bhyve virtual machine instead of on the hardware itself to add some flexibility to my setup. I was initially concerned that running the router in a virtual environment might affect the speed of the internet connection but that fear proved to be a non-issue. I have a 1gb fiber connection to the internet that the virtual machine handles the flood of packets without breaking a sweat.
The biggest benefit of running the router as a virtual machine is that I can move it around if need be. This is dead simple with a FreeBSD host using ZFS. Just take a snapshot of the virtual machine’s dataset and send that snapshot to the backup host on a regular basis. I really like using syncoid for this but there are an infinite number of ways to replicate.
Next time there’s a failure of the router (that’s now a host for the virtual machine) I just power down the router and start it up on the backup host.