My brand-new SSD has bad sectors! Oh, wait.

Just recently I decided to repurpose my Raspberry Pi 4 which was running a private nextcloud and a Pi3B that served as a TVHeadend and Samba server. The reasons for that are complex and don’t matter in the given context here, so I’ll save you from that. In case you’re interested, I got you covered below.

Their successor was a refurbished tiny, energy-saving Fujitsu USFF PC sporting a 4th-gen Core-I5 quadcore Intel CPU, 16 gigs of RAM and a 200-something GB SSD. I didn’t want to get into the trouble of having to trade-off OS storage versus user data storage and so on, so I added an external 480 gigs SSD to host the cloud user data. As I usually opt in favor of RedHat-based distros, my server OS of choice was Fedora 37.

The symptom

Once everything was up and running (which takes some time with SELinux, firewall, apps and the basic shebang of nextcloud itself) I went for the extra mile and checked the servers logs via Cockpit. And there it was.

screenshot of logfile with a couple of kernel-level failure messages

And these lines came back with every reboot. Ouch.

Well, you might probably disagree, but for my part, reading about “interface fatal error”s, sector numbers and DRDY statuses doesn’t really give me that warm cozy feeling of “all is fine”. In fact, it’s the opposite. And even worse, I’ll also have to get myself a replacement drive because my all brand new SSD is broken!

Thinking it over

Ehh wait. Really? A brand-new SSD exposing bad sectors upon first use? That might not be impossibe to happen, but, come on, it’s highly unlikeable in my mind. I’d rather assume the refurbished machine to have a bad SATA cable, a “broken” SATA controller, some BIOS thing… you name it.

Okay, let’s see what the web knows about

SError: { UnrecovData 10B8B BadCRC }

4 out of 5 answers tell people to IMMEDIATELY do a backup and get rid of their dying disk drive. Doesn’t sound good for me. Wait, 1 in 5 answers refers to something called NCQ. Hm. Never heard of that. Okay we’ll keep that “NCQ” thing in mind and move on to the web’s take on

failed command: READ FPDMA QUEUED

Hmm. More referrals to that NCP thing. Plus, oh, a suggestion:

The solution

I finally found something promising:

https://askubuntu.com/questions/1232188/during-boot-process-message-failed-command-read-fpdma-queued

At that point I decided that giving a try on “software basis” means less work than finding my screwdriver and checking cables. Also, less work than rebooting the machine and hitting the magic keystroke required to get into BIOS and poking around possible optimization without having a clue whether that would yield useful results. And: …which key was that again?

So, let’s take the risk of messing up grub and making everything worse.
But you know what? Adding the kernel parameter

libata.force=noncq

to the end of

GRUB_CMDLINE_LINUX_DEFAULT=

indeed did the trick for me! I really didn’t expect to see it work that flawlessly.

After rebooting and waiting for the system to settle into operative mode with all services up and running, that whole block of error messages as described in the screenshot above has disappeared. And it didn’t come back since.

Looking at NCQ

Once my stress level decreased a considerable amount, I wanted to understand what that NCQ thing is about. I understand that it’s a technique supposed to increase performance of hard disks by letting the drive itself decide which operations to pick first from its command queue, based upon where the read-write head currently resides.

It makes sense: HD access (meaning: accessing a given sector on the disk) is a 2-dimensional approach, consisting of

the single-direction rotation of the disk platters and
the linear position (inwards, outwards) of the read-write head

Both dimensions need to work in conjunction to access a sector. And that might take its (mechanical) time. Usually, the drive would pick the next operation from its command queue and then position the head accordingly (which means: Wait for the sector to become positioned under the head by platter rotation and head movement). Once that operation is completed, pick the next command from the queue.

With NCQ, the drive basically doesn’t takes the next command but rather any other command from the queue if its involved sector is the one that is the fastest accessible at this moment based upon the current position of head and platter.

Looking back (in anger) at NCQ

My personal feeling is that NCQ is outdated and causing more problems than it solves.

It’s all fun and games if you’re still running on 5600 or 7200 RPM HDDs and have ATA bus rates of 133 MBit/s, which all boils down to a considerable amount of latency that allows NCQ to make a tangible impact.

But on SSDs there’s no rotation and no platter. SATA sports at least 1.5GBit/s and currently scales at 6 GBit/s or higher. And that’s for “consumer” devices. With that trunkload of data whopping in and off a drive, I doubt there’s enough time left for a drive to optimize the sequence fo its command queue entries at all.

But most important: I don’t care about NCQ’s theoretical marginal improvements if they come at the expense of stability and with “your disk is kaput” error messages. Thanks but no, thanks.

But that’s just my 2 cents – make your own judgement, e.g. from what Wikipedia has to say about Native Command Queuing (NCQ)

Why did I replace the Raspis?

I was in the need of a “better than 3B” Raspi to act as our bedroom Kodi. Raspi prices have literally exploded and so I wasn’t happy with having to buy a new RPi4.

The Raspi4 we already had was driving our private NextCloud. The initial intention here was to have a low-power, cloud-based backup of the pictures taken by my wife’s phone and mine but with everything under my control. Probably also to not have to pay a king’s ransom for a nitpicking thrifty amount of picture storage space in the web.

All was fine until the point where it turned out that the RPi4-based Nextcloud seemed to be a bit low on horsepower for coping with the WAF in the scenario where “someone” wanted to scroll though the gallery of the past years’ pictures backed-up.
Yes I have the preview generator plugin for Nextcloud but that didn’t help. 🙂

Also, there was a second RPi (a 3B) running TVHeadend serving Satellite TV to the bedroom Kodi and acting as a Samba server for serving Music and Movies across the flat. The idea was to aggregate all that into a single machine to replace 2 “underpowered” devices and get a new bedroom RPi4. Plus a spare 3B for tinkering.

About self-hosting NextCloud

Before you get the false impression of me being an impressive, kick-ass, rockstar Linux server administrator: I’m not. I remember to once have gone through the hell of setting up NextCloud on a LAMP stack on a SELinux Fedora server. Been there, done that – will never ever want to go back to that. What I want to say by this is:

Over the past years, I enjoyed NextCloud-AIO although I had no clue about Docker in the beginning (…and still don’t really have). That made me look for similar approaches and finally stumble across Snappy NextCloud. I like that “nobrainer-approach” they come up with. These guys do a pretty good job and their work saves you an incredible amount of hassle.

So if you ever thought about self-hosting your own private cloud but were afraid of the efforts and skills involved, I strongly recommend you give either “one-in-all” approaches above a try.

Notes to self