Introduction
My 42 TB storage is running out and Black Friday is around the corner. Deals are already out online. This sounds like a pretty good time for an upgrade. I need something that can last for a few more years. The plan? Same as the last time. Shuck external drives to get high capacity drives for cheap. Can't be a data hoarder without the capacity to store data, right?
The setup from 2017
Let's take a trip back for a sec, so we can compare past to present. Those WD (Western Digital) easystore drives have consistently been a nice way to get high capacity drives for cheap. Back in 2017, I needed more storage because some external 2 and 3 TB drives weren't enough. I wanted something overkill. Something that will last me for a few years. So I bought 4 of the 8 TB models for $160 each. That's just a bit above $20/TB when tax is considered. I have tweets with pictures covering these and what shucking them looks like. You can see those below:
|
|
|
|
In all 4 of those, there was a red WD 8 TB internal drive which can be plugged into a computer. It is then treated as an internal hard drive as opposed to an external one. To my surprise, they are capable of handling me recording gameplay to them over USB 3.0. Raw 1920x1080 60 fps footage via Fraps and Dxtory. So I kept one drive to use as an external drive, and popped the rest into my server. I had a laptop at the time as my gaming computer, so making that last drive an internal one was not an option.
"So, why do you want so much storage Clara?"
For context on why I want so much storage, I am a data hoarder. I am obsessed with data. When I play games, I record every match. In university, I salvage and back up all data from as many courses and semesters as possible. I am into photography as well, and always capture in RAW. Those files are multiples larger than your average JPG. I don't just store them, though. Recording and shooting in RAW gives me full control of my content in post, which I prefer. Data hoarding is a hobby, and quite something I enjoy doing. But it's also a way of archiving my life. Organising it, watching it grow over time, and looking back is nice.
The Deal
Now then, let's jump back to 2020. 8 TB is kind of puny now. There's higher capacity options out there. I've seen up to 18 TB offered on Best Buy's site. In fact, there's the 18 TB external drive. At first, I thought this had the best deal, so I bought two of them for $330 ($18.33/TB). A few days later, I checked back and saw this magic:
Black Friday sales do this kind of thing a lot. It's kind of funny to see the 14 TB drive cheaper than then 12 TB one. Another funny example was on Amazon where a 500 GB Samsung SSD was $4 more than the 250 GB variant. But I digress. When I saw this, I cancelled my order of the 2 18 TB HDDs and went with 3 of the 14 TB ones. They were even available for pickup that very same day. Sweet.
They intentionally limited the number of purchases to 1 per customer. The way around this is to make a Best Buy Business Account. It lets you purchase up to 3 via that. The price came out much better than those 18 TB drives would have. $190 × 3 is $570. The original deal I had on the 18 TB drives was 2 × $330, being $660. Saving $90 and getting 42 TB rather than 36 TB was a steal. Plus, it was around $13.57/TB.
So, what now?
Drive Dumping
I'm contemplating on how to configure the drives. However, while I do that,
I did want to experiment on something. So I hopped on Arch Linux and
blasted dd
on the drives. But, rather than wiping them clean,
I'm doing the opposite. I wanted to see how compression would be if I
backed up an almost empty drive (they come with installers for Windows and
Mac). In my
Christmas
Deathmatch Production Procedure post, I mentioned a 4.63 GB file being
compressed down to a mere 773 KB. So I wanted to beat that. As an
experiment, it's pretty useless, I know. But it would be nice to have a raw
image of the drive's original state upon purchase.
Let's assume those three drives are connected and are identified as
/dev/sda
, /dev/sdb
, and /dev/sdc
.
Let's also assume you are the superuser (run su
...).
dd if=/dev/sda bs=1M | pv -s $(blockdev --getsize64 /dev/sda) | gzip -9 > "wd14tb_SERIAL_NUMBER.img.gz"
Obviously, I don't have a drive lying around that can hold 3 disk images of
14 TB. So compression has to be applied on-the-fly while reading the bytes
off the disk directly. UNIX Shell makes this easy. Just pipe
dd
into gzip
via the |
. Why
gzip
? It's for the sake of speed. I'll extract and pipe it
into another algorithm like bz2
or xz
later.
Trying to run xz
on the piped output of dd
does work. But it's more CPU intensive, and slowed down the transfer
rate. Dumping these drives would've taken 4 days rather than 23
hours if I took that route.
Watching this work via tmux
, running it, top
, and
ls
combined with watch
was pretty cool too:
After some time, it finished. These were running in parallel so having my
computer sit for a day was enough to get all three drive contents dumped.
The speed could've been faster, I suppose. But it'll do.
Hilariously, trying to get the compression ratio and the original size
from gzip
directly fails because the size causes an integer
overflow:
Take note, the serial numbers are edited out. So I've replaced it with
SERIAL_NUMBER_1
, SERIAL_NUMBER_2
, and so on.
With that noted, here's a fa
(fl -a
) of the
directory with all three drive image dumps:
Interesting. They were all not tampered with prior to being dumped, and
they all contain the same exact files. So I'm guessing there's some other
very minor differences between them that is causing the different file
sizes.
Right. That was phase 1. Now for phase 2. Let's extract each of these and repack them into bz2. This gz→bz2 repacking is as simple as:
gunzip < "wd14tb_SERIAL_NUMBER.img.gz" | pv -s 14000519643136 | bzip2 -9c > "wd14tb_SERIAL_NUMBER.img.bz2"
Sure, it's possible to just bzip2
the gzip'd file. But
compressing a file twice is almost never a good idea. So, we have to
extract it and then let bzip2
recompress it. Again, one of the
beauties of piping is that we don't need a 14 TB drive to hold the
decompressed file midway. It's just passed into bzip2
.
So, why use bzip2
over xz
? Simple. I'm under the
assumption most of this data is just 00
bytes repeating over
and over. This is one of the very few scenarios where bz2
beats every other algorithm, and by a fairly large amount. If you want a
ratio comparison of pure 00
's being crammed into each of these
algorithms, here you go:
File | Params | Size (bytes) | Ratio |
---|---|---|---|
8GiB.bin | 8589934592 | 1:1 | |
8GiB.bin.gz | -9 | 8336315 | 1030.42:1 |
8GiB.bin.bz2 | -9 | 6030 | 1424533.09:1 |
8GiB.bin.xz | -9e | 1249556 | 6874.39:1 |
So yeah. bz2
is objectively better in this scenario. And I'm
pretty sure that xz
is better in almost every other case.
With that out of the way, my assumption about the bytes was correct... so behold:
Beautiful. Around a 1190709:1 compression ratio for all three drives (original size: 14000519643136 bytes), and a 1155x smaller file compared to the gzip'd version. It feels redundant and useless, but now I have image files of the original states that these drives were in upon shipment. And, I have them in a very compact state. I'm curious what is different between them. But, I'm not going to go into that in this blog post.
Drive Shucking
Now we get to the fun part. Let's shuck those drives.
These were somewhat more difficult to shuck compared to the earlier models. Honestly, it could just be that my memory is bad. There weren't many pins though. I just slipped some guitar picks through and then a screwdriver and it was over. Here's the inside of one of them:
It's a WD140EDFZ. Notice something? It's not explicitly stating that it's a WD Red drive this time. Instead, it's white-labelled. Supposedly, even back in 2017, some of the easystores contained a white label instead of red label. There's some posts regarding these on r/DataHoarder. I'm not going to go into an analysis of it myself.
For the curious, here's a screenshot from CrystalDiskInfo:
5400 RPM. Hmm. I've read reports that it can hit up to 7200 RPM when it's under load. But, again, I'm not going to go into an analysis of it to prove or disprove that. Since I'm just looking for raw storage capability, this is good. Also, the Power On Count is "5". This screenshot is from the first time I have personally plugged this drive up and powered it on.
Anyways, I eventually got the other drives shucked and extracted the other hard drives. Here's some pictures of those.
There were reports online that you would have to block a specific pin for the HDD to work as an internal drive. For me, this was not the case. I plugged them in and saw my 14 × 3 TB of storage. I formatted them to btrfs and started structuring data from previous drives.
As for what to do with this storage, raw archival. My media drive has been running low due to a recent interest in multiple friends recording multi-perspective videos for my YouTube channel. This'll keep me going for a few more years. I'll upgrade once again in a few years when inevitably more storage space is needed. Redundancy also has to be considered.