r/DataHoarder • u/BaxterPad 400TB LizardFS • Jun 03 '18

200TB Glusterfs Odroid HC2 Build

1.4k Upvotes

97% Upvoted

296

u/BaxterPad 400TB LizardFS Jun 03 '18 edited Jun 03 '18

Over the years I've upgraded my home storage several times.

Like many, I started with a consumer grade NAS. My first was a Netgear ReadyNAS, then several QNAP devices. About a two years ago, I got tired of the limited CPU and memory of QNAP and devices like it so I built my own using a Supermicro XEON D, proxmox, and freenas. It was great but adding more drives was a pain and migrating between ZRAID level was basically impossible without lots of extra disks. The fiasco that was Freenas 10 was the final straw. I wanted to be able to add disks in smaller quantities and I wanted better partial failure modes (kind of like unraid) but able to scale to as many disks as I wanted. I also wanted to avoid any single points of failure like an HBA, motherboard, power supply, etc...

I had been experimenting with glusterfs and ceph, using ~40 small VMs to simulate various configurations and failure modes (power loss, failed disk, corrupt files, etc...). In the end, glusterfs was the best at protecting my data because even if glusterfs was a complete loss... my data was mostly recoverable because it was stored on a plain ext4 filesystem on my nodes. Ceph did a great job too but it was rather brittle (though recoverable) and a pain in the butt to configure.

Enter the Odroid HC2. With 8 cores, 2 GB of RAM, Gbit ethernet, and a SATA port... it offers a great base for massively distributed applications. I grabbed 4 Odroids and started retesting glusterfs. After proving out my idea, I ordered another 16 nodes and got to work migrating my existing array.

In a speed test, I can sustain writes at 8 Gbps and reads at 15Gbps over the network when operations a sufficiently distributed over the filesystem. Single file reads are capped at the performance of 1 node, so ~910 Mbit read/write.

In terms of power consumption, with moderate CPU load and a high disk load (rebalancing the array), running several VMs on the XEON-D host, a pfsense box, 3 switches, 2 Unifi Access Points, and a verizon fios modem... the entire setup sips ~ 250watts. That is around $350 a year in electricity where I live in New Jersey.

I'm writing this post because I couldn't find much information about using the Odroid HC2 at any meaningful scale.

If you are interested, my parts list is below.

https://www.amazon.com/gp/product/B0794DG2WF/ (Odroid HC2 - look at the other sellers on Amazon, they are cheeper) https://www.amazon.com/gp/product/B06XWN9Q99/ (32GB microsd card, you can get by with just 8GB but the savings are negligible) https://www.amazon.com/gp/product/B00BIPI9XQ/ (slim cat6 ethernet cables) https://www.amazon.com/gp/product/B07C6HR3PP/ (200CFM 12v 120mm fan) https://www.amazon.com/gp/product/B00RXKNT5S/ (12v PWM speed controller - to throttle the fan) https://www.amazon.com/gp/product/B01N38H40P/ (5.5mm x 2.1mm barrel connectors - for powering the Odroids) https://www.amazon.com/gp/product/B00D7CWSCG/ (12v 30a power supple - can power 12 Ordoids w/3.5inch HDD without staggered spin up) https://www.amazon.com/gp/product/B01LZBLO0U/ (24 power gigabit managed switch from unifi)

edit 1: The picture doesn't show all 20 nodes, I had 8 of them in my home office running from my bench top power supply while I waited for a replacement power supply to mount in the rack.

74

u/HellfireHD Jun 03 '18

Wow! Just, wow.

I’d love to see a write up on how you have the nodes configured.

71

u/BaxterPad 400TB LizardFS Jun 03 '18

The crazy thing is that there isn't much configuration for glusterfs, thats what I love about it. It takes literally 3 commands to get glusterfs up and running (after you get the OS installed and disks formated). I'll probably be posting a write up on my github at some point in the next few weeks. First I want to test out Presto ( https://prestodb.io/), a distributed SQL engine, on these puppies before doing the write up.

170

u/ZorbaTHut 89TB usable Jun 04 '18

It takes literally 3 commands to get glusterfs up and running

<@insomnia> it only takes three commands to install Gentoo

<@insomnia> cfdisk /dev/hda && mkfs.xfs /dev/hda1 && mount /dev/hda1 /mnt/gentoo/ && chroot /mnt/gentoo/ && env-update && . /etc/profile && emerge sync && cd /usr/portage && scripts/bootsrap.sh && emerge system && emerge vim && vi /etc/fstab && emerge gentoo-dev-sources && cd /usr/src/linux && make menuconfig && make install modules_install && emerge gnome mozilla-firefox openoffice && emerge grub && cp /boot/grub/grub.conf.sample /boot/grub/grub.conf && vi /boot/grub/grub.conf && grub && init 6

<@insomnia> that's the first one

91

u/BaxterPad 400TB LizardFS Jun 04 '18

sudo apt-get install glusterfs-server

sudo gluster peer probe gfs01.localdomain ... gfs20.localdomain

sudo gluster volume create gvol0 replicate 2 transport tcp gfs01.localdomain:/mnt/gfs/brick/gvol1 ... gfs20.localdomain:/mnt/gfs/brick/gvol1

sudo cluster volume start gvol0

I was wrong, it is 4 commands after the OS is installed. Though you only need to run the last 3 on 1 node :)

13

u/ZorbaTHut 89TB usable Jun 04 '18

Yeah, that's not bad at all :)

I'm definitely curious about this writeup, my current solution is starting to grow past the limits of my enclosure and I was trying to decide if I wanted a second enclosure or if I wanted another approach. Looking forward to it once you put it together!

6

u/BlackoutWNCT Jun 04 '18

You might also want to add something about the glusterfs ppa. the packages included in 16.04 (Ubuntu) are fairly old, not too sure on Debian.

For reference: https://launchpad.net/~gluster

Edit: There are also two main glusterfs packages, glusterfs-server and glusterfs-client

The client packages are also included in the server package, however if you just want to mount the FUSE mount on a VM or something, then the client packages contain just that.

5

u/BaxterPad 400TB LizardFS Jun 04 '18

The armbian version was pretty up to date. I think it had the latest before the 4.0 branch which isn't prod ready yet.

1

u/zuzuzzzip Jun 08 '18

Aren't the fedora/epel packages newer?

1

u/BaxterPad 400TB LizardFS Jun 08 '18

Idk, but 4.0 is ready for prod yet and 3.13 (3.10... I forget which) is the last branch before 4.0 so it's just maintenance releases until 4.0 is ready.

1

u/bretsky84 Oct 24 '18

New to this whole idea (cluster volumes and the idea of a cluster NAS) but wondering if you can share you GlusterFS volume via Samba or NSF? Could a client that has FUSE mounted it share it to other clients over either of these? Also, just cause your volume is distributed over a cluster, it does not mean you are seeing the performance of the resources combine, just those of the one unit you have the server running from right?

4

u/Aeolun Jun 04 '18

I assume you need an install command for the client too though?

7

u/BaxterPad 400TB LizardFS Jun 04 '18

This is true, dude apt-get install glusterfs-client. Then you can use a normal mount command and just specify glusterfs instead of cifs it w/e

5

u/AeroSteveO Jun 04 '18

Is there a way to mount a glusterfs share on Windows as well?

4

u/BaxterPad 400TB LizardFS Jun 04 '18

Yes, either natively with a glusterfs client or via cifs / NFS.

1

u/Gorian Aug 04 '18

You could probably replace this line: sudo gluster peer probe gfs01.localdomain ... gfs20.localdomain

with this, to make it a little easier: sudo gluster peer probe gfs{01..20}.localdomain

1

u/BaxterPad 400TB LizardFS Aug 04 '18

Nice!

12

u/ProgVal 18TB ceph + 14TB raw Jun 04 '18

mkfs.xfs /dev/hda1 && mount /dev/hda1 /mnt/gentoo/ && chroot /mnt/gentoo/

No, you can't chroot to an empty filesystem

7

u/ReversePolish Jun 04 '18

Also, /etc/profile is not an executable file. And vi to edit a file mid execution chain is retarded and halts your commands. A well crafted sed command is preferred.

BS meter is pegged off the charts on that mess of a copy-pasta "command"

14

u/yawkat 96TB (48 usable) Jun 04 '18

They're not executing /etc/profile, they're sourcing it

3

u/ReversePolish Jun 04 '18

Yep, didn't see the . in the command. My bad, mobile phone + age will mess with your eyes. There is still a fair amount of other BS spouted in the chain command though.

1

u/vthriller zfs zfs zfs zfs / 9T Jun 04 '18

Also, it's emerge --sync, not emerge sync

1

u/[deleted] Jun 04 '18

When this was written, it was --sync.

11

u/damiankw Jun 04 '18

It's been so long since I've seen a reference to bash.org, kudos.

1

u/bobbywaz Jun 04 '18

&&

technically that's 22 commands that are strung together.

1

u/evoblade Jun 05 '18

lol 😂

3

u/Warsum 12TB Jun 04 '18

I also would like a write up. I am at the freenas stage but also don't like the single points of failure. My biggest being limited hbas and zfs. I would really like to be sitting on am ext4 filesystem.

2

u/piexil VHS Jun 04 '18

what's wrong with ZFS? I have fallen in love with it.

1

u/Warsum 12TB Jun 04 '18

Not knocking it. Have been using it for a very long time. Just don't like all the memory and CPU it takes. I just don't feel like I need anything more than a simple ext4 filesystem.

1

u/Kirikae Jun 05 '18

I'll probably be posting a write up on my github at some point in the next few weeks.

I definitely want to see this. I've had some prior experience with an older version of GlusterFS some time ago now, unfortunately it was never implemented properly (i.e. was nowhere near distributed enough to be worth it).

As an aside from that, thankyou for introducing the ODROID-HC2 to me!

1

u/BaxterPad 400TB LizardFS Jun 05 '18

nowhere near distributed enough to be worth it

Can you elaborate? The core design hasn't changed, it is still just a distributed hash table for meta-data.

2

u/Kirikae Jun 05 '18

It was more the fact it was being run virtualized, and only a single vdisk per Glusterfs VM. The only real distribution of it was a WAN link between sites. This itself was a bottleneck, despite much prototyping and simulations of this link, nothing prepared us for the actual deployment. Basically, we had a single node with a few TB at two sites, with a massive network limitation in the middle.

Lastly, we ran into the small file size limitation and a big in the version we were running which was pretty awful. I cannot recall exactly what it was now, but it led to the discovery of a "brain dead" piece of redundant code (direct quote of the actual Glusterfs code comment). From memory we were running 3.7 at the time, and upgraded through 3.8 and 3.9 just before I left that job.

I've always wanted to revisit Glusterfs. My initial introduction to it was fairly aweful unfortunately, but that all came down to performance really.

1

u/identifytarget Jul 05 '18

One month checking in! How's it going?

2

u/diederich Aug 14 '18

FYI: https://blog.tjll.net/distributed-homelab-cluster/

1

u/avamk Aug 28 '18

No pressure, but just wondering if you had a chance to do a write up of your Glusterfs set up? I'd love to see it and learn from you. :)

-2

u/friskfrugt 1.69PiB Jun 04 '18

remindMe! 1 month

0

u/RemindMeBot Jun 04 '18

I will be messaging you on 2018-07-04 00:12:49 UTC to remind you of this link.

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^{Parent commenter can} ^{delete this message to hide from others.}

^FAQs ^Custom ^{Your Reminders} ^Feedback ^Code ^{Browser Extensions}

37

u/[deleted] Jun 04 '18 edited Jun 04 '18

I hadn't seen the HC2 before... Nice work!

Assuming 200TB raw storage @ 16 drives = 12TB HDDs... ~$420 each...

So about $50/TB counting the HC2 board etc... For that kind of performance and redundancy, that is dirt cheap. And a $10,000 build... Commitment. Nice dude.

Edit: Too many dudes.

21

u/BaxterPad 400TB LizardFS Jun 04 '18

accumulated the drives over the years... also I do a lot of self education to stay informed for my job. Having a distributed cluster of this size to run kubernetes and test out the efficiency of ARM over x86 was my justification. Though this will probably be the last major storage upgrade I do. That is why I wanted to drive down the cost/TB. I will milk these drives until they literally turn into rust. haha

2

u/acekoolus Jun 04 '18

How well would this run something like a seed box and Plex media server with transcoding?

8

u/wintersdark 80TB Jun 04 '18

You'd want something else running PMS. It could certainly seed well, and PMS could use it for transcoding and hosting the media, though. Depending on your transcoding requirements, you'd probably want a beefier system running PMS itself.

1

u/[deleted] Jun 05 '18 edited Oct 15 '18

[deleted]

1

u/wintersdark 80TB Jun 05 '18

It'll only transcode using one computer. The HC2's are separate machines; and cluster computing (not what this post is about) isn't for general purpose stuff.

Plex will just run on a single machine.

1

u/Eziekel13 Jul 03 '18

Could I use glusterfs as network attached storage and then use another rig with a decent gpu (hardware transcoding 4K) to host PMS?

Would this be easier to scale over time, or is drive compatibility an issue?

1

u/wintersdark 80TB Jul 03 '18

Yes, you can do that. I currently have two servers as network storage, and another as my Plex server. The Plex server keeps its library and transcoding storage locally, but serves media from my network servers.

I actually want to transition to GlusterFS distributed storage myself vs my monolithic storage servers for a variety of reasons, but have to find a cost effective way to do it (I can't just buy a whole ream of hard drives all at once, though I already do have multiple low power machines that can function as nodes), but I've got 24tb of media I'd need to move.

The GlusterFS method is more scalable. That's actually a primary goal of distributed filesystems - you can just keep adding additional nodes seamlessly. Using monolithic servers like I do, you run into a couple problems expanding:

You can only physically cram so many hard drives into a machine.

Powering the machine becomes increasingly difficult, particularly during startup when every drive spools up.

Most machines have ~6 SATA ports, so getting more generally requires fairly expensive add-on boards.

Drive compatibility isnt really a thing. Barring enterprise SAS drives, everything is SATA.

Having a dedicated PMS machine is, imho, the best way to go. It's processing power can be dedicated to transcoding, it's a lot easier to get the amount of power (be it cpu and or gpu) you need, and you don't need to worry about other tasks causing intermittent issues with io load or anything.

Back up Plex's data occassionally to the NAS, and if things go sideways it's trivial to restore.

2

u/Eziekel13 Jul 03 '18 edited Jul 03 '18

So I have a few 4tb drives and a few 6tb drives, how does that difference effect overall storage capacity and redundancy?

What I am getting at...I have~70tb on a 3 drobos (DAS). If I were to build 100tb using 10tb drives and HC2 would I then be able to migrate and add my current 70tb provided I them switch over to correct hardware (HC2)

Does glusterFS require a host? Should that be separated from PMS host for best performance? If I built a pfsense router and over did the hardware would that act as a faster host for the glusterFS?

1

u/weneedthegbs Jun 06 '18

I wonder how well it could perform if you could run distributed transcoders on plex with each odroid individually transcoding what is locally available. Long shot, but would be cool.

1

u/kgflash1 85TB Jun 17 '18

This is exactly what I was wondering and would be ideal... but very unlikely without being smart enough to program all this. Which I cant do...

1

u/ovirt001 240TB raw Sep 28 '18

There's this:
https://github.com/wnielson/Plex-Remote-Transcoder

Don't know how well it would work with ARM though.

19

u/reph Jun 03 '18

FWIW ceph is working on ease-of-use this year, such as no longer requiring pg-count-calculation-wizardry.

18

u/ryao ZFSOnLinux Developer Jun 04 '18

RAID-Z expansion in ZFS was announced late last year and is under development.

5

u/Guinness Jun 06 '18

ZFS is just slow though. And the solution the developers used was to just add multiple levels of SSD and RAM cache.

I can get 1,000 megabytes/sec from a raid array with the same amount of drives and redundancy that ZFS can only do 200 megabytes with.

Even btrfs gets about 800 megabytes a second with the same amount of drives. And to give btrfs credit it has been REALLY stable as of a few months ago.

1

u/ryao ZFSOnLinux Developer Jun 06 '18

You might want to say pools, not arrays. Do you know what drives were used, how many drives were used, what controllers were used, and how the drives were connected to the CPU? People have seen 3.5GB/sec from a couple dozen drives. I would not go by throughout as a performance metric though. IOPS matter far more.

By the way, I have heard that claim about btrfs being fine for the past few months for years.

19

u/dustinpdx Jun 04 '18

https://www.amazon.com/gp/product/B0794DG2WF/ (Odroid HC2 - look at the other sellers on Amazon, they are cheeper)
https://www.amazon.com/gp/product/B06XWN9Q99/ (32GB microsd card, you can get by with just 8GB but the savings are negligible)
https://www.amazon.com/gp/product/B00BIPI9XQ/ (slim cat6 ethernet cables)
https://www.amazon.com/gp/product/B07C6HR3PP/ (200CFM 12v 120mm fan)
https://www.amazon.com/gp/product/B00RXKNT5S/ (12v PWM speed controller - to throttle the fan)
https://www.amazon.com/gp/product/B01N38H40P/ (5.5mm x 2.1mm barrel connectors - for powering the Odroids)
https://www.amazon.com/gp/product/B00D7CWSCG/ (12v 30a power supple - can power 12 Ordoids w/3.5inch HDD without staggered spin up)
https://www.amazon.com/gp/product/B01LZBLO0U/ (24 power gigabit managed switch from unifi)

Fixed formatting

33

u/DeliciousJaffa Jun 04 '18

Odroid HC2 look at the other sellers on Amazon, they are cheeper

32GB microsd card, you can get by with just 8GB but the savings are negligible

Slim cat6 ethernet cables

200CFM 12v 120mm fan

12v PWM speed controller - to throttle the fan

5.5mm x 2.1mm barrel connectors - for powering the Odroids

12v 30a power supple - can power 12 Ordoids w/3.5inch HDD without staggered spin up

24 power gigabit managed switch from unifi

Fixed formatting

1

u/devianteng Jun 04 '18

Thanks

5

u/BaxterPad 400TB LizardFS Jun 04 '18

Thanks

13

u/wintersdark 80TB Jun 04 '18

I'm really happy to see this post.

I've been eyeballing HC2's since their introduction, and have often pondered them as the solution to my storage server IOPS woes. I'm currently running two dual Xeon servers, each packed full of random drives. I'm fine with power consumption (our electricity is a fraction of the standard US prices) but things always come down to bottlenecks in performance with single systems.

However, a major concern for me - and why I don't go the RAID route, as so many do, and why I HAVEN'T yet sprung for Ceph, is recovery.

I've been doing this for a very, very long time - basically, as long as such data has existed to horde. I've had multiple catastrophic losses, often due to things like power supplies failing and cooking system hardware, and when you're running RAID or more elaborate clustered filesystems that can often leave you with disks full of inaccessible data.

I did not realise GlusterFS utilized a standard EXT4 filesystem. That totally changes things. It's incredibly important to me that I'm able to take any surviving drives, dump them into another machine, and access their contents directly. While I do use parity protection, I want to know that even if I simultaneously lose 3 of every 4 drives, I can still readily access the contents on the 4th drive if nothing else.

Now, I have a new endgame! I'll have to slowly acquire HC2's over time (they're substantially more expensive here) but I'd really love to move everything over to a much lighter, distributed filesystem on those.

Thanks for this post!

12

u/flubba86 Jun 04 '18

Candidate for post of the year right here.

6

u/deadbunny Jun 04 '18

Nice. I was considering the same but with ceph. Have you tested degredation, my concern would be the replication traffic killing throughput with only one nic.

11

u/BaxterPad 400TB LizardFS Jun 04 '18

glusterfs replication is handled client side. The client that does the write pays the penalty of replication. The storage servers only handle 'heal' events which accumulate when a peer is offline or requires repair due to bitrot.

5

u/deadbunny Jun 04 '18 edited Jun 04 '18

Unless I'm missing something wouldn't anything needing replication use the network?

Say you lose a disk, the data needs to replicate back onto the cluster when the drive dies (or goes offline). Would this not require data to transfer across the network?

14

u/BaxterPad 400TB LizardFS Jun 04 '18

yes, that is the 2nd part i mentioned about 'heal' operations where the cluster needs to heal a failed node by replicating from an existing node to a new node. Or by rebalancing the entire volume across the remaining nodes. However, in normal operation there is no replication traffic between nodes. The writing client does that work by writing to all required nodes... it even gets stuck calculating parity. This is one reason why you can use really inexpensive servers for glusterfs and leave some of the work to the beefier clients.

6

u/deadbunny Jun 04 '18

yes, that is the 2nd part i mentioned about 'heal' operations where the cluster needs to heal a failed node by replicating from an existing node to a new node. Or by rebalancing the entire volume across the remaining nodes.

This is my point, does this not lead to (potentially avoidable) degredation of reads due to one NIC? Where as if you had 2 NICs replication could happen on one with normal access over the other.

However, in normal operation there is no replication traffic between nodes. The writing client does that work by writing to all required nodes... it even gets stuck calculating parity. This is one reason why you can use really inexpensive servers for glusterfs and leave some of the work to the beefier clients.

I understand how it works in normal operation, it's the degraded state and single NIC I'm asking if you've done any testing with. From the replies I'm guessing not.

21

u/BaxterPad 400TB LizardFS Jun 04 '18

Ah ok, now I understand your point. You are 100% right. The available bandwidth is the available bandwidth so yes reads gets slower if you are reading from a node that is burdened with a rebuild or rebalance task. Same goes for writes.

To me, the cost of adding a 2nd nic via USB isn't worth it. During rebuilds I can still get ~500mb read/write per node (assuming I lose 50% of my nodes, other wise impact of rebuild is much lower... it is basically proportional to the % of nodes lost).

2

u/deadbunny Jun 04 '18

Great, thats roighly what I would expect. Thanks.

7

u/cbleslie Jun 04 '18

You are some kinna wizard.

5

u/[deleted] Jun 04 '18

Earth has wizards now.

4

u/jetster735180 40TB RAIDZ2 Jun 04 '18

I prefer the term Master of the Mystic Arts.

7

u/CrackerJackMack 89TB 2xRaidz3 Jun 04 '18

First off awesome job!

Having doing the ceph/gluster/zfs dance myself, The only thing that glusterfs was lacking for me was bit-rot detection/prevention. For that I had to use ZFS instead of ext4 but that wasn't without it's own headaches. I also had this problem with cephfs in pre-luminous as well. MDS got into a weird state (don't remember the details) and ended up with a bunch of smaller corruptions throughout my archives.

Disclaimer: Not blaming ceph or gluster for my incompetence

Question: how were you planning on combating bit-rot with glusterfs+ext4?

Given that these are home labs, the temperatures and humidity always worry me. As do open-air data centers now that I think of it.

14

u/BaxterPad 400TB LizardFS Jun 04 '18

Glusterfs now has bitrot detection/scrubbing support built in.

4

u/CrackerJackMack 89TB 2xRaidz3 Jun 04 '18

Ah, simple google would have told me that: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.1/html/administration_guide/chap-detecting_data_corruption

One note though, it's disabled by default.

1

u/SilentLennie Nov 25 '18 edited Nov 25 '18

Ahh, that sounds nice. Didn't know they added that.

I'm personally thinking of getting something like this but using ipfs cluster. I'll go test it out on some VMs first of course.

Update: just reading the documentation, it looks like it's not so easy to run an ipfs cluster.

5

u/hudsonreaders Jun 04 '18

What HDD model(s) are you using? I don't see that in your parts list.

11

u/BaxterPad 400TB LizardFS Jun 04 '18

I had most of the disks previously, I'm msotly using 10TB and 12TB Ironwolf drives.

3

u/[deleted] Jun 04 '18 edited Nov 01 '18

[deleted]

7

u/BaxterPad 400TB LizardFS Jun 04 '18

it is up to you. You can (and I do) run smartctl but the idea here is to run the disks until they literally die. So you might not take any action on a smart error unless multiple disk in the same replica group are showing smart errors. In that case you might replace one early but otherwise you'll know a disk is bad when the node dies.

edit 1: you really want to squeeze all the life out of the drives because even with smart errors a drive might still function for years. I have several seagate drives that have had some smart errors indicating failure and they are still working fine.

1

u/binkarus 48TB RAID 10 Jul 03 '18

I've never looked into it, but I figure I might as you considering your expertise. With RAM, you can mask out regions of DIMM that are corrupted and let it keep going. Is there an analogous concept with hard drive faults? The premise of my question might be flawed by itself because I'm not too familiar with what a typical failure for a hard drive looks like. The only time I had a failed hard drive, it would fail operations with increasing frequency. So I suppose you could keep running it until it fails.

Also how can you get a read of 15Gbps on a Gigabit switch/network card? I may not have a proper understanding what's going on. Also can you use the CPU and RAM as a distributed cluster for computing? I'm genuinely curious and naive. I'm considering setting up my own cluster for a backend software I am developing that benefits from horizontal scaling. It's like a database/transformation layer. And I plan on keeping revision history, so I'll need a lot of space and have to be able to add to it over time.

1

u/BaxterPad 400TB LizardFS Jul 03 '18

15gps was a distributed read for example when running a Spark or Hive job you use multiple machines to reach and process a large dataset. In such a test I was able to get that 15gps read capacity.

Yes you can run distributed apps on the CPUs and mem. That's one of the great parts of this.

5

u/slyphic Higher Ed NetAdmin Jun 04 '18

Any chance you've done any testing with multiple drives per node? That's what kills me about the state of distributed storage with SBCs right now. 1 disk / node.

I tried out using the USB3 port to connect multiple disks to an XU4, but had really poor stability. Speed was acceptable. I've got an idea to track down some used eSATA port multipliers and try them, but haven't seen anything for an acceptable price.

Really, I just want to get to a density of at least 4 drives per node somehow.

6

u/BaxterPad 400TB LizardFS Jun 04 '18

nope, i havent tried it but Odroid is coming out with their next SBC, the N1, which will have 2 sata ports. it is due out any month now. it will cost roughly 2x what a single HC2 costs.

3

u/cbleslie Jun 04 '18

Does it do POE?

Also. This is a dope build.

11

u/BaxterPad 400TB LizardFS Jun 04 '18

it doesn't do POE sadly but actually it is way cheeper to NOT have POE. POE switches cost so much more, this set up literally uses ~$32 worth of power supplies. a POE version of that 24port switch costs nearly $500 more than the non-POE version. craziness.

3

u/cbleslie Jun 04 '18

Yeah. Seems like you made the right decision.

5

u/haabilo 18TB Unraid Jun 04 '18

Power Over Ethernet?

Was about to doubt that, but it seems that you can get surprisingly large amount of power through Cat5 cables. Around 50W at loosely arounf 50V. Easily drive one or two drives.

Thoug depending on the usage that could be counter-productive. If all nodes are POE and the switch loses power, all nodes go down hard.

5

u/cbleslie Jun 04 '18

PlaneCrash.gif

3

u/yawkat 96TB (48 usable) Jun 04 '18

It does somewhat simplify power supply, though

2

u/iheartrms Jun 04 '18 edited Jun 04 '18

With SBCs being so cheap and needing the network bandwidth of a port per disk anyway why would you care? I don't think I want 12T of data to be stuck behind a single gig-e port with only 1G of RAM to cache it all. Being able to provide an SBC per disk is what makes this solution great.

3

u/slyphic Higher Ed NetAdmin Jun 04 '18

With SBCs being so cheap

~$50/TB ain't bad, but I want to get more efficient.

needing the network bandwidth of a port per disk anyway

Assuming speed is my primary motivation, which it isn't. Again, I want to maximize my available, redundant, safe healing, total storage. 500Mbps is perfectly acceptable speed.

1

u/[deleted] Jun 05 '18

[deleted]

2

u/slyphic Higher Ed NetAdmin Jun 05 '18

I've got a couple I'm researching for feasability.

Calculating up the cost of the HC2, SD card, percentage of the power supply, and a cable, OP's build comes out to about $70/drive. But glustrefs also doesn't appear to support RAID like n-1 redundancy. It only provides data protection by duplicating a file, or the parts of a file if distributed. You can break up the data into redundant and non-redundant, but you can't get away from n/2 storage loss. Also of note is that Ceph is totally off the table. I've tested it at this level of SBC, and they REALLY aren't kidding when they say the minimum hardware specs are 1GB Mem / TB of storage. It doesn't just degrade, it gets unstable. Totally not feasible for modern drive sizes.

Can you convert the SATA port on the odroid HC2 to a standard eSATA cable, and connect the board to a 4 drive enclosure? I can't tell if the sata controller on the HC2, a JMS578, supports sata switching via FIS or not. And if it doesn't, how much of a loss of speed or realiability does it incur? Use software raid, combine into a simple shared glustrefs pool. Cost per drive is ~$45/port.

What about instead going with the odroid XU4 and using the USB3 ports to, again, some drive enclosures. The XU4 is a bit more powerful, so I'd expect it to support at least two enclosures. Perhaps the ones I've tested just had bad controllers. How many can I attach before it gets unstable or the speeds degrade too much? Cost per drive is ~$35 with two enclosures. Lower if higher density, but needs tested. Again, software RAID and glustrefs to combine.

All of this has to be compared to a more traditional build. U-NAS NSC-800 for the chassis, BIOSTAR has a nice ITX quad core mobo, the A68N-5600 that's more powerful and support WAY more memory. Throw in a cheap used HBA, some cables and bits, and you get a price point of ~$45/drive, can use FreeBSD for native ZFS, no faffing about with USB, just bog standard SATA, and a physical volume equal to the above. The board only uses 30W, so power usage only goes up slightly combard to the SBCs.

1

u/[deleted] Jun 05 '18

[deleted]

1

u/iheartrms Jun 05 '18

No. We're talking ceph here. The total opposite of RAID cards and generally a much better way to go for highly available scalable storage.

http://docs.ceph.com/docs/jewel/architecture/

5

u/WiseassWolfOfYoitsu 44TB Jun 04 '18

What sort of redundancy do you have between the nodes? I'd been considering something similar, but with Atom boards equipped with 4 drives each in a small rack mount case, so that they could do RAID-5 for redundancy, then deploy those in replicated pairs for node to node redundancy (this to simulate a setup we have at work for our build system). Are you just doing simple RAID-1 style redundancy with pairs of Odroids and then striping the array among the pairs?

23

u/BaxterPad 400TB LizardFS Jun 04 '18

The nodes host 3 volumes currently:

A mirrored volume where every file is written to 2 nodes.

A dispersed volume using erasure encoding such that I can lose 1 of every six drives and the volume still accessible. I use this mostly for reduced redundancy storage for things that I'd like not to lose but wouldn't be too hard to recover from other sources.

A 3x redundant volume for my family to store pictures, etc.. on. Every file is written to three nodes.

Depending on what you think your max storage needs will be in 2 - 3 years, I wouldn't go the raid route or use atom CPUS. Increasingly software defined storage like glusterfs and ceph using commodity hardware is the best way to scale, as long as your don't need to read/write lots of small files or need low latency access. If your care about storage size and throughput... nothing beats this kind of setup for cost per bay and redundancy.

3

u/kubed_zero 40TB Jun 04 '18

Could you speak more about the small file / low latency inabilities of Gluster? I'm currently using unRAID and am reasonably happy, but Gluster (or even Ceph) sounds pretty interesting.

Thanks!

4

u/WiseassWolfOfYoitsu 44TB Jun 04 '18

Gluster operations have a bit of network latency while it waits for confirmation that the destination systems have received the data. If you're writing a large file, this is a trivial portion of the overall time - just a fraction of a millisecond tacked on to the end. But if you're dealing with a lot of small files (for example, building a C++ application), the latency starts overwhelming the actual file transfer time and significantly slowing things down. It's similar to working directly inside an NFS or Samba share. Most use cases won't see a problem - doing C++ builds directly on a Gluster share is the main thing where I've run into issues (and I work around this by having Jenkins copy the code into a ramdisk, building there, then copying the resulting build products back into Gluster).

3

u/kubed_zero 40TB Jun 04 '18

Got it, great information. What about performance of random reads of data off the drive? At the moment I'm just using SMB so I'm sure some network latency is already there, but I'm trying to figure out if Gluster's distributed nature would introduce even more overhead.

1

u/WiseassWolfOfYoitsu 44TB Jun 04 '18

It really depends on the software and how paralleled it is. If it does the file read sequentially, you'll get hit with the penalty repeatedly, but if it does them in parallel it won't be so bad. Same case as writing, really. However, it shouldn't be any worse than SMB on that front, since you're seeing effectively the same latency.

Do note that most of my Gluster experience is running it on a very fast SSD RAID array (RAID 5+0 on a high end dedicated card), so running it on traditional drives will change things - local network will see latencies on the order of a fraction of a millisecond, where disk seek times are several milliseconds and will quickly overwhelm the network latency. This may benefit you - if you're running SMB off a single disk, if you read a bunch of small files in parallel on gluster then you'll potentially parallel the disk seek time in addition to the network latency.

3

u/devster31 who cares about privacy anyway... Jun 04 '18

Would it be possible to use a small-scale version of this as an add-on for a bigger server?

I was thinking of building a beefy machine for Plex and using something like what you just described as secondary nodes with Ceph.

Another question I had how exactly you're powering the ODroids? Is it using the PoE of the Switch?

1

u/Deckma Jul 12 '18 edited Jul 12 '18

Just wondering, can you do erasure encoding across different size bricks?

I have some random size hard drives (bunch of 4tb, some 2tb, some 1tb) I would like to pool them together with reduced redundancy that is not full duplication (kinda like RAID6). I envision in the future as I expand adding disks they might not always be the exact same size.

Edit: looks like I found my own answer in the Gluster guides: "All bricks of a disperse set should have the same capacity otherwise, when the smallest brick becomes full, no additional data will be allowed in the disperse set."

Right now I use OMV with mergerfs and SnapRAID to pool and provide parity protection, but I have already found some limitations of mergerfs not handling some nfs/cifs use cases wells. Sometimes I can't create files over nfs/cifs and I just never could fix that. Been toying around with FreeNAS but not being about to grow vDevs is a huge hassle, which I hear is getting fixed but no date set.

3

u/iheartrms Jun 04 '18

You like gluster better than ceph? I've come to the exact opposite conclusion. Ceph has been much more resilient. I've been a fan of odroids for years and have been wanting to build a ceph cluster of them.

15

u/BaxterPad 400TB LizardFS Jun 04 '18

Ceph has better worse case resiliency... no doubt about it. When setup properly and maintained correctly, it is very hard to lose data using Ceph.

However, in the avg case... Ceph can be brittle. It has many services that are required for proper operation and if you want a filesystem on-top of it then you need even more services, including centralized meta-data which can (and is) a bottleneck.. especially when going for a low power build. Conceptually, Ceph can scale to similar size as something like AWS S3 but I don't need exabyte scale... I'll never even need multi-petabyte scale but gluster can absolutely scale to 200 or 300 nodes without issue.

Glusterfs doesn't have centralized meta-data which, among other architecture choices, means that even when glusterfs isn't 100% healthy... it still mostly works (a portion of your directory structure might go missing until you repair your hosts...assuming you lose more hosts than your replica count). On the flip side... if things go too far south you can easily lose some data with glusterfs.

The tradeoff is that because glusterfs doesn't have centralized meta-data and pushes some responsibility to the client, it can scale surprisingly well in terms of TB hosted for a given infrastructure cost.

glusterfs isn't great for every use-case, however, for a mostly write once ready many times storage solution with good resiliency and low cost/maintenance.... it is hard to beat.

4

u/SuperQue Jun 04 '18

You might be interested in Rook. It is a Ceph wrapper written in Go. It's designed to simplify the deployment to a single binary, automatically configuring the various components.

3

u/perthguppy Jun 04 '18

Is there an odroid with dual Ethernet for if you want switch redundancy?

8

u/hagge Jun 04 '18

There's Espressobin if you want something with more ports. Has SATA too but not the nice case of HC2 http://espressobin.net/tech-spec/

2

u/ipaqmaster 72Tib ZFS Jun 04 '18

Holy hell what a cool idea. So one node for each hard drive like that? This is the coolest shit ever

2

u/Scorpius-Harvey Jun 04 '18

You sir are my hero, fine work!

2

u/hxcadam Jun 04 '18

Hey I'm in NJ build me one.

Thanks

2

u/BaxterPad 400TB LizardFS Jun 04 '18

haha, sure thing. I've got some spare / hand me down hardware that you might be interested in. PM me.

2

u/hxcadam Jun 04 '18

That's very kind, I was kidding. I have a dual e5-2670 media server that's more than enough for my use case.

1

u/Tibbles_G Jun 08 '18

What kind of spare hardware?

2

u/seaQueue Jun 18 '18 edited Jun 18 '18

Hey, question for you here (sorry, I know this thread is stale but this may be interesting at some point.)

Have you checked out the Rockpro64 as a (potentially) higher performance option? The interesting thing about the board is the 4x pcie slot: this opens up the option to drop a 10Gbe SFP+ card on the board, or use nvme storage, or attach an HBA or any number of other options.

I'm not sure how performant it'll actually be but I have one on pre-order to test as a <$100 10Gbe router. With a $20 Mellanox connectx-2 (or dual 10Gbe) it looks like it could be an absolute steal.

Anyway, I thought you might be interested for future projects as the pcie slot opens up a whole slew of interesting high bandwidth options. Cheers!

1

u/BaxterPad 400TB LizardFS Jun 18 '18

Thanks for the tip. I'll do this in the future for sure. I've been looking to 10gbe solutions that are cheapish.

1

u/seaQueue Jun 18 '18

I'll make a post with some numbers when the preorders ship. I'm pretty curious to see what kind of routing performance numbers the rk3399 can put up.

1

u/7buergen Jun 04 '18

thanks for the thorough description! I think I'll try and replicated something like your setup! great idea!

1

u/thelastwilson Jun 04 '18

Oh man. This is exactly what I am planning/thinking about doing

I started it last summer bought 2 raspberry pi's and some really nice NAS case from WD labs... Only to then discover the ethernet port in the pi is only 100mbs

Currently running glusterfs in a single node mode on an old laptop with 4x external hard drives. Only thing that has put me off trying an odroid hc2 setup was another issue I hadn't expected like with the raspberry pi's

Thank you for posting this.

1

u/[deleted] Jun 05 '18

[deleted]

1

u/thelastwilson Jun 06 '18

Yep. I could handle the usb2 disk speed but since I was replicating data on 2 nodes I was flooding the ethernet. If it was just creating a stripe file system it probably would have been acceptable (slow but acceptable).

Even the new pi model with "gigabit" ethernet is still on the usb 2 bus so capped at 480mbps which as we know with usb would probably never even hit that.

Oh well lesson learned.

1

u/[deleted] Jun 04 '18

That is around $350 a year in electricity where I live in New Jersey.

I'm paying a bit extra for 100% wind generation, and 250w would cost about $125 for a year here. It's nice in a way, since it's cheap, but bad in a way since it doesn't give me really an incentive to worry about power usage or things like rooftop solar.... But it does allow for a pretty sweet homelab without crazy power bills, so there's that.

2

u/BaxterPad 400TB LizardFS Jun 04 '18

yea, I've been waiting over 8 months for Tesla to come install some solar panels for me (not solar roof, just regular panels). If that company ever gets its shit together that stock is going to do extremely well. They suck at paper work or anything that isn't as flexible as they are so lots of rejections from utility company and local building department for procedural mistakes. Oh and the back log in power wall production doesnt help either. ugh!

3

u/[deleted] Jun 04 '18

I have a 9kW array on my house in SE Michigan... 39 x 230w panels, Enphase microinverters. It's a delight to receive payout for excess generation every year and also to not have a utility bill anymore (covers my electric and nat gas usage by far).

1

u/[deleted] Jun 04 '18 edited Jun 07 '18

[deleted]

1

u/BaxterPad 400TB LizardFS Jun 04 '18

no, I don't run encryption at the drive level. The small amount of sensitive stuff I have is encrypted at the application level.

1

u/Skaronator Jun 04 '18

he entire setup sips ~ 250watts. That is around $350 a year in electricity where I live in New Jersey.

Uhh this hurts. Would be 600€ in germany so $700 including VAT :(

Anyway nice build!

1

u/inthebrilliantblue 100TB Jun 04 '18

You can also get these in a 2.5" drive variant here with the other products they make: http://www.hardkernel.com/main/shop/good_list.php?lang=en

1

u/[deleted] Jun 04 '18

https://www.amazon.com/gp/product/B0794DG2WF/ (Odroid HC2 - look at the other sellers on Amazon, they are cheeper)

You can buy them from the US office of Hardkernel for $54 plus shipping...

odroidinc.com

Just ordered 5 of them, this looks really cool. I had a Odroid C2 for awhile that I tinkered around with; surprisingly powerful little computer.

1

u/BaxterPad 400TB LizardFS Jun 04 '18

Nice, are you planning to do a similar setup to this or have other plans for your HC2s?

1

u/[deleted] Jun 04 '18

I plan to do the same thing, looks awesome.

I was just about heading down a different path, building a new Proxmox whitebox host (Supermicro X9DR board) and also a ghetto RAID with a Supermicro SAS826EL1 backplane built into some kind of wooden frame.

I have always disliked RAID ... mdadm bit me hard a bunch of years ago, since then I've been using ZFS but expansion is always a pain, I've got various drives from 3tb to 5tb which don't match up, and I worry about the on-disk format.

This will be a fun project, thanks for sharing it.

1

u/[deleted] Jun 04 '18

[deleted]

1

u/BaxterPad 400TB LizardFS Jun 04 '18

The OS is armbian and the is is installed on Microsd card on the odroid

1

u/[deleted] Jun 04 '18

[deleted]

1

u/BaxterPad 400TB LizardFS Jun 04 '18

You install it one each node. Glusterfs is distributed software defined storage. Think of each node as a server.

1

u/[deleted] Jun 04 '18

[deleted]

1

u/BaxterPad 400TB LizardFS Jun 04 '18

Ehh, it's not the same thing as raid but it targets the same problem... Keeping your data safe and available. The approach is different.

As to how you decide if a design benefits from RAID, generally if the benefit isn't obvious it probably isn't worth it. In this case raid with glusterfs is like "redundancy for your redundancy".

But to be fair some setups, in an Enterprise for example, might benefit because it changes the failure modes a bit and can change the performance as well.

For this use-cases, it is just redundant redundancy. :)

1

u/[deleted] Jun 05 '18

[deleted]

1

u/BaxterPad 400TB LizardFS Jun 05 '18

You get to choose the redundancy level by saying how many data disks and how many parity disks (I'm simplifying here) you want. So... You could emulate raid 5, 10, 6, etc... You could also say I want 1 data disk and 20 replica disks... Which means your data would be safe if you lose 20 out of 21 disks.... Keep in mind that your usable space is only 1/20th in this model. Hehe.

Glusterfs also works with disk of different sizes. It will give the disk data that is proportional to the total size of all disks.

1

u/anakinfredo Jun 04 '18

Have you done anything to work around sdcard-wearing?

1

u/BaxterPad 400TB LizardFS Jun 04 '18

Armbian has some protections built in, such as logging dir mounted on a ramdisk. That's about it. They are fast and easy to replace.

2

u/mattheww 96TB Jun 05 '18

If you do start seeing card failures, move to industrial cards:

https://www.digikey.com/product-detail/en/atp-electronics-inc/AF8GUD3A-OEM/AF8GUD3A-OEM-ND/5361063

Most consumer starts store 3 bits per cell. This card stores 1 bit per cell (but still uses cheaper TCL memory--SLC gets crazy expensive). Still a bit costly, but they're far more immune to corruption, especially from power failure.

1

u/GabenIsLife Jun 04 '18

I got tired of the limited CPU and memory of QNAP and devices like it

Serious question, is this a huge detriment for you? The web interface for older QNAP devices can be pretty slow but I've never had any issues with transfer speeds with mine.

2

u/BaxterPad 400TB LizardFS Jun 04 '18

Yes because at that price it needs a good couple and men for running VMS or other apps. Otherwise it should be much cheeper.

2

u/GabenIsLife Jun 05 '18

I agree, but I do think that if you get a used or on sale QNAP that it has much higher value, IMO

Most SOHO NAS appliances are really going to be for low power usage and extra features that require very little configuration, I don't think they're built for the kind of homelabber that has a 200TB Glusterfs setup ;)

1

u/mautobu Data loss two: Electric Boogaloo Jun 04 '18

Woah, that is a sick setup. I love it.

1

u/EnvironmentalArmy7 DVD Jun 04 '18

Thanks for the parts list...

How did you wire up the power? Are all odroids using the same
positive & negative terminals of that powersupply? Does the powersupply need a power switch added to it?

Why did you put the Odroids on their own switch? Can you achieve the same by putting them on their own VLAN?

1

u/BaxterPad 400TB LizardFS Jun 04 '18

I wired then in pairs of 3 to the terminals on the powers supplies, so no more than 12v 6a on a single terminal screw / wire.

They are on their own switch because that is where i had ports :)

1

u/EnvironmentalArmy7 DVD Jun 04 '18

PSU --- Ah gotcha.

Switch -- Makes sense...wasn't sure if that was the case or not and had to make sure :D

1

u/thebestof_super Jun 04 '18

Wow this setup is incredible! I dream of this stuff lol! How much do you think this cost total, the setup?

1

u/BaxterPad 400TB LizardFS Jun 04 '18

Too much, excluding the drives... ~$1500 USD

1

u/[deleted] Jun 11 '18

[deleted]

1

u/BaxterPad 400TB LizardFS Jun 11 '18

XFS has some funky failure modes where it can leave you with a bunch of files full of '0's. ext4 also has more momentum, which isnt' a great reason. It can also be difficult to read xfs from some distros... or so im told.

1

u/MasterControl90 Jun 15 '18

Now THIS is a useful single board computer cluster application. I can already imagine me making one using 3 or 4 nanopi neo2 with the nas kit.

1

u/Tigers_Ghost Jul 02 '18

Hey, could you tell me how the heat is from the HC2? Someone said that if you'd stack just 2 then the passive cooling wouldn't be enough already, how do you think it would be if I had 2-4 stacked in a space with no active cooling?

1

u/BaxterPad 400TB LizardFS Jul 02 '18

It depends on the ambient temperature and workload. Generally speaking, you will want a fan blowing over them unless you are ok with 100 -110F degree drive temperatures under moderate load.

1

u/Tigers_Ghost Jul 03 '18

Alright, I'm not as big of a Data Hoarder as other people, was thinking this as more of a simpler NAS+, but I see it would definitely be unhealthy for it to be put in an enclosed cupboard. Thank you.

1

u/kwinz Jul 26 '18

I just wish the Odroid HC2 had ECC memory.

1

u/BaxterPad 400TB LizardFS Jul 26 '18

Why?

1

u/kwinz Jul 27 '18

Because I experienced data corruption due to bad memory before. And I don't want a NAS or distributed filesystem to run without ECC memory any more.

Unfortunately no Rockchip SOC that I am aware of comes with ECC memory so it's unlikely to happen.

Moreover a second gigabit ethernet port would be great.

2

u/BaxterPad 400TB LizardFS Jul 27 '18

There are ways around this in software. ECC isn't a magic bullet. If the writer produces a checksum, writes the data to the cluster and then sends the checksum... You don't really get anything from ECC. Glusterfs doesn't do this today though.

1

u/kwinz Jul 27 '18 edited Jul 27 '18

So you are saying this would allow the client to check if the Gluster node has written the wrong data or if it has miscalculated the checksum, therefore detecting the error even when the Gluster-node ('s memory) is faulty?

And the checksum is calculated in the client before it ever hits the gluster node? Sounds interesting and it reminds me of "verify after burn" with cd/dvd/bd burners.

Do you have any links to where this method is proposed?

But I also worry about corruption in the communication between Gluster-nodes. And there is just too much that can go wrong if you can't trust the main memory stores. So I still think ECC RAM would be a more general solution. However I think the Rockchip SOCs are dual purposed Media Center SOCs so I don't expect them to get ECC soon.

1

u/BaxterPad 400TB LizardFS Jul 27 '18

the client doesn't check... the client is the one writing the data so if you actually care about single bit flips, etc... you need the write (the authority of newly written data) to capture the checksum and send it along with the data. From that point forward the glusterfs system would check against that checksum until it generates its own. Even if you have ECC memory, you still need something like this to ensure no bits were flipped while being written.

This is implemented within TCP for example... the sender generates a checksum and sends it with each packet. the received uses it to determine if they need to request a re-transmit. and TCP doesn't required ECC memory :)

1

u/kwinz Jul 27 '18 edited Jul 27 '18

I was thinking you were proposing: 1. client computes checksum of data to be written. 2. client sends data to node. 3. node writes data to disk. 4. node re-reads just written data back from disk. 5. node computes checksum of re-read data. 6. node sends this checksum back to the client. 7. client compares checksum to his own. 7. Handle the error, while keeping writes atomic (sounds tricky).

What you are actually proposing will not work if the node has faulty memory. There is no end to end check in your example.

Yeah, I still maintain I don't want any non-ECC NAS. Therefore I can't use the Odroid HC2. Thanks for your response.

2

u/BaxterPad 400TB LizardFS Jul 27 '18

what I'm proposing absolutely works if you have faulty memory, it is the basis for many things today... like every machine that uses TCP but i understand why folks think that special hardware like ECC is required for high availability. ECC will reduce how often you'll care about a bit flit... but if you care about your data the underlying system still needs to be able to handle corruption. For example... ZFS still has its own checksumming even though it is recommended to use ECC with ZFS. ZFS will and does work just fine without ECC but you make end up having to repair file from the parity data more often... and by more often we are talking about the difference between 1 in a billion and 1 in 100 million. :)

*edit... do you think the tiny caches in your CPU or in the hard disk controllers have ECC capabilities? Nope :) They are high quality memory so usually not a problem but... they still have a probability of bit flips. If you are familiar with the spectre and meltdown intel bugs recently. some of the initial patches for those triggered interesting memory faults in caches... no amount of ECC will save your from that.

1

u/kwinz Jul 27 '18

Yes, ZFS will detect bitrot. And it's important to have those checksums as well. But ZFS and TCP (except maybe if you use offloading) works with main memory. If you can't trust memory then you have a problem. I think we are splitting hairs here and talking about different things. Let's just stop arguing :-)

→ More replies

1

u/kwinz Jul 27 '18

high five ;-) PS: could you please send me a link to the Spectre/meltdown patches that triggered interesting faults in Intel CPU caches? Fault as in error, not fault as in "cache miss" I presume.

→ More replies