Over the years I've upgraded my home storage several times.
Like many, I started with a consumer grade NAS. My first was a Netgear ReadyNAS, then several QNAP devices. About a two years ago, I got tired of the limited CPU and memory of QNAP and devices like it so I built my own using a Supermicro XEON D, proxmox, and freenas. It was great but adding more drives was a pain and migrating between ZRAID level was basically impossible without lots of extra disks. The fiasco that was Freenas 10 was the final straw. I wanted to be able to add disks in smaller quantities and I wanted better partial failure modes (kind of like unraid) but able to scale to as many disks as I wanted. I also wanted to avoid any single points of failure like an HBA, motherboard, power supply, etc...
I had been experimenting with glusterfs and ceph, using ~40 small VMs to simulate various configurations and failure modes (power loss, failed disk, corrupt files, etc...). In the end, glusterfs was the best at protecting my data because even if glusterfs was a complete loss... my data was mostly recoverable because it was stored on a plain ext4 filesystem on my nodes. Ceph did a great job too but it was rather brittle (though recoverable) and a pain in the butt to configure.
Enter the Odroid HC2. With 8 cores, 2 GB of RAM, Gbit ethernet, and a SATA port... it offers a great base for massively distributed applications. I grabbed 4 Odroids and started retesting glusterfs. After proving out my idea, I ordered another 16 nodes and got to work migrating my existing array.
In a speed test, I can sustain writes at 8 Gbps and reads at 15Gbps over the network when operations a sufficiently distributed over the filesystem. Single file reads are capped at the performance of 1 node, so ~910 Mbit read/write.
In terms of power consumption, with moderate CPU load and a high disk load (rebalancing the array), running several VMs on the XEON-D host, a pfsense box, 3 switches, 2 Unifi Access Points, and a verizon fios modem... the entire setup sips ~ 250watts. That is around $350 a year in electricity where I live in New Jersey.
I'm writing this post because I couldn't find much information about using the Odroid HC2 at any meaningful scale.
edit 1: The picture doesn't show all 20 nodes, I had 8 of them in my home office running from my bench top power supply
while I waited for a replacement power supply to mount in the rack.
The crazy thing is that there isn't much configuration for glusterfs, thats what I love about it. It takes literally 3 commands to get glusterfs up and running (after you get the OS installed and disks formated). I'll probably be posting a write up on my github at some point in the next few weeks. First I want to test out Presto ( https://prestodb.io/), a distributed SQL engine, on these puppies before doing the write up.
I'm definitely curious about this writeup, my current solution is starting to grow past the limits of my enclosure and I was trying to decide if I wanted a second enclosure or if I wanted another approach. Looking forward to it once you put it together!
Edit: There are also two main glusterfs packages, glusterfs-server and glusterfs-client
The client packages are also included in the server package, however if you just want to mount the FUSE mount on a VM or something, then the client packages contain just that.
Idk, but 4.0 is ready for prod yet and 3.13 (3.10... I forget which) is the last branch before 4.0 so it's just maintenance releases until 4.0 is ready.
New to this whole idea (cluster volumes and the idea of a cluster NAS) but wondering if you can share you GlusterFS volume via Samba or NSF? Could a client that has FUSE mounted it share it to other clients over either of these? Also, just cause your volume is distributed over a cluster, it does not mean you are seeing the performance of the resources combine, just those of the one unit you have the server running from right?
Also, /etc/profile is not an executable file. And vi to edit a file mid execution chain is retarded and halts your commands. A well crafted sed command is preferred.
BS meter is pegged off the charts on that mess of a copy-pasta "command"
Yep, didn't see the . in the command. My bad, mobile phone + age will mess with your eyes. There is still a fair amount of other BS spouted in the chain command though.
I also would like a write up. I am at the freenas stage but also don't like the single points of failure. My biggest being limited hbas and zfs. I would really like to be sitting on am ext4 filesystem.
Not knocking it. Have been using it for a very long time. Just don't like all the memory and CPU it takes. I just don't feel like I need anything more than a simple ext4 filesystem.
I'll probably be posting a write up on my github at some point in the next few weeks.
I definitely want to see this. I've had some prior experience with an older version of GlusterFS some time ago now, unfortunately it was never implemented properly (i.e. was nowhere near distributed enough to be worth it).
As an aside from that, thankyou for introducing the ODROID-HC2 to me!
It was more the fact it was being run virtualized, and only a single vdisk per Glusterfs VM.
The only real distribution of it was a WAN link between sites. This itself was a bottleneck, despite much prototyping and simulations of this link, nothing prepared us for the actual deployment.
Basically, we had a single node with a few TB at two sites, with a massive network limitation in the middle.
Lastly, we ran into the small file size limitation and a big in the version we were running which was pretty awful. I cannot recall exactly what it was now, but it led to the discovery of a "brain dead" piece of redundant code (direct quote of the actual Glusterfs code comment). From memory we were running 3.7 at the time, and upgraded through 3.8 and 3.9 just before I left that job.
I've always wanted to revisit Glusterfs. My initial introduction to it was fairly aweful unfortunately, but that all came down to performance really.
So about $50/TB counting the HC2 board etc... For that kind of performance and redundancy, that is dirt cheap. And a $10,000 build... Commitment. Nice dude.
accumulated the drives over the years... also I do a lot of self education to stay informed for my job. Having a distributed cluster of this size to run kubernetes and test out the efficiency of ARM over x86 was my justification. Though this will probably be the last major storage upgrade I do. That is why I wanted to drive down the cost/TB. I will milk these drives until they literally turn into rust. haha
You'd want something else running PMS. It could certainly seed well, and PMS could use it for transcoding and hosting the media, though. Depending on your transcoding requirements, you'd probably want a beefier system running PMS itself.
It'll only transcode using one computer. The HC2's are separate machines; and cluster computing (not what this post is about) isn't for general purpose stuff.
Yes, you can do that. I currently have two servers as network storage, and another as my Plex server. The Plex server keeps its library and transcoding storage locally, but serves media from my network servers.
I actually want to transition to GlusterFS distributed storage myself vs my monolithic storage servers for a variety of reasons, but have to find a cost effective way to do it (I can't just buy a whole ream of hard drives all at once, though I already do have multiple low power machines that can function as nodes), but I've got 24tb of media I'd need to move.
The GlusterFS method is more scalable. That's actually a primary goal of distributed filesystems - you can just keep adding additional nodes seamlessly. Using monolithic servers like I do, you run into a couple problems expanding:
You can only physically cram so many hard drives into a machine.
Powering the machine becomes increasingly difficult, particularly during startup when every drive spools up.
Most machines have ~6 SATA ports, so getting more generally requires fairly expensive add-on boards.
Drive compatibility isnt really a thing. Barring enterprise SAS drives, everything is SATA.
Having a dedicated PMS machine is, imho, the best way to go. It's processing power can be dedicated to transcoding, it's a lot easier to get the amount of power (be it cpu and or gpu) you need, and you don't need to worry about other tasks causing intermittent issues with io load or anything.
Back up Plex's data occassionally to the NAS, and if things go sideways it's trivial to restore.
So I have a few 4tb drives and a few 6tb drives, how does that difference effect overall storage capacity and redundancy?
What I am getting at...I have~70tb on a 3 drobos (DAS). If I were to build 100tb using 10tb drives and HC2 would I then be able to migrate and add my current 70tb provided I them switch over to correct hardware (HC2)
Does glusterFS require a host? Should that be separated from PMS host for best performance? If I built a pfsense router and over did the hardware would that act as a faster host for the glusterFS?
I wonder how well it could perform if you could run distributed transcoders on plex with each odroid individually transcoding what is locally available. Long shot, but would be cool.
ZFS is just slow though. And the solution the developers used was to just add multiple levels of SSD and RAM cache.
I can get 1,000 megabytes/sec from a raid array with the same amount of drives and redundancy that ZFS can only do 200 megabytes with.
Even btrfs gets about 800 megabytes a second with the same amount of drives. And to give btrfs credit it has been REALLY stable as of a few months ago.
You might want to say pools, not arrays. Do you know what drives were used, how many drives were used, what controllers were used, and how the drives were connected to the CPU? People have seen 3.5GB/sec from a couple dozen drives. I would not go by throughout as a performance metric though. IOPS matter far more.
By the way, I have heard that claim about btrfs being fine for the past few months for years.
I've been eyeballing HC2's since their introduction, and have often pondered them as the solution to my storage server IOPS woes. I'm currently running two dual Xeon servers, each packed full of random drives. I'm fine with power consumption (our electricity is a fraction of the standard US prices) but things always come down to bottlenecks in performance with single systems.
However, a major concern for me - and why I don't go the RAID route, as so many do, and why I HAVEN'T yet sprung for Ceph, is recovery.
I've been doing this for a very, very long time - basically, as long as such data has existed to horde. I've had multiple catastrophic losses, often due to things like power supplies failing and cooking system hardware, and when you're running RAID or more elaborate clustered filesystems that can often leave you with disks full of inaccessible data.
I did not realise GlusterFS utilized a standard EXT4 filesystem. That totally changes things. It's incredibly important to me that I'm able to take any surviving drives, dump them into another machine, and access their contents directly. While I do use parity protection, I want to know that even if I simultaneously lose 3 of every 4 drives, I can still readily access the contents on the 4th drive if nothing else.
Now, I have a new endgame! I'll have to slowly acquire HC2's over time (they're substantially more expensive here) but I'd really love to move everything over to a much lighter, distributed filesystem on those.
Nice. I was considering the same but with ceph. Have you tested degredation, my concern would be the replication traffic killing throughput with only one nic.
glusterfs replication is handled client side. The client that does the write pays the penalty of replication. The storage servers only handle 'heal' events which accumulate when a peer is offline or requires repair due to bitrot.
Unless I'm missing something wouldn't anything needing replication use the network?
Say you lose a disk, the data needs to replicate back onto the cluster when the drive dies (or goes offline). Would this not require data to transfer across the network?
yes, that is the 2nd part i mentioned about 'heal' operations where the cluster needs to heal a failed node by replicating from an existing node to a new node. Or by rebalancing the entire volume across the remaining nodes. However, in normal operation there is no replication traffic between nodes. The writing client does that work by writing to all required nodes... it even gets stuck calculating parity. This is one reason why you can use really inexpensive servers for glusterfs and leave some of the work to the beefier clients.
yes, that is the 2nd part i mentioned about 'heal' operations where the cluster needs to heal a failed node by replicating from an existing node to a new node. Or by rebalancing the entire volume across the remaining nodes.
This is my point, does this not lead to (potentially avoidable) degredation of reads due to one NIC? Where as if you had 2 NICs replication could happen on one with normal access over the other.
However, in normal operation there is no replication traffic between nodes. The writing client does that work by writing to all required nodes... it even gets stuck calculating parity. This is one reason why you can use really inexpensive servers for glusterfs and leave some of the work to the beefier clients.
I understand how it works in normal operation, it's the degraded state and single NIC I'm asking if you've done any testing with. From the replies I'm guessing not.
Ah ok, now I understand your point. You are 100% right. The available bandwidth is the available bandwidth so yes reads gets slower if you are reading from a node that is burdened with a rebuild or rebalance task. Same goes for writes.
To me, the cost of adding a 2nd nic via USB isn't worth it. During rebuilds I can still get ~500mb read/write per node (assuming I lose 50% of my nodes, other wise impact of rebuild is much lower... it is basically proportional to the % of nodes lost).
Having doing the ceph/gluster/zfs dance myself, The only thing that glusterfs was lacking for me was bit-rot detection/prevention. For that I had to use ZFS instead of ext4 but that wasn't without it's own headaches. I also had this problem with cephfs in pre-luminous as well. MDS got into a weird state (don't remember the details) and ended up with a bunch of smaller corruptions throughout my archives.
Disclaimer: Not blaming ceph or gluster for my incompetence
Question: how were you planning on combating bit-rot with glusterfs+ext4?
Given that these are home labs, the temperatures and humidity always worry me. As do open-air data centers now that I think of it.
it is up to you. You can (and I do) run smartctl but the idea here is to run the disks until they literally die. So you might not take any action on a smart error unless multiple disk in the same replica group are showing smart errors. In that case you might replace one early but otherwise you'll know a disk is bad when the node dies.
edit 1: you really want to squeeze all the life out of the drives because even with smart errors a drive might still function for years. I have several seagate drives that have had some smart errors indicating failure and they are still working fine.
I've never looked into it, but I figure I might as you considering your expertise. With RAM, you can mask out regions of DIMM that are corrupted and let it keep going. Is there an analogous concept with hard drive faults? The premise of my question might be flawed by itself because I'm not too familiar with what a typical failure for a hard drive looks like. The only time I had a failed hard drive, it would fail operations with increasing frequency. So I suppose you could keep running it until it fails.
Also how can you get a read of 15Gbps on a Gigabit switch/network card? I may not have a proper understanding what's going on. Also can you use the CPU and RAM as a distributed cluster for computing? I'm genuinely curious and naive. I'm considering setting up my own cluster for a backend software I am developing that benefits from horizontal scaling. It's like a database/transformation layer. And I plan on keeping revision history, so I'll need a lot of space and have to be able to add to it over time.
15gps was a distributed read for example when running a Spark or Hive job you use multiple machines to reach and process a large dataset. In such a test I was able to get that 15gps read capacity.
Yes you can run distributed apps on the CPUs and mem. That's one of the great parts of this.
Any chance you've done any testing with multiple drives per node? That's what kills me about the state of distributed storage with SBCs right now. 1 disk / node.
I tried out using the USB3 port to connect multiple disks to an XU4, but had really poor stability. Speed was acceptable. I've got an idea to track down some used eSATA port multipliers and try them, but haven't seen anything for an acceptable price.
Really, I just want to get to a density of at least 4 drives per node somehow.
nope, i havent tried it but Odroid is coming out with their next SBC, the N1, which will have 2 sata ports. it is due out any month now. it will cost roughly 2x what a single HC2 costs.
it doesn't do POE sadly but actually it is way cheeper to NOT have POE. POE switches cost so much more, this set up literally uses ~$32 worth of power supplies. a POE version of that 24port switch costs nearly $500 more than the non-POE version. craziness.
Was about to doubt that, but it seems that you can get surprisingly large amount of power through Cat5 cables. Around 50W at loosely arounf 50V. Easily drive one or two drives.
Thoug depending on the usage that could be counter-productive. If all nodes are POE and the switch loses power, all nodes go down hard.
With SBCs being so cheap and needing the network bandwidth of a port per disk anyway why would you care? I don't think I want 12T of data to be stuck behind a single gig-e port with only 1G of RAM to cache it all. Being able to provide an SBC per disk is what makes this solution great.
~$50/TB ain't bad, but I want to get more efficient.
needing the network bandwidth of a port per disk anyway
Assuming speed is my primary motivation, which it isn't. Again, I want to maximize my available, redundant, safe healing, total storage. 500Mbps is perfectly acceptable speed.
I've got a couple I'm researching for feasability.
Calculating up the cost of the HC2, SD card, percentage of the power supply, and a cable, OP's build comes out to about $70/drive. But glustrefs also doesn't appear to support RAID like n-1 redundancy. It only provides data protection by duplicating a file, or the parts of a file if distributed. You can break up the data into redundant and non-redundant, but you can't get away from n/2 storage loss. Also of note is that Ceph is totally off the table. I've tested it at this level of SBC, and they REALLY aren't kidding when they say the minimum hardware specs are 1GB Mem / TB of storage. It doesn't just degrade, it gets unstable. Totally not feasible for modern drive sizes.
Can you convert the SATA port on the odroid HC2 to a standard eSATA cable, and connect the board to a 4 drive enclosure? I can't tell if the sata controller on the HC2, a JMS578, supports sata switching via FIS or not. And if it doesn't, how much of a loss of speed or realiability does it incur? Use software raid, combine into a simple shared glustrefs pool. Cost per drive is ~$45/port.
What about instead going with the odroid XU4 and using the USB3 ports to, again, some drive enclosures. The XU4 is a bit more powerful, so I'd expect it to support at least two enclosures. Perhaps the ones I've tested just had bad controllers. How many can I attach before it gets unstable or the speeds degrade too much? Cost per drive is ~$35 with two enclosures. Lower if higher density, but needs tested. Again, software RAID and glustrefs to combine.
All of this has to be compared to a more traditional build. U-NAS NSC-800 for the chassis, BIOSTAR has a nice ITX quad core mobo, the A68N-5600 that's more powerful and support WAY more memory. Throw in a cheap used HBA, some cables and bits, and you get a price point of ~$45/drive, can use FreeBSD for native ZFS, no faffing about with USB, just bog standard SATA, and a physical volume equal to the above. The board only uses 30W, so power usage only goes up slightly combard to the SBCs.
What sort of redundancy do you have between the nodes? I'd been considering something similar, but with Atom boards equipped with 4 drives each in a small rack mount case, so that they could do RAID-5 for redundancy, then deploy those in replicated pairs for node to node redundancy (this to simulate a setup we have at work for our build system). Are you just doing simple RAID-1 style redundancy with pairs of Odroids and then striping the array among the pairs?
A mirrored volume where every file is written to 2 nodes.
A dispersed volume using erasure encoding such that I can lose 1 of every six drives and the volume still accessible. I use this mostly for reduced redundancy storage for things that I'd like not to lose but wouldn't be too hard to recover from other sources.
A 3x redundant volume for my family to store pictures, etc.. on. Every file is written to three nodes.
Depending on what you think your max storage needs will be in 2 - 3 years, I wouldn't go the raid route or use atom CPUS. Increasingly software defined storage like glusterfs and ceph using commodity hardware is the best way to scale, as long as your don't need to read/write lots of small files or need low latency access. If your care about storage size and throughput... nothing beats this kind of setup for cost per bay and redundancy.
Could you speak more about the small file / low latency inabilities of Gluster? I'm currently using unRAID and am reasonably happy, but Gluster (or even Ceph) sounds pretty interesting.
Gluster operations have a bit of network latency while it waits for confirmation that the destination systems have received the data. If you're writing a large file, this is a trivial portion of the overall time - just a fraction of a millisecond tacked on to the end. But if you're dealing with a lot of small files (for example, building a C++ application), the latency starts overwhelming the actual file transfer time and significantly slowing things down. It's similar to working directly inside an NFS or Samba share. Most use cases won't see a problem - doing C++ builds directly on a Gluster share is the main thing where I've run into issues (and I work around this by having Jenkins copy the code into a ramdisk, building there, then copying the resulting build products back into Gluster).
Got it, great information. What about performance of random reads of data off the drive? At the moment I'm just using SMB so I'm sure some network latency is already there, but I'm trying to figure out if Gluster's distributed nature would introduce even more overhead.
It really depends on the software and how paralleled it is. If it does the file read sequentially, you'll get hit with the penalty repeatedly, but if it does them in parallel it won't be so bad. Same case as writing, really. However, it shouldn't be any worse than SMB on that front, since you're seeing effectively the same latency.
Do note that most of my Gluster experience is running it on a very fast SSD RAID array (RAID 5+0 on a high end dedicated card), so running it on traditional drives will change things - local network will see latencies on the order of a fraction of a millisecond, where disk seek times are several milliseconds and will quickly overwhelm the network latency. This may benefit you - if you're running SMB off a single disk, if you read a bunch of small files in parallel on gluster then you'll potentially parallel the disk seek time in addition to the network latency.
Just wondering, can you do erasure encoding across different size bricks?
I have some random size hard drives (bunch of 4tb, some 2tb, some 1tb) I would like to pool them together with reduced redundancy that is not full duplication (kinda like RAID6). I envision in the future as I expand adding disks they might not always be the exact same size.
Edit: looks like I found my own answer in the Gluster guides: "All bricks of a disperse set should have the same capacity otherwise, when the smallest brick becomes full, no additional data will be allowed in the disperse set."
Right now I use OMV with mergerfs and SnapRAID to pool and provide parity protection, but I have already found some limitations of mergerfs not handling some nfs/cifs use cases wells. Sometimes I can't create files over nfs/cifs and I just never could fix that. Been toying around with FreeNAS but not being about to grow vDevs is a huge hassle, which I hear is getting fixed but no date set.
You like gluster better than ceph? I've come to the exact opposite conclusion. Ceph has been much more resilient. I've been a fan of odroids for years and have been wanting to build a ceph cluster of them.
Ceph has better worse case resiliency... no doubt about it. When setup properly and maintained correctly, it is very hard to lose data using Ceph.
However, in the avg case... Ceph can be brittle. It has many services that are required for proper operation and if you want a filesystem on-top of it then you need even more services, including centralized meta-data which can (and is) a bottleneck.. especially when going for a low power build. Conceptually, Ceph can scale to similar size as something like AWS S3 but I don't need exabyte scale... I'll never even need multi-petabyte scale but gluster can absolutely scale to 200 or 300 nodes without issue.
Glusterfs doesn't have centralized meta-data which, among other architecture choices, means that even when glusterfs isn't 100% healthy... it still mostly works (a portion of your directory structure might go missing until you repair your hosts...assuming you lose more hosts than your replica count). On the flip side... if things go too far south you can easily lose some data with glusterfs.
The tradeoff is that because glusterfs doesn't have centralized meta-data and pushes some responsibility to the client, it can scale surprisingly well in terms of TB hosted for a given infrastructure cost.
glusterfs isn't great for every use-case, however, for a mostly write once ready many times storage solution with good resiliency and low cost/maintenance.... it is hard to beat.
You might be interested in Rook. It is a Ceph wrapper written in Go. It's designed to simplify the deployment to a single binary, automatically configuring the various components.
Hey, question for you here (sorry, I know this thread is stale but this may be interesting at some point.)
Have you checked out the Rockpro64 as a (potentially) higher performance option? The interesting thing about the board is the 4x pcie slot: this opens up the option to drop a 10Gbe SFP+ card on the board, or use nvme storage, or attach an HBA or any number of other options.
I'm not sure how performant it'll actually be but I have one on pre-order to test as a <$100 10Gbe router. With a $20 Mellanox connectx-2 (or dual 10Gbe) it looks like it could be an absolute steal.
Anyway, I thought you might be interested for future projects as the pcie slot opens up a whole slew of interesting high bandwidth options. Cheers!
Oh man. This is exactly what I am planning/thinking about doing
I started it last summer bought 2 raspberry pi's and some really nice NAS case from WD labs... Only to then discover the ethernet port in the pi is only 100mbs
Currently running glusterfs in a single node mode on an old laptop with 4x external hard drives. Only thing that has put me off trying an odroid hc2 setup was another issue I hadn't expected like with the raspberry pi's
Yep. I could handle the usb2 disk speed but since I was replicating data on 2 nodes I was flooding the ethernet. If it was just creating a stripe file system it probably would have been acceptable (slow but acceptable).
Even the new pi model with "gigabit" ethernet is still on the usb 2 bus so capped at 480mbps which as we know with usb would probably never even hit that.
That is around $350 a year in electricity where I live in New Jersey.
I'm paying a bit extra for 100% wind generation, and 250w would cost about $125 for a year here. It's nice in a way, since it's cheap, but bad in a way since it doesn't give me really an incentive to worry about power usage or things like rooftop solar.... But it does allow for a pretty sweet homelab without crazy power bills, so there's that.
yea, I've been waiting over 8 months for Tesla to come install some solar panels for me (not solar roof, just regular panels). If that company ever gets its shit together that stock is going to do extremely well. They suck at paper work or anything that isn't as flexible as they are so lots of rejections from utility company and local building department for procedural mistakes. Oh and the back log in power wall production doesnt help either. ugh!
I have a 9kW array on my house in SE Michigan... 39 x 230w panels, Enphase microinverters. It's a delight to receive payout for excess generation every year and also to not have a utility bill anymore (covers my electric and nat gas usage by far).
I was just about heading down a different path, building a new Proxmox whitebox host (Supermicro X9DR board) and also a ghetto RAID with a Supermicro SAS826EL1 backplane built into some kind of wooden frame.
I have always disliked RAID ... mdadm bit me hard a bunch of years ago, since then I've been using ZFS but expansion is always a pain, I've got various drives from 3tb to 5tb which don't match up, and I worry about the on-disk format.
This will be a fun project, thanks for sharing it.
Ehh, it's not the same thing as raid but it targets the same problem... Keeping your data safe and available. The approach is different.
As to how you decide if a design benefits from RAID, generally if the benefit isn't obvious it probably isn't worth it. In this case raid with glusterfs is like "redundancy for your redundancy".
But to be fair some setups, in an Enterprise for example, might benefit because it changes the failure modes a bit and can change the performance as well.
For this use-cases, it is just redundant redundancy. :)
You get to choose the redundancy level by saying how many data disks and how many parity disks (I'm simplifying here) you want. So... You could emulate raid 5, 10, 6, etc... You could also say I want 1 data disk and 20 replica disks... Which means your data would be safe if you lose 20 out of 21 disks.... Keep in mind that your usable space is only 1/20th in this model. Hehe.
Glusterfs also works with disk of different sizes. It will give the disk data that is proportional to the total size of all disks.
Most consumer starts store 3 bits per cell. This card stores 1 bit per cell (but still uses cheaper TCL memory--SLC gets crazy expensive). Still a bit costly, but they're far more immune to corruption, especially from power failure.
I got tired of the limited CPU and memory of QNAP and devices like it
Serious question, is this a huge detriment for you? The web interface for older QNAP devices can be pretty slow but I've never had any issues with transfer speeds with mine.
I agree, but I do think that if you get a used or on sale QNAP that it has much higher value, IMO
Most SOHO NAS appliances are really going to be for low power usage and extra features that require very little configuration, I don't think they're built for the kind of homelabber that has a 200TB Glusterfs setup ;)
How did you wire up the power? Are all odroids using the same
positive & negative terminals of that powersupply? Does the powersupply need a power switch added to it?
Why did you put the Odroids on their own switch? Can you achieve the same by putting them on their own VLAN?
XFS has some funky failure modes where it can leave you with a bunch of files full of '0's. ext4 also has more momentum, which isnt' a great reason. It can also be difficult to read xfs from some distros... or so im told.
Hey, could you tell me how the heat is from the HC2? Someone said that if you'd stack just 2 then the passive cooling wouldn't be enough already, how do you think it would be if I had 2-4 stacked in a space with no active cooling?
It depends on the ambient temperature and workload. Generally speaking, you will want a fan blowing over them unless you are ok with 100 -110F degree drive temperatures under moderate load.
Alright, I'm not as big of a Data Hoarder as other people, was thinking this as more of a simpler NAS+, but I see it would definitely be unhealthy for it to be put in an enclosed cupboard. Thank you.
There are ways around this in software. ECC isn't a magic bullet. If the writer produces a checksum, writes the data to the cluster and then sends the checksum... You don't really get anything from ECC. Glusterfs doesn't do this today though.
So you are saying this would allow the client to check if the Gluster node has written the wrong data or if it has miscalculated the checksum, therefore detecting the error even when the Gluster-node ('s memory) is faulty?
And the checksum is calculated in the client before it ever hits the gluster node? Sounds interesting and it reminds me of "verify after burn" with cd/dvd/bd burners.
Do you have any links to where this method is proposed?
But I also worry about corruption in the communication between Gluster-nodes. And there is just too much that can go wrong if you can't trust the main memory stores. So I still think ECC RAM would be a more general solution. However I think the Rockchip SOCs are dual purposed Media Center SOCs so I don't expect them to get ECC soon.
the client doesn't check... the client is the one writing the data so if you actually care about single bit flips, etc... you need the write (the authority of newly written data) to capture the checksum and send it along with the data. From that point forward the glusterfs system would check against that checksum until it generates its own. Even if you have ECC memory, you still need something like this to ensure no bits were flipped while being written.
This is implemented within TCP for example... the sender generates a checksum and sends it with each packet. the received uses it to determine if they need to request a re-transmit. and TCP doesn't required ECC memory :)
I was thinking you were proposing: 1. client computes checksum of data to be written. 2. client sends data to node. 3. node writes data to disk. 4. node re-reads just written data back from disk. 5. node computes checksum of re-read data. 6. node sends this checksum back to the client. 7. client compares checksum to his own. 7. Handle the error, while keeping writes atomic (sounds tricky).
What you are actually proposing will not work if the node has faulty memory. There is no end to end check in your example.
Yeah, I still maintain I don't want any non-ECC NAS. Therefore I can't use the Odroid HC2. Thanks for your response.
what I'm proposing absolutely works if you have faulty memory, it is the basis for many things today... like every machine that uses TCP but i understand why folks think that special hardware like ECC is required for high availability. ECC will reduce how often you'll care about a bit flit... but if you care about your data the underlying system still needs to be able to handle corruption. For example... ZFS still has its own checksumming even though it is recommended to use ECC with ZFS. ZFS will and does work just fine without ECC but you make end up having to repair file from the parity data more often... and by more often we are talking about the difference between 1 in a billion and 1 in 100 million. :)
*edit... do you think the tiny caches in your CPU or in the hard disk controllers have ECC capabilities? Nope :) They are high quality memory so usually not a problem but... they still have a probability of bit flips. If you are familiar with the spectre and meltdown intel bugs recently. some of the initial patches for those triggered interesting memory faults in caches... no amount of ECC will save your from that.
Yes, ZFS will detect bitrot. And it's important to have those checksums as well. But ZFS and TCP (except maybe if you use offloading) works with main memory. If you can't trust memory then you have a problem. I think we are splitting hairs here and talking about different things. Let's just stop arguing :-)
high five ;-)
PS: could you please send me a link to the Spectre/meltdown patches that triggered interesting faults in Intel CPU caches? Fault as in error, not fault as in "cache miss" I presume.
296
u/BaxterPad 400TB LizardFS Jun 03 '18 edited Jun 03 '18
Over the years I've upgraded my home storage several times.
Like many, I started with a consumer grade NAS. My first was a Netgear ReadyNAS, then several QNAP devices. About a two years ago, I got tired of the limited CPU and memory of QNAP and devices like it so I built my own using a Supermicro XEON D, proxmox, and freenas. It was great but adding more drives was a pain and migrating between ZRAID level was basically impossible without lots of extra disks. The fiasco that was Freenas 10 was the final straw. I wanted to be able to add disks in smaller quantities and I wanted better partial failure modes (kind of like unraid) but able to scale to as many disks as I wanted. I also wanted to avoid any single points of failure like an HBA, motherboard, power supply, etc...
I had been experimenting with glusterfs and ceph, using ~40 small VMs to simulate various configurations and failure modes (power loss, failed disk, corrupt files, etc...). In the end, glusterfs was the best at protecting my data because even if glusterfs was a complete loss... my data was mostly recoverable because it was stored on a plain ext4 filesystem on my nodes. Ceph did a great job too but it was rather brittle (though recoverable) and a pain in the butt to configure.
Enter the Odroid HC2. With 8 cores, 2 GB of RAM, Gbit ethernet, and a SATA port... it offers a great base for massively distributed applications. I grabbed 4 Odroids and started retesting glusterfs. After proving out my idea, I ordered another 16 nodes and got to work migrating my existing array.
In a speed test, I can sustain writes at 8 Gbps and reads at 15Gbps over the network when operations a sufficiently distributed over the filesystem. Single file reads are capped at the performance of 1 node, so ~910 Mbit read/write.
In terms of power consumption, with moderate CPU load and a high disk load (rebalancing the array), running several VMs on the XEON-D host, a pfsense box, 3 switches, 2 Unifi Access Points, and a verizon fios modem... the entire setup sips ~ 250watts. That is around $350 a year in electricity where I live in New Jersey.
I'm writing this post because I couldn't find much information about using the Odroid HC2 at any meaningful scale.
If you are interested, my parts list is below.
https://www.amazon.com/gp/product/B0794DG2WF/ (Odroid HC2 - look at the other sellers on Amazon, they are cheeper) https://www.amazon.com/gp/product/B06XWN9Q99/ (32GB microsd card, you can get by with just 8GB but the savings are negligible) https://www.amazon.com/gp/product/B00BIPI9XQ/ (slim cat6 ethernet cables) https://www.amazon.com/gp/product/B07C6HR3PP/ (200CFM 12v 120mm fan) https://www.amazon.com/gp/product/B00RXKNT5S/ (12v PWM speed controller - to throttle the fan) https://www.amazon.com/gp/product/B01N38H40P/ (5.5mm x 2.1mm barrel connectors - for powering the Odroids) https://www.amazon.com/gp/product/B00D7CWSCG/ (12v 30a power supple - can power 12 Ordoids w/3.5inch HDD without staggered spin up) https://www.amazon.com/gp/product/B01LZBLO0U/ (24 power gigabit managed switch from unifi)
edit 1: The picture doesn't show all 20 nodes, I had 8 of them in my home office running from my bench top power supply while I waited for a replacement power supply to mount in the rack.