Hi all!

I will soon acquire a pretty beefy unit compared to my current setup (3 node server with each 16C, 512G RAM and 32T Storage).

Currently I run TrueNAS and Proxmox on bare metal and most of my storage is made available to apps via SSHFS or NFS.

I recently started looking for “modern” distributed filesystems and found some interesting S3-like/compatible projects.

To name a few:

  • MinIO
  • SeaweedFS
  • Garage
  • GlusterFS

I like the idea of abstracting the filesystem to allow me to move data around, play with redundancy and balancing, etc.

My most important services are:

  • Plex (Media management/sharing)
  • Stash (Like Plex 🙃)
  • Nextcloud
  • Caddy with Adguard Home and Unbound DNS
  • Most of the Arr suite
  • Git, Wiki, File/Link sharing services

As you can see, a lot of download/streaming/torrenting of files accross services. Smaller services are on a Docker VM on Proxmox.

Currently the setup is messy due to the organic evolution of my setup, but since I will upgrade on brand new metal, I was looking for suggestions on the pillars.

So far, I am considering installing a Proxmox cluster with the 3 nodes and host VMs for the heavy stuff and a Docker VM.

How do you see the file storage portion? Should I try a full/partial plunge info S3-compatible object storage? What architecture/tech would be interesting to experiment with?

Or should I stick with tried-and-true, boring solutions like NFS Shares?

Thank you for your suggestions!

  • Appoxo@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    3
    ·
    1 day ago

    I have my NFS storage mounted via 2.5G and use qcow2 disks. It is slow to snapshot…

    Maybe I understand your question wrong?

    • tal@lemmy.today
      link
      fedilink
      English
      arrow-up
      3
      ·
      23 hours ago

      NFS doesn’t do snapshotting, which is what I assumed that you meant and I’d guess ShortN0te also assumed.

      If you’re talking about qcow2 snapshots, that happens at the qcow2 level. NFS doesn’t have any idea that qemu is doing a snapshot operation.

      On a related note: if you are invoking a VM using a filesystem images stored on an NFS mount, I would be careful, unless you are absolutely certain that this is safe for the version of NFS and the specific caching options for both NFS and qemu that you are using.

      I’ve tried to take a quick look. There’s a large stack involved, and I’m only looking at it quickly.

      To avoid data loss via power loss, filesystems – and thus the filesystem images backing VMs using filesystems – require write ordering to be maintained. That is, they need to have the ability to do a write and have it go to actual, nonvolatile storage prior to any subsequent writes.

      At a hard disk protocol level, like for SCSI, there are BARRIER operations. These don’t force something to disk immediately, but they do guarantee that all writes prior to the BARRIER are on nonvolatile storage prior to writes subsequent to it.

      I don’t believe that Linux has any userspace way for an process to request a write barrier. There is not an fwritebarrier() call. This means that the only way to impose write ordering is to call fsync()/sync() or use similar-such operations. These force data to nonvolatile storage, and do not return until it is there. The downside is that this is slow. Programs that are frequently doing such synchronizations cannot issue writes very quickly, and are very sensitive to latency to their nonvolatile storage.

      From the qemu(1) man page:

               By  default, the cache.writeback=on mode is used. It will report data writes as completed as soon as the data is
             present in the host page cache. This is safe as long as your guest OS makes sure to correctly flush disk  caches
               where  needed.  If  your  guest OS does not handle volatile disk write caches correctly and your host crashes or
               loses power, then the guest may experience data corruption.
      
               For such guests, you should consider using cache.writeback=off.  This means that the host  page  cache  will  be
               used  to  read and write data, but write notification will be sent to the guest only after QEMU has made sure to
               flush each write to the disk. Be aware that this has a major impact on performance.
      

      I’m fairly sure that this is a rather larger red flag than it might appear, if one simply assumes that Linux must be doing things “correctly”.

      Linux doesn’t guarantee that a write to position A goes to disk prior to a write to position B. That means that if your machine crashes or loses power, with the default settings, even for drive images sorted on a filesystem on a local host, with default you can potentially corrupt a filesystem image.

      https://docs.kernel.org/block/blk-mq.html

      Note

      Neither the block layer nor the device protocols guarantee the order of completion of requests. This must be handled by higher layers, like the filesystem.

      POSIX does not guarantee that write() operations to different locations in a file are ordered.

      https://stackoverflow.com/questions/7463925/guarantees-of-order-of-the-operations-on-file

      So by default – which is what you might be doing, wittingly or unwittingly – if you’re using a disk image on a filesystem, qemu simply doesn’t care about write ordering to nonvolatile storage. It does writes. it does not care about the order in which they hit the disk. It is not calling fsync() or using analogous functionality (like O_DIRECT).

      NFS entering the picture complicates this further.

      https://www.man7.org/linux/man-pages/man5/nfs.5.html

      The sync mount option The NFS client treats the sync mount option differently than some other file systems (refer to mount(8) for a description of the generic sync and async mount options). If neither sync nor async is specified (or if the async option is specified), the NFS client delays sending application writes to the server until any of these events occur:

               Memory pressure forces reclamation of system memory
               resources.
      
               An application flushes file data explicitly with sync(2),
               msync(2), or fsync(3).
      
               An application closes a file with close(2).
      
               The file is locked/unlocked via fcntl(2).
      
        In other words, under normal circumstances, data written by an
        application may not immediately appear on the server that hosts
        the file.
      
        If the sync option is specified on a mount point, any system call
        that writes data to files on that mount point causes that data to
        be flushed to the server before the system call returns control to
        user space.  This provides greater data cache coherence among
        clients, but at a significant performance cost.
      
        Applications can use the O_SYNC open flag to force application
        writes to individual files to go to the server immediately without
        the use of the sync mount option.
      

      So, strictly-speaking, this doesn’t make any guarantees about what NFS does. It says that it’s fine for the NFS client to send nothing to the server at all on write(). The only time a write() to a file makes it to the server, if you’re using the default NFS mount options. If it’s not going to the server, it definitely cannot be flushed to nonvolatile storage.

      Now, I don’t know this for a fact – would have to go digging around in the NFS client you’re using. But it would be compatible with the guarantees listed, and I’d guess that probably, the NFS client isn’t keeping a log of all the write()s and then replaying them in order. If it did so, for it to meaningfully affect what’s on nonvolatile storage, the NFS server would have to fsync() the file after each write being flushed to nonvolatile storage. Instead, it’s probably just keeping a list of dirty data in the file, and then flushing it to the NFS server at close().

      That is, say you have a program that opens a file filled with all ‘0’ characters, and does:

      1. write ‘1’ to position 1.
      2. write ‘1’ to position 5000.
      3. write ‘2’ to position 1.
      4. write ‘2’ to position 5000.

      At close() time, the NFS client probably doesn’t flush “1” to position 1, then “1” to position 5000, then “2” to position 1, then “2” to position 5000. It’s probably just flushing “2” to position 1, and then “2” to position 5000, because when you close the file, that’s what’s in the list of dirty data in the file.

      The thing is that unless the NFS client retains a log of all those write operations, there’s no way to send the writes to the server in a way that avoid putting the file into a corrupt state if power is lost. It doesn’t matter whether it writes the “2” at position 1 or the “2” at position 5000. In either case, it’s creating a situation where, for a moment, one of those two positions has a “0”, and the other has a “2”. If there’s a failure at that point – the server loses power, the network connection is severed – that’s the state in which the file winds up in. That’s a state that is inconsistent, should never have arisen. And if the file is a filesystem image, then the filesystem might be corrupt.

      So I’d guess that at both of those two points in the stack – the NFS client writing data to the server, and the server block device scheduler, permit inconsistent state if there’s no fsync()/sync()/etc being issued, which appears to be the default behavior for qemu. And running on NFS probably creates a larger window for a failure to induce corruption.

      It’s possible that using qemu’s iSCSI backend avoids this issue, assuming that the iSCSI target avoids reordering. That’d avoid qemu going through the NFS layer.

      I’m not going to dig further into this at the moment. I might be incorrect. But I felt that I should at least mention it, since filesystem images on NFS sounded a bit worrying.

      • Appoxo@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        19 hours ago

        Thanks for digging this “shallow” (lol. What you dug up is equal to my senior technician explaining the full tech stack of a client).
        Anyway, I host the system disk on local disk and NFS storage acts as the mass storage for my VMs like my media server for jellyfin).
        And I also do daily backups with Veeam Backup and Replication of both my most important files of my media server and the important VMs.
        So in case of a data failure it should be more or less fine.

        Wouldnt the sync option also confirm that every write also arrived on the disk?
        Because I did mount the NFS (Storage host: TrueNAS, Hypervisor: Proxmox) in sync mode.

        • tal@lemmy.today
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          14 hours ago

          Wouldnt the sync option also confirm that every write also arrived on the disk?

          If you’re mounting with the NFS sync option, that’ll avoid the “wait until close and probably reorder writes at the NFS layer” issue I mentioned, so that’d address one of the two issues, and the one that’s specific to NFS.

          That’ll force each write to go, in order, to the NFS server, which I’d expect would avoid problems with the network connection being lost while flushing deferred writes. I don’t think that it actually forces it to nonvolatile storage on the server at that time, so if the server loses power, that could still be an issue, but that’s the same problem one would get when running with a local filesystem image with the “less-safe” options for qemu and the client machine loses power.

    • ShortN0te@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1
      ·
      19 hours ago

      If i understand you correctly, your Server is accessing the VM disk images via a NFS share?

      That does not sound efficient at all.

      • Appoxo@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        1
        ·
        19 hours ago

        No other easy option I figured out.
        Didnt manage to understand iSCSI in the time I was patient with it and was desperate to finish the project and use my stuff.
        Thus NFS.