I have a ZFS pool that I made on proxmox. I noticed an error today. I think the issue is the drives got renamed at some point and how its confused. I have 5 NVME drives in total. 4 are supposed to be on the ZFS array (CT1000s) and the 5th samsung drive is the system/proxmox install drive not part of ZFS. Looks like the numering got changed and now the drive that used to be in the array labeled nvme1n1p1 is actually the samsung drive and the drive that is supposed to be in the array is now called nvme0n1.

root@pve:~# zpool status
  pool: zfspool1
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:07:38 with 0 errors on Sun Oct 13 00:31:39 2024
config:

        NAME                     STATE     READ WRITE CKSUM
        zfspool1                 DEGRADED     0     0     0
          raidz1-0               DEGRADED     0     0     0
            7987823070380178441  UNAVAIL      0     0     0  was /dev/nvme1n1p1
            nvme2n1p1            ONLINE       0     0     0
            nvme3n1p1            ONLINE       0     0     0
            nvme4n1p1            ONLINE       0     0     0

errors: No known data errors

Looking at the devices:

 nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme4n1          /dev/ng4n1            193xx6A         CT1000P1SSD8                             1           1.00  TB /   1.00  TB    512   B +  0 B   P3CR013
/dev/nvme3n1          /dev/ng3n1            1938xxFF         CT1000P1SSD8                             1           1.00  TB /   1.00  TB    512   B +  0 B   P3CR013
/dev/nvme2n1          /dev/ng2n1            192xx10         CT1000P1SSD8                             1           1.00  TB /   1.00  TB    512   B +  0 B   P3CR010
/dev/nvme1n1          /dev/ng1n1            S5xx3L      Samsung SSD 970 EVO Plus 1TB             1         289.03  GB /   1.00  TB    512   B +  0 B   2B2QEXM7
/dev/nvme0n1          /dev/ng0n1            19xxD6         CT1000P1SSD8                             1           1.00  TB /   1.00  TB    512   B +  0 B   P3CR013

Trying to use the zpool replace command gives this error:

root@pve:~# zpool replace zfspool1 7987823070380178441 nvme0n1p1
invalid vdev specification
use '-f' to override the following errors:
/dev/nvme0n1p1 is part of active pool 'zfspool1'

where it thinks 0n1 is still part of the array even though the zpool status command shows that its not.

Can anyone shed some light on what is going on here. I don’t want to mess with it too much since it does work right now and I’d rather not start again from scratch (backups).

I used smartctl -a /dev/nvme0n1 on all the drives and there don’t appear to be any smart errors, so all the drives seem to be working well.

Any idea on how I can fix the array?

  • qupada@fedia.io
    link
    fedilink
    arrow-up
    8
    ·
    17 days ago

    Generally, you just need to export the pool with zpool export zfspool1, then import again with zpool import -d /dev/disk/by-id zfspool1.

    I believe it should stick after that.

    Whether that will apply in its current degrated state I couldn’t say.

    • Lem453@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      17 days ago

      Thanks, this worked. I made the ZFS array in the proxmox GUI and it used the nvmeX names by default. Interestingly, when I did zfs export, nothing seemed to happen and it -> I tried zpool import and is said no pools available to import, but then when I did zpool status it showed the array up and working with all 4 drives showing healthy and it was now using device IDs. Odd but seems to be working correctly now.

      root@pve:~# zpool status
        pool: zfspool1
       state: ONLINE
        scan: resilvered 8.15G in 00:00:21 with 0 errors on Thu Nov  7 12:51:45 2024
      config:
      
      		NAME                                                                                 STATE     READ WRITE CKSUM
      		zfspool1                                                                             ONLINE       0     0     0
      		  raidz1-0                                                                           ONLINE       0     0     0
      			nvme-eui.000000000000000100a07519e22028d6-part1                                  ONLINE       0     0     0
      			nvme-nvme.c0a9-313932384532313335343130-435431303030503153534438-00000001-part1  ONLINE       0     0     0
      			nvme-eui.000000000000000100a07519e21fffff-part1                                  ONLINE       0     0     0
      			nvme-eui.000000000000000100a07519e21e4b6a-part1                                  ONLINE       0     0     0
      
      errors: No known data errors