Recently I’ve had to fix failed virtual drives (volumes) in a Storage Spaces Direct cluster.
The virtual disks went offline with the following errors:
Event ID: 1793 / Level: Error
Description:
Cluster physical disk resource online failed.
Physical Disk resource name: Cluster Virtual Disk (X500_ClustS2D_5)
Device Number: 4294967298
Device Guid: {e6dd0658-0bb1-401a-a938-7a0ff9d671d0}
Error Code: 15
Additional reason: SpaceStateInvalidFailure
Event ID: 1069 / Level: Error
Description:
Cluster resource ‘Cluster Virtual Disk (X500_ClustS2D_5)’ of type ‘Physical Disk’ in clustered role ‘7d10624b-262a-4ad0-9d11-5440543383d0’ failed. The error code was ‘0xf’ (‘The system cannot find the drive specified.’).
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Cause
This happens because ReFS can’t write metadata to the virtual disk when mounting it, causing the virtual disk to move between cluster hosts until it’s tried them all, and ultimately fails.
Get-VirtualDisk shows the virtual disk is in a detached state.
FriendlyName: X500_ClustS2D_5
ResiliencySettingName:
OperationalStatus: Detached
HealthStatus: Unknown
IsManualAttach: True
Size: 10TB
Fix
Step 1
Remove-ClusterSharedVolume -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
This removes the volume from the Cluster Shared Volumes (CSV) in the failover cluster, and places it in Available Storage in the cluster.
Get-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRunChkdsk -Value 7
DiskRunChkdsk = 7 read-only mode.
Set-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRecoveryAction -Value 1
DiskRecoveryAction = 1 enables attaching the volume in read-write mode without any checks.
Start-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
The following should be initiated from the server where the detached volume is online:
Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask
Monitor the repair progress with Get-StorageJob.
Sometimes the repair fails, if it does re-run it:
Stop-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Start-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask
When it’s finished, and the volume is repaired, move onto Step 2.
Step 2
Stop-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Get-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRunChkdsk -Value 0
Get-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRecoveryAction -Value 0
Add-ClusterSharedVolume -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Start-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Get-VirtualDisk should now show the virtual disk in a healthy state.
FriendlyName: X500_ClustS2D_5
ResiliencySettingName:
OperationalStatus: OK
HealthStatus: Healthy
IsManualAttach: True
Size: 10TB
Do we have any plan of action if (Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask) is stuck at Suspended state ?
LikeLike
The suspended state eventually went away for me. I was not able to see any kind of transition to a different state or any repair progress, but the repair did complete after being in a suspended state for a few miniutes.
LikeLike
You sir are a LIFE SAVER!!! This fixed the failed S2D drives for us!!
LikeLike