Storage Spaces Direct: Repair failed volumes

Recently I’ve had to fix failed virtual drives (volumes) in a Storage Spaces Direct cluster.

The virtual disks went offline with the following errors:

Event ID: 1793 / Level: Error
Description:
Cluster physical disk resource online failed.

Physical Disk resource name: Cluster Virtual Disk (X500_ClustS2D_5)
Device Number: 4294967298
Device Guid: {e6dd0658-0bb1-401a-a938-7a0ff9d671d0}
Error Code: 15
Additional reason: SpaceStateInvalidFailure

Event ID: 1069 / Level: Error
Description:
Cluster resource ‘Cluster Virtual Disk (X500_ClustS2D_5)’ of type ‘Physical Disk’ in clustered role ‘7d10624b-262a-4ad0-9d11-5440543383d0’ failed. The error code was ‘0xf’ (‘The system cannot find the drive specified.’).

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Cause

This happens because ReFS can’t write metadata to the virtual disk when mounting it, causing the virtual disk to move between cluster hosts until it’s tried them all, and ultimately fails.

Get-VirtualDisk shows the virtual disk is in a detached state.

FriendlyName: X500_ClustS2D_5
ResiliencySettingName:
OperationalStatus: Detached
HealthStatus: Unknown
IsManualAttach: True
Size: 10TB

Fix

Step 1

Remove-ClusterSharedVolume -Name “Cluster Virtual Disk (X500_ClustS2D_5)”

This removes the volume from the Cluster Shared Volumes (CSV) in the failover cluster, and places it in Available Storage in the cluster.

Get-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRunChkdsk -Value 7

DiskRunChkdsk = 7 read-only mode.

Set-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRecoveryAction -Value 1

DiskRecoveryAction = 1 enables attaching the volume in read-write mode without any checks.

Start-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”

The following should be initiated from the server where the detached volume is online:

Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask

Monitor the repair progress with Get-StorageJob.

Sometimes the repair fails, if it does re-run it:

Stop-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”

Start-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”

Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask

When it’s finished, and the volume is repaired, move onto Step 2.

Step 2

Stop-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Get-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRunChkdsk -Value 0
Get-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)” | Set-ClusterParameter -Name DiskRecoveryAction -Value 0
Add-ClusterSharedVolume -Name “Cluster Virtual Disk (X500_ClustS2D_5)”
Start-ClusterResource -Name “Cluster Virtual Disk (X500_ClustS2D_5)”

Get-VirtualDisk should now show the virtual disk in a healthy state.

FriendlyName: X500_ClustS2D_5
ResiliencySettingName:
OperationalStatus: OK
HealthStatus: Healthy
IsManualAttach: True
Size: 10TB

Advertisements

One comment

  1. Do we have any plan of action if (Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask) is stuck at Suspended state ?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s