I'm having an issue with a win2k3 guest with 2 RDM's. Periodically (but not on a regular schedule) the VM will lose access to one of the disks (Event ID's 11 (The driver detected a controller error on \Device\Harddisk1.) and 15 The device, \Device\Scsi\symmpi1, is not ready for access yet). The guest will still respond to pings, but locks up when attempting to log on, and all shares become unavailable. A hard reset of the guest brings everything back up fine. Also, we've seen it come back on it's own after being unavailable for an hour or two. On the storage side the Array and LUN are hardly touching I/O, fiber isn't registering any issues. When I first ran into the issue I noticed that the problematic RDM had an inconsistently named vmdk file, and without knowing all the history behind the VM, I shut it down and removed and readded the RDM connections. So naming is now consistent, and both RDM's have been recreated under vsphere 4, but the problem remains. I have another vm with two RDM's that has had none of these issues. I also created a test VM with a couple rdm's and couldn't recreate the issue.
Details of the situation:
Host: HP Blade
ESX: 4 U1 (same issue on each 4 U1 host, error does not occur on my one 3.5 U4 host)
Storage: IBM DS4700 (fiber attached)
The issue first popped up when I rebuilt the hosts to ESX 4 and migrated this VM from 3.5U1 to 4U1.
The vmkernel log shows the below error over and over when the lock up happens:
Jun 28 15:22:26 server2 vmkernel: 10:14:37:20.932 cpu2:4394)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x410006085c00) to NMP device "naa.600a0b80002999a80000300946c42cb3" failed on physical path "vmhba0
:C0:T2:L21" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Jun 28 15:22:26 server2 vmkernel: 10:14:37:20.932 cpu2:4394)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.600a0b80002999a80000300946c42cb3" state in doubt; requested fast path state update.
..
Jun 28 15:22:26 server2 vmkernel: 10:14:37:20.932 cpu2:4394)ScsiDeviceIO: 770: Command 0x28 to device "naa.600a0b80002999a80000300946c42cb3" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Idea's, suggestions? I'm inclined to simply plan to migrate the data to vmdk's and get rid of the rdm's altogether, but if I can track down a cause/cure that would be better still.
Thanks