I've filed an SR about this but I'm still working through the 'maybe it's your hardware' stuff with the support reps.
We upgraded one of our ESX 3.5 hosts to ESX 4 and ran for a couple of days with no issues. Then we upgraded the rest of our cluster (6 machines) this past weekend. I have since experienced 7 PSOD lockups on 3 of the machines, all identical. I have attached a sample PSOD.
My understanding is that I'm supposed to be able to retrieve a core dump image from my VMKCORE partition using esxcfg-dumppart, however when I try to do this I get the following:
Single slot coredump
Error running command. Unable to copy the dump partition: Couldn't find a valid VMKernel dump file. Dump partition might be uninitialized.
I am not sure how to initialize the dump partition. These were set up automatically by the ESX installation software. I have gone to each VM host and issued a 'esxcfg-dumppart -a'. I figure either the partition is still not initialized, or ESX is actually not writing to the partition like it says.
We used to have similar issues (random machine check exceptions) with 3.5, but these were fixed by a BIOS update.
Has anyone else experienced this issue with ESX 4 ?
All system components are on the HCL except for our NICs - these are integrated Intel Pro/1000 EB controllers which were on the HCL for 3.5U4. We don't have any other cards to use, so if this is the culprit we'll not be able to upgrade to 4.0.