Dear fellow admins.
I have issues with my vsphere cluster.
First a breif explanation of my setup.
4 Fujitsu Siemens RX servers - each with 3 nics.
Ip net for ESX servers, and virtual servers: 192.168.10.x
Ip Net for iSCSI 192.168.11.x and 192.168.12.x
Each server have:
2 Nic for iSCSI - ip net 11.x and 12.x
1 nic for Everything else. 10.x net
1 48 POrt 1000 mbit switch (zyxel level 3 switch) I am not using vlan at all. All traffic goes to this switch.
1 IBM DS3300 iSCSI dual controller.
Controller setup:
COntroller A:
POrt1: 192.168.11.6
Port 2 192.168.12.6
Controller B
Port1 192.168.11.7
Port2 192.168.12.7
ISCSI is configured with 2 paths to each lun, active / active (io) and 2 inactive paths.
Got no internal DNS
Got 1 vmware infrastructure server witch controls the cluster. ip: 192.168.10.37
Snippet from host file from ESX servers:
192.168.10.20 ESX1.Hosting ESX1
192.168.10.21 ESX2.Hosting ESX2
192.168.10.22 ESX3.Hosting ESX3
192.168.10.23 ESX4.Hosting ESX4
192.168.10.37 VSS.Hosting VSS
Allright, i have attached a picture of my network setup on one server (the same on all)
Problem 1:
Now my problems:
Some of the ESX hosts sometimes gets disconnected from the cluster, and then they reconnect afterwards.
from event log:
Host 192.168.10.21 in datacenter Datacenter is not responding.Then the servs gets disconnected, and then:
Alarm 'Virtual machine cpu usage' on SERVER changed
from Green to Gray
info
12-05-2010 10:19:44
Alarm 'Host connection failure' on 192.168.10.21 triggered
an action
info
12-05-2010 10:19:44
Alarm 'Host connection failure' on entity 192.168.10.21
send SNMP trap
info
12-05-2010 10:19:44
Then normally it reconnects itself again.
Problem 2:
SOmetimes the virtual serves looses connection with the vmware network.
Today one virtual server even got shut down, and and startedd again. But before that happended, i noticed this from looking at "backup server" Tasks and events:
192.168.10.21 is disconnected (.21 is a ESX server where Backupserver is located)
And then:
Host is connected
info
12-05-2010 06:40:27
info
12-05-2010 05:58:19
This occoured 8 times within 1½ hours. and then the backupserver shut down.
Problem 3:
one of the esx servers shows this in the event log:
Alarm 'Cannot connect to storage' on entity 192.168.10.20
send SNMP trap
info
12-05-2010 01:43:13
Alarm 'Cannot connect to storage' on 192.168.10.20
changed from Gray to Gray
info
12-05-2010 01:43:13
Alarm 'Cannot connect to storage' on 192.168.10.20
changed from Gray to Gray
info
12-05-2010 01:43:13
Lost connectivity to storage device
naa.600a0b80005aedbb00000ac54b17dfff. Path
vmhba33:C7:T0:L3 is down. Affected datastores: "IBM
LUN3".
error
12-05-2010 01:41:03
Lost connectivity to storage device
naa.600a0b80005aedbb00000f7e4b656403. Path vmhba33:
C7:T0:L5 is down. Affected datastores: "IBM LUN5".
error
12-05-2010 01:41:03
Lost access to volume
4b2a36fb-986d63ce-b6ea-000ae48a8ba7 (IBM LUN2) due
to connectivity issues. Recovery attempt is in progress and
outcome will be reported shortly.
info
12-05-2010 01:41:04
Successfully restored access to volume 4b2a36fb-986d63ce
-b6ea-000ae48a8ba7 (IBM LUN2) following connectivity
issues.
info
12-05-2010 01:42:03
and so on
Please help me guys, i've read a lot of docu on this before posting here, but frankly, i no longer know what to do. This is serius problems.
Maybe installing a second nic in each server, and dedicating that to vm traffic, so i have service console traffic seperated would be a good idea?