Ugh. My brain hurts. I’ve spent the past few hours reviewing scripted ESX installations and working on a PowerShell script for a customer that will reorder vmnics after a scripted installation is complete (because I can’t find any other way to force their order during the install). It’s been a few months since I’ve done a scripted installation, so I definitely needed a refresher. Plus, according to the VMware Enterprise Administration Exam Blueprint v3.5, section 8.1 is all about automating ESX deployments. The good news is that section 8.1 is the last section of the blueprint, so I believe I’m almost done preparing for the VCDX Admin Exam, which I’m scheduled to take in a few days.
Anyway, going back to the beginning of the Blueprint, and continuing from where I left off, here is the next section of my study notes.
Objective 1.3 – Troubleshoot Virtual Infrastructure storage components.
Knowledge
Identify storage related events and log entries. Analyze storage events to determine related issues.
All storage related events will be recorded in the /var/log/vmkernel log file. Most of the messages in this log file are fairly cryptic and can be difficult to interpret. Furthermore, this log file contains all messages from the vmkernel, not just storage related messages, so you’ll have to filter through it. An easy way to do this is simply to search for SCSI. For example, the command cat /var/log/vmkernel | grep SCSI on one of my servers produces the following output (only showing the last 10 lines) …
[root@cincylab-esx3 root]# cat /var/log/vmkernel | grep SCSI | tail -10
May 18 12:57:47 cincylab-esx3 vmkernel: 2:16:01:55.748 cpu2:1069)iSCSI: login phase for session 0x8603f90 (rx 1071, tx 1070) timed out at 23051576, timeout was set for 23051576
May 18 12:57:47 cincylab-esx3 vmkernel: 2:16:01:55.748 cpu2:1071)iSCSI: session 0x8603f90 connect timed out at 23051576
May 18 12:57:47 cincylab-esx3 vmkernel: 2:16:01:55.748 cpu2:1071)<5>iSCSI: session 0x8603f90 iSCSI: session 0x8603f90 retrying all the portals again, since the portal list got exhausted
May 18 12:57:47 cincylab-esx3 vmkernel: 2:16:01:55.748 cpu2:1071)iSCSI: session 0x8603f90 to iqn.2004-08.jp.buffalo:TS-IGLA68-001D7315AA68:vol1 waiting 1 seconds before next login attempt
May 18 12:57:48 cincylab-esx3 vmkernel: 2:16:01:56.748 cpu2:1071)iSCSI: bus 0 target 0 trying to establish session 0x8603f90 to portal 0, address 10.10.8.200 port 3260 group 1
May 18 12:58:00 cincylab-esx3 vmkernel: 2:16:02:09.355 cpu2:1071)iSCSI: bus 0 target 0 established session 0x8603f90 #3, portal 0, address 10.10.8.200 port 3260 group 1
May 18 12:58:01 cincylab-esx3 vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba35:C0:T0:L0 : 0×0 0×80 0×83
May 18 12:58:01 cincylab-esx3 vmkernel: VMWARE SCSI Id: Device id info for vmhba35:C0:T0:L0: 0×1 0×1 0×0 0×18 0×42 0×55 0×46 0×46 0×41 0x4c 0x4f 0×0 0×0 0×0 0×0 0×0 0×1 0×0 0×0 0×0 0×0 0×0 0×0 0×0 0×2 0×0 0×0 0×0
May 18 12:58:01 cincylab-esx3 vmkernel: VMWARE SCSI Id: Id for vmhba35:C0:T0:L0 0×20 0×20 0×20 0×20 0×56 0×49 0×52 0×54 0×55 0×41
[root@cincylab-esx3 root]#
If you look closely, I clearly had some issues with my iSCSI appliance a few hours ago. I decided make some configuration changes to the switch and then, all of a sudden, the ESX server lost connectivity to its storage. Weird!
Anyway, what does all this mean? There’s an really good VMworld Europe 2008 presentation (which you can get from www.vmworld.com) titled VI3 Advanced Log Analysis, which goes into detail about how to interpret VMware log files. From that presentation, I found this diagram which describes the components of a message in the vmkernel log file.
Skills and Abilities
Verify storage configuration and troubleshoot storage connection issues using CLI , VI Client and logs
- Rescan events
A rescan event can be initiated with the esxcfg-rescan at the command line. The output should look like the following …
[root@cincylab-esx3 root]# esxcfg-rescan vmhba32
Rescanning vmhba32 …
On scsi1, removing: 0:0.
On scsi1, adding: 0:0.
Done.
[root@cincylab-esx3 root]# cat /var/log/vmkernel | grep SCSI | tail -3
May 18 19:13:09 cincylab-esx3 vmkernel: VMWARE SCSI Id: Supported VPD pages for vmhba32:C0:T0:L0 : 0×0 0×80 0×83
May 18 19:13:10 cincylab-esx3 vmkernel: VMWARE SCSI Id: Device id info for vmhba32:C0:T0:L0: 0×2 0×0 0×0 0×18 0x4c 0×69 0x6e 0×75 0×78 0×20 0×41 0×54 0×41 0x2d 0×53 0×43 0×53 0×49 0×20 0×73 0×69 0x6d 0×75 0x
May 18 19:13:10 cincylab-esx3 vmkernel: VMWARE SCSI Id: Id for vmhba32:C0:T0:L0 0×36 0×52 0×58 0×36 0x4a 0×39 0×39 0×58 0×20 0×20 0×20 0×20 0×20 0×20 0×20 0×20 0×20 0×20 0×20 0×20 0×47 0×42 0×30 0×31 0×36 0x
[root@cincylab-esx3 root]#
- Failover events
I don’t have redundant paths in my lab to simulate this. So, again from the VMworld presentation, VI3 Advanced Log Analysis, here is a screen shot from the slide that covers this topic.
There will obviously be a lot of different types of error and event messages in /var/log/vmkernel. And I’m certainly not going to try and list every possible combination here. I highly suggest you download the VMworld preso because it does a great job of explaining how to further decipher the log files (like defining SCSI error codes).
Well, that’s about it for this section. Back to PowerShell scripting for another hour or so.