VMware: datastores entering APD randomly

For a long time I was using 1gbps uplinks configured in LACP with IP Hash policy to access my datastores laying on NetApp storage array – and it worked just fine (at least, as expected for 1gbps connection). But finally time has come and I was able to switch my infrastructure to 10gbps connection.

Topology is simple as we use NFS protocol for our datastores, so what I needed to do is to connect each VMware server with 2 x 10gbps LACP (IP Hash policy on standard vSwitches) through 10gbps Enterasys switches to 4 port LACP (dynamic) on the NetApp side. These switches are exclusively for storage traffic so there are no VLANs configured on them.

topology

Current configuration:

NetApp – vif with 4 interfaces in trunk mode “LACP”, MTU 9000

VMware – separate vSwitch with 2 VMkernel ports (storage traffic and vMotion). Two NICs, configured with IP Hash policy and MTU 9000 (both, on the portgroup and vswitch):

vswitch config

Both adapters are active, no standby adapters.

Enterasys switches  – dynamic LACP for NetApp ports, static LACP for VMware’s IP Hash, jumbo frames on all ports / LACPs.

So far so good. But here comes the problem! I was able to mount all of my datastores without any issues and I was pretty sure it works, but whenever I tried to upload a file or make a storage vmotion it was failing with error messages (All Paths Down states) and my datastores were disconnected at the same time:

all-paths-down

I have double checked all VMware settings. Proper MTU, proper load balancing policy (IP Hash) on both – portgroups and vswitch, no standby adapters. NetApp’s config was also okay – single VIF with LACP trunk mode, MTU 9000 enabled.

Guess what’s left? yea… read all your documentation, entirely. 

There are at least few things that should be configured on the switches to make all things work properly:

  • VMware’s IP Hash has to be configured with static LACP, not dynamic LACP for VMware ports.
  • Using LACP trunk on Netapp means you should use dynamic LACP on switches for NetApp ports.
  • Make sure that for VMware ports it is LACP on Layer 3  (IP Hash algorithm), not Layer 2!
  • If using VLANs – make sure they are configured both, on physical switch ports and aggregation logical links (plus switch uplinks….)
  • If using Jumbo frames – make sure that value 9216 (or dynamic size) is set on switch ports / aggregation links  as NetApp documentation states:2015-04-25 10_33_02-netapp-vmware-networkstorage.pdf - Adobe Reader
  • Additionally it is recommended to disable flowcontrol on storage and vmware sides. In my case it produced giant performance boost! I was really surprised and impressed by that.flow-controlTo disable flow control on the VMware (KB1013413), open up CLI and run following commands:

    This has to be done on each adapter in storage vmkernel group. Note that these changes are not permanent and should be added to /etc/rc.local.d/local.sh file as described in VMware’s KB 2043564.

All these things configured on Storage, Network, VMware sides – made things work for me. And should also work for you. It also gave me a lesson to read docs related to all parts of infrastructure – not just VMware.

Leave a Reply