Monday, 26 November 2012

VMware vCD with Nexus 100V and Cisco Virtual Secure Gateway

I have just had a request on a pre sales call for vCD to use Cisco VSG.  Now as i posted before i have only recently taken the vCD course so I was very much "ill have to check that out".  Upon looking into it I could see that this was possible with 5.1 utilising VXLAN from the vShield devices.

Cisco Website has some more info on the setup of this.  I am very excited to get stuck in with doing this and see how things go once i have the full requirements.


Monday, 19 November 2012

EMC VNX Round Robin supported without the need for PowerPath



I was recently asked to to a upgrade of a VI3.5 Environment to vSphere 5.1.  After doing the normal day on-sight checking the computability of all the components etc i quickly found that they were going to need full new server estate.

As luck would have it the customer had just purchased a VNX based EMC SAN and decided to use this for the vSphere environment.  So i started my design process and when checking the firmware versons etc on the VMware HCL i found a nice article detailing you can now use Round Robin PSP instead of Fixed.  Previously this was a licensed feature that needed PowerPath to be installed on the hosts.

Have a read here VMware KB2034799 to confirm support and Bios/firmware versions.

vSphere 5.1 Whats New - Platform


User access

VMware have significantly increased the Web based platforms usability and now recommend that the web portal is how the vSphere environment is managed.  This will cause a large number of users upset as moving away from the thick client is not something many users will want to do.

VMware have also introduced the usage of more than one root account for ESXi for the first time.  This will allow for administrators to have there own logins to the console.  This is to assist with tracking configuration changes in much more detail. 

Hardware Accelerated Graphics

This is a feature that will be utilized in the next release of VMware View.  This feature is not likely to be used in in server installations, and is intended to provide better graphics performance for VDI.  The VMware documentation describes the feature below.

With vSphere 5.1, VMware has partnered with NVIDIA to provide hardware-based vGPU support inside the virtual machine. vGPUs improve the graphics capabilities of a virtual machine by off-loading graphic-intensive workloads to a physical GPU installed on the vSphere host. In vSphere 5.1, the new vGPU support targets View environments that run graphic-intensive workloads such as graphic design and medical imaging. Hardware-based vGPU support in vSphere 5.1 is limited to View environments running on vSphere hosts with supported NVIDIA GPU cards (refer to the VMware Compatibility Guide for details on supported GPU adapters). In addition, the initial release of vGPU is supported only with desktop virtual machines running Microsoft Windows 7 or 8. Refer to the View documentation for more information on the vGPU capabilities of vSphere 5.1. NOTE: vGPU support is enabled in vSphere 5.1, but the ability to leverage this feature is dependent on a future release of View. Refer to the View documentation for information on when this feature will be available.

 

Improved CPU virtualization

Improvements have been made to CPU virtualization allowing for more information of the CPU architecture to be exposed to the VM better. This allows for the VMs to have near native access to the CPUs and allow for advanced OS features to run more successful.



Auto Deploy

There have been two new modes added to auto deploy.  However I still anticipate that uptake on this feature will be slow until the new release of vSphere.

Stateless caching mode is a new mode where if the host cannot boot from the PXE network due to a failure of some sort, it will boot from a backup image located on local media.  In normal mode caching mode operates the same as stateless mode deploying an image over the network to the host to run from a memory RAM drive.  However another step is performed in caching mode where that the image deployed to the hosts memory is then written to the internal boot device, HD, SD card or USB stick.  This then makes sure the host can boot in the case of a PXE boot failure.

The second mode that was added to auto deploy was Stateful install mode.  This mode is used to actually install the ESXI image onto the local media of the host.  This is very similar to the Stateless caching however the difference is that the boot order of the BIOS should be changed on the host.  The host only needs to PXE boot once from the Auto Deploy sever to install the ESXi image and then should boot from the local media after.  

Monday, 12 November 2012

vCP5 IaaS Training Course

This week my company have put me on a training course to brush up on some of my vCloud Director Skills.  With the release of vCloud Director 5.1 Suite and the end of support for Lab Manager, we have had a massive uptake in the vCloud Directors 5.1 products.  Many people using the multi tenancy, fast provisioning, advanced network features and the ability to just create test environments and then delete them quickly has resulted in vCloud Director being better in many ways than Lab Manager.

I have the exam booked already to take the VCP5 IaaS exam a week after the end of the course.  Reason for this is the waiting time at my local test center is horrendous and i want to get the exam under my belt by the end of the year. 

As well as using the VMware course material I am using a the trusted Brown Bags.  Some of the podcast videos on here are very good and i would like to thank the guys that do them.  I also like to read books and have found this one very good Cloud Computing with VMware


Anyway I will update here with my thoughts of the design and the course when I have more information.

vSphere 5.1 whats new Networking


Networking

 

Networking health check

This feature is aimed to assist with the divide that is often seen between vSphere Administrators and Network Administrators.  Often configuration errors can occur when there are a large number of uplinks to be configured for the vSphere infrastructure. 

The process checks the following items are configured correctly on the VCD switch.

  1. VLAN
  2. MTU
  3. Network Teaming Adapter

VMware document states that this feature works by sending probing packets over the layer 2 network every minutes, to the network equipment connected directly to the DVS uplinks.  REG and ACK packets are sent to probe the network.  If these packets are dropped or rejected a configuration problem is highlighted on the DVS.

VDS management rollback and recovery

One of the major problems with VDS in the past was if there was a complete DC failure and the vCenter was virtualized (VMware recommendation) then when hosts were recovered the networking would not be restored until the vCenter was online to provide the VDS configuration to the hosts.  However when the host containing the vCenter came online, it had no networking because the vCenter was not available to lay down the VDS configuration. This often resulted in the management being placed on a separate VSS.

vSphere 5.1 avoids this buy introducing management rollback. If when the hosts are up and running they cant communicate with each other.  An automatic rollback to a last working configuration is performed (VSS) this will allow for the hosts to communicate with the vCenter and then when the vCenter is fully operational the VDS is recreated.  vSphere 5.1 also allows for interaction of the VDS configuration at the DUCI now as well to provide better troubleshooting.

Link Aggregation control protocol

This has always been a massive point of confusion for administrators.  vSphere documentation has often miss used LACP terms in stating what is and what isn’t supported.  However this has now been clarified in vSphere 5.1

Previously static Link Aggregation was supported however now full Dynamic LACP is supported.  But only on VDS.

Bridge Protocol Data Unit Filter

This is a new feature ad builds on top of the recommendations to disabled STP and enable Port Fast on uplink switches for both VSS and VDS.  It is now recommended to enable bridge protocol unit filtering to stop loop behavior being detected.  The VMware documentation details the behavior below.

VMware virtual switches do not generate BPDU packets. But if a virtual machine sends them, they will be forwarded to the physical switch port over the uplink. When the physical switch port, configured with the BPDU guard setting, detects the packet, that port will be put in err-disabled state. In this err-disabled state, the switch port is completely shut down, which prevents effecting STP. However, the vSphere environment will detect this port failure and will move the virtual machine traffic over another uplink that is connected to another physical switch port. The BPDU packets will be seen on this new physical switch port, and the switch will block that port as well. This ultimately will cause a denial-of-service (DoS) attack situation across the virtual infrastructure cluster.

This configuration is recommendation by VMware and is enabled on the VDS, not the physical uplinked switch. This will be tested and added to the standard building blocks documentation.

Increases in scalability



Wednesday, 7 November 2012

vSphere 5.1 Whats new - Storage


Storage


Increase of number of hosts able to share a read only file has increased from 8 to 32

A read only file in VMware is often a source VMDK used for linked clones in either VMware View or VMware vCloud Director.  Linked clones are used for quick deployment in both View and VCD.

The current limit is 8 hosts.  This results in View and VCD clusters being limited to 8 hosts as if one of the linked clones was placed on the 9th host in the cluster, this host would be denied access to the source VMDK file.

This limitation has been increased to allow up to 32 hosts to access the read only file.  This removes the 8 host limitation for View and VCD and the limitation now is defined by the cluster limitation of 32.

Introduction of new VMDK type, SE virtual disk (Space –efficient virtual disk)

A well know problem with thin provisioning VMDK files was that if space was freed in the OS of the VMDK file by deleting files etc., the VMDK file did not shrink.

With SE virtual disks the space can be reclaimed. 
  1. VMware tools will scan the OS hard disks for unused allocated blocks.  The blocks are marked as free.
  2. Run the SCSI UNMAP command in the guest to instruct the virtual SCSI layer in the VMKernal to mark the blocks as free in the SE vmdk.
  3. The once the VMKernal knows what blocks are free it reorganizes the SE vmdk so that the construction of the vmdk is of a continuous data lump with all the free blocks at the end of the vmdk.
  4. The VMKernal then sends either a SCSI UNMAP command to the SCSI array or a RPC TRUNCATE command to NFS based storage.
This then frees the unused blocks in the SE vmdk and frees space on the datastore for other VMs.

Please see the below image taken from the VMware white paper for storage in vSphere 5.1


Improvements in detecting APD (all paths down) and PDL (permanent device loss)

It is common when an APD condition is seen by an ESXI host, the host will become unresponsive and will eventually become disconnected from vCenter.  This is due to the hostd process not knowing if the removal of the storage devices is a permanent or a transient state for the lost paths, because of this hostd does not timeout the recan operation for rediscovering the paths or any threads being processed by hostd.  Everything just waits for I/O to be received from the storage array, and because hostd has an infinite number of threads the hostd process often becomes unresponsive and crashes.

There have been improvements over the last few releases of vSphere to improve the detection of APD by introducing PDL detection (Permanent device loss) by detecting specific SCSI sense codes from the target array.  Improvements have been made to this function as well as alterations to how APD conditions are handled.
The following information is an extract from the VMware documentation listed in the appendix of this document.

In vSphere 5.1, a new time-out value for APD is being introduced. There is a new global setting for this
feature called Misc.APDHandlingEnable. If this value is set to 0, the current (vSphere 5.0) condition is used,
i.e., permanently retrying failing I/Os. If Misc.APDHandlingEnable is set to 1, APD handling is enabled to follow the new model, using the time-out value Misc.APDTimeout. This is set to a 140-second time-out by default, but it is tunable. These settings are exposed in the UI. When APD is detected, the timer starts. After 140 seconds, the device is marked as APD Timeout. Any further I/Os are fast-failed with a status of No_Connect, preventing hostd and others from getting hung. If any of the paths to the device recover, subsequent I/Os to the device are issued normally, and special APD treatment concludes.

The above text indicates that the advanced setting Misc.APDHandillingEnable should be set to 1 to allow for APD timeouts and to prevent the hostd process crashing when APD occurs.

Another setting that should be configured is the disk.terminateVMOnPDLDefault this allows HA to restart VMs that were impacted by the APD on another host that is unaffected by the APD issues. There is a know problem with this setting restarting machines that were gracefully shutdown during APD.  Specifying the following advanced setting removes this problem das.maskCleanShutdownEnabled.  Both advanced settings should be used together for best results from an APD condition.

Storage DRS V2.0

Improvements to storage DRS include additional detection for latency and datastore placement on the storage device.  Storage DRS introduces Storage Correlation which is a new feature used to detect if datastores reside on the same SAN spindles. There would be little benefit in moving a VM from one datastore to another if they reside on the same set of physical spindles.  Previously SDRS would analysis constraint on the datastores and move a VM to a less populated datastore with the assumption the VM would receive a performance benefit due to the datastore being less populated.  Now SDRS will investigate if the datastore is sitting on the same spindles and if so it will conclude that there will be little to no benefit and depending the aggressiveness of the SDRS settings will not move the VM’s VMDK files.

VmObservedLatency is another new feature of SDRS and is used to analysis the latency from the time the VMKernal receives the storage command to the time the VMKernal receives a response from the storage array.  This is an improvement over the previous level of monitoring which was monitoring the latency only after the storage request had left the ESXi host.  The new feature allows for latency inside the host to be monitored as well.  This is useful because the latency between the array and the host may be 1 or 2 milliseconds but the latency in the host from the VMKernal could be 20-30milliseconds due to the number of commands being issued/queued on the HBA that is being used for a specific datastore. 

Tuesday, 6 November 2012

Whats New In vSphere 5.1 - High level overview


Compute Resource

There have been several increases in the compute resource for vSphere 5.1

  • VMs now support 64 vCPU’s
  • VMs Now support 1TB of RAM
  • Virtual Machine Hardware levels 9 supports these larger virtual machine configurations as well as provide enhancements in vCPU counters and improves graphics with a new hardware accelerator.
  • Hosts can support up to 256 physical CPUs.

Storage Resource

There have been several improvements to the way storage is handled by vSphere 5.1 some are focused on providing better performance within vSphere for virtual machines.  However some are focused on removing some limitations of the new VMFS 5 file system.

  • Increase in number of hosts able to share a read only file has increased from 8 to 32. 
  • Introduction of new VMDK type, SE virtual disk (Space –efficient virtual disk)
  • Improvements in detecting APD (all paths down) and PDL (permanent device loss)
  • Software FCoE boot support has been added along with improvements to the software FCoE HBA
  • VAAI (vStorage API for Array integration) has been improved for NFS, targeted specifically at VDC and VDI to offload creating of linked clones.
  • Additional troubleshooting tools added to esxcli for storage related troubleshooting.
  • SSD monitoring using SMART based plugin to identify media detrition.
  • SIOC (Storage IO Control) now detects the correct latency level for each datastore, instead of using the default 30ms
  • Storage DRS V2.0 – Now interacts with VCD.  Storage DRS can now detect linked clones, and VCD can now see the datastore clusters.  Improvements in both product ranges. 
  • Storage vMotion Parallel migrations.  Storage vMotion cannot migrate 4 virtual disk files at the same time.  This is an improvement over the serial (single file at a time) migration technology used in 5.0.

Networking

There have been several improvements made to VMware’s Distributed Switch.  All items listed here are in reference to VDS and no changes have been made to VMware Standard Switches (VSS)

  • A Network Health check feature has been added to check VLAN, MTU and Network Teaming settings on a regular bases for configuration errors
  • VDS backup and restore of configuration for the VDS
  • Management network rollback and recovery has been added, as well as the ability to manage the VDS from each hosts DCUI
  • Static port binding option of the VDS portgroup has the Auto Expand option enabled by default.  This needed to be enabled manually on previous versions of vSphere.
  • Dynamic port binding option of the VDS portgroup will be removed after this release of vSphere.
  • MAC addresses are now more flexible, the 64K MAC address limitation has also been removed to allow VCD environments to scale larger.
  • Dynamic LACP (LINK Aggregation Control Protocol) is now fully supported.  Previous versions only support static link aggregation. 
  • Bridge Protocol Data Unit Filter, has been added to VDS.  VM’s forwarding BPDU packets to the physical switches and prevents the switch from disabling ports due to these packets.
  • Netflow version 10 (IPFIX) is now supported on VDS
  • Port Mirroring RSPAN (Remote Switch Port Analyzer) and ERSPAN (Encapsulated Remote Port Analyzer) is now supported on VDS
  • Enhanced SNMP support for v1, v2 and v,3
  • Single Root I/O Virtualization, support for this function has been increased in vSphere 5.1

 

vSphere Platform

Serveral changes were made to the platform with the sole intention of improving support, management and monitoring of the vSphere environment.  This included increasing the support for SNMP.  It has been detailed that VDS now supports SNMP v1, v2 and v3, this also applies for the hosts and the vCenter.

  • Local ESXi accounts are automatically given shell and root access.  Removed the usage for a shared root account and makes monitoring shell usage simpler.
  • All host activity from both the DCUI and the shell is now logged DUCI access was never logged in previous versions.
  • vMotion allows for the virtual machine state (memory and CPU) to be migrated simultaneously with a sVmotion operation of the machines vmdk and vmx files.  This is a massive advantage for environments whit no shared storage.  Machines can be migrated between hosts with local storage.
  • Windows 8 and Windows Server 2012 supported added.
  • vShield Endpoint is now included in VMware tools and allows vShield to be used for agentless AV and Malware protection from selected AV vendors.
  • VMware View environments can now gain better graphic support if the ESXi hosts are using a supported NVIDIA graphics adapter.  This provides hardware bases vGPU support the View hosted VMs.  This feature will be available in the next release of View.
  • Improved CPU virtualization, referred to as virtualized hardware virtualization, allows near native access to the CPU features.
  • More Low level CPU counters are available to the VMs OS to allow for better troubleshooting and debugging from inside the VM.
  • Auto Deploy has two new deployment modes, Stateless Caching Mode, and Stateful Install Mode
  • VMware tools can be upgraded with no downtime. Once VMware tools are upgraded to the version that ships with vSphere 5.1 future upgrades do not require a reboot.
  • vSphere replication is now an optional component available outside of SRM.
vSphere Data Protection is another added feature to provide instant recovery of virtual machines

Monday, 10 September 2012

The most important part of my installations!!!

Ok I have done some nice VMware projects recently with both blade systems and rack servers, some have been SRM installs across sites and some have been Metro Clusters (NetApp) and some have been bog standard installs with a handfull of hosts at each site.

I have done the , conceptual, low level and then high level design elements before creating an installation guide for the consultants installing the systems (when it has not been myself) and also created a validation plan for the install after to make sure it has filled the fundamental requirements identified in the design phase.

However one of the most important elements of the design recently has been the assistance of migrating workloads.  All of my recent customer have been migrating from an ageing VMware platform to the new system I have been installing.

During the design phase and the gathering of the fundamental requirements one of the key requirements is to improve performance.  Either the performance of the aging hardware is poor or a handful of VMs are performing bad.

Often customers will chose to do the migration themselves.  I have seen customer use a number of ways to do this from block level storage copy, to un mounting of LUNs from the live system and mounting them to the new system, Copying VM folders, V2V with VMware converter or other third party tools. I have even seen some customers use the old sVmotion script for 3.5 migrations.

However the migration is performed, the outcome from the customer is always the same, initially the system will perform OK but after more VM's are created the performance decreases.  The system is nowhere near the consolidation ratio that was quoted in my capacity planning and the customer becomes unhappy.

So my company will receive a call and we will go and investigate.  What will we find!!!!

  1. The migrated workload is badly sized - VM's with 4 CPU's and 8GB of RAM 
  2. The migrated workload VM's have version 4 or version 7 hardware 
  3. The migrated workload has out of date VM tools
  4. The migrated workload is all sitting on the same datastore
  5. the migrated workload VMs have reservations and limits configured on the VMs
All of the above my seem simple but they are always the reasons for bad performing V2V Virtual machines.  Often physical migrations have additional problems with post and pre checks, driver removal etc.  But the physical migrations are becoming more and more rare these days.

So one of the most important parts of a vSphere migration in my view is sizing the VMs correctly, often customers dont appreciate that giving VMs more vCPU can have a negative affect on its performance.  I may cover the technical bits as to why this is the case in a later blog.  


Wednesday, 29 August 2012

HP Flex-10 and VMware


Hi All
         I have recently been doing a full HP Flex-10 installation with VMware and NetApp storage.  Now I have done a few of these in the past and I have always struggled to calculate how the bandwidth should be split between the vNIC's presented to the blades.

There are a few posts out there and all point to gathering customer requirements.  Now most customers when asked this will just say "What do you recommend" and this is what i would expect from most of my customers.  I am a consultant and I should advise them correctly on this as there compleatly new to the concept and technology.

During a full project life cycle this is more than something I am comfortable with.  We would perform a complete capacity planning exercise and in this we could see the existing bandwidth needed by storage and network connectivity, as long as the system was not bottle necking.  However most of these HP Flex-10 projects i seem to get handed are half completed,  the sell and the design has been completed and we are asked to complete the implementation.  Now this is something that I am not a fan of but unfortunately is something that is common in the channel consultancy sector.

So I am sitting infront of the customer and deciding what I should do to carve up the bandwidth.  To be 100% sure is impossible so I decided to use the reference architecture produced by HP for vSphere 4.0.    Located Here

This details the following breakdown.




This makes a lot of sense in what it suggests,

ESXi Management is given a 500Mb Share - VMware beat practise is to dedicate a 2X physical NICs for failover, but VMware do not define a best practise for bandwidth for the Management traffic. Often Customers using physical separation of networks will use a 100Mb switch witch will more than cope with vCenter agent traffic and heartbeat traffic. 

vMotion and Fault tolerance are given a 2.5Gb share - Most of my designs now include these two sitting on the same vSwitch with separate portgroups.  To guarantee the 1GB bandwidth is provided and uninterrupted to both portgroups each port group has an active NIC witch is provided to the other prtgroup as a standby adapter to provide redundancy.  

ISCSI is given 4GB share - This largely depends on the backend storage, what I am trying to say is providing 4Gb bandwidth to the blade is pointless if the backend storage only has 2X 1Gb connections. The sizing in the reference architecture just so happend to tie in with what was configured on the backend of the storage I was using each storage processor had 2X 10Gb connections, and the storage array had 2 storage processors.   

Virtual Machine traffic (multiple Networks) are given a 3Gb Share. - This was the remaining bandwidth and after reviewing some low level perf mon and switch statistics from the existing physical infrastructure, 3Gb was more than enough bandwidth.

Each module has its own defied ethernet network for redundancy, N1 is the first Flex-10 interconnect and N2 is the second Flex-10 interconnect.

Another thing that is important is to make sure you utilise the full bandwidth from each Flex-10,  Configuring the above and using 1X10Gb will cause a bottle neck on the uplink's.  

I will go into to configuration with VMware in another post.  I would also be interested to know if anyone else had any different ways of calculating this?






Friday, 24 August 2012

Back for More......

Ok I have not been on here for a VERY Long time.  Mainly because I have been promoted at work last September :-)  I have been promoted to a Senior Virtualisation Consultant and a SME (Subject matter expert) in VMware vSphere and vCloud Director.

This resulted in me learning a massive amount to make some of the companys processes better and smoother.  One of the responsibilities of my role is to complete all the vSphere designes and oversee all the implementations by Engineers and Consultants.  This has resulted in me doing a massive amount of work, to not just update the company processes but to actually create some!!!

I am responsible for all the design based documentation, All the White papers to keep consultants up to date on how to do things, all the companies standards/best practises as well as blog on the company blog about things wee see in the field.

On top of the above I am still the lead implementation consultant undertaking all the enterprise and high level implementations.

So I have decided to start back on my blog again as there are many things I have been implementing over the last few months, also I am hoping to obtain my VCDX some time in the future.  I am currently studying for my VCAP5-DCD and I will also upgrade my VCAP4-DCA soon as well.

So my next post will be something more exciting.......