Understanding why “CPU Ready” matters…

One of the biggest advantages of virtualization technology is possibility to overcommit your host’s resources – there’s really small chance of a VM using it’s all assigned CPU / RAM for 24/7 and that’s why we want to place a higher number of VMs on a single host to utilize it’s physical resources in a better way. But it doesn’t mean we can run unlimited amount of virtual machines on a single host and expect good performance. There are some *invisible* boundries that we should be aware of and one of them is host’s phycial CPUs / virtual CPUs ratio which has a big impact on your running VMs.

VMware tells us that our CPU resources are best utilized when it’s usage is measured at about 70-75% level. And that’s true. If there’s any VM running with for example 4 virtual CPUs and it’s using only 10% of it’s CPU power, you should consider changing the number of vCPUs to a single core and expect a load of ~60-70%. At this point your VM’s CPU is properly scaled. Yet, there’s another important thing related to the number of your virtual CPUs…

As you know every physical VMware host has it’s own physical CPUs with a number of physical / logical cores. On the top of that you deploy a number of virtual machines with virtual CPUs. Now, tell me please – did you ever check how many logical cores your physical host has? I bet you can tell me that right away. Another question then – did you ever check how many virtual CPUs are running on this very same physical host? I bet you can’t tell me that. But there’s a thing I can tell you – it shouldn’t be more than 5 multiplied by a number of physical cores in your hosts. So if your physical host has 20 cores, VMs running on it, shouldn’t be configured with more than 100 cores.

This ratio of pCPU / vCPU not higher by 5 is a VMware’s recommendation to make sure that your VMs are not impacted by CPU co-scheduling or high CPU Ready times. To be honest – I try to keep this ratio highest at 2 or 3.

First of all, here’s a little bit of help so you can actually measure your cluster’s pCPU / vCPU ratio:

As an output you should see something like that:

Result above is measured from my test cluster and as you can see at least two hosts are really overcommited with virtual CPUs and other four are in good shape. How does it affect my host and it’s virtual machines? Please take a closer look on the performance chart taken from one of the VMs:

cpu-ready-chart

As we can see, Summation of Ready time for this VM is ~200ms. Actually it is kinda meaningless and we should change this value to a percentage for better understanding. To calculate it, use following formula:
(CPU summation / (<chart default update interval in seconds> * 1000)) * 100 = CPU ready %

In this case, it would be as follows:

(201ms / (20s * 1000)) * 100 = 1% CPU Ready

It’s a really good result and it means that for a 1% of time a vCPU was ready to be scheduled on a physical processor but couldn’t be due to contention (in human words – VM was doing nothing, but awaiting for physical CPU for 1% of time). This value definitely shouldn’t be higher than 10% and you should monitor anything above 5%.

Another method of checking CPU Ready values is esxtop avaialble from ESXi CLI. Using putty connect to your ESXi host and run esxtop, then hit “c” key to access CPU tab or “v” to access VM tab.

esxtop-cpu-ready

You should be really interested in highlighted attributes:

  • %RDY – (Ready) % of time a vCPU was ready to be scheduled on a physical processor but couldn’t be due to contention.  This value shouldn’t be higher than 10% and you should monitor anything above 5%.
  • %CSTP – (Co-Stop) % in time a vCPU is stopped waiting for access to physical CPU high numbers here represent problems.  You do not want this above 5%

Ok, you found out that your VMs are slow on CPU access and CPU Ready has high values. What’s next? I’d recommend you checking a pCPU / vCPU ratio (you can use powercli script from the beginning of this post) to see which host is mostly over-subscribed and then make a list of your VMs with their vCPUs using following one-liner:

Of course you can replace Cluster with single esxi hostname to get VMs only from specific server. Having that list – check your VMs with highest number of vCPUs. Generate CPU Usage charts for each one to evaluate weather they really need all of currently assigned vCPUs or will they do as good but with half of resources? Lowering number of vCPUs in VMs will end up with less scheduling jobs, lower CPU Ready times and higher CPU utilization (which is also good unless it rises above 80%).

Paradoxically – If you’re looking for better performance you should minimize the number of vCPUs in your VMs. But it also doesn’t mean all of them should have only 1 vCPU. Just make sure that all of your VMs are scaled as needed – If VM with 4 vCPUs has max load of 25%CPU it means is uses 1 complete vCPU (and you might want it to have 2 vCPUs maximum). Remember that smaller vCPU -> pCPU ratio on the VMware host will result in smaller CPU co-scheduling times (higher performance, smaller response times for your VMs).

I hope this article has been informative for you. Although, if you have any questions, don’t hesitate to leave a comment!

2 thoughts on “Understanding why “CPU Ready” matters…

  1. Thanks Lukasz for making this CPU ready thing easy to understand. I have one hypothetical question that was being asked to me couple of days ago and I’m still looking for it. Here is the scenario that database team’s VM is under performing with 4 vCPU and 8 GB RAM. And, they have asked VMware Admins to raise it to 8 vCPU and 12 GB RAM, but, the performance issue is still there. What are the things I should be checking in this kind of performance issue?

    1. Thank you for your comment Shashi! As for your question I’d look for a bottleneck in your infrastructure. Underperforming SQL VM may be a side effect of high load on almost every virtualization component as SQL is kinda sensitive on any delays / latencies. Start from gathering performance counters from SQL/GuestOS/ESXi host and narrow down the problem source. Analyze of mentioned counters may point you to a problem on the storage, network layer, SQL server or maybe ESXi host misconfiguration 🙂 Key thing here is a proper analyze of all counters.

Leave a Reply