NUMA in AHV
NUMA on AHV/QEMU/KVM
The exmaples here are based on a two socket physical server.
- What does the guest see?
- The guest sees whatever is specified by
num_vnuma_nodes=Xin acli which manipulates the XML<numa> ... </numa>field in the XML
- The guest sees whatever is specified by
- How does the guest map to the physical
- If
num_vnuma_nodes=0the UVM can run on any core on any NUMA node - memory is allocated statistically on whicever NUMA node has most free cores at startup - If
num_vnuma_nodes=1the UVM can run on any core on a single NUMA node that is configured at poweron. The chosen NUMA node is random and not affected by load. - If
num_vnuma_nodes=2the UVM can run on any core on either NUMA node. The vCPUs are spread equally across each NUMA node 1-1/2N (Node0) 1/2N-N (Node1) - If
num_vnuma_nodes>2the UVM will run on any core on a particular NUMA node (maybe always 0?) but the UVM will see the configured topology.
- If
- BEST way
- For keeping a UVM on a single NUMA node. Just use
extra_flags=numa_pinning=<X>to keep the UVM away from the CVM do NOT usenum_vnuma_nodes=1(which should work) because the placement becomes a random choice of a single physical NUMA node - which is not what you wanted. - For spreading a VM equally across NUMA nodes. Do NOT set extra_flags, instead use
num_vnuma_nodes=2
- For keeping a UVM on a single NUMA node. Just use
First a look at the physical topology
This host has two sockets with 20 physical cores on each socket. Hyperthreading is enabled - so the physical host shows 40 pCPUs to the OS on each socket. We ahve two sockets - for a total of 80 “CPUs” that the OS can schedule onto. The output of numactl --hardware shows us in a simple table.
[root@N-0068-1-A ~]# lscpu | head
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
For this host,
- NUMA Node 0 is pCPU’s 0-19 and 40-59
- NUMA Node 1 is pCPU’s 20-39 and 60-79
[root@N-0068-1-A ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 191595 MB
node 0 free: 713 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 193489 MB
node 1 free: 335 MB
What is the current NUMA disposition of a given VM.
The meaningful disposition of NUMA for a given VM is to check the XML of the VM. Regardless of what is configured in any other tool - it is the XML that QEMU will use to determine how the VM will get scheduled on the physical cores of the host. For a VM with 32 vCPU…
a VM with NO NUMA specification
The VM XML
No cpuset key means dont do any NUMA stuff. <vcpu placement='static' current='32'>240</vcpu>
The acli view
nutanix@NTNX-18SM3E330258-A-CVM:10.56.4.216:~$ acli vm.get ubuntu_no_pin_32G | grep -A 5 numa
num_vnuma_nodes: 0 <---- This just means disabled
The UVMs view of itself
The vm itself sees a single array of cores
ubuntu@ubuntu:~$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
a VM with NUMA specification (NUMA Node 0)
vm.update 5e2dc5bb-77b9-4c4b-878f-7b48040facc4 name=ubuntu extra_flags=numa_pinning=0
Note the cpuset key here it says create a 32 vCPU virtual machine but only schedule the threads on physical CPUs 0-19 and 40-59. On the physical box - that equates to 20 cores on socket 0 (both siblings)
The VM XML (NUMA Node 0)
<vcpu placement='static' cpuset='0-19,40-59' current='32'>240</vcpu>
a VM with NUMA specification (NUMA Node 1)
vm.update 5e2dc5bb-77b9-4c4b-878f-7b48040facc4 name=ubuntu extra_flags=numa_pinning=1
Note the cpuset key here it says create a 32 vCPU virtual machine but only schedule the threads on physical CPUs 20-39 and 60-79. On the physical box - that equates to 20 cores on socket 0 (both siblings)
The VM XML (NUMA Node 1)
<vcpu placement='static' cpuset='20-39,60-79' current='32'>240</vcpu>
As we saw on the physical host pCPUs 0-19 and 40-59 are both siblings on the first socket (NUMA Node 0)
The acli view
nutanix@NTNX-18SM3E330258-A-CVM:10.56.4.216:~$ acli vm.get ubuntu-busy-node0-32vcpu-4g_ram | grep -A 5 numa
key: "numa_pinning"
value: "0" <----- Shows we are pinned to NUMA node 0
}
flash_mode: False
generation_uuid: "bb7450c4-e6f2-485b-94b0-40ecb5e818d1"
gpu_console: False
--
num_vnuma_nodes: 0 <----- just means disabled
The UVMs view of itself
The vm itself sees a single array of cores
ubuntu@cpu-burner-numa-node-0:~$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
What happens in the case of a VM with no NUMA topology?
A VM that does not have any topology set - it is free to run on whatever core is available. However a big problem is that when the VM is instantiated - the memory for that VM will be allocated on whichever memory node is related to whichever socket the thread runs on at the moment of instantiation.
This can be a problem - for instance if a VM is created when the machine is idle - lets say that the memory is equally spread across NUMS Node 0 and NUMA Node 2. If subsequently - Socket 1 becomes very busy with a different high priority VM - The VM we are interested in will run on Node 0, which is correct from a CPU scheduling perspective - but not from a memory access perspective (because now, 1/2 of memory accesses are remote)
It is possible to move pages around - but that is an admin task.
Can we prove this?
Yes. If we make NUMA Node 0 very busy, then the memory for our VM will be allocated primarily on NUMA Node 1. If we make NUMA Node 1 busy the VM will be allocated primarily on NUMA Node 0.
start a vm with no pinning when CPU0 is busy
We expect the threads to be scheduled on Node1 so memory allocation is on Node1. This is exactly what we see.
run cpuburner on Node-0
We see node 0 is busy (from our CPU burner)
08:48:38 PM NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
08:48:39 PM all 0.11 0.00 0.86 0.00 0.22 0.02 0.00 42.04 0.00 56.60
08:48:39 PM 0 0.07 0.00 0.20 0.00 0.35 0.05 0.00 80.30 0.00 18.88
08:48:39 PM 1 0.15 0.00 1.52 0.00 0.10 0.00 0.00 3.72 0.00 94.38
So our VM of interest runs on Node1 (which is where the free cycles are) and so memory is also allocated on Memory Node1 (32810 is our 32G UVM)
[root@N-0068-1-A ~]# numastat -p qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------------- --------------- --------------- ---------------
14299 (qemu-kvm) 1.89 49200.46 49202.35
102354 (qemu-kvm) 4119.77 19.38 4139.14
222919 (qemu-kvm) 2.93 32810.68 32813.60
----------------- --------------- --------------- ---------------
Total 4124.58 82030.51 86155.09
PDI 222919 is ubuntu_no_pin_32G
If we run the CPU burner at a lower rate, we expect to see some threads run on node-1 and some threads run on node-0 - and so we would also expect to see some memory allocated on Node 1 and some memory allocated on Node 0
Some cycles free on each NUMA node
10:11:47 PM NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:11:48 PM all 0.01 0.00 0.11 0.00 0.05 0.05 0.00 10.43 0.00 89.29
10:11:48 PM 0 0.02 0.00 0.00 0.00 0.07 0.05 0.00 19.96 0.00 79.89
10:11:48 PM 1 0.00 0.00 0.22 0.00 0.03 0.05 0.00 0.90 0.00 98.70
We see that memory is split across nodes
[root@N-0068-1-A ~]# numastat -p qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------------- --------------- --------------- ---------------
14299 (qemu-kvm) 1.89 49200.46 49202.35
102354 (qemu-kvm) 4119.78 19.38 4139.16
428260 (qemu-kvm) 28948.17 3865.73 32813.90
----------------- --------------- --------------- ---------------
Total 33069.84 53085.57 86155.41
PID 428260 is ubuntu_no_pin_32G
What happens in the case of a VM which is pinned
In AHV we achieve this with something like
<acropolis> vm.update ubuntu-busy-node0-32vcpu-4g_ram extra_flags=numa_pinning=0
Change num_vnuma_nodes from 0 to 1
We are not “pinning” the VM - but we are going to tell it to run on a single NUMA node.
<acropolis> vm.update 6fedab0e-0be3-4dcc-b22d-9e346e06d955 name=ubuntu_no_pin_32G-1_vnuma_node num_vnuma_nodes=1
XML
In AHV we directly manipulate the vcpupin - and not the placement - in this case we are saying “run this VM on a single NUMA node - but it can be either Node1 or Node2. vcpupin is in the <cputune> section which maps virtual to physical - see also <numatune>.
Guest
Guest knows only a single array of CPU
ubuntu@ubuntu:~$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
In this case our VM will ony run on pCPU 20-39 and 60-79 which we know from the host is NUMA node 1
[root@N-0068-1-A ~]# egrep 'placement|vcpupin' /var/run/libvirt/qemu/6fedab0e-0be3-4dcc-b22d-9e346e06d955.xml
<vcpu placement='static' current='32'>240</vcpu>
<vcpupin vcpu='0' cpuset='20-39,60-79'/>
<vcpupin vcpu='1' cpuset='20-39,60-79'/>
...
<vcpupin vcpu='31' cpuset='20-39,60-79'/>
if we shutdown and restart the UVM
[root@N-0068-1-A ~]# egrep 'placement|vcpupin' /var/run/libvirt/qemu/6fedab0e-0be3-4dcc-b22d-9e346e06d955.xml
<vcpu placement='static' current='32'>240</vcpu>
<vcpupin vcpu='0' cpuset='0-19,40-59'/>
<vcpupin vcpu='1' cpuset='0-19,40-59'/>
...
<vcpupin vcpu='31' cpuset='0-19,40-59'/>
The VM is now pinned to a single range of CPUs on a single NUMA node - but this time on NUMA node 0
In terms of memory - the memory will all be placed on whichever NUMA node it decides to pin to. It seems not to matter about load.
[root@N-0068-1-A ~]# numastat -p qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------------- --------------- --------------- ---------------
14299 (qemu-kvm) 1.89 49200.46 49202.35
102354 (qemu-kvm) 4119.79 19.38 4139.17
483962 (qemu-kvm) 20.84 32792.71 32813.56
----------------- --------------- --------------- ---------------
Total 4142.53 82012.55 86155.07
Change num_vnuma_nodes from 1-2
<acropolis> vm.update 6fedab0e-0be3-4dcc-b22d-9e346e06d955 name=ubuntu_no_pin_32G-2_vnuma_nodes num_vnuma_nodes=2
We see that memory is balanced acrosss NUMA nodes
[root@N-0068-1-A ~]# numastat -p qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------------- --------------- --------------- ---------------
14299 (qemu-kvm) 1.89 49200.46 49202.35
102354 (qemu-kvm) 4119.79 19.38 4139.17
462472 (qemu-kvm) 16404.53 16409.05 32813.58
----------------- --------------- --------------- ---------------
Total 20526.21 65628.89 86155.10
```
The UVM itself sees two NUMA nodes
```
ubuntu@ubuntu:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 15986 MB
node 0 free: 15497 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 16087 MB
node 1 free: 15765 MB
```
The XML looks like this
```
<vcpu placement='static' current='32'>240</vcpu>
<vcpupin vcpu='0' cpuset='0-19,40-59'/>
...
<vcpupin vcpu='15' cpuset='0-19,40-59'/>
<vcpupin vcpu='16' cpuset='20-39,60-79'/>
...
<vcpupin vcpu='31' cpuset='20-39,60-79'/>
```
The NUMA xml looks like this
```
<numatune>
<memnode cellid='0' mode='strict' nodeset='0'/> <---- How virtual NUMA nodes map to physical NUMA nodes
<memnode cellid='1' mode='strict' nodeset='1'/>
</numatune>
```
and
```
<numa>
<cell id='0' cpus='0-15,32-239' memory='16777216' unit='KiB' memAccess='shared'/> <---- What we tell the guest.
<cell id='1' cpus='16-31' memory='16777216' unit='KiB' memAccess='shared'/>
</numa>
```
### Change num_vnuma_nodes from 2-4 (we only have 2 NUMA nodes on the host).
The memory is all on NODE0!
```
[root@N-0068-1-A ~]# numastat -p qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------------- --------------- --------------- ---------------
14299 (qemu-kvm) 1.89 49200.46 49202.35
102354 (qemu-kvm) 4119.79 19.38 4139.17
471782 (qemu-kvm) 32794.32 19.40 32813.72
----------------- --------------- --------------- ---------------
Total 36916.00 49239.23 86155.23
```
The XML shows all vCPU bound to a single NUMA domain...
```
[root@N-0068-1-A ~]# egrep 'placement|vcpupin' /var/run/libvirt/qemu/6fedab0e-0be3-4dcc-b22d-9e346e06d955.xml
<vcpu placement='static' current='32'>240</vcpu>
<vcpupin vcpu='0' cpuset='0-19,40-59'/>
...
<vcpupin vcpu='31' cpuset='0-19,40-59'/>
```
But the UVM thinks it has 4 NUMA nodes
```
ubuntu@ubuntu:~$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 7984 MB
node 0 free: 7719 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 8049 MB
node 1 free: 7809 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 8001 MB
node 2 free: 7796 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 8037 MB
node 3 free: 7830 MB
```
### XML for 4 NUMA node UVM
``` <numatune>
<memnode cellid='0' mode='strict' nodeset='0'/> <---- How virtual NUMA maps to physical NUMA - NOTE nodeset=0 which is why everything is actually on physical Node0
<memnode cellid='1' mode='strict' nodeset='0'/>
<memnode cellid='2' mode='strict' nodeset='0'/>
<memnode cellid='3' mode='strict' nodeset='0'/>
</numatune>
```
and
```
<numa>
<cell id='0' cpus='0-7,32-239' memory='8388608' unit='KiB' memAccess='shared'/> <---- What we tell the guest
<cell id='1' cpus='8-15' memory='8388608' unit='KiB' memAccess='shared'/>
<cell id='2' cpus='16-23' memory='8388608' unit='KiB' memAccess='shared'/>
<cell id='3' cpus='24-31' memory='8388608' unit='KiB' memAccess='shared'/>
</numa>
```