Zones start looking like Containers: CPU and Memory Caps

Posted on March 26, 2007

When Solaris Containers debuted with Solaris 10 many of us were blown away. At the time Zones became my primary passion because I was suffering from a massive shortage of test systems for everything from deployment testing to Enlightenment build and test install systems. I needed lots and lots of systems but they didn’t need lots of CPU or memory, they just needed isolated Solaris installations and Zones filled that hole perfectly. But “Containers”, that being a term which signifies the unity of Solaris Zones with Solaris Resource Controls, wasn’t quite what I was hoping for. In fact, from my point of view it was a complete joke. Solaris resource controls are applied to workloads, a process, task, or project (where that project is likely a user or group). With the introduction of Zones we finally had the kind of workgroup granularity that I really wanted, but I didn’t have the ability to slap the controls on.

Sun’s worked very hard since that time to make things right and its been a slow but constant effort thats seriously paying off now. With Nevada Build 56 we got Duckhorn, which provided Memory and Swap Capping integrated into the zone configuration. Now, with Nevada Build 61 we get its big brother, CPU Capping. We can pair those up with the rest of the rctl’s we can apply to a zone such as zone.max-lwps, zone.cpu-shares (FSS Shares), and shm limits. Finally we’ve got something seriously slick.

Configuring these limits is just smooth as silk. No files to muck with, no /etc/projects, no BS… just create or modify your zone and go:

root@aeon ~$ zonecfg -z testing1
testing1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:testing1> create
zonecfg:testing1> set zonepath=/zones/testing1
zonecfg:testing1> set autoboot=true
zonecfg:testing1> set scheduling-class=FSS

zonecfg:testing1> add capped-cpu
zonecfg:testing1:capped-cpu> set ncpus=1.0
zonecfg:testing1:capped-cpu> end

zonecfg:testing1> add capped-memory
zonecfg:testing1:capped-memory> set physical=512m
zonecfg:testing1:capped-memory> set swap=512m
zonecfg:testing1:capped-memory> end

zonecfg:testing1> add rctl
zonecfg:testing1:rctl> set name=zone.max-lwps
zonecfg:testing1:rctl> add value (priv=privileged,limit=500,action=deny)
zonecfg:testing1:rctl> end

zonecfg:testing1> add rctl
zonecfg:testing1:rctl> set name=zone.cpu-shares
zonecfg:testing1:rctl> add value (priv=privileged,limit=10,action=none)
zonecfg:testing1:rctl> end

Once your zone is configured do the usual dance to install and boot. Then throw down the hurt and enjoy. Notice above that I set CPU Capping to ncpus=1.0. I’m testing on my home workstation which is a Dual Core Athlon64. Setting ncpus=1.0 is saying that I’m allocating 1 full CPU to the zone. If you had a 4 CPU system and you wanted to give 2.5 of those CPU’s to a zone you’d set ncpus=2.5. So you can get really creative in how you carve up resource. And, whats really slick, is that that CPU is given across all the procs in a given PSET, which by default is all your processors. So on my 2 cores a share of 1.0 still can use both cores for threaded applications which is nice.

Here is an example of some pain:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
115910 benr     5936K 4040K cpu0    59    0   0:01:13 0.7% prstat/1
101179 benr      162M  107M sleep   49    0   0:03:55 0.5% firefox-bin/9
100474 benr      210M  164M sleep   59    0   0:02:58 0.4% Xorg/1
116150 root     3304K 1184K wait    28    0   0:00:05 0.3% cpuhog.pl/1
109850 daemon   7644K 2756K sleep   59    0   0:00:02 0.2% rcapd/1
116211 root     3304K 1184K wait     1    0   0:00:05 0.2% cpuhog.pl/1
116138 root     3304K 1184K wait     2    0   0:00:05 0.2% cpuhog.pl/1
116335 root     3304K 1184K wait    12    0   0:00:05 0.2% cpuhog.pl/1
116028 root     3304K 1184K wait     2    0   0:00:05 0.2% cpuhog.pl/1
ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE                        
    10      410  217M  134M   6.6%   0:31:53  46% testing1                    
     0      125  638M  589M    29%   0:10:42 2.2% global                      

Total: 535 processes, 738 lwps, load averages: 378.18, 377.42, 343.45

The load average is high because I’ve got 400 of those cpuhog scripts going which run and then get back in the run queue to get back on CPU. Even with a load that high I’m not feeling the effects at all on the globalzone, where, coincidently, I’m currently typing this blog entry (see firefox-bin, thats me writing this entry). Now check out the mpstat for a closer look:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   17   384  275  132   32   34    3    0   135   51   1   0  48
  1    0   0   11   190   63  203   54   35    2    0    83   48   0   0  52
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   20   394  291  152   46   23    1    0    81   56   1   0  43
  1    0   0    2   185   67  186   46   28    5    0   116   45   1   0  54
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   15   406  290  146   50   34    0    0   730   48  15   0  38
  1    0   0   11   197   64  234   68   31    9    0   628   60   1   0  39

Of course, we can alter the allocation of CPU on the fly without a zone reboot:

root@aeon ~$ prctl -n zone.cpu-cap -i zone testing1
zone: 10: testing1
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
zone.cpu-cap
        privileged        100       -   deny                                 -
        system          4.29G     inf   deny                                 -
root@aeon ~$ prctl -r -t privileged -n zone.cpu-cap -v 150 -i zone testing1
root@aeon ~$ prctl -n zone.cpu-cap -i zone testing1
zone: 10: testing1
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
zone.cpu-cap
        privileged        150       -   deny                                 -
        system          4.29G     inf   deny                                 -

Look at the zone CPU usage now….

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
115910 benr     5984K 4092K cpu1    59    0   0:01:57 0.8% prstat/1
101179 benr      162M  107M sleep   49    0   0:04:49 0.7% firefox-bin/9
116067 root     3304K 1184K wait    16    0   0:00:10 0.5% cpuhog.pl/1
116323 root     3304K 1184K wait    37    0   0:00:09 0.4% cpuhog.pl/1
116073 root     3304K 1184K wait     1    0   0:00:10 0.4% cpuhog.pl/1
100474 benr      217M  165M sleep   59    0   0:03:20 0.4% Xorg/1
116084 root     3304K 1184K wait    43    0   0:00:10 0.4% cpuhog.pl/1
116118 root     3304K 1184K wait    35    0   0:00:10 0.3% cpuhog.pl/1
116050 root     3304K 1184K wait    56    0   0:00:10 0.3% cpuhog.pl/1
ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE                        
    10      410  217M  136M   6.6%   1:01:26  68% testing1                    
     0      125  600M  553M    27%   0:12:44 2.1% global                      

Total: 535 processes, 736 lwps, load averages: 380.23, 379.94, 372.62

Solaris Containers are really looking like what they promised to be, a fine pairing of Solaris Resource Control and Solaris Zones. Its awesome to watch that little zone thrash its brains out while I’m watching movies on the same system.

While CPU and Memory are top of the mind, network inevitably become a concern. With Nevada 61 we can allocate full real NIC’s to a zone using IP Instances (PDF), part of Project Crossbow. VNIC’s should be upon us soon, which will close the loop and put zones in the place they should be.

A big round of applause for Alexander Kolbasov, Erik Nordmark, Yukun Zhang, Dong-Hai Han, Stephen Lawrence, Andrei Dorofeev, Jerry Jelinek, and everyone else involved with these great advancements in Solaris.

PS: If you hadn’t noticed, Solaris Resource Control starting to look very sexxy indeed.