Xen 3.3 Press Release

August 28th, 2008 by Stephen Spector

The official Xen.org Press Release announcing Xen 3.3 has been posted here. There are many partner quotes in the release from Oracle, Novell, Intel, AMD, Sun, IBM, Fujitsu, Samsung, Neocleus, Citrix, SignaCert, etc and I encourage everyone in the community to take a look. I just got a Google News email with the word “Xen” in the search and there are a lot of public news groups promoting the release (e.g. MarketWatchSys-Con, OStatic, Redmond Developer News, etc).

Congrats again on the great community effort in getting this release out to the world…

Xen 3.3 Feature: Optimized HVM Video Memory Tracking

August 28th, 2008 by Stephen Spector

From Samuel Thibault:

When having a look at how much CPU time is used when an HVM guest is idle, one can notice that the ioemu process used to permanently take something like 7%. This is because ioemu used to keep checking the content of the HVM video RAM for modifications, because setting up a trap on each guest video write would slow guest video operations awfully down.  In Xen 3.3, ioemu requests the hypervisor to track video memory modification.  The hypervisor can do it more efficiently since it has access to the dirty bit that the processor automatically sets in the page table flags on write accesses to pages.  As a result, instead of regularly comparing 8MB of video memory, ioemu just makes a hypercall to read the list of dirty pages.  As an additional optimization, if no modification has occurred for two seconds, the entire video memory write access is dropped until the guest writes to video memory again, hence saving the page table walk itself.

The result is that the CPU time goes down around 0.3%!

Xen 3.3 Feature: PV-GRUB

August 28th, 2008 by Stephen Spector

From Samuel Thibault:

The traditional way to configure a PV guest is to write in the configuration file the path to the kernel/initrd to be loaded.  However, logically enough, these should be on the PV guest disk image, to allow them to be managed by the distribution installed inside the PV guest.  PyGRUB used to act as a “PV bootloader”: it runs in dom0 as root, opens the PV disk image, reads its GRUB menu.lst, presents a GRUB-like menu to let the user choose a kernel which it copies to the dom0 filesystem, it then closes the disk image and eventually tells the domain builder to use that copy.  Such a dom0 root process that parses user-provided data is a potential security breach.

PV-GRUB, on the other hand, is the real GRUB source code recompiled against Mini-OS, and works much more like a usual bootloader: it runs inside the very PV domain that will host the PV guest.  In the PV domain configuration file, one just gives the path to the PV-GRUB kernel.  PV-GRUB will boot inside the PV domain, detect the PV disks and network interfaces of the domain, and just use that to access the PV guests’ menu.lst, use the regular PV console to show the GRUB menu, and again use the PV interface to load the kernel image from the guest disk image.  Some black magic is then used to boot the PV guest kernel from inside the PV domain (see summit slides for the details).  The limitation, however, is that it can not perform a 32/64bit switch: to boot a 32bit (resp. 64bit) PV kernel, a 32bit (resp.  64bit) PV-GRUB is needed.  The bonus features with PV-GRUB is that network boot is also possible, both for providing the menu.lst and kernel/initrd, and works exactly like the regular GRUB.

As a result, PV-GRUB is far more secure than PyGRUB, as is just only uses the very resources that the PV guest will use.

See Summit slides:
http://www.xen.org/files/xensummitboston08/SamThibault_XenSummit.pdf

Xen 3.3 Feature: HVM Device Model Domain

August 28th, 2008 by Stephen Spector

From Samuel Thibault:

To provide HVM domains with virtual hardware, Xen uses a modified version of qemu, ioemu.  It used to run in dom0 as a root process, since it needs to directly access disks and tap network.  That poses both a problem of security, as the qemu code base was not particularly meant to be safe, and a problem of efficiency, as when an HVM guest performs an I/O operation, the hypervisor gives hand to dom0, which then may not schedule the ioemu process immediately, leading to uneven performances.

In Xen 3.3, ioemu can be run in a Stub Domain (see previous article on Stub Domains).  That means that for each HVM domain there is a dedicated Device Model Domain that processes the I/O requests of the HVM guest.  The Device Model Domain then uses the regular PV interface to actually perform disk and network I/O.  That permits to restrict any harm that ioemu could do to what the regular PV interface enforces.  On the performance point of view, the benefit is twofold: since ioemu runs directly in the same addressing space as Mini-OS, it runs more efficiently: the cost of e.g. select(), clock_gettime(), etc. is reduced a lot; since it runs as a domain, the hypervisor can directly schedule it, which permits to limit the latency of I/O operations at a minimum.  The result is that disk performance gets even closer to native, while network bandwidth gets doubled!

See Summit slides:
http://www.xen.org/files/xensummitboston08/SamThibault_XenSummit.pdf

Xen 3.3 Feature: Stub Domains

August 28th, 2008 by Stephen Spector

From Samuel Thibault:

Domain 0 running a lot of components like physical device drivers, the domain builder, ioemu device models, PyGRUB, etc. has been worrisome from a security point of view, particularly since most of them run as root, and thus breaches there would potentially be disastrous.  It also poses scalability issues since the hypervisor can not itself schedule them appropriately.  The goal of domain 0 disaggregation is thus to move these components to separate domains: driver domain, builder domain, device model domains, etc.

Mini-OS used to be just a small PV kernel serving as a sample of how a PV guest works.  In Xen 3.3, it has been extended up to being able to run the newlib C library and the lwIP stack, thus providing a basic POSIX environment, including TCP/IP networking.  This permits to quite easily embed an application in a dedicated Xen domain by just recompiling it against that environment.

Everything gets linked together as a kernel which can then just be started like any PV guest kernel.  In Xen 3.3, it is thus now possible to have the device model and grub running in their own domains, as described in further blog posts.

On the technical side, the additional features of Mini-OS include:

- Disk frontend
- FrameBuffer frontend
- FileSystem frontend (to access part of the dom0 filesystem)
- Improved Memory management: read-only memory and Copy on Write for zeroed pages
- Bug fixes!

But the simplicity (and thus the efficiency) of Mini-OS is still kept:

- Single address space (in particular, no kernel/user separation, completely
eliminating system call costs)
- Single CPU
- Threads without preemption for Mini-OS internal use, not exposed at the POSIX layer.

Both C and Caml “hello world” samples are provided to get started with developing a stub domain.

See Summit slides:
http://www.xen.org/files/xensummitboston08/SamThibault_XenSummit.pdf

Xen 3.3 Feature: Shadow 3

August 27th, 2008 by dunlapg

Shadow 3 is the next step in the evolution of the shadow pagetable code.  By making the shadow pagetables behave more like a TLB, we take advantage of guest operating system TLB behavior to reduce and coalesce the number of guest pagetable changes that the hypervisor has to translate to the shadow pagetables.  This can dramatically reduce the virtualization overhead for HVM guests.

Shadow paging overhead is one of the largest source of cpu virtualization overhead for HVM guests.  Because HVM guest operating systems don’t know the physical frame numbers of the pages assigned to them, they use guest frame numbers instead.  This requires the hypervisor to translate each guest frame numbers into machine frames in the shadow pagetables before they can be used by the guest.

Those who have been around awhile may remember the Shadow-1 code.  Its method of propagating changes from guest pagetables to the shadow pagetables was as follows:

  • Remove write access to any guest pagetable.
  • When a guest attempts to write to the guest pagetable, mark it out-of-sync, add the page to the out-of-sync list and give write permission.
  • On the next page fault or cr3 write, take each page from the out-of-sync list and:
    • resync the page: look for changes to the guest pagetable, propagate those entries into the shadow pagetable
    • remove write permission, and clear the out-of-sync bit.

While this method worked so-so for Linux, it was disastrous for Windows.  Windows heavily uses a technique called demand-paging.  Resyncing a guest page is an expensive operation, and under Shadow-1, every time a page was faulted in would cause an out-of-sync, write, and a resync.

The next step, Shadow-2, (among many other things) did away with the out-of-sync mechanism and instead emulated every write to guest pagetables.  Emulation avoids the expensive unsync-resync cycle for demand paging.  However, it removes any “batching” effects: every write is immediately reflected in the shadow pagetables, even though the guest operating system may not have been expecting the address change to be available until later.

Furthermore, Windows will frequently write “transition values” into pagetable entries when a page is being mapped in or mapped out.  The cycle for demand-faulting zero pages in 32-bit Windows looks like:

  • Guest process page faults
  • Write transition PTE
  • Write real PTE
  • Guest process accesses page

On bare hardware, this looks like “Page fault / memory write / memory write”.  Memory writes are relatively inexpensive.  But in Shadow-2, this looks like:

  • Page fault
  • Emulated write
  • Emulated write

Each emulated write involves a VMEXIT/VMENTER as well as about 8000 cycles of emulation inside the hypervisor, much more expensive than a mere memory write.

Shadow-3 brings back the out-of-sync mechanism, but with some key changes.  First, only L1 pagetables are allowed to go out-of-sync.  All L2+ pagetables are emulated.  Secondly, we don’t necessarily resync on the next page fault.  One of the things this enables is to do a “lazy pull-through”: if we get a page fault where the shadow is not-present but the guest is present, we can simply propagate that entry to the shadows, and return to the guest, leaving the rest of the page out-of-sync.   This means that once a page is out-of-sync, demand-faulting looks like this:

  • Page fault
  • Memory write
  • Memory write
  • Propagate guest entry to shadows

Pulling through a single guest value is actually cheaper than emulation.  So for demand-paging under Windows, we have 1/3 fewer trips into the hypervisor.  Furthermore, batch updates, like process destruction or mapping large address spaces, are propagated to the shadows in a batch at the next CR3 switch, rather than going into and out of the hypervisor on each individual write.

All of this adds up to greatly improved performance for workloads like compilation, compression, databases, and any workload which does a lot of memory management in an HVM guest.

Xen Summit 2009 Proposal

August 27th, 2008 by Stephen Spector

I am currently working on the Xen.org Community Plans for 2009 Xen Summits and I wanted to share my thoughts with the community to get feedback on my ideas. In the past, Xen Summits have been held every 9 months with the majority of them being in North America. It is my intention, as you can see with the upcoming Xen Summit Tokyo/Asia in November, to ensure that we provide an opportunity for all community members to attend a Xen Summit without having to travel a great distance. To support this concept, I am proposing the following plan for 2009:

  • Xen Summit North America
  • Location: Oracle is scheduled to host this event on February 24 – 25, 2009 in Redwood Shores, CA at Oracle’s Conference Center
  • Focus:  The Xen.org development community with an agenda highlighting the latest features being developed, status updates on research projects leveraging Xen, and customer demonstrations of Xen solutions
  • Length: 2 Days
  •  Xen Summit Europe
  • Location: I am speaking to the LinuxTAG and Linux Kongress organizations about co-locating with one of their events in Germany
  • Focus: Research and customer demonstrations of how they are using Xen; As this event is co-located with a Linux event this is a good opportunity to promote the Xen solution to a wider audience so the agenda needs to be more “how to use Xen”.
  • Length: 1 Day
  •  Xen Summit Asia
  • Location: OPEN (Xen Summit Tokyo/Asia 2008 is being hosted by Fujitsu in Tokyo)
  • Focus: Specific developer topics related to areas critical in Asia (e.g. IA64), how Xen is being used in Asia and new research occurring in Asia [Cross b/w Xen Summit North America and Xen Summit Europe for overall agenda focus]
  • Length: 2 Days
  • Xen Summit North America II
  • Location: If we follow previous schedules, this event will be held nine months after Xen Summit North America; therefore this would be in the Fall 09
  • Focus: Same as Xen Summit North America
  • Length: 2 Days

Note, I have tried to create different focuses for the events to ensure that community members are not required to attend all the events to stay in touch with the community. The Xen Summit North America event will be the developer focused meeting while the other Xen Summits will take a more customer/researcher focus.

As for the future, I have received requests to hold a Xen Summit in India and possibly South America. As the community grows, I expect to see us offer more events globally to better serve the global community and we will revisit the plan when scheduling for 2010.

Community Questions for Discussion

  1. Do we want to host 2 Xen Summits North America next year to continue the 9 month separation of events?
  2.  Is there demand for a 1 day Xen Summit event in Germany? Is there another location in Europe that would be better? Is there another event to consider for co-location?

Xen 3.3 Feature : Memory Overcommit

August 27th, 2008 by Stephen Spector

From Dan Magenheimer at Oracle:

Memory overcommit provides the ability for the sum of the physical memory allocated to all active domains to exceed the total physical memory on the system.  For example, if your machine has 4GB of RAM and you want to run as many 1GB domains as possible, you can run at most three — because Xen and domain 0 require some physical memory also.  With the new memory overcommit feature in Xen 3.3, in some environments, you can run six or ten or even more.

To be clear, there is no magic:  Memory overcommit may have some performance impact and may be unusable in some environments.  Memory for new domains is obtained by taking it away from currently running domains so environments where all domains heavily utilize memory are not a candidate for memory overcommit.  And to maximize benefit, all domains must be properly configured.  But for environments which require a ratio of high virtual-domains-to- physical-machines and that are willing to make some tradeoffs, memory overcommit can substantially increase “VM density” and save cost.

Memory is taken from one domain and given to another using the existing Xen “ballooning” mechanism, which has recently been improved to be more robust.  For example, a domain that is idle (or nearly so) is probably not using much memory; this memory can be made available to use in another domain, or for a newly created domain.  The tricky part is to determine how MUCH memory can be taken away from domains without causing problems for them; and, even more importantly, how to give the memory back if a domain suddenly needs it again.

This careful memory balancing ideally should be done in a management tool that can monitor memory needs of all domains and add or subtract memory from each domain as needed.  A very simple management tool supplied with Xen 3.3 provides “self-ballooning” and, while more sophisticated tools may be needed in the future, self-ballooning is sufficient for many environments.

To best implement memory overcommit, all domains should be configured with a properly sized and configured virtual swap disk and all HVM domains must have a working balloon driver and runnable Xenstore tools.   Next self-ballooning scripts are installed in each domain and enabled as a service.

The scripts, along with a comprehensive README, are found in xen.hg/tools/xenballoond in the open source Xen distribution. Once all domains are rebooted, automatic memory balancing will occur and idle memory is freed up to run additional domains, thus resulting in memory overcommit!

For more information, see:
http://www.xen.org/files/xensummitboston08/MemoryOvercommit XenSummit2008.pdf
http://wiki.xensource.com/xenwiki/Open_Topics_For_Discussion?action=AttachFile&do=get&target=Memory+Overcommit.pdf

Xen 3.3 Feature Details

August 27th, 2008 by Stephen Spector

Xen.org Community:

As part of the Xen 3.3 release, I have asked the various development authors to supply me with information on their new features. Over the next few weeks, I will be posting their overviews  to this  blog to give everyone further information on the features in the new release.

Preview the new Xen.org Website

August 26th, 2008 by Stephen Spector

Xen.org Community:

As many of you are aware, I have been working the past few months to update the current Xen.org website to better target various users of the site as well as simplify the organization of the information. I have completed the web development and am now making the site available for feedback and comment. You can reach the site at http://staging.xen.org and I encourage all feedback from broken links to “what were you thinking?”. I plan to allow at least 2 weeks for comments before I make the final changes and transition the new site to www.xen.org.

If you are interested in seeing the design document that I based the new site on or the target profile descriptions, please search in the blog for “web development” and you will get those documents.

Please note that links to the Wiki, Bug Tracker, Source Browser, and Mercurial Repository will take you to the existing header structure as I am in the process of preparing those services to link to the new site.