Skip to content


Virtualization on ARM with Xen

This is a repost of a tutorial published initially on community.arm.com – Thank you to Andrew Wafaa for allowing us to repost.

With ARM entering the server space, a key technology in play in this segment is Virtualization. Virtualization is not a tool solely for servers and the data center, it is also used in the embedded space in segments like automotive and it is also starting to be used in mobile.

This is not a new technology, IBM pioneered it in the 1960s, and there are many different hypervisors implementing different methods of virtualization. In the Open Source realm there are two major hypervisors: KVM and Xen. Both are interact directly with the Linux kernel, however KVM is solely in the Linux domain whereas Xen works with Linux, *BSD and other UNIX variants.

In the past it was generally accepted that there are two types of hypervisor, Type1 (also known as bare metal or native) where the hypervisor runs directly on the host server and controls all aspects of the hardware and manages the guest operating systems, and Type2 (also known as hosted) where the hypervisor runs within a normal operating system; under this classification Xen falls into the Type1 camp and KVM fell into the Type2 camp. However the modern implementations of the hypervisors has now blurred the lines of distinction.

This time round I’ll be taking a look at the Xen Hypervisor, which is now one of the Linux Foundation’s collaborative projects. Here is a brief overview of some of Xen’s features:

  • Small footprint. Based on a microkernel design, it has a limited interface to the guest virtual machine and takes up around 1MB of memory.
  • Operating system agnostic. Xen works well with BSD variants and other UNIX systems, although most deployments use Linux.
  • Driver Isolation. This allows the main device drivers of a system to run inside the VM, which enables the VM containing drivers to be rebooted in the event of a crash or compromise without affecting the host or other guests. In the Xen model the majority of the device drivers run in virtual machines rather than in the hypervisor, as well as allowing reusing of existing OS driver stacks this allows the VM containing the driver to be rebooted without affecting the host of other guests. Individual drivers can even be run in separate VMs in order to improve isolation and fault tolerance or just to take advantage of differing OS functionality.
  • Paravirtualization (PV). This style of port enables Xen to run on hardware that doesn’t have virtualization extensions, such as Cortex-A5/A8/A9 in ARM’s case.  There can also be some performance gains for some PV guests, but this requires the guests to be modified and prevents “out of the box” implementations of operating systems.
  • No emulation, no QEMU. Emulated interfaces are slow and insecure. By using hardware virtualization extensions and IO paravirtualization, Xen removes any need for emulation. As a result you have a smaller code base and better performances

Continued…

Posted in Community.

Tagged with .


Rackspace hosts Xen Project Hackathon, May 29-30 in London

I am pleased to announce the next Xen Project Hackathon. The Hackathon will be hosted by Rackspace in their London offices, May 29 and 30. I wanted to thank Paul Voccio and Gus Maskowitz from Rackspace for hosting the Hackathon. I also wanted to thank Rackspace for hosting the Xen Project wiki, mailing lists, blog and other services. This is in line with Rackspace’s vision of openness and helping people:

At Rackspace we build and live openness as much as possible and go out of our way to support and nurture open source where we can. We help people achieve success by solving their hybrid hosting and cloud problems. In that same line we are proud to help the Xen Project run a successful Hackathon in London 2014. Rackspace

What to expect at a Xen Project Hackathon?

The aim of the Hackathon is to give developers the opportunity to meet face to face, to discuss development, coordinate, write code and collaborate with other developers. Of course the event will allow everyone to meet in person and build relationships: to facilitate this, we will have a social event on the evening of the 29th. We will cover many hot topics such as the latest Xen Project Hypervisor 4.4 features, planning for the next Xen Project Hypervisor release, Cloud Integration, Cloud Operating Systems, Mirage OS, Xen Project in emerging segments such as embedded, mobile, automotive and NFV. But at the end of the day, the community will chose the topics that are covered.

To ensure that the event runs efficiently, we are following the following process: Each day is divided into several segments. We will have a number of work areas that are labelled with numbers (or other unique identifiers). Each morning starts with a plenary and scheduling session. Every attendee who cares about a topic can announce a topic, which we will map against a work area and time-slot. This makes it easy for other attendees to participate in projects and discussions they care about. Of course we also encourage attendees to highlight projects they plan to share before the event by adding them to our wiki.

We will wrap up each day with another short plenary session: the aim of this session is to summarize what was done, show brief demos and make improvements to the process.

To give you a sense of the venue, we attached a few pictures of the venue and past events:

Rackspace_events_2 MA0139_N73_medium MA0139_N60_medium
MA0139_N10_medium Rackspace_events_1 MA0139_N94_medium

How to Register?

As spaces at the Xen Project Hackathon are limited, we are asking attendees to request an invitation. If you did, you will be notified by email within 5 business days with instructions on how to confirm your invitation.

Like last year, we will be asking for a small registration fee of $15. This fee will be given to a charity or open source organisation. You will need to cover your own travel, accommodation and other costs such as evening meals, etc. We do have limited travel stipends available for individuals who cannot afford to travel. Please contact community dot manager at xenproject dot org if you need to make use of it.

More Information

Posted in Community, Events.

Tagged with , .


Xen Project Team Hits the Road

You’ll find many of our members and contributors taking on more than coding this spring. We’re excited to attend several upcoming industry events and share Xen Project milestones, news, use cases and roadmap updates in-person with many in our community.

We encourage you to attend any of these upcoming Xen Project talks. And, if you do, make sure to introduce yourself to the speaker.  It’s always good to meet new people from the Xen Project community!

First Stop: Linux Foundation Collaboration Summit

This week you’ll find the Xen Project team in wine country at the Linux Foundation Collaboration Summit in Napa Valley, California.  We have a terrific set of Xen Project sessions at this year’s conference. In fact, the schedule for March 27 reads like a mini Xen Project Summit.

At 11:30 AM, GlobalLogic CTO Alex Azizam discusses “Xen versus Xen Automotive,” an overview of the technologies required to fully use Xen Project software in automotive applications.

At 2:00 PM, Intel Software Engineer Zhiyuan Lv presents “XenGT: A Full GPU Virtualization Solution with Mediated Pass-Through.”  GPU virtualization is especially hot right now, and this project to provide high-performance virtual GPUs for use within the Xen Project environment is especially interesting.

At 3:00 PM, Oracle Software Engineer Mukesh Rathor talks about “PVH: A PV Guest in HVM Container.”  The combination of PV and HVM promises to yield the highest performance of any Xen Project hypervisor mode for most workloads.

And, finally, at 4:00 PM, I deliver an overview of our new release with “Xen Project 4.4: Features and Futures.”  Attendees will hear about the newest capabilities, as well as hear a quick summary of some upcoming enhancements on the project roadmap.

Check out the Collaboration Summit Q&A on Linux.com for more event highlights and Xen Project reflections.

Up Next: ApacheCon / CloudStack Collaboration Conference / CentOS Dojo

With barely any time to rest, you’ll next find us at ApacheCon and the CloudStack Collaboration Conference April 7-9 in Denver, Colo.

At 2:20 PM on Thursday, April 10 I’ll again present “Xen Project 4.4: Features and Futures” for the Apache CloudStack enthusiasts visiting the Mile High City.

Later that day, at 4:20 PM I’ll deliver “Using and Understanding Xen CentOS” at CentOS Dojo (also co-located with ApacheCon), discussing how to use the Xen Project hypervisor on top of CentOS.

On Friday, April 11 at 11:45 AM, John Mark Walker, Gluster Community Manager at Red Hat, presents “The New Cloud Stack — CloudStack, Xen and GlusterFS” and the latest work from Gluster engineers to improve integration and scalability across these technologies.

More to Come…

If we miss you in California or Colorado, be sure to visit XenProject.org and check out the upcoming events to find out when a Xen Project talk is coming to a conference near you.  And don’t forget that we normally link to our talk slides and videos as they become available.

Hope to see you at an event soon!

Posted in Announcements, Events.


XenGT – a Full Graphics Virtualization Solution on Intel ® Processor Graphics

Background

The Graphics Processing Unit (GPU) has become a fundamental building block in today’s computing environment, accelerating tasks from entertainment applications (gaming, video playback, etc.) to general purpose windowing (Windows* Aero*, Compiz Fusion, etc.) and high performance computing (medical image processing, weather forecast, computer aided designs, etc.).

Today, we see a trend toward moving GPU-accelerated tasks to virtual machines (VMs). Desktop virtualization simplifies the IT management infrastructure by moving a worker’s desktop to the VM. In the meantime, there is also demand for buying GPU computing resources from the cloud. Efficient GPU virtualization is required to address the increasing demands.

Enterprise applications (mail, browser, office, etc.) usually demand a moderate level of GPU acceleration capability. When they are moved to a virtual desktop, our integrated GPU can easily accommodate the acceleration requirements of multiple instances.

GPU Background

Let’s first look at the architecture of Intel Processor Graphics:

arch_of_intel_graphics

The render engine represents the GPU acceleration capabilities with fixed pipelines and execution units, which are used through GPU commands queued in command buffers. The display engine routes data from graphics memory to external monitors, and contains states of display attributes (resolution, color depth, etc.). The global state represents all the remaining functionality, including initialization, power control, etc. Graphics memory holds the data, used by the render engine and display engine.

The Intel Processor Graphics uses system memory as the graphics memory, through the graphics translation table (GTT). A single 2GB global virtual memory (GVM) space is available to all GPU components through the global GTT(GGTT). In the meantime, multiple per-process virtual memory (PPVM) spaces are created through the per-process GTTs (PPGTTs), extending the limited GVM resource and enforcing process isolation.

Graphics Virtualization Technologies

Several technologies achieve graphics virtualization, as illustrated in the image below, with more hardware acceleration toward the right.

virtual_gfx_techs

Device emulation is mainly used in server virtualization, with emulation of an old VGA display card. Qemu is the most widely used vehicle. Full emulation of a GPU is almost impossible, because of complexity and extremely poor performance.

API forwarding implements a frontend/backend driver pair. The frontend driver forwards high-level DirectX/OpenGL API calls from the VM to the backend driver in the host through an optimized inter-VM channel. Multiple backend drivers behave like normal 3D applications in the host, so a single GPU can be multiplexed to accelerate multiple VMs. However, the difference between the VM and host graphics stacks easily leads to reduced performance or compatibility issues. Because it is hardware-agnostic, this is the most widely used technology, so far. Actual implementations vary, depending on the level where forwarding happens. For example, VMGL directly forwards GL commands, while VMware vGPU presents itself as a virtual device, with high-level DirectX calls translated to its private SVGA3D protocol. Another recent example is Virgil, with its experimental virtual 3D support for QEMU.

Direct pass-through, based on VT-d, assigns the whole GPU exclusively to a single VM. When achieving the best performance, it sacrifices the sharing capability.

Mediated pass-through extends direct pass-through, using a software approach. Every VM is allowed to access partial device resources without hypervisor intervention, while privileged operations are mediated through a software layer. It sustains the performance of direct pass-through, while still provides the sharing capability. XenGT adopts this technology. 

XenGT

XenGT is a full GPU virtualization solution with mediated pass-through, on Intel Processor Graphics. A virtual GPU instance is maintained for each VM, with part of performance critical resources directly assigned. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature and sharing capability.

arch_of_xengt

Above figure shows the overall XenGT architecture. Each VM is allowed to access a partial performance critical resource without hypervisor intervention. Privileged operations are trapped by Xen and forwarded to the mediator for emulation. The mediator emulates a virtual GPU instance for each VM. Context switches are conducted by the mediator when switching the GPU between VMs. XenGT implements the mediator in dom0. This avoids adding complex device knowledge to Xen, and also permits a more flexible release model. In the meantime, we want to have a unified architecture to mediate all the VMs, including dom0, itself. So, the mediator is implemented as a separate module from dom0’s graphics driver. It brings a new challenge, that Xen must selectively trap the accesses from dom0’s driver while granting permission to the mediator. We call it a “de-privileged” dom0 mode.

Performance critical resources are passed through to a VM:

  • Part of the global virtual memory space
  • VM’s own per-process virtual memory spaces
  • VM’s own allocated command buffers (actually in graphics memory)

This minimizes hypervisor intervention in the critical rendering path. Even when a VM is not scheduled to use the render engine, that VM can continuously queue commands in parallel.

Other operations are privileged, and must be trapped and emulated by the mediator, including:

  • MMIO/PIO
  • PCI configuration registers
  • GTT tables
  • Submission of queued GPU commands

The mediator maintains the virtual GPU instance based on the traps mentioned above, and schedules use of the render engine among VMs to ensure secure sharing of the single physical GPU.

Current Status

The latest source codes and the setup guide are available at the github repositories:

(The first repository has a XenGT_Setup_Guide.pdf, which supplies step-by-step instructions for getting a system set up.)

Linux: https://github.com/01org/XenGT-Preview-kernel.git

Xen: https://github.com/01org/XenGT-Preview-xen.git

Qemu: https://github.com/01org/XenGT-Preview-qemu.git

Patches are welcomed!

We plan to upstream this work, and are now preparing some cleanup.

 

Changelog

XenGT was first announced in Sep 2013:

http://lists.xen.org/archives/html/xen-devel/2013-09/msg00681.html

It was presented at the 2013 Xen Project Developer Summit, Edinburgh:

http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-v7_0.pdf

An update was announced recently in Feb 2014:
http://lists.xen.org/archives/html/xen-devel/2014-02/msg01848.html

Posted in Xen Development.


Xen 4.4 Released

Xenproject.org is pleased to announce the release of Xen 4.4.0. The release is available from the download page:

Xen 4.4 is the work of 8 months of development, with 1193 changesets. It’s our first release made with an attempt at a 6-month development cycle. Between Christmas, and a few important blockers, we missed that by about 6 weeks; but still not too bad overall.

Additionally, this cycle we’ve had a massive increase in the amount of testing. The XenProject’s regression testing system, osstest, has recieved a number of additional tests, and the XenServer team at Citrix have put Xen through their massive testing suite (XenRT). Additionally, early in this development cycle we had the go-ahead to use Coverity static analysis engine to comb through the source code for hard-to-spot bugs. The result should be that Xen 4.4 is one of the most secure, reliable releases yet.

Highlights

Although the development part of the release cycle was shorter than the previous one, we still have far too many exciting improvements than we can mention in this blog post; I’ll call out just a few.

Probably one of the most important is solid libvirt support for libxl. Jim Fehlig from SuSE and Ian Jackson from Citrix worked together to test and improve the interface between libvirt and libxl, making it fast and reliable. This lays the foundation for solid integration into any tools that can use libvirt, from GUI VM managers to cloud orchestration layers like CloudStack or OpenStack.

Another big one is a new scalable event channel interface, designed and implemented by David Vrabel from Citrix. The original Xen event channel interface was limited to the number of bits on the platform squared — 1024 for 32-bit guests and 4096 for 64-bit guests. With many VMs requiring 4 event channels each, that means a theoretical maximum of 256 guests on a 32-bit dom0 — more than enough back when a large machine had 8 cores, and every VM was a full OS; but a major limitation on systems with 128 cores, or those using cloud OSes like Mirage or OSv. The new “FIFO” event channel interface by default scales up to over 200,000 event channels, and in the future can be extended even further if necessary in a backwards-compatible manner. This should be enough for many years to come.

The ARM port is maturing quickly. As of 4.4, the hypervisor ABI for ARM has been declared stable, meaning that any guest which uses the 4.4 ARM ABI can rely on being able to boot on all future versions of Xen. There are a number of improvements making Xen on ARM more flexible, easier to set up and use, and easier to extend to new platforms. More details can be found in the Xen 4.4 feature list.

One other feature worth a note is Nested Virtualization on Intel hardware. It’s not ready for production use yet, but it has improved to the point where we feel comfortable moving it from “experimental” to “tech preview”. Please feel free to take it for a spin and report any issues you find.

There are many more improvements and changes under the hood. For a more complete list, see the Xen 4.4 feature list.

Features in related projects

The Xen Project is part of a much larger ecosystem of projects. We are typically very closely tied to Linux and qemu, but a number of other projects have had important developments that are worth a mention.

The first is the pv port of grub2. Rather than having a re-implementation of grub in the Xen tree, grub2 now has native support for running in Xen and using the Xen pv block protocol. This guarantees 100% compatibility with grub2 going forward.

Another project worth a mention is the 3.3 release of Xen Orchestra. Xen Orchestra is a web interface that interfacer with the xapi protocol (and thus can be used for XCP, XenServer, or other xapi-based systems). New creating snapshots, revert or delete, remove host from pool, restart toolstack/reboot/shutdown host) and more stable upgrade process from appliance

Finally, GlusterFS 3.5 now supports creating iSCSI nodes. One of the benefits of this is that now, by creating iSCSI devices in dom0, Xen guest disks can be stored in GlusterFS.

Updates

Posted in Announcements, Community, Xen Hypervisor.


Xen Project Developer Summit Call for Participation is Open

Join us in Chicago August 18-19, 2014

The Xen Project Developer Summit will feature content for developers, integrators and power users of the Xen Project. We are looking for presentations related to development, such as development proposals, updates on feature development, project updates, etc. We are also looking for insight into best practices in deploying Xen Project at scale, case studies by Xen users and other topics that large scale users of the Xen Project hypervisor care about.

The program committee will be looking for presentations and workshops related to working with the Xen Project. Topics related to Xen Project development include:

  • development proposals
  • updates on feature development
  • project updates
  • discussions and proposals on the architectural evolution of Xen Project
  • development best practices
  • studies and benchmarks of system characteristics such as performance/scalability/security/ease of use/power consumption
  • lessons learned
  • interfacing with other open source projects
  • making Xen Project software easier to consume by distributions and integrators

We are also interested in proposals that provide insight into best practices in deploying Xen Project, case studies by Xen Project users and other topics that large users and integrators of Xen Project care about. This includes:

  • deploying Xen Project software or its sister projects at scale
  • best practices for working with Xen Project
  • case studies by users
  • Xen Project benchmarks
  • tips and tricks in securing Xen-Project-based clouds
  • managing Xen-Project-based environments
  • open source projects that are related to Xen Project and deliver benefits to our users
  • 3rd party integrations
  • and more!

In short, if it’s relevant to Xen Project development, integration and usage we are interested in what you might have to say.

Submit Your Talk Proposal by May 2

Ready to submit your talk, or just interested in learning more?  Click here.

Posted in Announcements, Community, Events.


Xen Project participates in GSoC 2014 and OPW Round 8

The Xen Project is pleased to announce that we have been accepted to participate in this years Google Summer of Code and that the Xen Project will also participate in Round 8 of the Gnome Outreach Program For Women

Google Summer of Code

You can find our project list and more information on how to chose projects, apply for as a student and participate as a mentor on our Xen Project Google Summer of Code Portal.

OPW Round 8

You can find our project list and more information on how to chose projects, apply for as an intern and participate as a mentor on our OPW Round 8 Portal.

Posted in Community.


Xen Project automatic testing on community infrastructure

Currently the Xen Project’s automatic testing setup runs on a small set of hardware in space borrowed from Citrix. Because it’s on the Citrix network, it’s not possible to give access to other community members. The underlying systems are creaking rather. And the system is too small – we already find that testing is rather too slow.

The Xen Project Test Framework Working Group has agreed to press forward with a plan to provide a new setup (in a public colo, probably). We have a budget for this from the Advisory Board, which we think will be sufficient to provide a bigger and better setup than we have now.

We have decided to separate this immediately pressing concern – the inadequate and inaccessible hosting – from the longer-term questions of how to make more use of Xen community members’ existing test software. In particular, we have deferred the question of whether to stick with the existing osstest system long-term, or move to another system such as Citrix’s XenRT.

We’ll consider whether, when and how to make such a transition after we have sorted out our underlying infrastructure. We will make sure that the hardware and facilities we are organising now will be suitable for whatever software system we might want to run.

So, our immediate task now is to set out a more detailed plan for the amount and kind of hardware to acquire, and to identify a suitable hosting facility.

Posted in Community, Xen Development.


Announcing Xen 4.3.2 and 4.2.4 Releases

The Xen Project is pleased to announce the availability of  two maintenance releases: Xen 4.3.2 and Xen 4.2.4.

Xen 4.3.2 Release

This release is available immediately from the git repository:

http://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/stable-4.3 (tag RELEASE-4.3.2)

or from the XenProject download page:

http://www.xenproject.org/downloads/xen-archives/supported-xen-43-series/xen-432.html

This fixes the following critical vulnerabilities:
Continued…

Posted in Announcements.


Ballooning, rebooting, and the feature you’ve never heard of

Today I’d like to talk about a functionality of Xen you may not have heard of, but might have actually used without even knowing it. If you use memory ballooning to resize your guests, you’ve likely used “populate-on-demand” at some point. 

As you may know, ballooning is a technique used to dynamically adjust the physical memory in use by a guest. It involves having a driver in the guest OS, called a balloon driver, allocate pages from the guest OS and then hand those pages back to Xen. From the guest OS perspective, it still has all the memory that it started with; it just has a device driver that’s a real memory hog. But from Xen’s perspective, the memory which the device driver asked for is no longer real memory — it’s just empty space (hence “balloon”). When the administrator wants to give memory back to the VM, the balloon driver will ask Xen to fill the empty space with memory again (shrinking or “deflating” the balloon), and then “free” the resulting pages back to the guest OS (making the memory available for use again).

While this can be used to shrink guest memory and then expand it again, this technique has an important limitation: It can never grow the memory above the starting size of the VM. This is because the only way to grow guest memory is to “deflate” the balloon. Once it gets back to the starting size of the VM, the balloon is entirely deflated and no additional memory can be added by the balloon driver.

To see why this is important, consider the following scenario.

Host A and B both have 4GiB of RAM, and 2 VMs with 2GiB of RAM each. Suppose you want to reboot host B to do some hardware maintenance. You could do the following:

  • Balloon all 4 VMs down to 1GiB
  • Migrate the 2 VMs from host B onto host A
  • Shut down host B to do your maintenance
  • Bring up host B
  • Migrate the 2 VMs originally on host B back
  • Balloon all 4 VMs back up to 2GiB

All well and good. But suppose that while you had one of those VMs ballooned down to 1GiB, you needed to reboot one. Now you have a problem: Most operating systems will only check how much memory is available at boot time. You only have 1GiB of free memory. If you boot with 1GiB of memory, you will be able to balloon *smaller* than 1GiB, but you will not be able to balloon back up to 2GiB when the maintenance of host B is done.

This is where populate-on-demand comes in. It allows a VM to boot with a maximum memory larger than its current target memory. It enables a guest that thinks it has 2GiB of RAM to boot while only actually using 1GiB of RAM. It can do this because it only needs to allow the guest to run until the balloon driver can start. Once the balloon driver starts, it will “inflate” the balloon to the proper size. At that point, there is nothing special to do; the VM looks like it did when we shut it down (guest thinks it has 2GiB of RAM, but 1GiB is allocated to the balloon driver and not accessed). When host B comes back up and more memory is free, the balloon driver can deflate the balloon, bringing the total memory back up to 2GiB.

Populate-on-demand comes into play in Xen whenever you start an HVM guest with maxmem and memory set to different values. In that case, the guest will be told it has maxmem RAM, but will only have memory allocated to it; the populate-on-demand code will allow the guest to run in this mode until the balloon driver comes up and hands “frees” maxmem-memory back to Xen.

Virtualizing memory: A primer

In order to desrcibe how populate-on-demand works, I’ll need to explain a bit more about how Xen virtualizes memory. On real hardware, the actual hardware memory is referred to as physical memory; and it is typically divided into 4k-chunks called physical frames. These frames are addressed by their physical frame number, or pfn. In the x86 world, pfns typically start at 0, and are mostly contiguous (with the occasional “hole” for IO devices). Historically, on x86 platforms, a description of which pfns are available for use by memory is in something called the E820 map, provided by the BIOS to operating systems at boot.

When we virtualize, we need to provide the guest with virtual “physical address space,” described in the virtual E820 map provided to the guest. These are called guest physical frame numbers, or gpfns. But of course there is still real hardware backing this memory; in the virtualization world, it is common to refer to these as machine frames, or mfns. Every useable gpfn must have a mfn behind it.

But the gpfns have to start at 0 and be contiguous, while the mfns which back them may come from anywhere in Xen’s memory. So every VM has a physical to machine translation table, or p2m table, which maps the gpfn space onto the mfn space. Each gpfn will have an entry in the table, and every useable bit of RAM has an mfn behind it. Normally this is done by the domain builder in domain 0, which will ask Xen to fill the p2m table appropriately (including any holes for IO devices if necessary).

Ballooning then works like this. To inflate the balloon, the balloon driver will ask the guest OS for a free page. After allocating the page, it puts it on its list of pages and finds the gpfn for that page. It then tells Xen it can take the memory behind the gpfn back. Xen will replace the mfn in that gpfn space with “invalid entry,” and put the mfn on its own free list (available to be given to another VM). If the guest were to attempt to read or write this memory now, it would crash; but it won’t, because the guest OS thinks the page is in use by the balloon driver. The balloon driver won’t touch it, and the OS won’t use it for anything else.

To deflate the balloon, the balloon driver will choose one of the pages on its list that it has allocated, and then asks Xen to put some memory behind the gpfn. If Xen determines that the guest is allowed to increase its memory, and there is free memory available, then it will allocate an mfn and put it in the p2m table behing that gpfn. Now the gpfn is useable again; the balloon driver then frees the page back to the guest OS, which will put it on its own free list to use for whatever needs memory.

Populate on Demand: The Basics

The idea behind populate-on-demand was that the guest didn’t actually need all of its memory to boot up until the balloon driver was active — it only needed a small portion of it. But there was no way for the domain builder to know ahead of time which gpfns the guest OS will actually need to use in order to do that; nor which memory will be given to the balloon driver by the guest OS once it starts up.

So when building a domain in populate-on-demand mode the domain builder tells Xen to allocate the mfns into a special pool, which I will call here the PoD pool, according to how much memory is specified in the memory parameter. (In the Xen code it’s actually called the PoD cache, but it’s not a good name, because in computer science “cache” has a very specific meaning that doesn’t match what the PoD pool does. This will probably be renamed at some point for clarity.)

It then creates the guest’s p2m table as it did before, but instead of filling it with mfns, it fills it with a special PoD entry. The PoD entry is an invalid entry; so as the guest boots, whenever it touches a gpfn backed by a PoD entry, it will trap up into Xen. When Xen sees that the PoD entry, it will take an mfn from the PoD pool and put it in the p2m for that gpfn. It will then return to the guest, at which point the memory access will succeed and the guest can continue.

Thus, rather than populating the p2m table when building the domain, the p2m table is populated on demand; hence the name.

The key reason for having the the PoD pool is that the memory is already allocated to the domain. If you do a domain list it shows up as owned by the domain; and it cannot be allocated to a different domain. If this were instead allocate on demand, where you actually allocated the memory from Xen when you hit an invalid entry, there would be a risk that the memory you needed to boot until the balloon driver could run would already have been allocated to a different domain.

However, the guest can’t run like this for long. There are far more PoD entries in the p2m table than there are mfns in the PoD pool — that was the point. But the guest OS doesn’t know that; as far as it’s concerned, it has maxmem to work with. If the balloon driver doesn’t start, nothing will keep it from trying to use all of its memory. If it uses up all the memory in the PoD pool, the next time Xen hits a PoD entry, there won’t be any mfns in the PoD pool to populate the entry with. At that point, Xen would have no choice but to kill the guest.

Getting back to normal: the balloon driver

The balloon driver, like the guest operating system, knows nothing about populate-on-demand. It just knows that it has maxmem gpfn space, and it needs to hand maxmem-memory back to Xen. So it begins allocating pages from the guest operating system, and freeing the gpfns back to Xen.

What Xen does next depends on a few things. Xen keeps track of both the number of PoD entries in the p2m table, and the number of mfns in the PoD pool.

  • If the gpfn is a PoD entry, Xen will simply replace the PoD entry with a normal invalid entry and return. This reduces the number of outstanding PoD entries in the pool.
  • If the gpfn has a real mfn behind it, and the number of PoD entries left in the p2m table is more than the number of mfns in the PoD pool, Xen will replace the entry with an invalid entry, and put the mfn back into the PoD pool. This increases the size of the pool.
  • If the gpfn has a real mfn behind it, but the number of PoD entries left in the p2m table is equal to the number of mfns in the pool, it will put the mfn back on the free list, ready to be used by another domain.

Eventually, the number of outstanding PoD entries is equal to the number of entries in the PoD pool, and the system is now in a stable state. There is no more risk that the guest will touch a PoD entry and not find memory in the pool; and for an active OS, eventually all pages will be touched, and the VM will be the same as one booted not in PoD mode.

It’s never that simple: Page scrubbing

At a high level, that’s the idea behind populate-on-demand. Unfortunately, the real world is often a bit more messy than we would like.

On real hardware, if you do a soft reboot (or if you do some special trick, like spraying the RAM with liquid nitrogen), the memory when the operating system starts may still contain information from a previous boot. The freshly booting operating system has no idea what may be in there: it may be security sensitive information like someone’s taxes or private data keys.

To avoid any risk that information from the previous boot might leak into untrusted programs which might run this time, most operating systems will scrub the memory at boot — that is, fill all the memory with zeros. This also means that drivers can assume that freshly allocated memory will already be zeroed, and not bother doing it themselves. Doing this all at once, at the beginning, allows the operating system to use more efficient algorithms, and also localizes the processor cache pollution.

For an operating system running under Xen this is unnecessary, because Xen will scrub any memory before giving it to the guest (for pretty much the same potential security issue). However, many operating systems which run on Xen — in particular, proprietary operating systems like Windows — don’t know this, and will do their own scrub of memory anyway. Typically this happens very early in boot, long before it is possible to load the balloon driver. This pretty much guarantees that every gpfn will be written to before the balloon driver loads. How does populate on demand deal with that?

The key is that the state of a gpfn after it has been scrubbed by the operating system is the same as the default initial state of a gpfn just populated by the PoD code. This means that after a gpfn has been scrubbed by the operating system, Xen can reclaim the page: it can replace the mfn in the p2m table with a PoD entry, and put the mfn in the PoD pool. The next time the VM touches the page, it will be replaced with a different zero page from the PoD pool; but to the VM it will look the same.

So the populate-on-demand system has a number of zero-page reclaim techniques. The primary one is that when populating a new PoD entry, we look at recently populated entries and see if they are zero, and if they are, we reclaim them. The effect of this is to have each scrubbing thread only have one outstanging PoD page at a time.

If that fails, there is another technique we call the “emergency sweep.” When Xen hits a PoD entry, but the PoD pool is empty, before crashing the guest, it will search through all of guest memory, looking for zeroed pages to reclaim. Because this method is very slow, it is only used as a last resort.

Conclusion

So that’s populate-on-demand in a nutshell. There are more complexities under the hood (like trying to keep superpages together), but I’ll leave those for another day.

Posted in Xen Development, Xen Hypervisor.

Tagged with , , .