QEMU for and on ARM cores

Draining the CP15 Swamp

leave a comment »

A surprisingly large amount of the work we’ve been doing with QEMU and with KVM on ARM has been trying to get handling of CP15 correct.

CP15 is the System Control coprocessor; the architecture manual says it is for “control and configuration of the ARM processor system, including architecture and feature identification”. So this is the place where the control knobs for all the interestingly complicated processor features live: MMU, TLBs, caches, TrustZone access controls, performance monitors, virtualization… and complicated features need a lot of control knobs. Although early system control coprocessors were very simple (ARMv3 system coprocessors had just 8 registers), a modern ARMv7A processor like the Cortex-A15 has about 150 different CP15 registers.

The difficulty for QEMU is twofold. Firstly, the CP15 emulation code has grown organically along with the architecture. When we were dealing with 8 to 16 registers a simple set of switch statements was workable. As registers have been added the switch statements have got more and more cumbersome. Secondly, unlike hardware we want to support multiple CPUs in the same codebase, so we need to deal with all these variations simultaneously. As we added more conditionals things rapidly became unreadable. Registers were being defined for more CPUs than they should be, and it was hard to add new registers without accidentally breaking other CPUs, especially where some older CPUs defined registers that were reused for different purposes in newer architecture versions, or where the older CPU didn’t completely decode the CP15 instructions and so provided the same register in several different locations.

I spent a fair amount of time earlier this year rewriting QEMU’s CP15 code to use a more data-driven approach. Each register is described by a structure like this:

    { .name = "FCSEIDR", .cp = 15, .crn = 13, .crm = 0, .opc1 = 0, .opc2 = 0,
      .access = PL1_RW, .fieldoffset = offsetof(CPUARMState, cp15.c13_fcse),
      .resetvalue = 0, .writefn = fcse_write },

which concisely describes where it sits in the coprocessor, its read/write access permissions, what fields of QEMU’s CPUARMState structure hold the information, and any special-purpose read or write accessor functions that might be needed. At startup we simply define the right registers based on the CPU feature bits. The rewrite also throws in some useful new features like support for 64 bit coprocessor registers and much better support for UNDEFfing on bad register accesses.

This is much easier to work with and we’re starting to see the benefits. When I wrote the LPAE support patches (which have just landed upstream) it was really easy to add the necessary new registers and modify the behaviour of some of the existing ones.

On the kernel side, we currently only support the Cortex-A15, but we’re anxious to keep things clean from the start (and we have the added incentive that if we fail to handle a CP15 register it could potentially let the guest mess with the host’s CPU state, which would be a security hole). Rusty Russell has just posted a patchset to the KVM ARM mailing list which also drives the CP15 emulation from a data table. These patches create a flexible userspace-to-kernel ABI (borrowed from the x86 handling of MSRs) which lets QEMU query the kernel for which registers it supports and read and write only the registers that both QEMU and the kernel know about. This should help avoid nasty binary compatibility breaks in the future when we add code to deal with new CP15 registers.

We’re not completely done yet; for instance we still need to think about how we handle possible compatibility issues with migration of a VM between QEMU instances which are different versions of QEMU and might have different CP15 support. But we’ve definitely drained a fair amount of the muddy water from this swamp and dispatched a few of the alligators…

Written by pm215

July 22, 2012 at 6:37 pm

Posted in linaro, qemu

This End Up…

leave a comment »

I’ve just been reading the ARM ARM on the subject of big-endian support. It’s quite complicated now (as with many bits of the architecture), especially if like QEMU you need to support both old obsolete features and their new replacements. First, a quick summary:

ARM v4 and v5 supported a big-endian model now known as BE32 (although at the time it was just big-endian mode). The key features of BE32 are:

  • word invariant: this means that if you store a 32 bit word in little-endian mode, then flip to big-endian and reload it, you’ll get the same value back. However, if you do a byte load in big-endian mode you’re reading a different byte of RAM than you would for a byte load of the same address in little-endian mode. (Under the hood, the hardware adjusts the addresses for loads and stores of bytes and halfwords.)
  • operates on all memory accesses: data loads and stores, instruction fetches and translation table walks.
  • system wide: it is controlled by bit 7 in the System Control Register (SCTLR.B), and only the operating system can set or clear this. (Implementations might make the bit read-only if they don’t support big-endian mode or if they only allow it to be set via an external signal on reset.)

ARM v6 deprecated BE32 and introduced BE8 as its replacement. Key features:

  • byte invariant: a byte load from address X in little-endian mode accesses the same data as a byte load from X in big-endian mode. However, a word access in big-endian mode will return a word whose bytes are in the opposite order to the same word access in little-endian mode. (Instead of fiddling with addresses like BE32 hardware, BE8 hardware simply flips the four bytes of data for 32 bit accesses, and flips two bytes of data for 16 bit accesses.)
  • only operates on data accesses. Loads and stores done by the program will be in big-endian order, but when the CPU fetches instructions it does so little-endian. This means that self-modifying code needs to know it’s in BE8 mode, because the instruction words it reads from memory will appear to it to be the “wrong way” round, because the CPU reads instructions in little-endian mode and so they must always be in RAM that way round. Since executables are loaded into memory without distinguishing code from data, this also means that when the toolchain writes out a BE8 executable it effectively needs to flip the instructions. This is usually done in the linker.
  • potentially per-user-process: the main control bit is the CPSR.E bit, which can be changed with the unprivileged SETEND instruction. So that the OS gets a predictable data endianness there is a new bit SCTLR.EE in the System Control Register (“exception endianness”) which controls the value of CPSR.E on exception entry; it also determines endianness used for translation table walks.

Notice that both “byte invariant” and “word invariant” approaches meet the key big-endian requirement that if the CPU stores a word 0x12345678 to an address and then reads back a byte from that address it will read 0x12. You can only tell the difference if you have some other way to look at the actual bytes in memory (for instance if you have a second little-endian processor in the system that can read the RAM, or if you can switch the CPU back into little-endian mode).

A v6 core can support both BE32 and BE8, so it still has the SCTLR.B bit. Attempting to turn them both on at once is (fortunately!) UNPREDICTABLE…

In ARMv7 BE32 was dropped completely, so SCTLR.B will always read as zero. However, for R profile only, implementations may support reversing byte order for instruction accesses as well as data. If this is provided then it’s only changeable by asserting an input signal to the CPU on reset. A new System Control Register bit SCTLR.IE tells you whether this instruction endianness flipping is in effect. A system with SCTLR.EE, SCTLR.IE and CPSR.E all set looks pretty similar to a BE32 system from the point of view of the code running on the CPU.

So how does this fit in to QEMU? QEMU’s basic model of endianness is that it is a fixed thing; targets are at compile time specified to be big- or little-endian, and the QEMU core then swaps data if the host and guest are of differing endianness; all memory and device accesses are assumed to be of the same endianness. This is really a kind of byte-invariant big-endianness, but we can use it to implement support for BE32 systems provided that you can never switch back into little-endian mode. In fact, QEMU’s current armeb targets provide exactly this fixed always-BE32 system.[Update: we don’t have any BE32 system targets currently, only the linux-user one, but in theory it should work.]

We don’t currently support BE8, and to do so we need to support separate control of data and code access byteswapping. Paul Brook has posted some patches to add BE8 support to the linux-user-mode, again as a fixed always-on setting (automatically enabled if the ELF file we’re running specifies that it is BE8). This works by telling QEMU’s core that the guest CPU is big-endian (which means data accesses are correct); we then have manual code to swap back the values when we’re doing a read which is an instruction access. This is much simpler than trying to only swap all the data accesses because there are far fewer places where we read words as instructions. The inefficiency of swapping twice is not as bad as it might seem, because we will only do it when we first read code to translate it; subsequent reexecution of the instruction will just reexecute the translated code. I expect this user-mode-only BE8 support to get into upstream QEMU and qemu-linaro within a month or so.

BE8 in system mode would be trickier, and ideally we’d support dynamic endianness switching. The simplest approach would be to have QEMU treat the system as “little-endian”, and then do the byteswapping for data accesses by translating a LDR instruction as “load 32 bits; byteswap 32 bit word”, and so on. Of course if you were running in BE8 mode on a big-endian host system you’d end up swapping everything twice; it would be more efficient to add some support to QEMU’s core for this. However there isn’t really much demand for BE8 system mode support at the moment, so we don’t have any plans to work on it.

Written by pm215

April 2, 2012 at 6:53 pm

Posted in linaro, qemu

ARM ARM update

leave a comment »

The latest revision of the ARM ARM (or to give it its full title, the ARMv7-AR Architecture Reference Manual) was released this week. (It’s available from the ARM Infocenter website; you need to register as a user on the website to be able to download it, but this is a quick and painless process.) If you’ve got a copy of revision B you should grab rev C now. It folds the previously separate documentation of the virtualization and LPAE extensions in to the main architecture specification, and sweeps up a few loose ends like documentation of fused multiply-accumulate.

Working on CPU and device models means spending quite a lot of time looking at hardware reference manuals; you quickly develop an appreciation for the good ones.

Writing a model is creating a from-scratch reimplementation of the hardware. Unfortunately hardware documentation is often written for the device driver writer, not the implementor. You can see this difference of focus most clearly in documents that use phrasing like “you must do X” but which don’t say what happens when you do something else. That’s fine for a device driver writer, who can just stay safely in the area the documentation describes, but to write a good model you also need to know how to behave when the guest OS does do something non-standard. The ARM ARM scores well here, describing both sides of the hardware/software contract rather than merely making rules for software; it also carefully marks out the areas which are implementation defined or unpredictable.

I also like documentation that doesn’t skimp on the details. If I’m halfway through writing some CPU emulation code and I reach a corner case, I want to be able to grab the manual and look up exactly how that corner case needs to be handled. The ARM ARM’s extensive use of pseudocode is a fantastic help here — it acts as a guide for the authors to ensure they really did write down all the corner case behaviours, and it’s a concise and unambiguous way to communicate them. (There’s a price, of course — the rev C is over 2600 pages — but I’ll willingly pay that.)

So it’s cool to see a new revision of an old friend; I wish everybody else’s docs were this good!

Written by pm215

December 2, 2011 at 10:04 pm

Posted in linaro, qemu

Computing a*b+c : how hard can it be?

leave a comment »

Recently I’ve been working on adding “fused multiply-accumulate” support to QEMU (‘FMAC’ for short); the patches were accepted upstream earlier this week and will be in upstream QEMU 1.0, and have already appeared in qemu-linaro 2011.10.

FMAC is easy enough to describe: it’s just a floating point operation that computes (a * b) + c without doing the intermediate rounding that would happen if you used separate multiplication and addition instructions. It was added to the IEEE floating point arithmetic standard in IEEE 754-2008, and (as Wikipedia shows) has been implemented in various CPU architectures. It appears in the ARM architecture starting with VFPv4, which is implemented in the Cortex-A5 and Cortex-A15 cores, as the instructions VFMA, VFMS, VFNMA, and VFNMS. (Don’t confuse these with the older VMLA, VNMLA and friends, which do similar operations but do perform the rounding between the multiply and the addition.) Implementation is slightly more complicated than result = (a * b) + c, however, and it takes over 250 lines of code just to implement this for single precision…

When QEMU generates code for target floating point operations, it doesn’t turn them into floating point instructions for the host CPU; instead they are emulated using integer operations only. This surprises some people, since almost all CPUs support IEEE754 floating point, which specifies bit-exact results. Unfortunately, there are a swathe of special cases where IEEE permits implementation-defined results, or where CPUs have modes which deviate from IEEE for performance reasons; in order to get these right QEMU is forced to do floating point “the hard way”.

FMAC itself provides some good examples of these special cases:

  • NaN propagation

    IEEE says that if you have an operation which has several inputs which are NaN (“Not a Number”, a special ‘error’ representation generated for things like square roots of negative numbers) then the result will be one of the input NaNs. However it’s up to the implementation which one it picks, and different CPU architectures make different choices. FMAC is interesting here because it is the only three-input operation in IEEE, and so the “pick a NaN” code for it is completely FMAC specific.

  • Denormal handling

    IEEE defines arithmetic on “denormal” numbers (which are so close to zero that they can’t be represented except with reduced precision). Handling these is slower than dealing with normal numbers, so some CPU architectures allow the user to select a “fast” mode where denormals are “flushed” to zero, either on input or on output or both. Implementations vary a lot on how this is controlled, which kind of flushing is done and whether status flags are raised when flushing occurs. On ARM, flushing of denormals can be enabled via an FPSCR bit. Neon instructions also always work in “flush denormals” mode; this means that C compilers typically won’t use the Neon variants for floating point arithmetic unless the user enables a “fast math” mode. FMAC is no exception here — there are VFP instructions which are (by default) fully IEEE compliant and Neon versions which flush.

  • Architecture-specific choices of negation

    IEEE specifies only the basic (a * b) + c operation; most CPU architectures have extended this to provide some negated variants, but not always in the same way. So for instance x86 provides -(a * b) + c, but PPC has -((a * b) + c), and ARM has (-a * b) + c. It might look at first as if you can implement some of these by providing a small set of core functions and having the architecture-specific code negate inputs and outputs. This unfortunately doesn’t always work for the special case where one of the inputs is a NaN. In some cases (like ARM) the instruction is defined to be implemented as a simple negation followed by a multiply-add — in this case if the negated input was a NaN it will emerge from the other side with its sign bit flipped. But some CPU architectures, like PPC, specify that the whole instruction including negation is a single operation for NaN handling purposes: if the negated input is a NaN it must not have its sign bit flipped. This means the negation needs to be handled in the “core” floating point emulation so that it can be done after the special-case processing of NaN inputs.

  • (0 * Inf) + QNaN

    IEEE allows an implementation choice in how the special case of (0 * Inf) + QNaN is handled — it will always result in a QNaN, but whether the InvalidOp exception flag should be raised is implementation defined. (If your conceptual model of this operation is that you first do the multiply before you even look at the addend, then you’ll raise InvalidOp because multiplying zero by infinity isn’t a valid operation. If your model is that handling of NaN inputs is the first step before you start computing anything, then you won’t raise InvalidOp because you won’t get to the point of trying to multiply anything.)

Handling of special cases takes up about half the FMAC routine, but even when we’re down to the common case it’s still about a hundred lines of code, because we have to do a multiply, and then either an addition or a subtraction, all with only integer operations. To understand how this works you have to know that a floating point number is represented as (s * -1) * 2^e * m, where s, e and m are the sign bit, exponent and mantissa (fractional part). If you break out the separate fields for the input operands then you can calculate the required result sign, exponent and mantissa, and then pack them back into the floating point binary format. I’m just going to sketch the general idea here (and in particular I have omitted some details such as calculation of the result’s sign bit).

  • Multiplication:

    (2^a * b) * (2^c * d) == 2^(a+c) * b * d so all we need to do is add the exponents and multiply the mantissae. Well, nearly all: then we need to fix up the result if the most significant bit of the mantissa isn’t in the right place, by shifting the mantissa and decrementing the exponent. This doesn’t affect the answer because it’s doing both a multiplication by two (the shift) and a division by two (the exponent adjustment), and they cancel out. Of course multiplying the two 23-bit mantissae gives 56 bits of product, so the following addition or subtraction has to be done at the widened precision, to satisfy the “no intermediate rounding” requirement.

  • Addition:

    We can adjust one of the inputs (by the same trick of shifting the mantissa and adjusting the exponent) so that both inputs have the same exponent. Then we can calculate the result mantissa by adding the two input mantissae, because (2^x * y) + (2^x * z) == (2^x * (y + z)). As with multiplication, we may then need to fix up the result to put the most significant bit of the mantissa in the right place.

  • Subtraction:

    This is like addition, although the details are sufficiently different that it has to be done as a separate code path.

Working through this really solidified my understanding of what QEMU’s floating point emulation code was doing, and if you’re curious about what’s actually lurking behind the ‘single’ and ‘double’ data types I’d encourage you to work through the function with a copy of the IEEE specification and try to figure out why it gives you the right answers…

Written by pm215

November 6, 2011 at 2:42 am

Posted in linaro, qemu

Some QEMU patch statistics

leave a comment »

I’m Peter Maydell, and I work in the Linaro Toolchain Working Group on QEMU, the open source emulator, improving its support for ARM processors. This blog is a place for me to talk about what Linaro is doing with QEMU, interesting corners of the ARM architecture from a CPU modelling perspective, neat QEMU tricks, and generally anything in the intersection of “ARM” and “QEMU”. To kick things off I thought I’d start with a kind of high-level summary of what we’ve been doing so far…

At this year’s KVM Forum Anthony Liguori presented some statistics about contributions to QEMU over the last year. Linaro came in third in this league table of ‘corporate’ contributors, which I think is pretty impressive given we only really started seriously upstreaming patches about halfway through the year:

Company                   | Commits | Percent
Red Hat                   |     962 |     31%
IBM                       |     343 |     11%
Linaro                    |     222 |      7%
Siemens                   |     159 |      5%
SuSE                      |     124 |      4%

So what exactly were we doing with those 200+ patches? I did a little analysis of our committed patches, dividing them into four broad categories:

  • arm: patches to the core ARM CPU emulation code
  • board: patches to ARM boards and devices; this includes fixing bugs in existing board models like versatilepb and also adding new boards (vexpress-a9)
  • linux-user: patches to the code that implements QEMU’s “linux user” mode; mostly this is adding support for passing through new system calls
  • other : everything else

and generated this graph of committed patches every month by category:

Graph of patches submitted by date

Graph of patches submitted by date

You can see that initially we were doing a lot of ARM emulation code fixes, correcting a lot of bugs in the Neon and VFP emulation code. That activity is now complete, and for the last few months patches in this category have been typically code cleanup or adding new instructions rather than fixing bugs in the existing code. In its place, we’ve been doing more patches to ARM boards and devices. As well as getting an A9 Versatile Express board model upstream in time for QEMU 0.15, we’ve been working recently on getting the OMAP3 support patches from the Meego tree upstream, which you can see pretty clearly in the sudden increase in board patches in August.

And future patches? Well, there’s work to do to support the Cortex-A15, starting with some new user-visible instructions for integer divide and floating-point fused multiply-accumulate. We also need an A15 Versatile Express board model, which will tie in with the work being done for KVM virtualization on A15. And we’re planning to push on with the OMAP3 upstreaming as a “background” task. So I reckon Linaro will still be a strong contender in next year’s “patches league table”…

Written by pm215

October 14, 2011 at 11:57 am

Posted in linaro, qemu