## Archive for **November 2011**

## Computing a*b+c : how hard can it be?

Recently I’ve been working on adding “fused multiply-accumulate” support to QEMU (‘FMAC’ for short); the patches were accepted upstream earlier this week and will be in upstream QEMU 1.0, and have already appeared in qemu-linaro 2011.10.

FMAC is easy enough to describe: it’s just a floating point operation that computes `(a * b) + c`

without doing the intermediate rounding that would happen if you used separate multiplication and addition instructions. It was added to the IEEE floating point arithmetic standard in IEEE 754-2008, and (as Wikipedia shows) has been implemented in various CPU architectures. It appears in the ARM architecture starting with VFPv4, which is implemented in the Cortex-A5 and Cortex-A15 cores, as the instructions `VFMA`

, `VFMS`

, `VFNMA`

, and `VFNMS`

. (Don’t confuse these with the older `VMLA`

, `VNMLA`

and friends, which do similar operations but *do* perform the rounding between the multiply and the addition.) Implementation is slightly more complicated than `result = (a * b) + c`

, however, and it takes over 250 lines of code just to implement this for single precision…

When QEMU generates code for target floating point operations, it doesn’t turn them into floating point instructions for the host CPU; instead they are emulated using integer operations only. This surprises some people, since almost all CPUs support IEEE754 floating point, which specifies bit-exact results. Unfortunately, there are a swathe of special cases where IEEE permits implementation-defined results, or where CPUs have modes which deviate from IEEE for performance reasons; in order to get these right QEMU is forced to do floating point “the hard way”.

FMAC itself provides some good examples of these special cases:

- NaN propagation
IEEE says that if you have an operation which has several inputs which are NaN (“Not a Number”, a special ‘error’ representation generated for things like square roots of negative numbers) then the result will be one of the input NaNs. However it’s up to the implementation which one it picks, and different CPU architectures make different choices. FMAC is interesting here because it is the only three-input operation in IEEE, and so the “pick a NaN” code for it is completely FMAC specific.

- Denormal handling
IEEE defines arithmetic on “denormal” numbers (which are so close to zero that they can’t be represented except with reduced precision). Handling these is slower than dealing with normal numbers, so some CPU architectures allow the user to select a “fast” mode where denormals are “flushed” to zero, either on input or on output or both. Implementations vary a lot on how this is controlled, which kind of flushing is done and whether status flags are raised when flushing occurs. On ARM, flushing of denormals can be enabled via an FPSCR bit. Neon instructions also always work in “flush denormals” mode; this means that C compilers typically won’t use the Neon variants for floating point arithmetic unless the user enables a “fast math” mode. FMAC is no exception here — there are VFP instructions which are (by default) fully IEEE compliant and Neon versions which flush.

- Architecture-specific choices of negation
IEEE specifies only the basic

`(a * b) + c`

operation; most CPU architectures have extended this to provide some negated variants, but not always in the same way. So for instance x86 provides`-(a * b) + c`

, but PPC has`-((a * b) + c)`

, and ARM has`(-a * b) + c`

. It might look at first as if you can implement some of these by providing a small set of core functions and having the architecture-specific code negate inputs and outputs. This unfortunately doesn’t always work for the special case where one of the inputs is a NaN. In some cases (like ARM) the instruction is defined to be implemented as a simple negation followed by a multiply-add — in this case if the negated input was a NaN it will emerge from the other side with its sign bit flipped. But some CPU architectures, like PPC, specify that the whole instruction including negation is a single operation for NaN handling purposes: if the negated input is a NaN it must not have its sign bit flipped. This means the negation needs to be handled in the “core” floating point emulation so that it can be done after the special-case processing of NaN inputs. `(0 * Inf) + QNaN`

IEEE allows an implementation choice in how the special case of

`(0 * Inf) + QNaN`

is handled — it will always result in a QNaN, but whether the InvalidOp exception flag should be raised is implementation defined. (If your conceptual model of this operation is that you first do the multiply before you even look at the addend, then you’ll raise InvalidOp because multiplying zero by infinity isn’t a valid operation. If your model is that handling of NaN inputs is the first step before you start computing anything, then you won’t raise InvalidOp because you won’t get to the point of trying to multiply anything.)

Handling of special cases takes up about half the FMAC routine, but even when we’re down to the common case it’s still about a hundred lines of code, because we have to do a multiply, and then either an addition or a subtraction, all with only integer operations. To understand how this works you have to know that a floating point number is represented as `(s * -1) * 2^e * m`

, where `s`

, `e`

and `m`

are the sign bit, exponent and mantissa (fractional part). If you break out the separate fields for the input operands then you can calculate the required result sign, exponent and mantissa, and then pack them back into the floating point binary format. I’m just going to sketch the general idea here (and in particular I have omitted some details such as calculation of the result’s sign bit).

- Multiplication:
`(2^a * b) * (2^c * d) == 2^(a+c) * b * d`

so all we need to do is add the exponents and multiply the mantissae. Well, nearly all: then we need to fix up the result if the most significant bit of the mantissa isn’t in the right place, by shifting the mantissa and decrementing the exponent. This doesn’t affect the answer because it’s doing both a multiplication by two (the shift) and a division by two (the exponent adjustment), and they cancel out. Of course multiplying the two 23-bit mantissae gives 56 bits of product, so the following addition or subtraction has to be done at the widened precision, to satisfy the “no intermediate rounding” requirement. - Addition:
We can adjust one of the inputs (by the same trick of shifting the mantissa and adjusting the exponent) so that both inputs have the same exponent. Then we can calculate the result mantissa by adding the two input mantissae, because

`(2^x * y) + (2^x * z) == (2^x * (y + z))`

. As with multiplication, we may then need to fix up the result to put the most significant bit of the mantissa in the right place. - Subtraction:
This is like addition, although the details are sufficiently different that it has to be done as a separate code path.

Working through this really solidified my understanding of what QEMU’s floating point emulation code was doing, and if you’re curious about what’s actually lurking behind the ‘single’ and ‘double’ data types I’d encourage you to work through the function with a copy of the IEEE specification and try to figure out why it gives you the right answers…