Formally verifying a floating-point division routine with Gappa – part 2

September 4, 2025

In part 1 of this series, I described using Gappa and the Rocq prover to analyze the maximum possible numerical error in ‘ddiv’. This function is a highly optimized floating-point division function written in Arm assembler. It was recently open-sourced as part of Arm Optimized Routines.

In part 2, I finish the story. Once Gappa produced plausible answers, what remained to be done?

Quite a lot, as it turned out.

Making sure the Gappa code matches the original machine code

At the end of part 1, I had a source file written in Gappa’s input language. This file was intended to describe the same calculation as the original division function. Gappa is happy to consume that file and report a sensible-looking value for the maximum possible approximation error in that calculation.

What is the biggest remaining risk of something going wrong in this exercise?

It seemed to me that the biggest risk is that the two pieces of code do not match. The original division function was written in Arm machine code. As a result, it looks very different from the high-level language spoken by Gappa. There is a good chance that the Gappa input does not really match the machine code, because there was a mistake in the translation.

Everything after the translation into Gappa is rigorous. Gappa generates a detailed proof and Rocq double-checks it. However, the translation from machine code to Gappa language is not included in that. So how can we gain confidence that the translation is accurate?

One obvious approach is to try to prove the two representations of the same calculation match. This requires writing a second machine-checkable proof of their equivalence. Another approach is to autogenerate one from the other, by understanding enough about the semantics of Arm machine instructions to translate each one directly into a piece of Gappa input.

I dismissed both ideas as too hard. The semantics of individual Arm instructions are often quite complicated. This is because of their effects on the condition flags, and because 64-bit inputs and outputs to multiplication instructions are split across different registers. For example, when ddiv’s calculation needs a 64 × 64-bit integer multiplication, my Gappa input file represents that multiplication as a single operation. Gappa can then understand it instantly. If you wrote it the way Arm machine code does, it would require four separate 32 × 32 multiplications. These must then be recombined through additions using the carry flag. I believe Gappa would struggle to make sense of even a single 64-bit multiplication, let alone ddiv’s whole calculation containing many of those.

Instead, I used conventional software testing techniques. To check two pieces of code behave the same way, run them both with the same series of input test cases, and verify the outputs match exactly. Better still, check all the intermediate results as well as the final output. Just in case two errors cancelled each other out on this test case and might not have done for some other input.

This means we are not doing formal verification directly on the ddiv source code. That was never my intention. I only wanted to formally verify the error bound. This part was so difficult that it needed a handwritten proof in the original code. Verifying the total correctness of an entire assembler function from end to end is a much bigger project.

This testing strategy seemed adequate to defend against mistakes that are likely when you translate code like this. Such as leaving out a step completely or swapping round the order of two operations that almost commute, like a rounding and an addition.

OK, so the plan is to feed the same pair of test input values to both the machine code and the Gappa. How do you do that?

It turned out to be surprisingly hard. I hoped that Gappa would have something like an ‘evaluate’ mode. Instead of calculating a bound on an output of a calculation given bounds on the inputs, I wanted to provide specific inputs on the command line and have it print the exact value of the output. However, I did not find any mode of that kind in Gappa. I also could not fake it by providing ‘bounds’ on the inputs that constrained each one to be a single value. Strangely, if I did that, Gappa still output a range of more than one possible output value.

If in doubt, add another layer of abstraction

The answer I discovered was to replace my handwritten Gappa input file with a Python script that generates that in turn.

Having translated ddiv’s core calculation from machine code into Gappa, I re-translated it into a Python function. The function takes a class type as one parameter, and all the numbers are represented as instances of that class.

By calling the same calculations() function with two different class types, it can do two different things. One parameter class wraps a Fraction value. In this mode, the code performs the calculations directly and generates diagnostics printing out all the intermediate results and the final output.

Another parameter class wraps a string containing an arithmetic expression in Gappa’s input syntax. In that mode, the same calculation function does not do any calculation itself. Instead it builds up expressions to write into a Gappa source file, which then describes the whole calculation to Gappa.

This was a lot of effort! But now we have an end-to-end story of how we can be confident that Gappa’s error bound applies to the real machine code:

The machine code matches the Python code. We can run both with the same input values and print out all the values they compute. Both sets of values match exactly – bit for bit.
The Python code matches the Gappa code, because the Python code generated the Gappa input file.
The Gappa code matches the error bound, because Gappa generated Rocq input that verified Gappa’s analysis.

And I did make a mistake!

I wrote this Python script and put the corresponding diagnostics into the ddiv machine code, under a #ifdef that you could enable for just this test. When I ran them, the two calculations did not match. It is fortunate I checked.

The core division code takes two mantissas as input, the numerator n and denominator d, with their leading bit in a fixed place. When dividing two numbers of that form, the leading bit of the result is not always in the same place. It might be one bit higher or lower, depending on which of n and d was bigger.

The final step in the division code is to check which of those two places the leading bit ended up in. This happens just before checking whether the result is dangerously close to a rounding boundary and needs double-checking. If it is in the lower position, we shift the whole quotient left by one bit, and adjust the output exponent to compensate. This ensures the leading bit is reliably in the same place.

However, the quotient that we shifted left by one bit, multiplying by 2, includes the approximation error. We just multiplied the error by 2 as well as the result.

I had left this out of my Gappa translation of the calculation. As a result, Gappa’s idea of the worst possible error was half what it should have been. This is because it did not know that the error might be doubled at the last moment.

This highlights the importance of double-checking.

In the new ddiv with its new formal error analysis, it was not too late to fix this. I could not write an if statement in the Gappa input to perform the left shift conditionally. Instead, I ran Gappa multiple times, once with the left shift and once without. I added an extra constraint on the input telling it that either n < d or n > d, as appropriate.

I was already running Gappa 128 times, once for each lookup table entry. Dividing each into two cases meant we are now running Gappa 256 times. Even so, the process did not take too long.

It is worse than that. This was not just a mistake I had made while writing my Gappa code. I had also made the same mistake in the original handwritten proof, which the previous version of ddiv had been relying on since 2009.

Bug-hunting via number theory

That alone does not prove that the old ddiv had a bug. It is possible that my original hand-calculated error analysis had overestimated the error in other places. Perhaps enough to make up for underestimating it by a factor of 2 here.

This is not unlikely at all. There is an example earlier in this article where Gappa greatly overestimated the range of possible values of a function. What you can easily prove is often a much weaker statement than what is true.

To prove there was really a problem, you would want to find a pair of actual input values which generate the wrong quotient.

In single precision, this probably would not be difficult. As I mentioned earlier, there are few enough single-precision denominators that you could just loop over them all and find the one with the biggest error. Then loop over all the possible numerators to go with it and see if any fail. However, for the same reason, in single precision there was not a mathematical proof that needed checking in the first place!

As before, in double precision there are too many values to iterate over. How can we narrow down the search?

I was already running Gappa many times, for different subintervals of the space of possible denominators. It produced a different error bound for each subinterval, and we were finding the worst out of them all. The first step is to check the information we already have. Find the existing subinterval that is reporting the largest error bound. Whatever input value is the limiting factor on Gappa’s proof, it will be somewhere in there.

That Gappa input file contained a query section looking like this:

{ d in [0xF500000000000000, 0xF5FFFFFFFFFFFFFF] /\ recip08 = 0x85 /\
  n in [0x8000000000000000, 0xFFFFFFFFFFFFFFFF] /\ n/d <= 1 ->
  nquot - NQuot in ? }

The next step was to narrow down the interval of possible denominators, by editing the query section. If you replace the denominator interval with [0xF500000000000000, 0xF57FFFFFFFFFFFFF], covering just the first half of the original range, the error bound in Gappa’s output remains large. However, if you use [0xF580000000000000, 0xF5FFFFFFFFFFFFFF], the second half of the range, the error bound drops significantly. This shows the problem value is somewhere in the first half of the interval, and you can halve that again to continue the search.

Using this technique, I identified a much smaller range of denominators for which Gappa still reported a large potential error. When I ran one of those denominators through the ‘evaluate’ mode of my Python wrapper script, the result confirmed that the denominator did have a badly approximated reciprocal. Combined with any numerator, you would get a similarly badly approximated quotient.

However, to show an actual bug in the division function, a badly approximated quotient is not enough. We also need the bad approximation to be positioned just right, or perhaps “just wrong”, with respect to one of those rounding boundaries shown earlier:

Diagram of rounding some sample division results based on an incorrect error bound

In this diagram, each circular blob shows an approximate quotient returned by the code. The error bar to its right represents the interval that the code thinks the true quotient falls in. For the problem denominator, that error bar is too short. The diamond beyond the right-hand end of each error bar represents the true quotient.

In many cases, despite the approximation error, ddiv still gets the right answer. In the green case on the left, the error interval does not cross a rounding boundary. So the code is confident that wherever in that region the true quotient is, it will round to x − 2ε. The true quotient is outside the interval but still in the region of the number line that rounds to x − 2ε. Therefore, ddiv’s answer is still right, even if for the wrong reason.

The amber case in the middle also gives the right answer, for a different reason. The error interval itself crosses a rounding boundary, so ddiv knows this is a difficult case, and it takes the slow code path which double-checks the answer.

However, the red case on the right really does fail. The error bar does not cross a rounding boundary, so the code believes it is an easy case which can be safely rounded to x + ε. However, it is wrong. The true answer is on the other side of the rounding boundary. The right answer should have been x + 2ε.

Finding a numerator that causes a case like the red one is a problem of modular arithmetic. Given the ‘bad’ approximation r ≈ 1 / d, the code will multiply it by the numerator n. It looks at the bottom 8 bits of the product to decide whether it is near a rounding boundary. We want those bottom 8 bits to fall in the critical range of values that makes the code choose incorrectly. In other words, we want nr mod 2⁸ to be in a particular range.

Solving modular congruences like this is straightforward, so finding a suitable numerator was easy.

Indeed, there really was a bug in the old ddiv! Computing the quotient 6498463766869566.0 / 8620720919674879.0 delivered an answer that is wrong by one unit in the last place, for exactly the reason shown in that diagram.

Fixing the bug

Having found this bug, what is the fix?

The most obvious and the easiest fix is to edit the code that checks whether the approximate quotient is too close to a rounding boundary. Increasing the distance that counts as ‘too close’ and replacing the old wrong value I calculated by hand with the new reliable value proved by Gappa.

However, this is not a good fix because it makes ddiv run more slowly. Increasing that error bound means more cases go to the slow code path that double-checks the quotient is exactly right. Fewer cases take the fast path that returns early without needing extra checks. On average, ddiv is slower, which is the price for getting the right answers.

In any unavoidable tradeoff of performance versus correctness, correctness must win. If it takes more time to get the right answers, then we must take more time. The IEEE standard leaves no room for disagreement over what the right answers are. It is easy to get the wrong answers quickly.

But before doing that, it is always worth asking: is the tradeoff really unavoidable? Can we have speed and correctness at the same time?

In the past, it would have been very difficult to try to improve this algorithm. Doing the original error analysis by hand took a lot of time, and any modification to the algorithm would need it to be redone from scratch. That made the code expensive to change and trying lots of different things to see which one worked best would have been even harder.

Now it is much easier. To adapt my Python + Gappa error-analysis system to a new implementation of the function, it is only necessary to modify the Python calculations() function in the same way as the machine code. Then re-run the cross-check to make sure the two implementations still give the same results on some test inputs. All the mathematics that follows is done completely automatically. It is just a matter of re-running the script.

This seemed counterintuitive, because “rapid development” and “formal verification” are often viewed as opposite ends of a scale of software development techniques. At one end, you move fast and do not worry too much about breaking things. At the other, you take elaborate precautions to avoid breaking anything, and those precautions take time, so development is slow. Here however, formal verification enabled rapid development. I was able to move fast because I knew how to be confident of not breaking things.

I was able to quickly try out half a dozen different changes to the algorithm. Within an hour or two, I found a combination of changes that improved things by turning the 61-bit approximate quotient into a 63-bit one. As a result, 10 bits are discarded during rounding instead of 8. The error in the new algorithm was about 70 units in the last place of that 63-bit number, compared with 50 units in the previous 61-bit number. In the new version, each of those ‘units’ are a quarter the size of the old units, so that is a significant win. In fact, the new code is slightly faster than the original code and now has a proof of correctness that I trust!

How much slower and faster was all this? My original error analysis, with the bug in it, sent only about 1 case in 15 to the slow double-checking code path. Gappa’s revised analysis increased that to 1 in 5. After modifying the algorithm, it is back down to 1 in 16, a slightly smaller fraction than it started off.

Initially that came at the cost of one extra instruction in calculating the approximation. That is a net win, because the slow code path takes 20–30 cycles depending on the CPU. The lost performance was at least 2 cycles on average, and 1 guaranteed cycle in the fast path is a price worth paying to get that back. However, with some more tweaking of the code I was able to save another instruction to compensate for the one I had to introduce. Now the fast path is just as fast as it always has been.

Finishing up

While finishing the work, I found some last-minute oddities.

In all the example Gappa inputs I have shown, the question section ends with a clause like ‘expression in ?’. The question mark represents an interval and means “please tell me what interval you decide this value is in”. However, you can replace the question mark with an actual interval, meaning “I claim the value must be in this interval, please try to prove me right”.

I was surprised to find that Gappa works harder in “please prove my claim” mode. In the final version of my ddiv code, the “please tell me what you think the biggest error is” reports just over 70 units in the last place of the 63-bit quotient. However, if I ask Gappa to prove that it is really at most 64 units, it can do it! It has to think harder, and in one case produces a large proof for Rocq to check, but it gets there in the end. I only found this out by a typo, but having found it out, it is worth knowing. A tighter error bound on the same code means improved performance with no extra effort. Even if it is only a fraction of a cycle on average, it is helpful.

What does Gappa do differently between the two modes? When I looked at the huge proof file it emitted for Rocq to check, it seemed to be subdividing the interval of input values into lots of smaller subintervals. The same way I did by hand when I was looking for the worst-case denominator earlier. It then proved each case separately, perhaps using different methods. I think that in “find the maximum error” mode, it will not split up an interval unless the input file specifically tells it to. So, it returns the best error bound it can prove easily. However, if you say “please prove this error bound”, you are authorizing it to try harder.

Finally, there was an amusing disaster that nearly happened. My Python script runs Gappa 256 times. Each time, it gets it to output a Rocq proof file, which feeds to the Rocq command-line checker, still called coqc in current Ubuntu. Initially, I had piped the proof to coqc’s standard input. It seemed happy and returned a success code.

When I was almost finished, I became suspicious about how little time coqc seemed to be taking. Especially on that very long proof file that Gappa produced in one case when I ran it in the “prove this slightly tighter error bound” mode. It turned out that coqc does not read from its standard input at all!

It will only verify a proof in a file, and you must give the file name on its command line. When I ran it without any file name arguments and piped a proof to its standard input, it was completely ignoring the proof but still returning success. With no error message about a missing command-line file name, I very nearly called this whole project done without any proof ever having been checked at all!

Fortunately, once I spotted this and changed how I was running coqc, everything was fine. When it started checking the proofs, it found no problems in them.

Conclusions

Overall, this was a positive experience!

It is never nice to find out that your previous code had a bug in it. However, it does show the value of checking this kind of thing in a more formal way than a handwritten error-prone proof. Perhaps if I had concluded that my original error analysis was accurate, it would have felt like a waste of time to do any of this at all!

The version of ddiv that we have now published in Arm Optimized Routines contains the new more accurate algorithm. It only needs to run the double-checking code one time in 16 and has a formal proof that that amount of care is enough.

It is very pleasing that in this case some formal verification enabled rapid development of an improved algorithm, instead of making it more difficult and cumbersome.

On the other hand, I think that only went so well because I applied formal verification only where it was most needed, the error-analysis part of the function. That is, the part where I had already seen the need for a proof and written one by hand. Every other part of this function relies on conventional software testing. Even the part where I check that the machine code is implementing the same algorithm that the error-analysis proof is proving things about.

Writing a fully formally verified implementation of ddiv in Arm machine code would have been a much bigger project. An end-to-end proof starting from the individual machine instructions, showing that the entire function met the IEEE 754 specification in all the edge cases as well as the common case. It probably would have lived up to the expectation that formal verification is a slow way to write software.

This was my first time using Gappa, so it was interesting to find out what things it is good at, and where the difficult parts are. Challenges included finding out what hint to give it when it cannot do the algebra and the difficulty of simulating a lookup table. Most of all, the way I ended up having to wrap Gappa with a Python script to be able to check that the algorithm described in Gappa’s input is the same one in the machine code, and not mistranscribed.

But I think Gappa is more usually used for the next layer up in the mathematical function stack. Functions like sin and cos, or exp and log, implemented on top of basic floating-point arithmetic. Those functions tend to be written in higher-level languages, so you can make the source code and the Gappa code much easier to check against each other by eye!

Ola Liljedahl 1 month ago

Interesting read. With all of this pain of using Gappa (but in the end you found it worthwhile), it should have been called Grappa instead.
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Embedded and Microcontrollers blog

Formally verifying a floating-point division routine with Gappa – part 2

Simon Tatham

A method of testing whether a numerical error analysis using Gappa really matches the code it is intended to describe.
- September 4, 2025
Formally verifying a floating-point division routine with Gappa – part 1

Simon Tatham

Learn the basics of using Gappa for numerical error analysis, using floating-point division in Arm machine code as a case study.
- September 4, 2025
Building Solutions on Arm: A recap of IEEE Arm Community Technothon project presentation

Fidel Makatia

Read Fidel's account from the Arm Community Technothon!
- December 4, 2024

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Formally verifying a floating-point division routine with Gappa – part 2

Making sure the Gappa code matches the original machine code

If in doubt, add another layer of abstraction

And I did make a mistake!

Bug-hunting via number theory

Fixing the bug

Finishing up

Conclusions

Formally verifying a floating-point division routine with Gappa – part 2

Formally verifying a floating-point division routine with Gappa – part 1

Building Solutions on Arm: A recap of IEEE Arm Community Technothon project presentation