Abacus

Abacus is Codeplay’s kernel library for compute languages.

Key features include;

OpenCL 1.2 floating point math functions.
Heavily optimized for GPU, DSP and vector architectures.
Satisfies the high precision requirements for OpenCL conformance.
High code quality and documentation for easy maintainability.

Release Notes

Jam Roly Poly - 12th July 2016

The Jam Roly Poly release marks a broad stabilization of the Abacus codebase - the code is shipping in many more platforms now, we have wide testing support from a plethora of architectures, and the code is working perfectly well for our customers.

Huge tidy up of the code, unifying a number of built up conventions.
Added flush-to-zero support throughout Abacus.
Fixed some CTS-untested corner cases in our algorithms.

Iced Bun - 9th October 2015

link

The highlight of the Iced Bun release finalized our double support by having explicitly vectorized variants of all the double functions on offer.

full double support enabled (all vectorizations return correctly).
added native_* builtins for float.
super fast new algorithm for acos/asin double bringing 1.5-10x speed up.
spliced out some of the non-IP functionality into abacus_detail_* headers. A customer of Abacus really needed, for ease of use, these to be in a header to allow for function inlining, code size reduction, and code shipping ease.

Hairst Bree - 17th July 2015

link

The Hairst Bree release was a broadening of the scope of the math library, we’ve brought in nearly all of the functionality required to fully implement the OpenCL runtime, the only things missing are atomics (which required specific software solutions on each platform), and vload/vstore for half. These implementations are platform agnostic, they’ll work out of the box on whatever platform required.

5.6x performance improvement for hypot(float, float).
We now generate our abacus headers via some CMake wizardry.
Abacus now includes most of the OpenCL runtime, everything except atomic builtins and vload/vstore for half.
Initial support for double floating point is in the process of being added.
Abacus builds to a C++ static and shared library.

Granny Sookers - 13th November 2014

link

The Granny Sookers release has been a spring clean release - we’ve spent a good deal of time simplifying the maths library, cleaning up the headers, and trying to make the product much more usable for our customers. But of course - we’ve snuck in some really cool little features too;

C++ compilation mode
Monolithic headers separation begun
Performance improvements for pow, powr & pown functions

Previously, we referred to this product at Codeplay’s Math library, but the library has become so much more than that. To match the evolving focus of the library and to not constrain us, we’ve decided on a rename. Welcome to the wonderful world our new product, Abacus.

Hardware Requirements

link

Abacus has been designed as carefully as possible to allow for low level computation on our customers hardware, while also as much as possible trying to abstract what we can away from the underlying hardware.

We do make certain assumptions on what the underlying hardware is capable of though.

Basic Type Requirements

link

We use the following types, though they may not be natively supported:

signed 8-bit integer
signed 16-bit integer
signed 32-bit integer
signed 64-bit integer
unsigned 8-bit integer
unsigned 16-bit integer
unsigned 32-bit integer
unsigned 64-bit integer
16-bit floating point
32-bit floating point
64-bit floating point
1/2/3/4/8/16 element vectors of the above types

Denorm Support

link

Abacus fully supports denormal numbers for our floating point functions.

Rounding Modes

link

Our default rounding mode is Round-to-Even.

ULP Requirements for Basic Operations

link

Since Abacus was incepted as a kernel library to provide OpenCL builtins on GPU hardware, we follow the requirements OpenCL places on target hardware for basic operations.

What is ULP?

link

ULP stands for Unit in the Last Place, or Unit of Least Precision, is the spacing between floating-point numbers, used as a measurement of accuracy of floating point numbers. More information can be found on Wikipedia.

For example, say we would like to compute the ULP of 3.14159, we first extract the exponent using ilogb, then plug that into the equation ULP(x) = 2^(e-p+1) (where p is 24 if we’re dealing with FP32):

float number = 3.14159f;
int exponent = std::ilogb(number); // 1
float ulp = std::ldexp(1.0f, exponent - 23); // 2.38419e-07

We can also use the alternative definition where ULP(x) is the distance between the two closest straddling floating-point numbers a and b (i.e. a <= x <= b and a != b) to verify this:

float number = 3.14159f;
float a = number; // `number` is the same as `a`, due to FP rounding.
float b = std::nextafter(number, 4.0f); // next floating point value towards 4.
float ulp = b - a; // 2.38419e-07

Calculating the ULP error of an FP16 function

link

Lets say we have implemented an FP16 function sinpi(x), which calculates sin(pi * x), and for a specific input, it calculates the following result:

sinpi(1.234) = -0.67041 [0xb95d]

We know that the true FP16 value is -0.671387 [0xb95f]. This is the infinitely precise result of sinpi(1.234) rounded to FP16. We can calculate the ULP error by working out the difference between the actual and expected value, then divide by the ULP of the expected result (where p is 11 for FP16 in the ULP equation):

result = -0.67041
expected = -0.671387
expected_exp = ilogb(expected) = -1
ulp_diff = (result - expected) / 2^(expected_exp - 11 + 1) = 2.000896

If the specification says that sinpi(x) must be within 2 ULP of the exact answer, then this result would not be conformant, as the ULP difference is greater than 2.

16-bit Floating Point

link

We assume that IEEE 754 compliant floating point numbers are available on the hardware, with the following ULP guarantees.

x + y, x and y are 16 bit floating point numbers, correctly rounded.
x - y, x and y are 16 bit floating point numbers, correctly rounded.
x * y, x and y are 16 bit floating point numbers, correctly rounded.
x / y, x and y are 16 bit floating point numbers, <= 2.5 ULP.

32-bit Floating Point

link

We assume that IEEE 754 compliant floating point numbers are available on the hardware, with the following ULP guarantees.

x + y, x and y are 32 bit floating point numbers, correctly rounded.
x - y, x and y are 32 bit floating point numbers, correctly rounded.
x * y, x and y are 32 bit floating point numbers, correctly rounded.
x / y, x and y are 32 bit floating point numbers, <= 2.5 ULP.

64-bit Floating Point

link

We assume that IEEE 754 compliant floating point numbers are available on the hardware, with the following ULP guarantees.

x + y, x and y are 64 bit floating point numbers, correctly rounded.
x - y, x and y are 64 bit floating point numbers, correctly rounded.
x * y, x and y are 64 bit floating point numbers, correctly rounded.
x / y, x and y are 64 bit floating point numbers, <= 2.5 ULP.

Notes on Our Implementation

link

While we require support for 64-bit signed and unsigned integer types - we try as much as possible to avoid using them. Since Abacus is primarily written with customer GPUs in mind, we have only used 64-bit types where we would otherwise have to emulate 64-bit operations using 32-bit hardware. In practice, the most performance intensive section of Abacus (abacus_math) uses 64-bit types only when calling modf, remainder and remquo.

Implementation

link

Notes on algorithms used to implement the maths library, as well as ideas for future optimizations.

Maple Worksheets

link

The maple directory of abacus contains files with the .mw extension, these are worksheets which can be loaded into the Maple software environment. Worksheets provide mathematical expressions, computations, and plots visualizing how the algorithm implementing each builtin function was derived.

The Maple software environment includes a free viewer called Maple Player which can be used to view the worksheets. A copy of Maple Player can be downloaded from here

Sollya

link

The Sollya linux command line tool is used for generating polynomial coefficients. Scripts for recreating how these values were calculated can be found in the sollya folder.

When writing a new script we want to call the fpminmax sollya function, which uses a practically efficient heuristic method to find a good polynomial approximation of a function f on an interval range. Our function f will be the original function we want to find a polynomial for, but the range should correspond to narrowed ranged obtained using a range reduction algorithm.

We also need to pass fpminmax the order of the polynomial to generate, higher orders are more precise but also more computationally expensive. To achieve a good trade off we work out the error of our polynomial using dirtyinfnorm. Then pick an order which results in an error just smaller than can be represented by the floating point type we’re interested in, one of half/float/double precisions.

Algorithms

link

Outline of common algorithms used throughout Abacus.

Newton Raphson Method

link

The NR method is a technique for finding approximations for the roots(zeros) of a function. By starting with an initial guess for the root of the function, we can perform repeated iterations of the algorithm to get increasingly precise values.

In particular we use NR as part of the fast inverse square root algorithm found in Quake, which is used as part of the implementation of many of the builtins.

Halley’s Method

link

Like the NR method Halley’s method is a root finding algorithm for functions, however it improves the rate of convergence to the root for cubic functions so we use it as part of our cbrt implementation.

Horner’s Method

link

Horner’s method is an algorithm for evaluating polynomials for a specific value of x by transforming the polynomial into a computationally efficient form. It’s particularly useful for implementing fast multiplication and division of floating point numbers on target architectures without bespoke hardware, since the binary number can be represented as a polynomial of base 2.

Cody Waite

link

Cody-Waite is a range reduction algorithm, used to bound the interval of inputs to an elementary function. Cody-Waite is used when you are doing range reduction by subtracting a known constant value. For example, when calculating an exponential function, for instance exp(x), you will first reduce the range by subtracting the constant value of log(2). The problem occurs when you have a value of x that is close to log(2), you will get a big error in the subtraction when creating the very small result. To combat this, the Cody-Waite breaks the single log(2) constant into multiple sub-constants (normally 2 values, but sometimes more!). Then we subtract these parts of the same constant multiple times from the original value, starting with the largest part of the constant, then the next smallest, and again and again. The result is we get a much more precise resulting value after all these subtractions have occurred.

To calculate a Cody-Waite constant, you first get a very high precision value of the constant you want (for instance using Wolfram Alpha). Then you round this value into the precision of the value you are calculating, so for instance into 32-bit floating point. You want to round down here on the last digit of the number. Then subtract the rounded value from the high precision value (again using something like Wolfram Alpha), and the resulting value is your second Cody-Waite value. Generally you want the values being subtracted to always be positive, so if you get a negative result it means you didn’t round the original value down enough, and need to cut it lower.

Payne Hanek

link

Payne Hanek is a range reduction algorithm specifically for trigonometric functions. Trigonometric functions are unique in that f(x) = sin(x) results in a repeated pattern across the entire range (sin(x) == sin(x + 2 * π)). Large floating point values need to be precisely modulo’d down into this repeated range, but using any normal floating point tricks would result in wildly wrong guesses. Payne Hanek solves this by using a large lookup table representing many digits of π. Then, based on how large the floating point value is, the relevant digits of π are used to modulo x down into the range precisely.

Development Techniques

link

There are a number of common design patterns used throughout Abacus to achieve the required ULP precision targets.

Reading Materials

link

“Handbook of Floating-Point Arithmetic” by Jean-Michel Muller et al.
“Software Manual For The Elementary Functions” by William J Cody Jr and William Waite.

Coding Conventions

link

Pass OpenCL built-in scalar and vector types (float, float2, etc.) by value rather than by reference to helper functions. This elminates unnecessary alloca instructions which can hinder optimizations like the vectorizer.

Exact Operations

link

Abacus includes a number of floating point helper functions to enable more precise addition and multiplication operations, without needing to cast to a higher precision type. The idea is that the result of an operation a OP b is decomposed into two numbers hi and lo which fit into the required floating point data type. hi contains the larger magnitude number and lo contains the smaller magnitude number.

hi and lo should then be operated on independently, then combined together later. The idea is that enough precision should be retained in intermediate operations, so the final recombined result is more accurate than if the decomposition wasn’t used.

For example, (a + b) * c could be implemented in the following way:

abacus_half result_lo;
abacus_half result_hi = add_exact(a, b, &result_lo);
result_lo *= c;
result_hi *= c;
abacus_half final = result_hi + result_lo;

Chaining Exact Operations

link

It is possible to chain exact operations to further increase the final precision depending on the magnitudes of the intermediate numbers. For example, computing the same example as above:

abacus_half int_lo;
abacus_half int_hi = add_exact(a, b, &int_lo);

abacus_half result_lo
abacus_half result_hi = multiply_exact(int_hi, c, &result_lo);

// We've multiplied int_hi and c, but we also need to multiply int_lo and c. add
// this to result_lo, as it will more than likely have a low enough magnitude to
// hold the result.
result_lo += int_lo * c;

// result_hi and result_lo represents the decomposition of the final value.

or instead, computing a * b + c:

abacus_half int_lo;
abacus_half int_hi = multiply_exact(a, b, &int_lo);

abacus_half result_lo
abacus_half result_hi = add_exact(int_hi, c, &result_lo);
result_lo += int_lo;

// result_hi and result_lo represents the decomposition of the final value.

Utilising Exact Operations In Horner Polynomials

link

Horner Polynomials are used to implement a number of functions. As this calculation involves chaining many multiplications and additions, it can be made more accurate by replacing parts of the Horner Polynomial calculation with calls to multiply_exact and add_exact. An example of this can be found in the half implementation of acos.

Optimization Ideas

link

Notes from one of the significant contributors to Abacus about how to optimize our implementation. These may or may not still be valid.

pown

link

Somehow abuse the fact that the 32 bit int gets cast perfectly to a double, less accuracy is probably needed.

tan

link

Value of pi/4 in 32 bit p-h may not be the best one, it seems to be off by a (base 10) digit. (so maybe it’d be the same) Might not need extra rtz code if that’s the case.

Pad the start of payload[] with 0’s, change it so that works, and you won’t need to do cody waithe at all. Instead of payloads entirely use an array of floats/double and use exact multiplies and adds. Could be big, but would take lots of time

Don’t multiply by fractPiOver4 at the end of p-h. Incorporate it into the final polynomial estimate. Would take out all that big RTZ code :D (done in double).

double tan

link

Polynomials could be better. Need extended precision for proper testing so I’ll have to leave it for now.

sin cos

link

Polynomials can probably be made a bit better.

sinpi & cospi

link

You might be able to use a lower degree polynomial, and it just didn’t pass on my computer because of the +0.5 ULP. Might pass on a higher precision machine. cospi uses the same polynomial so be sure to change both. Maple didn’t like the cospi one so I had to work around it

sqrt float

link

I seem to remember you don’t need to do the checks for xBig if you change the NR iteration to be:

result = 0.5f * result * (3.0f - result * (result * x));

Instead of:

result = 0.5f * result * (3.0f - (result * result) * x);

Or for some combination of them. Free opt if you can get it. Thought I’d put it into BC’s code but maybe not. There might have had to be an odd number of NR iterations too or something to that effect.

atan2

link

This 19 degree polynomial was the best that can be found under the current algorithm. I imagine you could maybe split the interval into 2 or 4 or w/e and have different polynomials over it, but I’m not sure how much of a speedup you could expect really.

We should ignore above advice, split interval is bad for vector code.

exp double

link

Use a lesser degree polynomial. Don’t think much else goes here.

tanh

link

Prime candidate for polynomial estimate. You need to do polynomials in the range (0,18) to be done. They’ll probably be long, but better than the divide that’s there at the moment.

erfc

link

Not really much. The problem here is what you’re testing against, as there’s reference test. Intels is lacking in the [26.5,27.0] department, where the answers are denormal, but we seem to match them everywhere but there. More than likely all the polynomials can be shortened, but having to calculate exp(x^2) accurately seems to have to happen.

erf

link

Could separate it from erfc, but again what do you test against. Used erfc to make sure it’s overly accurate

tgamma

link

Keep a separate track of above and below fractions so you only need to do one divide. Just try find a better algorithm in general! It’s quite a hard function.

lgamma

link

Problem once again being an incomplete reference function. Our one is better than the test one anyway. It’s very branchy, but I think it needs to be. For x > 10 stirlings approximation is the way to go unless you want to split the whole double range up into polynomials. For x > 0 we have a max ulp < 16, and for x > 15 a max ulp of something very small, maybe 1.

rsqrt

link

Take out the last more accurate Newton Rhapson iteration if you can. It’s very slow. Take out the frexp and ldexp also, that’d be a big saving. I’ll do this if there’s time.

cbrt

link

The polynomial could probably be shorter. Maybe a better general algorithm, one without a divide. It seems to be quite hard to find one. A more streamlined frexp/ldexp is probably in order too.

exp10

link

Instead of the normal cody waithe, which leads to a long enough polynomial, we’ve reduced mod ln(10)/2ln(2) instead of ln(10)/ln(2), aka half the interval. This reduces the polynomial but you have to do one additional small branch. Probably worth it on CPU anyway. Probably don’t need the 4th cody waithe constant either, I suspect not.

Edit: it was only 1 (2?) extra polynomial term saved so scrapping this idea. Maybe an early return for Nans would be better, though it’s not needed.

expm1

link

I couldn’t get it to pass, but I think it’s because of the extra 0.5 ulp that gets added onto on my machine. Is probably fine on cpu but could maybe be streamlined a bit.

rootn

link

Is a little bit too accurate at the moment, reducing the second polynomial is the best way to make it less accurate.

log1p

link

Is probably fine for a branchy algorithm, not sure how to best speed it up/refactor it.

remquo & remainder

link

Find a way of doing bigger reductions to the input variables in the cases where they’re far apart. This is going to involve guessing an integer as close as possible to q = (x/m), the closer the better, (not x/m will overflow so we need to be smarter, then calculating xNew = x - q*m via exact multiplication and addition. If q is close enough to (x/m) this will be exact, and we have remquo(x,m) = remquo(xNew, m). Should be possible without too much work, at the moment in the worst case you can have a loop that goes on for 2*1024 / 11 = 187 loops. Argh.

menu_bookGuides

Release Notes

Jam Roly Poly - 12th July 2016

Iced Bun - 9th October 2015

Hairst Bree - 17th July 2015

Granny Sookers - 13th November 2014

Hardware Requirements

Basic Type Requirements

Denorm Support

Rounding Modes

ULP Requirements for Basic Operations

What is ULP?

Calculating the ULP error of an FP16 function

16-bit Floating Point

32-bit Floating Point

64-bit Floating Point

Notes on Our Implementation

Implementation

Maple Worksheets

Sollya

Algorithms

Newton Raphson Method

Halley’s Method

Horner’s Method

Cody Waite

Payne Hanek

Development Techniques

Reading Materials

Coding Conventions

Exact Operations

Chaining Exact Operations

Utilising Exact Operations In Horner Polynomials

Optimization Ideas

pown

tan

double tan

sin cos

sinpi & cospi

sqrt float

atan2

exp double

tanh

erfc

erf

tgamma

lgamma

rsqrt

cbrt

exp10

expm1

rootn

log1p

remquo & remainder

Builtins

Image Library

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source