Sorting floating numbers

Sorting floating numbers - c++

I have a list of normal vectors and I am calculating the scalar triple product and sorting them. I have compared the sorting in three different cases:
Using Matlab sort to find the largest absolute triple product values
Using std::sort function in C++ to get the product values for std::vector. Using doubles for triple products.
Using Radix sort in OpenCL C - converting the absolute floating values to unsigned integers and converting them back. I am using cl_float for triple products.
All of them give values which are different along with the different indices which causes problems in my algorithm. What is the problem in this case and how can I keep them consistent?

The problem at hand:
Calculate the scalar triple product of 3 3-dimensional vertices, being each component of each vector represented as binary32 float.
Being able to tell if a result of that calculation is greater than other result of that calculation.
So far so good, but if we directly apply the formulae to the vectors, some bits may be lost in the operations and we will be unable to discern two results. As #rcgldr pointed out, sorting is not the problem, the precision is.
One solution to floating point roundoff problems is to increase the number of bits, that is, use double. You said you have no double, so let's do it ourselves: perform the whole calculation in an array of unsigned char as long as we need.
Okay, okay, practical considerations:
The input is made of normalized vectors, so the length is no greater than one, that implies no component is greater than one
The exponent of a binary32 float ranges from -127 (zero, denormals) to 127 (or 128 for infinity), but all components will have exponent from -127 to 0 (or else they would be greater than one).
The maximum precision of the input is 24 bits.
Scalar triple product involves an vector product and an scalar product. In the vector product (which will happen first) there is subtraction of results of multiplication, and in the scalar product there is a sum of results of multiplication.
Considerations 2 and 3 tells us that the whole family of input components can be fit in an fixed-point format of size 127 bits for offsetting plus 24 bits for the mantissa, that's 19 bytes. Let's make it 24.
To be able to fully represent all possible sums and subtractions, one extra bit suffices (in the advent of carryover), but to fully represent all possible multiplication results, we need double the number of bits, so it resolutes that doubling the size is enough to represent the vector multiplication, and tripling it will make it enough for the next multiplication in the scalar product.
Here is a draft of class that loads a float to that very huge fixed point format, keeping the sign as a bool flag (there is a helper function rollArrayRight() that I'll post separately, but hopefully it's name explains it):
const size_t initialSize=24;
const size_t sizeForMult1=initialSize+initialSize;
const size_t finalSize=sizeForMult1+initialSize;
class imSoHuge{
public:
bool isNegative;
uint8_t r[finalSize];
void load(float v){
isNegative=false;
for(size_t p=0;p<finalSize;p++)r[p]=0;
union{
uint8_t b[4];
uint32_t u;
float f;
} reunion;
reunion.f=v;
if((reunion.b[3]&0x80) != 0x00)isNegative=true;
uint32_t m, eu;
eu=reunion.u<<1; //get rid of the sign;
eu>>=24;
m=reunion.u&(0x007fffff);
if(eu==0){//zero or denormal
if(m==0)return; //zero
}else{
m|=(0x00800000); //implicit leading one if it's not denormal
}
int32_t e=(int32_t)eu-127; //exponent is now in [e]. Debiased (does this word exists?)
reunion.u=m;
r[finalSize-1]=reunion.b[3];
r[finalSize-2]=reunion.b[2];
r[finalSize-3]=reunion.b[1];
r[finalSize-4]=reunion.b[0];
rollArrayRight(r, finalSize, e-(sizeForMult1*8)); //correct position for fixed-point
}
explicit imSoHuge(float v){
load(v);
}
};
When the class is constructed with the number 1.0f, for example, the array r have something like 00 00 00 00 80 00, notice that it is loaded to the lower part of it, the multiplications will ~roll~ the number accordingly to the upper bytes, and we can then recover our float.
To make it useful, we need to implement the equivalent to sum and multiplication, very straight-forward, as long as we remember we can only sum arrays that have been multiplied the same number of times (as in Triple product) or else theirs magnitude would not match.
One example where such class would make a difference:
Consider the following 3 vectors:
float a[]={0.0097905760, 0.0223784577, 0.9997016787};
float b[]={0.8248013854, 0.4413521587, 0.3534274995};
float c[]={0.4152690768, 0.3959976136, 0.8189856410};
And the following function that calculates the triple product: (hope I've got it right haha)
float fTripleProduct(float*a, float*b, float*c){
float crossAB[3];
crossAB[0]=(a[1]*b[2])-(a[2]*b[1]);
crossAB[1]=(a[2]*b[0])-(a[0]*b[2]);
crossAB[2]=(a[0]*b[1])-(a[1]*b[0]);
float tripleP=(crossAB[0]*c[0])+(crossAB[1]*c[1])+(crossAB[2]*c[2]);
return tripleP;
}
The result for fTripleProduct(a,b,c); is 0.1336331
If we change the last digit of the fisrt component of a from 0 to 6, making it 0.0097905766 (which have a different hexadecimal representation) and call the function again, the result is the same, but we know it should be greater.
Now consider we have implemented the multiplication, sum, and subtraction for the imSoHuge class and have a function to calculate the triple product using it
imSoHuge tripleProduct(float*a, float*b, float*c){
imSoHuge crossAB[3];
crossAB[0]=(imSoHuge(a[1])*imSoHuge(b[2]))-(imSoHuge(a[2])*imSoHuge(b[1]));
crossAB[1]=(imSoHuge(a[2])*imSoHuge(b[0]))-(imSoHuge(a[0])*imSoHuge(b[2]));
crossAB[2]=(imSoHuge(a[0])*imSoHuge(b[1]))-(imSoHuge(a[1])*imSoHuge(b[0]));
imSoHuge tripleP=(crossAB[0]*imSoHuge(c[0]))+(crossAB[1]*imSoHuge(c[1]))+(crossAB[2]*imSoHuge(c[2]));
return tripleP;
}
If we call that function for the two above versions of the vectors, the results in the array differ:
0 0 0 4 46 b9 4 69 39 3f 53 b8 19 e0 ...
0 0 0 4 46 b9 4 85 93 82 df ba 7d 80 ...
And they differ after the precision of a binary32 float indeed, meaning that if we cast that array back to float, it will be the same float, but if we compare the arrays, we can tell which one is greater.
The put that reasoning to the test, I've made a full working example, that you can compile and run right away with -O3 -Wall -std=c++11 in GCC, or equivalent on another compiler and will output:
Using class: second result is greater
casting to float:
first reasult: 1.336331e-001
second result: 1.336331e-001
as floats, the results are the same: 1.336331e-001
The source code is here (working fine on Ideone):
Source Code on IDEONE C++11 code
If you have not migrated to C++11, the code compiles and run in C++98 if you define the exact-width types uint8_t, uint16_t, uint32_t, int32_t yourself.
How to use it?
Simple call the function tripleProduct with your inputs and compare the results using the provided overload comparators operators, you can also cast the class imSoHuge to float (after the calculation of triple product) using the provided overload cast operator.
You can provide an array of that class and comparators to any sorting algorithm.
Conclusions and considerations:
Notice that a float multiplication is now performed as a multiplication of two 70+ bytes long array, that means hundreds of time more clock cycles, plus the sums, comparisons etc, this shall be thousands of time slower, but hey, it's exact.
The whole design of the algorithm is to work with normalized vectors (there is some room here, as I don't know the precision or your normalization procedure), but it will all overflow and be meaningless with most 'greater-than-one' vectors.
You can easily cap the array of the result to as many bytes you wish, if keeping all that array in memory is too much. Very few cases will produce results diverging after ~12bytes
I haven't stress-tested everything, like denormals, and corner cases, there is some comments in the code to the critical points.
and of course:
You can easily improve everything, I was just willing to share the reasoning =)
Source code again
Main reference:
Single-precision floating-point format (Wikipedia)

Related

What's the meaning of shifting in fixed-point arithmetic when implementing it in C++?

I have a problem with understanding fixed-point arithmetic and its implementation in C++. I was trying to understand this code:
#define scale 16
int DoubleToFixed(double num){
return num * ((double)(1 << scale));
}
double FixedToDoble(int num){
return (double) num / (double)(1 << scale);
}
double IntToFixed(int num){
return x << scale
}
I am trying to understand exactly why we shift. I know that shifting to the right is basically multiplying that number by 2x, where x is by how many positions we want to shift or scale, and shifting to the left is basically division by 2x.
But why do we need to shift when we convert from int to fixed point?

A fixed-point format represents a number as an integer multiplied by a fixed scale. Commonly the scale is some base b raised to some power e, so the integer f would represent the number f•be.
In the code shown, the scale is 2−16 or 1/65,536. (Calling the the shift amount scale is a misnomer; 16, or rather −16, is the exponent.) So if the integer representing the number is 81,920, the value represented is 81,920•2−16 = 1.25.
The routine DoubleToFixed converts a floating-point number to this fixed-point format by multiplying by the reciprocal of the scale; it multiplies by 65,536.
The routine FixedToDouble converts a number from this fixed-format to floating-point by multiplying by the scale or, equivalently, by dividing by its reciprocal; it divides by 65,536.
IntToFixed does the same thing as DoubleToFixed except for an int input.

Fixed point arithmatic works on the concept of representing numbers as an integer multiple of a very small "base". Your case uses a base of 1/(1<<scale), aka 1/65536, which is approximately 0.00001525878.
So the number 3.141592653589793, could be represented as 205887.416146 units of 1/65536, and so would be stored in memory as the integer value 205887 (which is really 3.14158630371, due to the rounding during conversion).
The way to calculate this conversion of fractional-value-to-fixed-point is simply to divide the value by the base: 3.141592653589793 / (1/65536) = 205887.416146. (Notably, this reduces to 3.141592653589793 * 65536 = 205887.416146). However, since this involves a power-of-two. Multiplication by a power-of-two is the same as simply left shifting by that many bits. So multiplication of 2^16, aka 65536, can be calculated faster by simply shifting left 16 bits. This is really fast, which is why most fixed-point calculations use an inverse-power-of-two as their base.
Due to the inability to shift float values, your methods convert the base to a float and does floating point multiplication, but other methods, such as the fixed-point multiplication and division themselves would be able to take advantage of this shortcut.
Theoretically, one can use shifting bits with floats to do the conversion functions faster than simply floating point multiplication, but most likely, the compiler is actually already doing that under the covers.
It is also common for some code to use an inverse-power-of-ten as their base, primarily for money, which usually uses a base of 0.01, but these cannot use a single shift as a shortcut, and have to do slower math. One shortcut for multiplying by 100 is value<<6 + value<<5 + value<<2 (this is effectively value*64+value*32+value*4, which is value*(64+32+4), which is value*100), but three shifts and three adds is sometimes faster than one multiplication. Compilers already do this shortcut under the covers if 100 is a compile time constant, so in general, nobody writes code like this anymore.

How to design INT of 16,32, 64 bytes or even bigger in C++

As a beginner, I know we can use an ARRAY to store larger numbers if required, but I want to have a 16 bytes INT data type in c++ on which I can perform all arithmetic operations as performed on basic data types like INT or FLOAT
So can we in effect increase, default data types size as desired, like int of 64 bytes or double of 120 bytes, not directly on basic data type but in effect which is the same as of increasing capacity of datatypes.
Is this even possible, if yes then how and if not then what are completely different ways to achieve the same?

Yes, it's possible, but no, it's not trivial.
First, I feel obliged to point out that this is one area where C and C++ really don't provide as much access to the hardware at the lowest level as you'd really like. In assembly language, you normally get a couple of features that make multiple-precision arithmetic quite a bit easier to implement. One is a carry flag. This tracks whether a previous addition generated a carry (or a previous subtraction a borrow). So to add two 12-bit numbers on a machine with 64-bit registers you'd typically write code on this general order:
; r0 contains the bottom 64-bits of the first operand
; r1 contains the upper 64 bits of the first operand
; r2 contains the lower 64 bits of the second operand
; r3 contains the upper 64 bits of the second operand
add r0, r2
adc r1, r3
Likewise, when you multiply two numbers, most processors generate the full answer in two separate registers, so when (for example) you multiply two 64-bit numbers, you get a 128-bit result.
In C and C++, however, we don't get that. One easy way to get around it is to work in smaller chunks. For example, if we want a 128-bit type on an implementation that provides 64-bit long long as its largest integer type, we can work in 32-bit chunks. When we're going to do an operation, we widen those to a long long, and do the operation on the long long. This way, when we add or multiply two 32-bit chunks, if the result is larger than 32 bits, we can still store it all in our 64-bit long long.
So, for addition life is pretty easy. We add the two lowest order words. We use a bitmask to get the bottom 32 bits and store them into the bottom 32 bits of the result. Then we take the upper 32 bits, and use them as a "carry" when we add the next 32 bits of the operands. Continue until we've added all 128 (or whatever) bits of operands and gotten our overall result.
Subtraction is pretty similar. In fact, we can do 2's complement on the second operand, then add to get our result.
Multiplication gets a little trickier. It's not always immediately obvious how we can carry out multiplication in smaller pieces. The usual is based on the distributive property. That is, we can take some large numbers A and B, and break them up into (a0 + a1) and (b0 + b1), where each an and bn is a 32-bit chunk of the operand. Then we use the distributive property to turn that into:
a0 * b0 + a0 * b1 + a1 * b0 + a1 * b1
This can be extended to an arbitrary number of "chunks", though if you're dealing with really large numbers there are much better ways (e.g., karatsuba).

If you want to define non-atomic big integers, you can use plain structs.
template <std::size_t size>
struct big_int {
std::array<std::int8_t, size> bytes;
};
using int128_t = big_int<16>;
using int256_t = big_int<32>;
using int512_t = big_int<64>;
int main() {
int128_t i128 = { 0 };
}

How do multiply an array of ints to result in a single number?

So I have a single int broken up into an array of smaller ints. For example, int num = 136928 becomes int num[3] = {13,69,28}. I need to multiply the array by a certain number. The normal operation would be 136928 * 2 == 273856. But I need to do [13,69,28] * 2 to give the same answer as 136928 * 2 would in the form of an array again - the result should be
for(int i : arr) {
i *= 2;
//Should multiply everything in the array
//so that arr now equals {27,38,56}
}
Any help would be appreciated on how to do this (also needs to work with multiplying floating numbers) e.g. arr * 0.5 should half everything in the array.
For those wondering, the number has to be split up into an array because it is too large to store in any standard type (64 bytes). Specifically I am trying to perform a mathematical operation on the result of a sha256 hash. The hash returns an array of the hash as uint8_t[64].

Consider using Boost.Multiprecision instead. Specifically, the cpp_int type, which is a representation of an arbitrary-sized integer value.
//In your includes...
#include <boost/multiprecision/cpp_int.hpp>
//In your relevant code:
bool is_little_endian = /*...*/;//Might need to flip this
uint8_t values[64];
boost::multiprecision::cpp_int value;
boost::multiprecision::cpp_int::import_bits(
value,
std::begin(values),
std::end(values),
is_little_endian
);
//easy arithmetic to perform
value *= 2;
boost::multiprecision::cpp_int::export_bits(
value,
std::begin(values),
8,
is_little_endian
);
//values now contains the properly multiplied result
Theoretically this should work with the properly sized type uint512_t, found in the same namespace as cpp_int, but I don't have a C++ compiler to test with right now, so I can't verify. If it does work, you should prefer uint512_t, since it'll probably be faster than an arbitrarily-sized integer.

If you just need multiplying with / dividing by two (2) then you can simply shift the bits in each byte that makes up the value.
So for multiplication you start at the left (I'm assuming big endian here). Then you take the most significant bit of the byte and store it in a temp var (a possible carry bit). Then you shift the other bits to the left. The stored bit will be the least significant bit of the next byte, after shifting. Repeat this until you processed all bytes. You may be left with a single carry bit which you can toss away if you're performing operations modulo 2^512 (64 bytes).
Division is similar, but you start at the right and you carry the least significant bit of each byte. If you remove the rightmost bit then you calculate the "floor" of the calculation (i.e. three divided by two will be one, not one-and-a-half or two).
This is useful if
you don't want to copy the bytes or
if you just need bit operations otherwise and you don't want to include a multi-precision / big integer library.
Using a big integer library would be recommended for maintainability.

How to multiply an integer by a fraction

So my goal here is to have a function whose signature looks like this:
template<typename int_t, int_t numerator, int_t denominator>
int_t Multiply(int_t x);
The type is an integral type, which is both the type of the one parameter and the return type.
The other two template parameters are the numerator and denominator of a fraction.
The goal of this function is to multiply a number "x" by an arbitrary fraction, which is in the two template values. In general the answer should be:
floor(x*n/d) mod (int_t_max+1)
The naive way to do this is to first multiply "x" by the numerator and then divide.
Looking at a specific case lets say that int_t=uint8_t, "x" is 30 and the numerator and denominator are 119 and 255 respectively.
Taking this naive route fails because (30*119)mod 256 = 242, which divided by 255 and then floored is 0. The real answer should be 14.
The next step would be to just use a bigger integer size for the intermediate values. So instead of doing the 30*119 calculation in mod 256 we would do it in mod 65536. This does work to a certain extent, but it fails when we try to use the maximum integer size in the Multiply function.
The next step would be to just use some BigInt type to hold the values so that it can't overflow. This also would work, but the whole reason for having the template arguments, is so that this can be extremely fast, and using a BigInt would probably defeat that purpose.
So here is the question:
Is there an algorithm that only involves shifts, multiplication, division, addition, subtraction, and remainder operators, that can preform this mathematical function, without causing overflow issues?

For Windows platform I urge you to look into this article on large integers that currently includes support for up to 128-bit integer values. You can specialize your template based on the bit-with of your int_t to serve as a proxy to those OS functions.
Implementing the "shift-and-add" for multiplication may provide a good enough alternative but a division will certainly negate any performance gains you could hope for.
Then there are "shortcuts" like trying to see if the numerator and denominator can be simplified by fraction reduction, e.g. multiplying by 35/49 is the same as multiplying by 5/7.
Another alternative that comes to mind is to "gradually" multiply by "fractions". This one will need some explanation though:
Suppose you are multiplying by 1234567/89012. I'll use decimal notation for readbility but the same is applicable (naturally) to binary math.
So what we have is a value x that needs to be multiplied by that fraction. Since we are dealing with integer arithmetic let's repackage that fraction a bit:
1234567/89012 = A + B/10 + C/100 + D/1000...
= 1157156/89012 + ((71210*10)/89012)/10 + ((5341*100)/89012)/100 + ((801*1000)/89012)/1000...
= 13 + 8/10 + 6/100 + 9/1000...
In fact at this point your main question is "how precise do I want to be in my calculations?". And based on the answer to that question you will have the appropriate number of the members of that long sequence.
That will give you the desired precision and provide a generic "no overflow" method for computing the product, but at what computational cost?

Karatsuba multiplication improvement

I have implemented Karatsuba multiplication algorithm for my educational goals. Now I am looking for further improvments. I have implemented some kind of long arithmetic and it works well whether I do not use the base of integer representation more than 100.
With base 10 and compiling with clang++ -O3 multiplication of two random integers in range [10^50000, 10^50001] takes:
Naive algorithm took me 1967 cycles (1.967 seconds)
Karatsuba algorithm took me 400 cycles (0.4 seconds)
And the same numbers with base 100:
Naive algorithm took me 409 cycles (0.409 seconds)
Karatsuba algorithm took me 140 cycles (0.14 seconds)
Is there a way for improve this results?
Now I use such function to finalize my result:
void finalize(vector<int>& res) {
for (int i = 0; i < res.size(); ++i) {
res[i + 1] += res[i] / base;
res[i] %= base;
}
}
As you can see each step it calculates carry and push it to the next digit. And if I take base >=1000 the result will be overflowed.
If you see at my code I use vectors of int to represent long integer. According to my base a number will divide in separate parts of vector.
Now I see several options:
to use long long type for vector, but it might also be overflowed for vast length integers
implement representation of carry in long arithmetic
After I had saw some coments I decided to expand the issue. Assume that we want to represent our long integer as a vector of ints. For instanse:
ULLONG_MAX = 18446744073709551615
And for input we pass 210th Fibonacci number 34507973060837282187130139035400899082304280 which does not fit to any stadard type. If we represent it in a vector of int with base 10000000 it will be like:
v[0]: 2304280
v[1]: 89908
v[2]: 1390354
v[3]: 2187130
v[4]: 6083728
v[5]: 5079730
v[6]: 34
And when we do multiplication we may get (for simplicity let it be two identical numbers)
(34507973060837282187130139035400899082304280)^2:
v[0] * v[0] = 5309706318400
...
v[0] * v[4] = 14018612755840
...
It was only the first row and we have to do the six steps like that. Certainly, some step will cause overflow during multiplication or after carry calculation.
If I missed something, please, let me know and I will change it.
If you want to see full version, it is on my github

Base 2^64 and base 2^32 are the most popular bases for doing high precision arithmetic. Usually, the digits are stored in an unsigned integral type, because they have well-behaved semantics with regard to overflow.
For example, one can detect the carry from an addition as follows:
uint64_t x, y; // initialize somehow
uint64_t sum = x + y;
uint64_t carry = sum < x; // 1 if true, 0 if false
Also, assembly languages usually have a few "add with carry" instructions; if you can write inline assembly (or have access to intrinsics) you can take advantage of these.
For multiplication, most computers have machine instructions that can compute a one machine word -> two machine word product; sometimes, the instructions to get the two halves are called "multiply hi" and "multiply low". You need to write assembly to get them, although many compilers offer larger integer types whose use would let you access these instructions: e.g. in gcc you can implement multiply hi as
uint64_t mulhi(uint64_t x, uint64_t y)
{
return ((__uint128_t) x * y) >> 64;
}
When people can't use this, they do multiplication in 2^32 instead, so that they can use the same approach to implement a portable mulhi instruction, using uint64_t as the double-digit type.
If you want to write efficient code, you really need to take advantage of these bigger multiply instructions. Multiplying digits in base 2^32 is more than ninety times more powerful than multiplying digits in base 10. Multiplying digits in base 2^64 is four times more powerful than that. And your computer can probably do these more quickly than whatever you implement for base 10 multiplication.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js