I know that in C or C++, you can see how much a multiplication overflowed by using a long
E.g
int[] multiply(int a, int b){
long long r = a * b;
int result = r;
int overflow = r >> 32;
return {result, overflow};
}
However, in GLSL, there are no 64 bit integers. Is there a way to achieve the same result in GLSL without longs?
Context: GLSL 3.0, running in my browser via WebGL 2
You would have to break up the multiplication into pieces yourself.
Here is an algorithm that tries to do this for 64-bit multiplication: https://stackoverflow.com/a/51587262/736162
The same principles apply to 32-bit. You would need to change the code in a few ways:
Halve the sizes of things: instead of 64-bit types, use 32-bit types; halve the shift constants; instead of casting from 64-bit to 32-bit ((uint32_t)), convert from 32-bit to 16-bit using & 0xffff.
Make it valid GLSL 3.0 (e.g. use out instead of &, and uint instead of uint32_t)
I haven't tested this, but I think it should work.
By the way, you'll want to be sure you have highp precision on everything. precision highp int; at the top of your shader is enough.
Related
I start with three values A,B,C (unsigned 32bit integer). And i have to obtain two values D,E (unsigned 32 bit integer also). Where
D = high(A*C);
E = low(A*C) + high(B*C);
I expect that multiply of two 32bit uint produce 64bit result. "high" and "low" is just my covnention for mark the first 32 bits and the last 32 bits in 64bit result of multiply.
I try to obtain optimized code of some allready functional one. I have a short part of the code in huge loop which is just few command lines, however it consumes almost all of computational time (physical simulation for couple of hours computing). That's the reason why i try to optimized this little part and rest of the code could remain more "user-well-arranged".
There is some SSE instructions that are fit for compute mentioned routine. The gcc compiler probably do optimized work. However i do not reject an option to write some piece of code in SSE intructions directly, if it will be necessary.
Be patient with my low experience with SSE please. I will try to write an algorithm for SSE just symbolically. There will be probably some mistakes with ordering masks or understanding the structure.
Store four 32-bit integers into one 128-bit register in order: A,B,C,C.
Apply instruction (probably pmuludq) into mentioned 128-bit register which multiply pairs of 32-bit integeres and return pairs of 64-bit integers as result. So it shoudld calculate multiply of A*C and multiply of B*C simultaneously and return two 64-bit values.
I expect that i have new 128bit register values P,Q,R,S (four 32-bit blocs) where P,Q is 64-bit result of A*C and R,S is 64-bit result of B*C. Then i continue with rearrange values at register into order P,Q,0,R
Take first 64 bits P,Q and add second 64 bits 0,R. The result is a new 64 bits value.
Read first 32 bits of the result as D and last 32 bits of the result as E.
This algorithm should return correct values for E and D.
My question:
Is there a static code in c++ which generate similar SSE routine as mentioned 1-5 SSE algorithm? I preffer solutions with higher performance. If the algorithm is problematic for standart c++ commands, is there a way how to write an algorithm in SSE?
I use TDM-GCC 4.9.2 64-bit compiler.
(note: Question was modified after advice)
(note2: I have an inspiration in this http://sci.tuomastonteri.fi/programming/sse for using SSE for obtain better performance)
You don't need vectors for this unless you have multiple inputs to process in parallel. clang and gcc already do a good job of optimizing the "normal" way to write your code: cast to twice the size, multiply, then shift to get the high half. Compilers recognize this pattern.
They notice that the operands started out as 32bit, so the upper halves are all zero after casting to 64b. Thus, they can use x86's mul insn to do a 32b*32b->64b multiply, instead of doing a full extended-precision 64b multiply. In 64bit mode, they do the same thing with a __uint128_t version of your code.
Both of these functions compile to fairly good code (one mul or imul per multiply).. gcc -m32 doesn't support 128b types, but I won't get into that because 1. you only asked about full multiplies of 32bit values, and 2. you should always use 64bit code when you want something to run fast. If you are doing full-multiplies where the result doesn't fit in a register, clang will avoid a lot of extra mov instructions, because gcc is silly about this. This little test function made a good test-case for that gcc bug report.
That godbolt link includes a function that calls this in a loop, storing the result in an array. It auto-vectorizes with a bunch of shuffling, but still looks like a speedup if you have multiple inputs to process in parallel. A different output format might take less shuffling after the multiply, like maybe storing separate arrays for D and E.
I'm including the 128b version to show that compilers can handle this even when it's not trivial (e.g. just do a 64bit imul instruction to do a 64*64->64b multiply on the 32bit inputs, after zeroing any upper bits that might be sitting in the input registers on function entry.)
When targeting Haswell CPUs and newer, gcc and clang can use the mulx BMI2 instruction. (I used -mno-bmi2 -mno-avx2 in the godbolt link to keep the asm simpler. If you do have a Haswell CPU, just use -O3 -march=haswell.) mulx dest1, dest2, src1 does dest1:dest2 = rdx * src1 while mul src1 does rdx:rax = rax * src1. So mulx has two read-only inputs (one implicit: edx/rdx), and two write-only outputs. This lets compilers do full-multiplies with fewer mov instructions to get data into and out of the implicit registers for mul. This is only a small speedup, esp. since 64bit mulx has 4 cycle latency instead of 3, on Haswell. (Strangely, 64bit mul and mulx are slightly cheaper than 32bit mul and mulx.)
// compiles to good code: you can and should do this sort of thing:
#include <stdint.h>
struct DE { uint32_t D,E; };
struct DE f_structret(uint32_t A, uint32_t B, uint32_t C) {
uint64_t AC = A * (uint64_t)C;
uint64_t BC = B * (uint64_t)C;
uint32_t D = AC >> 32; // high half
uint32_t E = AC + (BC >> 32); // We could cast to uint32_t before adding, but don't need to
struct DE retval = { D, E };
return retval;
}
#ifdef __SIZEOF_INT128__ // IDK the "correct" way to detect __int128_t support
struct DE64 { uint64_t D,E; };
struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) {
__uint128_t AC = A * (__uint128_t)C;
__uint128_t BC = B * (__uint128_t)C;
uint64_t D = AC >> 64; // high half
uint64_t E = AC + (BC >> 64);
struct DE64 retval = { D, E };
return retval;
}
#endif
If I understand it correctly, you want to compute number of potential overflows in A*B. If yes then you have 2 good options - the "use twice as big variable" (write 128bit math function for uint64 - it's not that hard (or wait for me to post it tomorrow)), and the "use floating point type":
(float(A)*float(B))/float(C)
as the loss of precision is minimal (assuming float is 4 bytes, double 8 bytes, and long double 16 bytes long) , and both float and uint32 require 4 bytes of memory (use double for uint64_t as it should be 8 bytes long):
#include <iostream>
#include <conio.h>
#include <stdint.h>
using namespace std;
int main()
{
uint32_t a(-1), b(-1);
uint64_t result1;
float result2;
result1 = uint64_t(a)*uint64_t(b)/4294967296ull; // >>32 would be faster and less memory consuming
result2 = float(a)*float(b)/4294967296.0f;
cout.precision(20);
cout<<result1<<'\n'<<result2;
getch();
return 0;
}
Produces:
4294967294
4294967296
But if you want really precise and correct answer I'd suggest using twice as big type for computing
Now that I think of it - you could use long double for uint64 and double for uint32 instead of writing function for uint64, but I don't think it's guaranteed that long double will be 128bit, and you'll have to check it. I'd go for more universal option.
EDIT:
You can write function to calculate that without using anything more
than A, B and result variable which would be of the same type as A.
Just add rightmost bit of (where Z equals B*(A>>pass_number&1)) Z<<0,
Z<<1, Z<<2 (...) Z<<X in first pass, Z<<-1, Z<<0, Z<<1 (...) Z<<(X-1)
for second (there should be X passes), while right shifting the result
by 1 (the just computed bit becomes irrelevant to us after it's
computed as it won't participate in calculation anymore, and it would
be erased anyway after dividing by 2^X (doing >>X)
(had to place in the "code" as I'm new here and couldn't find another way to prevent formatting script from eating half of it)
It's just a quick idea. You'll have to check it's correctness (sorry, but I'm really tired right now - but the result shouldn't overflow at any point of calculation, as the maximum carry would have value of 2X if I'm correct, and the algorithm itself seems to be good).
I will write code for that tomorrow if you'll still be in need of help.
I am trying to write a simple log base 2 method. I understand that representing something like std::log(8.0) and std::log(2.0) on a computer is difficult. I also understand std::log(8.0) / std::log(2.0) may result in a value very slightly lower than 3.0. What I do not understand is why putting the result of a the calculation below into a double and making it an lvalue then casting it to an unsigned int would change the result compared to casting the the formula directly. The following code shows my test case which repeatedly fails on my 32 bit debian wheezy machine, but passes repeatedly on my 64 bit debian wheezy machine.
#include <cmath>
#include "assert.h"
int main () {
int n = 8;
unsigned int i =
static_cast<unsigned int>(std::log(static_cast<double>(n)) /
std::log(static_cast<double>(2)));
double d =
std::log(static_cast<double>(n)) / std::log(static_cast<double>(2));
unsigned int j = static_cast<unsigned int> (d);
assert (i == j);
}
I also know I can use bit shifting to come up with my result in a more predictable way. I am mostly curious why casting the double that results int he operation is any different than sticking that value into a double on the stack and casting the double on the stack.
In C++, floating point is allowed to do this sort of thing.
One possible explanation would be that the result of the division is calculated internally in a higher precision than double, and stored in a register with higher precision than double.
Converting this directly to unsigned int gives a different result to first converting this to double and then to unsigned int.
To see exactly what is going on , it might be helpful to look at the assembly output generated by your compiler for the 32-bit case.
Needless to say, you shouldn't write code that relies on exactness of floating point operations.
I need to accurately convert a long representing bits to a double and my soluton shall be portable to different architectures (being able to be standard across compilers as g++ and clang++ woulf be great too).
I'm writing a fast approximation for computing the exp function as suggested in this question answers.
double fast_exp(double val)
{
double result = 0;
unsigned long temp = (unsigned long)(1512775 * val + 1072632447);
/* to convert from long bits to double,
but must check if they have the same size... */
temp = temp << 32;
memcpy(&result, &temp, sizeof(temp));
return result;
}
and I'm using the suggestion found here to convert the long into a double. The issue I'm facing is that whereas I got the following results for int values in [-5, 5] under OS X with clang++ and libc++:
0.00675211846828461
0.0183005779981613
0.0504353642463684
0.132078289985657
0.37483024597168
0.971007823944092
2.7694206237793
7.30961990356445
20.3215942382812
54.8094177246094
147.902587890625
I always get 0 under Ubuntu with clang++ (3.4, same version) and libstd++. The compiler there even tells me (through a warning) that the shifting operation can be problematic since the long has size equal or less that the shifting parameter (indicating that longs and doubles have not the same size probably)
Am I doing something wrong and/or is there a better way to solve the problem being as more compatible as possible?
First off, using "long" isn't portable. Use the fixed length integer types found in stdint.h. This will alleviate the need to check for the same size, since you'll know what size the integer will be.
The reason you are getting a warning is that left shifting 32 bits on the 32 bit intger is undefined behavior. What's bad about shifting a 32-bit variable 32 bits?
Also see this answer: Is it safe to assume sizeof(double) >= sizeof(void*)? It should be safe to assume that a double is 64bits, and then you can use a uint64_t to store the raw hex. No need to check for sizes, and everything is portable.
I am trying to convert a char* to double and back to char* again. the following code works fine if the application you created is 32-bit but doesn't work for 64-bit application. The problem occurs when you try to convert back to char* from int. for example if the hello = 0x000000013fcf7888 then converted is 0x000000003fcf7888 only the last 32 bits are right.
#include <iostream>
#include <stdlib.h>
#include <tchar.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[]){
char* hello = "hello";
unsigned int hello_to_int = (unsigned int)hello;
double hello_to_double = (double)hello_to_int;
cout<<hello<<endl;
cout<<hello_to_int<<"\n"<<hello_to_double<<endl;
unsigned int converted_int = (unsigned int)hello_to_double;
char* converted = reinterpret_cast<char*>(converted_int);
cout<<converted_int<<"\n"<<converted<<endl;
getchar();
return 0;
}
On 64-bit Windows pointers are 64-bit while int is 32-bit. This is why you're losing data in the upper 32-bits while casting. Instead of int use long long to hold the intermediate result.
char* hello = "hello";
unsigned long long hello_to_int = (unsigned long long)hello;
Make similar changes for the reverse conversion. But this is not guaranteed to make the conversions function correctly because a double can easily represent the entire 32-bit integer range without loss of precision but the same is not true for a 64-bit integer.
Also, this isn't going to work
unsigned int converted_int = (unsigned int)hello_to_double;
That conversion will simply truncate anything digits after the decimal point in the floating point representation. The problem exists even if you change the data type to unsigned long long. You'll need to reinterpret_cast<unsigned long long> to make it work.
Even after all that you may still run into trouble depending on the value of the pointer. The conversion to double may cause the value to be a signalling NaN for instance, in which cause your code might throw an exception.
Simple answer is, unless you're trying this out for fun, don't do conversions like these.
You can't cast a char* to int on 64-bit Windows because an int is 32 bits, while a char* is 64 bits because it's a pointer. Since a double is always 64 bits, you might be able to get away with casting between a double and char*.
A couple of issues with encoding any integer (specifically, a collection of bits) into a floating point value:
Conversions from 64-bit integers to doubles can be lossy. A double has 53-bits of actual precision, so integers above 2^52 (give or take an extra 2) will not necessarily be represented precisely.
If you decide to reinterpret the bits of a pointer as a double instead (via union or reinterpret_cast) you will still have issues if you happen to encode a pointer as set of bits that are not a valid double representation. Unless you can guarantee that the double value never gets written back by the FPU, the FPU can silently transform an invalid double into another invalid double (see NaN), i.e., a double value that represents the same value but has different bits. (See this for issues related to using floating point formats as bits.)
You can probably safely get away with encoding a 32-bit pointer in a double, as that will definitely fit within the 53-bit precision range.
only the last 32 bits are right.
That's because an int in your platform is only 32 bits long. Note that reinterpret_cast only guarantees that you can convert a pointer to an int of sufficient size (not your case), and back.
If it works in any system, anywhere, just all yourself lucky and move on. Converting a pointer to an integer is one thing (as long as the integer is large enough, you can get away with it), but a double is a floating point number - what you are doing simply doesn't make any sense, because a double is NOT necessarily capable of representing any random number. A double has range and precision limitations, and limits on how it represents things. It can represent numbers across a wide range of values, but it can't represent EVERY number in that range.
Remember that a double has two components: the mantissa and the exponent. Together, these allow you to represent either very big or very small numbers, but the mantissa has limited number of bits. If you run out of bits in the mantissa, you're going to lose some bits in the number you are trying to represent.
Apparently you got away with it under certain circumstances, but you're asking it to do something it wasn't made for, and for which it is manifestly inappropriate.
Just don't do that - it's not supposed to work.
This is as expected.
Typically a char* is going to be 32 bits on a 32-bit system, 64 bits on a 64-bit system; double is typically 64 bits on both systems. (These sizes are typical, and probably correct for Windows; the language permits a lot more variations.)
Conversion from a pointer to a floating-point type is, as far as I know, undefined. That doesn't just mean that the result of the conversion is undefined; the behavior of a program that attempts to perform such a conversion is undefined. If you're lucky, the program will crash or fail to compile.
But you're converting from a pointer to an integer (which is permitted, but implementation-defined) and then from an integer to a double (which is permitted and meaningful for meaningful numeric values -- but converted pointer values are not numerically meaningful). You're losing information because not all of the 64 bits of a double are used to represent the magnitude of the number; typically 11 or so bits are used to represent the exponent.
What you're doing quite simply makes no sense.
What exactly are you trying to accomplish? Whatever it is, there's surely a better way to do it.
We're trying to compare two equally sized native arrays of signed int values using inequality operations, <, <=, > and >=, in a high performance way. As many values are compared, the true/false results would be sotred in a char array of the same size of the input, where 0x00 means false and 0xff means true.
To accomplish this, we're using the Intel IPP library. The problem is that the function we found that does this operation, named ippiCompare_*, from the images and video processing lib, supports only the types unsigned char (Ipp8u), signed/unsigned short (Ipp16s/Ipp16u) and float (Ipp32f). It does not directly support signed int (Ipp32s)
I (only) envision two possible ways of solving this:
Casting the array to one of the directly supported types and executing the comparison in more steps (it would became a short array of twice the size or a char array of four times the size) and merging intermediate results.
Using another function directly supporting signed int arrays from IPP or from another library that could do something equivalent in terms of performance.
But there may be other creative ways... So I ask you're help with that! :)
PS: The advantage of using Intel IPP is the performance gain for large arrays: it uses multi-value processor functions and many cores simultaneously (and maybe more tricks). So simple looped solutions wouldn't do it as fast AFAIK.
PS2: link for the ippiCompare_* doc
You could do the comparison with PCMPEQD followed by a PACKUSDW and PACKUSWB. This would be something along
#include <emmintrin.h>
void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
for (unsigned i=0; i < count/16; ++i) {
__m128d result0 = _mm_cmpeq_pd(a[0], b[0]); // each line compares 4 integers
__m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
__m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
__m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
a += 4; b+= 4;
v8hi wresult0 = __builtin_ia32_packssdw(result0, result1); //pack 2*4 integer results into 8 words
v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);
*result = __builtin_ia32_packsswb(wresult0, wresult1); //pack 2*8 word results into 16 bytes
result++;
}
}
Needs aligned pointers, a count divisible by 16, some typecasts I have omitted because of lazyness/stupidity and probably a lot of debugging, of course. And I didn't find the intrinsics for packssdw/wb, so I just used the builtins from my compiler.
I thought there is an SSE instruction that would compare integers. Have you look into the intrinsics that can do that?
Backing out of the box for a bit: are you sure this is a performance problem? Unless your data set fits in L1 cache, you will be cache-fill limited and the actual cycles you're spending on your comparison operations (which are hardly slow even when done in the most naive way possible) can't possibly be limiting.