Implementing custom atomic_add() which works with floats - c++

I'm trying to follow the B.12 section of https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html for atomic add, which works with floats. Simply copying and pasting the code from there and changing the types to floats does not work because I can't perform the casting pointer casting from GLOBAL to PRIVATE, required for the atomicCAS operation. To overcome this I decided to use atomic_xchg() because it works with floats, with additional if statement to achieve same functionality as atomicCAS. However, this returns me varying answer when I perform addition on large float vector every time i run the program.
I've tried figuring out how to overcome the explicit conversion from GLOBAL to PRIVATE, but I honestly don't know how to do it so that when I perform addition, the address argument is changed instead of some temp variable.
kernel void atomicAdd_2(volatile global float* address, float value)
{
float old = *address, assumed;
do {
assumed = old;
if (*address == assumed) {
old = atomic_xchg(address, value + assumed);
}
else{
old = *address;
}
// Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
} while (assumed != old);
}
This is my implementation of atomicAdd for floats.
kernel void reduce_add(global const float* input, global float* output) {
float temp = 242.23f;
atomicAdd_floats(&output[0], temp);
printf(" %f ", output[0]);
}
This is the function where I supply the arguments to the atomicAdd_floats. Note that my input argument contains a vector of floats and output argument is simply where I want to store the result, specifically in the first element of the output vector output[0]; but instead when i printf(" %f ", output[0]); it shows my default initialisation value 0.

First of all, i'd suggest to remove the "kernel" keyword on the atomicAdd_2 function. "kernel" should be used only on functions you intend to enqueue from the host. 2nd, there are OpenCL implementations of atomic-add-float on the net.
Then, i'm a bit confused as to what you're trying to do. Are you trying to sum a vector and while having the sum in a private variable ? if so, it makes no sense to use atomicAdd. Private memory is always atomic, because it's private. Atomicity is only required for global and local memories because they're shared.
Otherwise i'm not sure why you mention changing the address or the GLOBAL to PRIVATE.
Anyway, the code from the link should work, though it's relatively slow. If the vector to sum is large, you might be better off using a different algorithm (wtth partial sums). Try googling "opencl parallel sum" or such.

Related

What uses more memory in c++? An 2 ints or 2 functions?

I am writing in c++ for the Nintendo DS (With 4MB of RAM). I have a button class that stores data like the x,y location and length. Which of the following would take less memory?
.
Method 1, class variables length, x, y, and halfPoint
Button::Button(int setX, int setY, int setLength)
{
x = setX;
y = setY;
length = setLength;
halfPoint = length/2;
}
//access variable with buttonName.halfPoint
Method 2, class variables length, x and y
Button::Button(int setX, int setY, int length)
{
x = setX;
y = setY;
length = setLength;
}
int Button::getHalfPoint()
{
return length/2;
}
//access variable with buttonName.getHalfPoint()
Any help is appreciated. (And in the real code I calculate a location much more complex than the half point)
The getHalfPoint() method will take up less room if there are a lot of Buttons. Why? Because member functions are actually just implemented by the compiler as regular functions with an implied first argument of a pointer to the object. So your function is rewritten by the compiler as:
int getHalfPoint(Button* this)
{
return this->length/2;
}
(It is a bit more complicated, because of name mangling, but this will do for an explanation.)
You should carefully consider the extra amount of computation that will have to be done to avoid storing 4 extra bytes, however. And as Cameron mentions, the compiler might add extra space to the object anyway, depending upon the architecture (I think that is likely to happen with RISC architectures).
Well, that depends!
The method code exists exactly once in memory, but a member variable exists once for each object instance.
So you'll have to count the number of instances you create (multiplied by the sizeof the variable), and compare that to the size of the compiled method (using a tool like e.g. objdump).
You'll also want to compare the size of your Button with and without the extra variable, because it's entirely possible that the compiler pads it to the same length anyway.
I suggest you declare the getHalfPoint method inside your class. This will make the compiler inline the code.
There is a possibility that the code in the function is one assembly instruction, and depending on your platform, take the size of 4 bytes or less. In this case, there is probably no benefit to have a variable represent the half of another variable. Research "right shifting". Also, to take full advantage, make the variable unsigned int. (Right shifting a signed integer is not defined.)
The inline capability means that the content of the function will be pasted wherever there is a call to the function. This reduces the overhead of a function call (such as the branch instruction, pushing and popping arguments). The reduction of a branch instruction may even speed up the program because there is no flushing of the instruction cache or pipeline.

Reversibly Combining Two uint32_t's Without Changing Datatype

Here's my issue: I need to pass back two uint32_t's via a single uint32_t (because of how the API is set up...). I can hard code whatever other values I need to reverse the operation, but the parameter passed between functions needs to stay a single uint32_t.
This would be trivial if I could just bit-shift the two 32-bit ints into a single 64-bit int (like what was explained here), but the compiler wouldn't like that. I've also seen mathematical pairing functions, but I'm not sure if that's what I need in this case.
I've thought of setting up a simple cipher: the unint32_t could be the cipher text, and I could just hard code the key. This is an example, but that seems like overkill.
Is this even possible?
It is not possible to store more than 32 bits of information using only 32 bits. This is a basic result of information theory.
If you know that you're only using the low-order 16 bits of each value, you could shift one left 16 bits and combine them that way. But there's absolutely no way to get 64 bits worth of information (or even 33 bits) into 32 bits, period.
Depending on how much trouble this is really worth, you could:
create a global array or vector of std::pair<uint32_t,uint32_t>
pass an index into the function, then your "reverse" function just looks up the result in the array.
write some code to decide which index to use when you have a pair to pass. The index needs to not be in use by anyone else, and since the array is global there may be thread-safety issues. Essentially what you are writing is a simple memory allocator.
As a special case, on a machine with 32 bit data pointers you could allocate the struct and reinterpret_cast the pointer to and from uint32_t. So you don't need any globals.
Beware that you need to know whether or not the function you pass the value into might store the value somewhere to be "decoded" later, in which case you have a more difficult resource-management problem than if the function is certain to have finished using it by the time it returns.
In the easy case, and if the code you're writing doesn't need to be re-entrant at all, then you only need to use one index at a time. That means you don't need an array, just one pair. You could pass 0 to the function regardless of the values, and have the decoder ignore its input and look in the global location.
If both special cases apply (32 bit and no retaining of the value), then you can put the pair on the stack, and use no globals and no dynamic allocation even if your code does need to be re-entrant.
None of this is really recommended, but it could solve the problem you have.
You can use an intermediate global data structure to store the pair of uint32_t on it, using your only uint32_t parameter as the index on the structure:
struct my_pair {
uint32_t a, b;
};
std::map<uint32_t, my_pair> global_pair_map;
uint32_t register_new_pair(uint32_t a, uint32_t b) {
// Add the pair of (a, b) to the map global_pair_map on a new key, and return the
// new key value.
}
void release_pair(uint32_t key) {
// Remove the key from the global_pair_map.
}
void callback(uint32_t user_data) {
my_pair& p = global_pair_map[user_data];
// Use your pair of uint32_t with p.a, and p.b.
}
void main() {
uint32_t key = register_new_pair(number1, number2);
register_callback(callback, key);
}

Faster execution in c++ structure creation

I am working on c++, I have a structure with all most 100 float variables within it, and i am going to initialize them with value 0 in no argument constructor ,so which way it is faster?
Type 1:
struct info
{
//no argument constructor
info();
float x1;
float x2;
.
.
.
float x100;
}Info;
info::info()
{
float x1 = 0;
float x2 =0;
.
.
.
.
.
float x100 = 0;
}
//creation
Info* info1 = new Info();
Type2 :
typedef struct info
{
float x1;
float x2;
.
.
.
.
float x100;
}Info;
Info* infoIns = new Info;
memset(infoIns,0,sizeof(Info));
One hundred variables called x1 .. x100 just CALLS out to be an array (or if the number varies, perhaps using a vector)
In which case std::fill(x, x+100, 0.0f) would probably beat all of the choices above.
A better solution is probably to just initialize the whole object:
Info* infoIns = new Info();
or
Info infoIns = {}; // C++11
or
Info infoIns = Info();
Whenever it's a question of performance, the ONLY answer that applies is "what you can measure". I can sit here and explain exactly why in my experience, on my machine (or my machines) method A is faster than method B or method C. But if you are using a different compiler, or have a different processor, that may not apply at all, because the compiler you use is doing something different.
Regardless of which is "faster", you should use a constructor to set the values to zero. If you want to use memset, then by all means do so, but inside the constructor. That way, you won't find some place in the code where you FORGOT to set one of your structures to zero before trying to use it. Bear in mind that setting a struct/class to zero using memset is very dangerous. If the class or struct has virtual member functions (or contain some object that does), this will most likely overwrite the VPTR, which describes the virtual functions. That is a bad thing. So if you want to use memset, use it with x1 and a size of 100 *sizeof(float) (but using an array is probably a better choice again).
In the absence of measurement, clearing will likely be faster but much much nastier, and, if the code you've written is what you actually implemented, the clearing will actually work a lot better. The 2nd way you have is not particularly good style in any case, and you should use member initialisation.
I'd be more inclined to use a std::array (C++11) or a std::vector or even a plain array in any case.
2nd version, apart from already mentioned flaws, actually doesn't guarantee that float values will be 0.0 after memset.
3.9.1 Fundamental types [basic.fundamental]
8
...
The value representation of floating-point types is implementation-defined.
...
Thus you can end up with non zero values in your floating points in case of weird float representation.

C++ streamsize prec = cout.precision(3) - How does it work?

I am kind of newbie in using c++. I have a quick question, probably a dumb question.
streamsize prec = cout.precision(3);
As I understand correctly this declaration works like that: set the cout precision to 3, but assign the previous precision value to prec.
Also, simply, we can assign a function result (say a math addition function) to a variable:
int z = addition(3,4);
In the second one, it does the calculation and assigns the results to the variable z, not the previous value or a default value. Is my understanding correct? What is the difference between them?
What value a function returns depends entirely on that particular function. Most functions simply return a result of their operation.
The state-setting functions in standard library streams (such as precision) are a bit unusual in their interface of "I set a new value and return the old one," but it's still perfectly valid, as long as the function's behaviour is documented (which it is in their case).

Memory Access Violations When Using SSE Operations

I've been trying to re-implement some existing vector and matrix classes to use SSE3 commands, and I seem to be running into these "memory access violation" errors whenever I perform a series of operations on an array of vectors. I'm relatively new to SSE, so I've been starting off simple. Here's the entirety of my vector class:
class SSEVector3D
{
public:
SSEVector3D();
SSEVector3D(float x, float y, float z);
SSEVector3D& operator+=(const SSEVector3D& rhs); //< Elementwise Addition
float x() const;
float y() const;
float z() const;
private:
float m_coords[3] __attribute__ ((aligned (16))); //< The x, y and z coordinates
};
So, not a whole lot going on yet, just some constructors, accessors, and one operation. Using my (admittedly limited) knowledge of SSE, I implemented the addition operation as follows:
SSEVector3D& SSEVector3D::operator+=(const SSEVector3D& rhs)
{
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
return (*this);
}
To speed-test my new vector class against the old one (to see if it's worth re-implementing the whole thing), I created a simple program that generates a random array of SSEVector3D objects and adds them together. Nothing too complicated:
SSEVector3D sseSum(0, 0, 0);
for(i=0; i<sseVectors.size(); i++)
{
sseSum += sseVectors[i];
}
printf("Total: %f %f %f\n", sseSum.x(), sseSum.y(), sseSum.z());
The sseVectors variable is an std::vector containing elements of type SSEVector3D, whose components are all initialized to random numbers between -1 and 1.
Here's the issue I'm having. If the size of sseVectors is 8,191 or less (a number I arrived at through a lot of trial and error), this runs fine. If the size is 8,192 or more, I get this error when I try to run it:
signal: SIGSEGV, si_code: 0 (memory access violation at address: 0x00000080)
However, if I comment out that print statement at the end, I get no error even if sseVectors has a size of 8,192 or more.
Is there something wrong with the way I've written this vector class? I'm running Ubuntu 12.04.1 with GCC version 4.6
First, and foremost, don't do this
__m128 * pLhs = (__m128 *) m_coords;
__m128 * pRhs = (__m128 *) rhs.m_coords;
*pLhs = _mm_add_ps(*pLhs, *pRhs);
With SSE, always do your loads and stores explicitly via the appropriate intrinsics, never by just dereferencing. Instead of storing an array of 3 floats in your class, store a value of type _m128. That should make the compiler align instances of your class correctly, without any need for align attributes.
Note, however, that this won't work very well with MSVC. MSVC seems to generally be unable to cope with alignment requirements stronger than 8-byte aligned for by-value arguments :-(. The last time I needed to port SSE code to windows, my solution was to use Intel's C++ compiler for the SSE parts instead of MSVC...
The trick is to notice that __m128 is 16 byte aligned. Use _malloc_aligned() to assure that your float array is correctly aligned, then you can go ahead and cast your float to an array of __m128. Make sure also that the number of floats you allocate is divisible by four.