CUDA: reindexing arrays - c++

In this example, I have 3 float arrays, query_points[], initial_array[], and final_array[]. Values in query_points[] are rounded down and become index values, and I want to copy the data at those indexes in initial_array[] to result_array[].
The problem I'm having is every few hundred values, I am getting different values compared to properly working c++ code. I am new to CUDA and not sure what is happening. Please let me know if you can point me towards a solution. Thanks!
CUDA Code:
int w = blockIdx.x * blockDim.x + threadIdx.x; // Col // width
int h = blockIdx.y * blockDim.y + threadIdx.y; // Row // height
int index = h*width+w;
if ((w < width) && (h < height)){
int piece = floor(query_points[index]) - 1;
int piece_index = h*width+piece;
result_array[index] = initial_array[piece_index];
}

You gave the answer in your own comment: "I also think it may have had to do with the fact that I was passing the same input and output array into the function, trying to do an in place operation."
Your description of the symptom (it only happens occasionally and it only repros on large arrays) also fits the explanation.
Note that it isn't always possible to guard against race conditions if you want full concurrency - you may have to use separate input and output arrays. Merge Sort and Radix Sort both ping-pong between intermediate arrays while processing. I don't think anyone has figured out how to implement those algorithms without O(N) auxiliary space.

I didn't write the code to test it but there are two problems that I can see:
If you are floor-ing a float than use floorf() function. I do not think this is the cause, but it is obviously the better way to do it.
The main issue that I can see is subtler, or maybe I am just speculating: floor() and floorf() return float and double respectively. So, when you do :
floor(query_points[index]) - 1;
what you have is still a float and might be smaller than the actual integral value you are supposed to get due to precision loss. When you implicitly cast it to integer by
int piece = floor(query_points[index]) - 1;
you basically truncate the decimal part and get n-1 where you think you are getting n.
Even without this analysis
int piece = floor(query_points[index]) - 1;
In this line, you are flooring and than truncating, which are essentially the same thing, so you do not even need to use floor() or floorf().

Related

How to convert 3 addition and 1 multiply into vectorized SIMD using intrinsic functions C++

I'm working with a problem using 2D prefix sum, also called Summed-Area Table S. For an 2D array I (grayscale image/matrix/etc), its definition is:
S[x][y] = S[x-1][y] + S[x][y-1] - S[x-1][y-1] + I[x][y]
Sqr[x][y] = Sqr[x-1][y] + Sqr[x][y-1] - Sqr[x-1][y-1] + I[x][y]^2
Calculating the sum of a sub-matrix with two corners (top,left) and (bot,right) can be done in O(1):
sum = S[bot][right] - S[bot][left-1] - S[top-1][right] + S[top-1][left-1]
One of my problem is to calculate all possible sub-matrix sum with a constant size (bot-top == right-left == R), which are then used to calculate their mean/variance. And I've vectorized it to the form below.
lineSize is the number of elements to be processed at once. I choose lineSize = 16 because Intel CPU AVX instructions can work on 8 doubles at the same time. It can be 8/16/32/...
#define cell(i, j, w) ((i)*(w) + (j))
const int lineSize = 16;
const int R = 3; // any integer
const int submatArea = (R+1)*(R+1);
const double submatAreaInv = double(1) / submatArea;
void subMatrixVarMulti(int64* S, int64* Sqr, int top, int left, int bot, int right, int w, int h, int diff, double submatAreaInv, double mean[lineSize], double var[lineSize])
{
const int indexCache = cell(top, left, w),
indexTopLeft = cell(top - 1, left - 1, w),
indexTopRight = cell(top - 1, right, w),
indexBotLeft = cell(bot, left - 1, w),
indexBotRight = cell(bot, right, w);
for (int i = 0; i < lineSize; i++) {
mean[i] = (S[indexBotRight+i] - S[indexBotLeft+i] - S[indexTopRight+i] + S[indexTopLeft+i]) * submatAreaInv;
var[i] = (Sqr[indexBotRight + i] - Sqr[indexBotLeft + i] - Sqr[indexTopRight + i] + Sqr[indexTopLeft + i]) * submatAreaInv
- mean[i] * mean[i];
}
How can I optimize the above loop to have the highest possible speed? Readability doesn't matter. I heard it can be done using AVX2 and intrinsic functions, but I don't know how.
Edit: the CPU is i7-7700HQ, kabylake = skylake family
Edit 2: forgot to mention that lineSize, R, ... are already const
Your compiler can generate AVX/AVX2/AVX-512 instructions for you, but you need to:
Select the latest available architecture when compiling. For example with GCC you might say -march=skylake if you know your code will run on Skylake and later, but does not need to support older CPUs. Without this, AVX instructions cannot be generated.
Add restrict or __restrict to your pointer inputs to tell the compiler they do not overlap. This applies to S and Sqr, as well as mean and var (both pairs have the same type, so the compiler assumes they might overlap, but you know they do not).
Make sure your data is "over-aligned." For example if you want the compiler to use 256-bit AVX2 instructions, you should align your arrays to 256 bits. There are a few ways to do this, such as making a typedef with the alignment, or using alignas() or std::assume_aligned() (available as a GCC attribute prior to C++20). The point is you need the compiler to know that S, Sqr, mean and var are aligned to the largest SIMD vector size available on your target architecture, so that it does not have to generate as much fixup code.
Use constexpr where possible, such as lineSize.
Most importantly, profile to compare performance as you make changes, and look at the generated code (e.g. g++ -S) to see if it looks the way you want it to.
I don't think you can perform efficiently this type of sum using SIMD due to the dependencies of the summation.
Instead you can do the computation differently which can be trivially optimized with SIMD:
Compute row-only partial summation. You parallelize it with SIMD by computing simultaneously for multiple rows.
Now with rows summed up, by computing cols-only partial summation to the output using the same SIMD optimization you obtain your desired Summed-Area Table.
You can do the same for both summation and summation of squares.
The only issue is you need extra memory and this type of computation requires more memory accesses. The extra memory is probably a minor thing but more memory access perhaps can be improved by storing the temporary data (the sums of rows) in a cache friendly manner. You'll probably need to experiment with this.

Finding the remainder of a large multiplication in C++

I would like to ask some help concerning the following problem. I need to create a function that will multiply two integers and extract the remainder of this multiplication divided by a certain number (in short, (x*y)%A).
I am using unsigned long long int for this problem, but A = 15! in this case, and both x and y have been calculated modulo A previously. Thus, x*y can be greater than 2^64 - 1, therefore overflowing.
I did not want to use external libraries. Could anyone help me designing a short algorithm to solve this problem?
Thanks in advance.
If you already have mod A of x and y, why not use them? something like,
if,
x = int_x*A + mod_x
y = int_y*A + mod_y
then
(x*y)%A = ((int_x*A + mod_x)(int_y*A + mod_y))%A = (mod_x*mod_y)%A
mod_x*mod_y should be much smaller, right?
EDIT:
If you are trying to find the modulus wrt a large number like 10e11, I guess you would have to use another method. But while not really efficient, something like this would work
const int MAX_INT = 10e22 // get max int
int larger = max(mod_x, mod_y) // get the larger number
int smaller = max(mod_x, mod_y)
int largest_part = floor(MAX_INT/smaller)
if (largest_part > larger):
// no prob of overflow. use normal routine
else:
int larger_array = []
while(largest_part < larger):
larger_array.append(largest_part)
larger -= largest_part
largest_part = floor(MAX_INT/smaller)
// now use the parts array to calculate the mod by going through each elements mod and adding them etc
If you understand this code and the setup, you should be able to figure out the rest

Efficiently Building Summed Area Table

I am trying to construct a summed area table for later use in an adaptive thresholding routine. Since this code is going to be used in time critical software, I am trying to squeeze as many cycles as possible out of it.
For performance, the table is unsigned integers for every pixel.
When I attach my profiler, I am showing that my largest performance bottleneck occurs when performing the x-pass.
The simple math expression for the computation is:
sat_[y * width + x] = sat_[y * width + x - 1] + buff_[y * width + x]
where the running sum resets at every new y position.
In this case, sat_ is a 1-D pointer of unsigned integers representing the SAT, and buff_ is an 8-bit unsigned monochrome buffer.
My implementation looks like the following:
uint *pSat = sat_;
char *pBuff = buff_;
for (size_t y = 0; y < height; ++y, pSat += width, pBuff += width)
{
uint curr = 0;
for (uint x = 0; x < width; x += 4)
{
pSat[x + 0] = curr += pBuff[x + 0];
pSat[x + 1] = curr += pBuff[x + 1];
pSat[x + 2] = curr += pBuff[x + 2];
pSat[x + 3] = curr += pBuff[x + 3];
}
}
The loop is unrolled manually because my compiler (VC11) didn't do it for me. The problem I have is that the entire segmentation routine is spending an extraordinary amount of time just running through that loop, and I am wondering if anyone has any thoughts on what might speed it up. I have access to all of the SSE's sets, and AVX for any machine this routine will run on, so if there is something there, that would be extremely useful.
Also, once I squeeze out the last cycles, I then plan on extending this to multi-core, but I want to get the single thread computation as tight as possible before I make the model more complex.
You have a dependency chain running along each row; each result depends on the previous one. So you cannot vectorise/parallelise in that direction.
But, it sounds like each row is independent of all the others, so you can vectorise/paralellise by computing multiple rows simultaneously. You'd need to transpose your arrays, in order to allow the vector instructions to access neighbouring elements in memory.*
However, that creates a problem. Walking along rows would now be absolutely terrible from a cache point of view (every iteration would be a cache miss). The way to solve this is to interchange the loop order.
Note, though, that each element is read precisely once. And you're doing very little computation per element. So you'll basically be limited by main-memory bandwidth well before you hit 100% CPU usage.
* This restriction may be lifted in AVX2, I'm not sure...
Algorithmically, I don't think there is anything you can do to optimize this further. Even though you didn't use the term OLAP cube in your description, you are basically just building an OLAP cube. The code you have is the standard approach to building an OLAP cube.
If you give details about the hardware you're working with, there might be some optimizations available. For example, there is a GPU programming approach that may or may not be faster. Note: Another post on this thread mentioned that parallelization is not possible. This isn't necessarily true... Your algorithm can't be implemented in parallel, but there are algorithms that maintain data-level parallelism, which could be exploited with a GPU approach.

How to approximate the count of distinct values in an array in a single pass through it

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:
speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
I have quite a bit of memory that I can use, but the less is used the better
the smaller the array data type, the more accurate I need to be
I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)
Examples of perfect output:
ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values
I understand that some of the constraints may seem illogical, but thats the way it is.
As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!
EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!
For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.
For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.
I'm pretty sure you can do it by:
Create a Bloom filter
Run through the array inserting each element into the filter (this is a "slow" O(n), since it requires computing several independent decent hashes of each value)
Count how many bits are set in the Bloom Filter
Compute back from the density of the filter to an estimate of the number of distinct values. I don't know the calculation off the top of my head, but any treatment of the theory of Bloom filters goes into this, because it's vital to the probability of the filter giving a false positive on a lookup.
Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.
I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.
So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.
One approach that can work, even for big values, is to spread them into lazily allocated buckets.
Suppose that you are working with 32 bits integers, creating an array of 2**32 bits is relatively impractical (2**29 bytes, hum). However, we can probably assume that 2**16 pointers is still reasonable (2**19 bytes: 500kB), so we create 2**16 buckets (null pointers).
The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null.
typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;
struct Comparator {
bool operator()(Pair const& left, Pair const& right) const {
return left.first < right.first;
}
};
void add(Bucket& v, int32_t value) {
Pair const pair(value, 1);
Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
if (it == v.end() or it->first > value) {
v.insert(it, pair);
return;
}
it->second += 1;
}
void gather(Vector& v, int32_t const* begin, int32_t const* end) {
for (; begin != end; ++begin) {
uint16_t const index = *begin >> 16;
Bucket*& bucket = v[index];
if (bucket == 0) { bucket = new Bucket(); }
add(*bucket, *begin);
}
}
Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.
A few notes:
the number of buckets is completely customizable (thus letting you control the amount of original memory)
the strategy of repartition is customizable as well (this is just a cheap hash table I have made here)
it is possible to monitor the number of allocated buckets and abandon, or switch gear, if it starts blowing up.
if each value is different, then it just won't work, but when you realize it, you will already have collected many counts, so you'll at least be able to give a lower bound of the number of different values, and a you'll also have a starting point for the top/bottom.
If you manage to gather those statistics, then the work is cut out for you.
For 8 and 16 bit it's pretty obvious, you can track every possibility every iteration.
When you get to 32 and 64 bit integers, you don't really have the memory to track every possibility.
Here's a few natural suggestions that are likely outside the bounds of your constraints.
I don't really understand why you can't sort the array. RadixSort is O(n) and once sorted it would be one more pass to get accurate distinctiveness and top X information. In reality it would be 6 passes all together for 32bit if you used a 1 byte radix (1 pass for counting + 1 * 4 passes for each byte + 1 pass for getting values).
In the same cusp as above, why not just use SQL. You could create a stored procedure that takes the array in as a table valued parameter and return the number of distinct values and the top x values in one go. This stored procedure could also be called in parallel.
-- number of distinct
SELECT COUNT(DISTINCT(n)) FROM #tmp
-- top x
SELECT TOP 10 n, COUNT(n) FROM #tmp GROUP BY n ORDER BY COUNT(n) DESC
I've just thought of an interesting solution. It's based on law of boolean algebra called Idempotence of Multiplication, which states that:
X * X = X
From it, and using the commutative property of boolean multiplication, we can deduce that:
X * Y * X = X * X * Y = X * Y
Now, you see where I'm going to? This is how the algorithm would work (I'm terrible with pseudo-code):
make c = element1 & element2 (binary AND between the binary representation of the integers)
for i=3 until i == size_of_array
make b = c & element[i];
if b != c then diferent_values++;
c=b;
In first iteration, we make (element1*element2) * element3. We could represent it as:
(X * Y) * Z
If Z (element3) is equal to X (element1), then:
(X * Y) * Z = X * Y * X = X * Y
And if Z is equal to Y (element2), then:
(X * Y) * Z = X * Y * Y = X * Y
So, if Z isnĀ“t different to X or Y, then X * Y won't change when we multiply it for Z
This remains valid for big expressions, like:
(X * A * Z * G * T * P * S) * S = X * A * Z * G * T * P * S
If we receive a value which is factor of our big multiplicand (that means that it has been already computed) then the big multiplicand won't change when we multiply it to the recieved input, so there's no new distinct value.
So that's how it will go. Each time that a different value is computed then the multiplication of our big multiplicand and that distinct value, will be different to the big operand. So, with b = c & element[i], if b!= c we just increment out distinct values counter.
I guess I'm no being clear enough. If that's the case, please let me know.

Randomly Generating a 2d/3d Array

I've thought about this for awhile but what is maybe a good way to go about randomly generating a 2-d or possibly 3-d array. Im not looking for specific code per se, but more just a general idea or way one might think of doing it.
edit: Sorry I mistyped, What I meant is say you have an array (2d or 3d) filled with empty space, and I wanted to randomly generate Char *'s in it......randomly(everything else would be blank). is what I meant.
create a 2D/3D array
Fill it with random data
?????
Profit!
If you were trying to sparsely fill a large 2D array with uppercase ASCII characters it would be something like the following in C:
int array[A][B];
for (int i = 0; i < SOMETHING; ++i)
{
//Note: Modulus will not give perfect random distributions depending on
//the values of A/B but will be good enough for some purposes.
int x = rand() % A;
int y = rand() % B;
char ch = rand() % 26 + 'A';
array[x][y] = ch;
}
Just generate a bunch of random values using rand() and arrange them into an array.
If you want the randomness to have continuity in 2 or 3 dimensions, the concept you're looking for is "noise".
I know you didn't want the whole code, but it's really not that much code.
int array[A][B][C];
std::generate_n(array[0][0], A*B*C, rand);
This is not the exact answer you asked for, but could prove useful anyway. Sometimes when I want to generate a large random string, I do the following:
1. generate a small random string, say 3 or 4 characters in length
2. get its hash with your algorithm of choice (MD5, SHA1, etc)
In this way you can generate quite long 'randoms' strings. Once you have the very long random string you can split it up into smaller ones or use it whole. The integrity of the randomness is based on how random the short initial string is.
This should work:
int *ary, /* Array */
x_size, /* X size of the array */
y_size; /* Y size of the array */
x = rand() % MAX_SIZE;
y = rand() % MAX_SIZE;
ary = malloc(sizeof(int) * (x * y));
ary[1][1] = 1;
If the [][] indexing doesn't work, you may need to use
*(ary + (x_size * X_COORD) + Y_COORD)
to access element [X_COORD][Y_COORD]. I'm not completely sure whether c99 supports that syntax.
Sorry, I couldn't think of a way to say it without code.
EDIT: Sorry for the confusion - thought you needed a random size array.