Why is this code so slow? - c++

So I have this function used to calculate statistics (min/max/std/mean). Now the thing is this runs generally on a 10,000 by 15,000 matrix. The matrix is stored as a vector<vector<int> > inside the class. Now creating and populating said matrix goes very fast, but when it comes down to the statistics part it becomes so incredibly slow.
E.g. to read all the pixel values of the geotiff one pixel at a time takes around 30 seconds. (which involves a lot of complex math to properly georeference the pixel values to a corresponding point), to calculate the statistics of the entire matrix it takes around 6 minutes.
void CalculateStats()
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
for(size_t row = 0; row < vals.size(); row++)
{
for(size_t col = 0; col < vals.at(row).size(); col++)
{
double mean_prev = new_mean;
T value = get(row, col);
new_mean += (value - new_mean) / (cnt + 1);
new_standard_dev += (value - new_mean) * (value - mean_prev);
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
stats_standard_dev = sqrt(new_standard_dev / (vals.size() * vals.at(0).size()) + 1);
std::cout << stats_standard_dev << std::endl;
}
Am I doing something horrible here?
EDIT
To respond to the comments, T would be an int.
EDIT 2
I fixed my std algorithm, and here is the final product:
void CalculateStats(const std::vector<double>& ignore_values)
{
//OHGOD
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
size_t cnt = 0;
int n = 0;
double delta = 0.0;
double mean2 = 0.0;
std::vector<double>::const_iterator ignore_begin = ignore_values.begin();
std::vector<double>::const_iterator ignore_end = ignore_values.end();
for(std::vector<std::vector<T> >::const_iterator row = vals.begin(), row_end = vals.end(); row != row_end; ++row)
{
for(std::vector<T>::const_iterator col = row->begin(), col_end = row->end(); col != col_end; ++col)
{
// This method of calculation is based on Knuth's algorithm.
T value = *col;
if(std::find(ignore_begin, ignore_end, value) != ignore_end)
continue;
n++;
delta = value - new_mean;
new_mean = new_mean + (delta / n);
mean2 = mean2 + (delta * (value - new_mean));
// Find new max/min's.
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
}
}
stats_standard_dev = mean2 / (n - 1);
stats_min = new_min;
stats_max = new_max;
stats_mean = new_mean;
This still takes ~120-130 seconds to do this, but it's a huge improvement :)!

Have you tried to profile your code?
You don't even need a fancy profiler. Just stick some debug timing statements in there.
Anything I tell you would just be an educated guess (and probably wrong)
You could be getting lots of cache misses due to the way you're accessing the contents of the vector. You might want to cache some of the results to size() but I don't know if that's the issue.

I just profiled it. 90% of the execution time was in this line:
new_mean += (value - new_mean) / (cnt + 1);

You should calculate the sum of values, min, max and count in the first loop,
then calculate the mean in one operation by dividing sum/count,
then in a second loop calculate std_dev's sum
That would probably be a bit faster.

First thing I spotted is that you evaluate vals.at(row).size() in the loop, which, obviously, isn't supposed to improve performance. It also applies to vals.size(), but of course inner loop is worse. If vals is a vector of vector, you better use iterators or at least keep reference for the outer vector (because get() with indices parameters surely eats up quite some time as well).
This code sample is supposed to illustrate my intentions ;-)
for(TVO::const_iterator i=vals.begin(),ie=vals.end();i!=ie;++i) {
for(TVI::const_iterator ii=i->begin(),iie=i->end();ii!=iie;++ii) {
T value = *ii;
// the rest
}
}

First, change your row++ to ++row. A minor thing, but you want speed, so that will help
Second, make your row < vals.size into some const comparison instead. The compiler doesn't know that vals won't change, so it has to play nice and always call size.
what is the 'get' method in the middle there? What does that do? That might be your real problem.
I'm not too sure about your std dev calculation. Take a look at the wikipedia page on calculating variance in a single pass (they have a quick explanation of Knuth's algorithm, which is an expansion of a recursion relation).

It's slow because you're benchmarking debug code.
Building and running the code on Windows XP using VS2008:
a Release build with the default optimisation level, the code in the OP runs in 2734 ms.
a Debug build with the default of no optimisation, the code in the OP runs in a massive 398,531 ms.
In comments below you say you're not using optimisation, and this appears to make a big difference in this case - normally it's less that a factor of ten, but in this case it's over a hundred times slower.
I'm using VS2008 rather than 2005, but it's probably similar:
In the Debug build, there are two range checks on each access, each of which calls std::vector::size() using a non-inlined function call and requires a branch predicition. There is overhead involved both with function calls and with branches.
In the Release build, the compiler optimizes away the range checks ( I don't know whether it just drops them, or does flow analysis based on the limits of the loop ), and the vector access becomes a small amount of inline pointer arithmetic with no branches.
No-one cares how fast the debug build is. You should be unit testing the release build anyway, as that's the build which has to work correctly. Only use the Debug build if you don't all the information you want if you try and step through the code.
The code as posted runs in < 1.5 seconds on my PC with test data of 15000 x 10000 integers all equal to 42. You report that it's running in 230 times slower that that. Are you on a 10 MHz processor?
Though there are other suggestions for making it faster ( such as moving it to use SSE, if all the values are representable using 8bit types ), but there's clearly something else which is making it slow.
On my machine, neither a version which hoisted a reference to the vector for the row and hoisting the size of the row, nor a version which used iterator had any measurable benefit ( with g++ -O3 using iterators takes 1511ms repeatably; the hoisted and original version both take 1485ms ). Not optimising means it runs in 7487ms ( original ), 3496ms ( hoisted ) or 5331ms ( iterators ).
But unless you're running on a very low power device, or are paging, or a running non-optimised code with a debugger attached, it shouldn't be this slow, and whatever is making it slow is not likely to be the code you've posted.
( as a side note, if you test it with values with a deviation of zero your SD comes out as 1 )

There are far too many calculations in the inner loop:
For the descriptive statistics (mean, standard
deviation) the only thing required is to compute the sum
of value and the sum of squared value. From these
two sums the mean and standard deviation can be computed
after the outer loop (together with a third value, the
number of samples - n is your new/updated code). The
equations can be derived from the definitions or found
on the web, e.g. Wikipedia. For instance the mean is
just sum of value divided by n. For the n version (in
contrast to the n-1 version - however n is large in
this case so it doesn't matter which one is used) the
standard deviation is: sqrt( n * sumOfSquaredValue -
sumOfValue * sumOfValue). Thus only two floating point
additions and one multiplication are needed in the
inner loop. Overflow is not a problem with these sums as
the range for doubles is 10^318. In particular you will
get rid of the expensive floating point division that
the profiling reported in another answer has revealed.
A lesser problem is that the minimum and maximum are
rewritten every time (the compiler may or may not
prevent this). As the minimum quickly becomes small and
the maximum quickly becomes large, only the two comparisons
should happen for the majority of loop iterations: use
if statements instead to be sure. It can be argued, but
on the other hand it is trivial to do.

I would change how I access the data. Assuming you are using std::vector for your container you could do something like this:
vector<vector<T> >::const_iterator row;
vector<vector<T> >::const_iterator row_end = vals.end();
for(row = vals.begin(); row < row_end; ++row)
{
vector<T>::const_iterator value;
vector<T>::const_iterator value_end = row->end();
for(value = row->begin(); value < value_end; ++value)
{
double mean_prev = new_mean;
new_mean += (*value - new_mean) / (cnt + 1);
new_standard_dev += (*value - new_mean) * (*value - mean_prev);
// find new max/min's
new_min = min(*value, new_min);
new_max = max(*value, new_max);
cnt++;
}
}
The advantage of this is that in your inner loop you aren't consulting the outter vector, just the inner one.
If you container type is a list, this will be significantly faster. Because the look up time of get/operator[] is linear for a list and constant for a vector.
Edit, I moved the call to end() out of the loop.

Move the .size() calls to before each loop, and make sure you are compiling with optimizations turned on.

If your matrix is stored as a vector of vectors, then in the outer for loop you should directly retrieve the i-th vector, and then operate on that in the inner loop. Try that and see if it improves performance.

I'm nor sure of what type vals is but vals.at(row).size() could take a long time if itself iterates through the collection. Store that value in a variable. Otherwise it could make the algorithm more like O(n³) than O(n²)

I think that I would rewrite it to use const iterators instead of row and col indexes. I would set up a const const_iterator for row_end and col_end to compare against, just to make certain it wasn't making function calls at every loop end.

As people have mentioned, it might be get(). If it accesses neighbors, for instance, you will totally smash the cache which will greatly reduce the performance. You should profile, or just think about access patterns.

Coming a bit late to the party here, but a couple of points:
You're effectively doing numerical work here. I don't know much about numerical algorithms, but I know enough to know that references and expert support are often useful. This discussion thread offers some references; and Numerical Recipes is a standard (if dated) work.
If you have the opportunity to redesign your matrix, you want to try using a valarray and slices instead of vectors of vectors; one advantage that immediately comes to mind is that you're guaranteed a flat linear layout, which makes cache pre-fetching and SIMD instructions (if your compiler can use them) more effective.

In the inner loop, you shouldn't be testing size, you shouldn't be doing any divisions, and iterators can also be costly. In fact, some unrolling would be good in there.
And, of course, you should pay attention to cache locality.
If you get the loop overhead low enough, it might make sense to do it in separate passes: one to get the sum (which you divide to get the mean), one to get the sum of squares (which you combine with the sum to get the variance), and one to get the min and/or max. The reason is to simplify what is in the inner unrolled loop so the compiler can keep stuff in registers.
I couldn't get the code to compile, so I couldn't pinpoint issues for sure.

I have modified the algorithm to get rid of almost all of the floating-point division.
WARNING: UNTESTED CODE!!!
void CalculateStats()
{
//OHGOD
double accum_f;
double accum_sq_f;
double new_mean = 0;
double new_standard_dev = 0;
int new_min = 256;
int new_max = 0;
const int oku = 100000000;
int accum_ichi = 0;
int accum_oku = 0;
int accum_sq_ichi = 0;
int accum_sq_oku = 0;
size_t cnt = 0;
int v1 = 0;
int v2 = 0;
v1 = vals.size();
for(size_t row = 0; row < v1; row++)
{
v2 = vals.at(row).size();
for(size_t col = 0; col < v2; col++)
{
T value = get(row, col);
int accum_ichi += value;
int accum_sq_ichi += (value * value);
// perform carries
accum_oku += (accum_ichi / oku);
accum_ichi %= oku;
accum_sq_oku += (accum_sq_ichi / oku);
accum_sq_ichi %= oku;
// find new max/min's
new_min = value < new_min ? value : new_min;
new_max = value > new_max ? value : new_max;
cnt++;
}
}
// now, and only now, do we use floating-point arithmetic
accum_f = (double)(oku) * (double)(accum_oku) + (double)(accum_ichi);
accum_sq_f = (double)(oku) * (double)(accum_sq_oku) + (double)(accum_sq_ichi);
new_mean = accum_f / (double)(cnt);
// standard deviation formula from Wikipedia
stats_standard_dev = sqrt((double)(cnt)*accum_sq_f - accum_f*accum_f)/(double)(cnt);
std::cout << stats_standard_dev << std::endl;
}

Related

Why do I get a larger relative speedup vs. scalar from SIMD intrinsics with larger arrays?

I want to learn SIMD programming. Now I have some interesting moment in my code.
I just want to measure the time of work of my code. I try to apply some base function for my array with a particular size.
Firstly I try to use function that was written with SIMD instructions and after that I try to use usual aproach. And I compare time of this two realizations the same function.
I defined performance like (time without sse) / (time using sse).
But when my size is 8 , I have performance is 1.3, and when my size = 512 - I have Performance = 3, if I have size = 1000 performance = 4, if size = 4000 -> performance = 5.
I don't understand why my performance is increasing when size of array is increasing.
My code
void init(double* v, size_t size) {
for (int i = 0; i < size; ++i) {
v[i] = i / 10.0;
}
}
void sub_func_sse(double* v, int start_idx) {
__m256d vector = _mm256_loadu_pd(v + start_idx);
__m256d base = _mm256_set_pd(2.0, 2.0, 2.0, 2.0);
for (int i = 0; i < 128; ++i) {
vector = _mm256_mul_pd(vector, base);
}
_mm256_storeu_pd(v + start_idx, vector);
}
void sub_func(double& item) {
for (int k = 0; k < 128; ++k) {
item *= 2.0;
}
}
int main() {
const size_t size = 8;
double* v = new double[size];
init(v, size);
const int num_repeat = 2000;//I should repeat my measuraments
//because I want to get average time - it is more clear information
double total_time_sse = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
t.restart();
for (int i = 0; i < size; i += 8) {
sub_func_sse(v, i);
}
total_time_sse += t.toc();
}
double total_time = 0;
for (int p = 0; p < num_repeat; ++p) {
init(v, size);
TimerHc t;
t.restart();
for (int i = 0; i < size; ++i) {
sub_func(v[i]);
}
total_time += t.toc();
}
std::cout << "time using sse = " << total_time_sse / num_repeat << std::endl <<
"time without sse = " << total_time / num_repeat << std::endl;
system("pause");
}
I defined performance like (time without sse) / (time using sse).
What you measure is speedup.
The speedup you can expect from applying parallelizations is modelled by Amdahl's law. It relates the savings in those parts that can be made faster (by parallelization or other means) to the total speedup. Amdahl's law can be rather intimidating, because it basically says that making parts faster will not always gain you a total speedup. The limit in achievable speedup is determined by the relative fraction of the workload that can be parallelized.
Gustavon's law takes a different point of view. In a nutshell, it states that you just have to increase the workload to make efficient use of parallelization. More workload in total has typically less impact on overhead from parallelization and the non-parallel part of computations, hence (according to Amdahl's law) results in more efficient use of parallelism.
...and in some sense, that's what you are observing here. The bigger your array, the more impact parallelization has.
PS: This is just some handwaving to explain why the effect you see is not too surprising. Luckily there is another answer which addresses your specific benchmark in more detail.
You're probably a victim of CPU frequency scaling; for stable results you should disable dynamic frequency scaling and turbo boost, or at least warm up the CPU before starting the measurement.
Since you start by measuring SSE performance and then proceed to regular performance, the CPU frequency is low in the beginning, so SSE performance appears worse.
Having said that, there are a few other issues with your approach:
The overhead of high_frequency_clock::now() calls compared to the work being measured is high; move the time measurement to outside the for (..num_repeat loop, i.e. time the entire loop, not individual iterations (then optionally divide the measured time by the number of iterations).
The results of the computation are never used; the compiler is free to optimize the work out entirely. Make sure to "use" the result, e.g. by printing it.
It is quite inefficient to multiply a double by 2.0. Indeed, the non-SSE version is optimized to an ADD instead (item *= 2.0 ==> vaddsd xmm0, xmm0, xmm0). So your hand-made SSE version is losing out.
An optimizing compiler will probably auto-vectorize your non-SSE code. To be sure, always check the generated assembly. Link to godbolt
Use a benchmarking framework like Google Benchmark; it will help you avoid many pitfalls associated with code benchmarking.

How can I optimize this function which handles large c++ vectors?

According to Visual Studio's performance analyzer, the following function is consuming what seems to me to be an abnormally large amount of processor power, seeing as all it does is add between 1 and 3 numbers from several vectors and store the result in one of those vectors.
//Relevant class members:
//vector<double> cache (~80,000);
//int inputSize;
//Notes:
//RealFFT::real is a typedef for POD double.
//RealFFT::RealSet is a wrapper class for a c-style array of RealFFT::real.
//This is because of the FFT library I'm using (FFTW).
//It's bracket operator is overloaded to return a const reference to the appropriate array element
vector<RealFFT::real> Convolver::store(vector<RealFFT::RealSet>& data)
{
int cr = inputSize; //'cache' read position
int cw = 0; //'cache' write position
int di = 0; //index within 'data' vector (ex. data[di])
int bi = 0; //index within 'data' element (ex. data[di][bi])
int blockSize = irBlockSize();
int dataSize = data.size();
int cacheSize = cache.size();
//Basically, this takes the existing values in 'cache', sums them with the
//values in 'data' at the appropriate positions, and stores them back in
//the cache at a new position.
while (cw < cacheSize)
{
int n = 0;
if (di < dataSize)
n = data[di][bi];
if (di > 0 && bi < inputSize)
n += data[di - 1][blockSize + bi];
if (++bi == blockSize)
{
di++;
bi = 0;
}
if (cr < cacheSize)
n += cache[cr++];
cache[cw++] = n;
}
//Take the first 'inputSize' number of values and return them to a new vector.
return Common::vecTake<RealFFT::real>(inputSize, cache, 0);
}
Granted, the vectors in question have sizes of around 80,000 items, but by comparison, a function which multiplies similar vectors of complex numbers (complex multiplication requires 4 real multiplications and 2 additions each) consumes about 1/3 the processor power.
Perhaps it has something to with the fact it has to jump around within the vectors rather then just accessing them linearly? I really have no idea though. Any thoughts on how this could be optimized?
Edit: I should mention I also tried writing the function to access each vector linearly, but this requires more total iterations and actually the performance was worse that way.
Turn on compiler optimization as appropriate. A guide for MSVC is here:
http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}

triangular matrix conversion and auto parallelization

I'm playing a bit with auto parallelization in ICC (11.1; old, but can't do anything about it) and I'm wondering why the compiler can't parallelize the inner loop for a simple gaussian elimination:
void makeTriangular(float **matrix, float *vector, int n) {
for (int pivot = 0; pivot < n - 1; pivot++) {
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
for (int row = pivot + 1; row < n; row++) { // line 72; should be parallelized
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) { // line 74
matrix[row][col] -= matrix[pivot][col] * tmp;
}
vector[row] -= vector[pivot] * tmp;
}
}
}
We're only writing to the arrays dependent on the private row (and col) variable and row is guaranteed to be larger than pivot, so it should be obvious to the compiler that we aren't overwriting anything.
I'm compiling with -O3 -fno-alias -parallel -par-report3 and get lots of dependencies ala: assumed FLOW dependence between matrix line 75 and matrix line 73. or assumed ANTI dependence between matrix line 73 and matrix line 75. and the same for line 75 alone. What problem does the compiler have? Obviously I could tell it exactly what to do with some pragmas, but I want to understand what the compiler can get alone.
Basically the compiler can't figure out that there's no dependency due to the name matrix and the name vector being both read from and written too (even though with different regions). You might be able to get around this in the following fashion (though slightly dirty):
void makeTriangular(float **matrix, float *vector, int n)
{
for (int pivot = 0; pivot < n - 1; pivot++)
{
// swap row so that the row with the largest value is
// at pivot position for numerical stability
int swapPos = findPivot(matrix, pivot, n);
std::swap(matrix[pivot], matrix[swapPos]);
std::swap(vector[pivot], vector[swapPos]);
float pivotVal = matrix[pivot][pivot];
float **matrixForWriting = matrix; // COPY THE POINTER
float *vectorForWriting = vector; // COPY THE POINTER
// (then parallelize this next for loop as you were)
for (int row = pivot + 1; row < n; row++) {
float tmp = matrix[row][pivot] / pivotVal;
for (int col = pivot + 1; col < n; col++) {
// WRITE TO THE matrixForWriting VERSION
matrixForWriting[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
}
// WRITE TO THE vectorForWriting VERSION
vectorForWriting[row] = vector[row] - vector[pivot] * tmp;
}
}
}
Bottom line is just give the ones you're writing to a temporarily different name to trick the compiler. I know that it's a little dirty and I wouldn't recommend this kind of programming in general. But if you're sure that you have no data dependency, it's perfectly fine.
In fact, I'd put some comments around it that are very clear to future people who see this code that this was a workaround, and why you did it.
Edit: I think the answer was basically touched on by #FPK and an answer was posted by #Evgeny Kluev. However, in #Evgeny Kluev's answer he suggests making this an input parameter and that might parallelize but won't give the correct value since the entries in matrix won't be updated. I think the code I posted above will give the correct answer too.
The same auto-parallelization problem is on icc 12.1. So I used this newer version for experiments.
Adding an output matrix to your function's parameter list and changing body of the third loop to this
out[row][col] = matrix[row][col] - matrix[pivot][col] * tmp;
fixed the "FLOW dependence" problem. Which means, "-fno-alias" affects only function parameters, while contents of the single parameter remain under suspicion of being aliased. I don't know why this option does not affect everything. Since different parts of your matrix do not really alias each other, you can just leave this additional parameter to the function and pass the same matrix through this parameter.
Interestingly, while complaining about 'matrix', compiler say nothing about 'vector', which really has aliasing problems: this line vector[row] -= vector[pivot] * tmp; may lead to false aliasing (writing to vector[row] in one thread may touch the cache line, storing vector[pivot], used by every thread).
"FLOW dependence" is not the only problem in this code. After it was fixed, compiler still refuses to parallelize second and third loops because of "insufficient computational work". So I tried to give it some extra work:
float tmp = matrix[row][pivot] * pivotVal;
...
out[row][col] = matrix[row][col] - matrix[pivot][col] *tmp /pivotVal /pivotVal;
And after all this, the second loop was at last parallelized, though I'm not sure if it gained any speed improvement.
Update: I found a better alternative to giving computer "some extra work". Option -par-threshold50 does the trick.
I have no access to an icc to test my idea but I suspect the compiler fears aliasing: matrix is defined as float**: an array of pointers pointing to arrays of floats. All those pointers could point to the same float array so parallizing this would be very dangerous. This would make no sense, but the compiler cannot know.

C++ performance: checking a block of memory for having specific values in specific cells

I'm doing a research on 2D Bin Packing algorithms. I've asked similar question regarding PHP's performance - it was too slow to pack - and now the code is converted to C++.
It's still pretty slow. What my program does is consequently allocating blocks of dynamic memory and populating them with a character 'o'
char* bin;
bin = new (nothrow) char[area];
if (bin == 0) {
cout << "Error: " << area << " bytes could not be allocated";
return false;
}
for (int i=0; i<area; i++) {
bin[i]='o';
}
(their size is between 1kb and 30kb for my datasets)
Then the program checks different combinations of 'x' characters inside of current memory block.
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
One of the functions that checks the non-overlapping gets called millions of times during a runtime.
bool fits(char* bin, int* pos, int width)
{
for (int i=pos[0]; i<pos[0]+pos[1]; i++)
for (int j=pos[2]; j<pos[2]+pos[3]; j++)
if (bin[i*width+j] == 'x')
return false;
return true;
}
All other stuff takes only a percent of the runtime, so I need to make these two guys (fits and place) faster. Who's the culprit?
Since I only have two options 'x' and 'o', I could try to use just one bit instead of the whole byte the char takes. But I'm more concerned with the speed, you think it would make the things faster?
Thanks!
Update: I replaced int* pos with rect pos (the same for best), as MSalters suggested. At first I saw improvement, but I tested more with bigger datasets and it seems to be back to normal runtimes. I'll try other techniques suggested and will keep you posted.
Update: using memset and memchr sped up things about twice. Replacing 'x' and 'o' with '\1' and '\0' didn't show any improvement. __restrict wasn't helpful either. Overall, I'm satisfied with the performance of the program now since I also made some improvements to the algorithm itself. I'm yet to try using a bitmap and compiling with -02 (-03)... Thanks again everybody.
Best possibility would be to use an algorithm with better complexity.
But even your current algorithm could be sped up. Try using SSE instructions to test ~16 bytes at once, also you can make a single large allocation and split it yourself, this will be faster than using the library allocator (the library allocator has the advantage of letting you free blocks individually, but I don't think you need that feature).
[ Of course: profile it!]
Using a bit rather than a byte will not be faster in the first instance.
However, consider that with characters, you can cast blocks of 4 or 8 bytes to unsigned 32 bit or 64 bit integers (making sure you handle alignment), and compare that to the value for 'oooo' or 'oooooooo' in the block. That allows a very fast compare.
Now having gone down the integer approach, you can see that you could do that same with the bit approach and handle say 64 bits in a single compare. That should surely give a real speed up.
Bitmaps will increase the speed as well, since they involve touching less memory and thus will cause more memory references to come from the cache. Also, in place, you might want to copy the elements of best into local variables so that the compiler knows that your writes to bin will not change best. If your compiler supports some spelling of restrict, you might want to use that as well. You can also replace the inner loop in place with the memset library function, and the inner loop in fits with memchr; those may not be large performance improvements, though.
First of all, have you remembered to tell your compiler to optimize?
And turn off slow array index bounds checking and such?
That done, you will get substantial speed-up by representing your binary values as individual bits, since you can then set or clear say 32 or 64 bits at a time.
Also I would tend to assume that the dynamic allocations would give a fair bit of overhead, but apparently you have measured and found that it isn't so. If however the memory management actually contributes significantly to the time, then a solution depends a bit on the usage pattern. But possibly your code generates stack-like alloc/free behavior, in which case you can optimize the allocations down to almost nothing; just allocate a big chunk of memory at the start and then sub-allocate stack-like from that.
Considering your current code:
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
Due to possible aliasing the compiler may not realize that e.g. best[0] will be constant during the loop.
So, tell it:
void place(char* bin, int const* best, int const width)
{
int const maxY = best[0] + best[1];
int const maxX = best[2] + best[3];
for( int y = best[0]; y < maxY; ++y )
{
for( int x = best[2]; x < maxX; ++x )
{
bin[y*width + x] = 'x';
}
}
}
Most probably your compiler will hoist the y*width computation out of the inner loop, but why not tell it do also that:
void place(char* bin, int* best, int const width)
{
int const maxY = best[0]+best[1];
int const maxX = best[2]+best[3];
for( int y = best[0]; y < maxY; ++y )
{
int const startOfRow = y*width;
for( int x = best[2]; x < maxX; ++x )
{
bin[startOfRow + x] = 'x';
}
}
}
This manual optimization (also applied to other routine) may or may not help, it depends on how smart your compiler is.
Next, if that doesn't help enough, consider replacing inner loop with std::fill (or memset), doing a whole row in one fell swoop.
And if that doesn't help or doesn't help enough, switch over to bit-level representation.
It is perhaps worth noting and trying out, that every PC has built-in hardware support for optimizing the bit-level operations, namely a graphics accelerator card (in old times called blitter chip). So, you might just use an image library and a black/white bitmap. But since your rectangles are small I'm not sure whether the setup overhead will outweight the speed of the actual operation – needs to be measured. ;-)
Cheers & hth.,
The biggest improvement I'd expect is from a non-trivial change:
// changed pos to class rect for cleaner syntax
bool fits(char* bin, rect pos, int width)
{
if (bin[pos.top()*width+pos.left()] == 'x')
return false;
if (bin[(pos.bottom()-1*width+pos.right()] == 'x')
return false;
if (bin[(pos.bottom()*width+pos.left()] == 'x')
return false;
if (bin[pos.top()*width+pos.right()] == 'x')
return false;
for (int i=pos.top(); i<=pos.bottom(); i++)
for (int j=pos.left(); j<=pos.right(); j++)
if (bin[i*width+j] == 'x')
return false;
return true;
}
Sure, you're testing bin[(pos.bottom()-1*width+pos.right()] twice. But the first time you do so is much earlier in the algorithm. You add boxes, which means that there is a strong correlation between adjacent bins. Therefore, by checking the corners first, you often return a lot earlier. You could even consider adding a 5th check in the middle.
Beyond the obligatory statement about using a profiler,
The advice above about replacing things with a bit map is a very good idea. If that does not appeal to you..
Consider replacing
for (int i=0; i<area; i++) {
bin[i]='o';
}
By
memset(bin, 'o', area);
Typically a memset will be faster, as it compiles into less machine code.
Also
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
for (int j=best[2]; j<best[2]+best[3]; j++)
bin[i*width+j] = 'x';
}
has a bit of room.for improvement
void place(char* bin, int* best, int width)
{
for (int i=best[0]; i<best[0]+best[1]; i++)
memset( (i * width) + best[2],
'x',
(best[2] + best[3]) - (((i * width)) + best[2]) + 1);
}
by eliminating one of the loops.
A last idea is to change your data representation.
Consider using the '\0' character as a replacement for your 'o' and '\1' as a replacement for your 'x' character. This is sort of like using a bit map.
This would enable you to test like this.
if (best[1])
{
// Is a 'x'
}
else
{
// Is a 'o'
}
Which might produce faster code. Again the profiler is your friend :)
This representation would also enable you to simply sum a set of character to determine how many 'x's and 'o's there are.
int sum = 0;
for (int i = 0; i < 12; i++)
{
sum += best[i];
}
cout << "There are " << sum << "'x's in the range" << endl;
Best of luck to you
Evil.
If you have 2 values for your basic type, I would first try to use bool. Then the compiler knows you have 2 values and might be able to optimize some things better.
Appart from that add const where possible (for example the parameter of fits( bool const*,...)).
I'd think about memory cache breaks. These functions run through sub-matrices inside a bigger matrix - I suppose many times much bigger on both width and height.
That means the small matrix lines are contiguous memory but between lines it might break memory cache pages.
Consider representing the big matrix cells in memory in an order that would keep sub-matrices elements close to each other as possible. That is instead of keeping a vector of contiguous full lines. First option comes to my mind, is to break your big matrix recursively to matrices of size [ 2^i, 2^i ] ordered { top-left, top-right, bottom-left, bottom-right }.
1)
i.e. if your matrix is size [X,Y], represented in an array of size X*Y, then element [x,y] is at position(x,y) in the array:
use instead of (y*X+x):
unsigned position( rx, ry )
{
unsigned x = rx;
unsigned y = rx;
unsigned part = 1;
unsigned pos = 0;
while( ( x != 0 ) && ( y != 0 ) ) {
unsigned const lowest_bit_x = ( x % 2 );
unsigned const lowest_bit_y = ( y % 2 );
pos += ( ((2*lowest_bit_y) + lowest_bit_x) * part );
x /= 2; //throw away lowest bit
y /= 2;
part *= 4; //size grows by sqare(2)
}
return pos;
}
I didn't check this code, just to explain what I mean.
If you need, also try to find a faster way to implement.
but note that the array you allocate will be bigger than X*Y, it has to be the smaller possible (2^(2*k)), and that would be wastefull unless X and Y are about same size scale. But it can be solved by further breaking the big matrix to sqaures first.
And then cache benfits might outwight the more complex position(x,y).
2) then try to find the best way to run through the elements of a sub-matrix in fits() and place(). Not sure yet what it is, not necessarily like you do now. Basically a sub-matrix of size [x,y] should break into no more than y*log(x)*log(y) blocks that are contiguous in the array representation, but they all fit inside no more than 4 blocks of size 4*x*y. So finally, for matrices that are smaller than a memory cache page, you'll get no more than 4 memory cache breaks, while your original code could break y times.