This is only a compact test case but I have a vector of doubles and I want to populate a square matrix (2D vector) of all pairwise differences. When compiled with -O3 optimization, this takes about 1.96 seconds on my computer (computed only from the nested double for-loop).
#include <vector>
using namespace std;
int main(){
vector<double> a;
vector<vector<double> > b;
unsigned int i, j;
unsigned int n;
double d;
n=10000; //In practice, this value is MUCH bigger
a.resize(n);
for (i=0; i< n; i++){
a[i]=static_cast<double>(i);
}
b.resize(n);
for (i=0; i< n; i++){
b[i].resize(n);
b[i][i]=0.0; //Zero diagonal
}
for (i=0; i< n; i++){
for (j=i+1; j< n; j++){
d=a[i]-a[j];
//Commenting out the next two lines makes the code significantly faster
b[i][j]=d;
b[j][i]=d;
}
}
return 0;
}
However, when I comment out the two lines:
b[i][j]=d;
b[j][i]=d;
The program completes in about 0.000003 seconds (computed only from the nested double for-loop)! I really didn't expect these two lines to be the rate-limiting-step. I've been staring at this code for a while and I'm out of ideas. Can anybody please offer any suggestions as to how I could optimize this simple piece of code so that the time can be significantly reduced?
When you comment out those two lines, all that's left in the nested loop is to keep computing d and then throwing away the result. Since this can't have any effect on the behaviour of the program, the compiler will just optimize out the nested loop. That's why the program finishes almost instantaneously.
In fact, I confirmed this by compiling the code twice with g++ -O3, once with only the d=a[i]-a[j] statement in the nested loop, and once with the nested loop deleted entirely. The code emitted was identical.
Nevertheless, your code is currently slower than it has to be, because it's missing the cache. When you access a two-dimensional array in a nested loop like this, you should always arrange for the iteration to be continuous through memory if possible. This means that the second index should be the one that's varying faster. The access to b[j][i] is violating this rule and missing the cache. So let's rewrite.
Before:
for (i=0; i< n; i++){
for (j=i+1; j< n; j++){
d=a[i]-a[j];
b[i][j]=d;
b[j][i]=d;
}
}
Timing:
real 0m1.026s
user 0m0.824s
sys 0m0.196s
After:
for (i = 0; i < n; i++) {
for (j = 0; j < i; j++) {
b[i][j] = a[j] - a[i];
}
for (j = i+1; j < n; j++) {
b[i][j] = a[i] - a[j];
}
}
Timing:
real 0m0.335s
user 0m0.164s
sys 0m0.164s
Related
What should be the time complexity of the following code?
I tried to think and come up with O(n2) but the output says it to be of O(n). Can someone please explain through code?
for(int i = 0; i < n; i++){
for(; i < n; i++){
cout << i << endl;
}
}
The complexity of your code is O(n).
Why?
Because, even though you have written two for loops, which probably made you thinking the complexity is O(n2), your code is actually one for loop like:
for (i = 0; i < n; i++){
std::cout << i << std::endl;
}
Once the inner for loop finishes, i is equal to n and therefore the condition of outer for loop i < n is no longer satisfied.
One point to be noted while using such for loops is that your using a single variable.
Irrespective of how many outer loops you add, your code will result in the same with the condition i<n prevailing in all. The innermost loop is the one which will run till i=n-1, the rest of which simply won't satisfy the condition.
for(int i=0; i<n; i++)
{ for(; i<n; i++)
{ for(; i<n; i++)
{ for(; i<n; i++) // and so on.
std::cout<<i<<"\n";
}
}
}
Providing a variant to this, if you were to observe one such case of O(n2) complexity, your condition would have been i<n*n:
for(int i=0; i<n; i++)
{ for(; i<n*n; i++)
std::cout<<i<<"\n";
}
Time complexity of your code is O(n) and not O(n^2) because when inner loop ends, at that time value of i has already reached to n . So outer loop cannot run any more.
for(int i = 0; i < 2; i++){
for(; i < 2; i++){
cout << i << endl;
}
//after loop run two times i has value 2.
//and outer loop cannot run anymore
}
I give the following example to illustrate my question:
bool bSign;
// bSign will be set depending on some criteria, which is omitted here.
// b[][][] is a float array, which is initialized in the program
for(int i=0; i<1000; i++)
for(int j=0; j<10000; j++)
for(int k=0; k<10000; k++)
if(bSign)
a[i][j][k] = (b[i][j][k]>500);
else
a[i][j][k] = (b[i][j][k]<500);
In the above codes, I have to rely on bSign to design the kind of operator (> or <) I should use to set the output variable a. However, I found it costly as it is done within a long for loop. Any ideas on how I can escape that?
Your compiler should be able to optimise this out if it knows bSign won't change during the loop. Making it const would help.
If for some reason that's not happening, you could move the if (bSign) out to surround the entire set of nested loops. (Basically, doing manually what you're hoping the compiler will do for you.)
Beyond that, you're relying on branch prediction (whose success will depend on the nature of your data).
Operator overloading, though, has absolutely nothing to do with it.
Really, though, this kind of iteration is just not going to be efficient. Can't you find a better algorithm?
As many people said, it is better to move if statement out of loops (if it is possible). To prevent code duplication, you can use function pointers, but it may be worse if there will be real calls. Calls cost more than simple if. If compiler can inline them, it will be ok.
You also can use templates metaprogramming (simple template algorithm in this case) on this manner:
template<typename ComparatorT>
void doTheWork(float ***a, const float ***b, ComparatorT comparator)
{
for(int i=0; i<1000; i++)
for(int j=0; j<10000; j++)
for(int k=0; k<10000; k++)
a[i][j][k] = comparator(b[i][j][k], 500);
}
...
// Usage.
bool bSign;
...
if (bSign) {
doTheWork(a, b, std::greater<float>());
} else {
doTheWork(a, b, std::larger<float>());
}
Compiler can inline comparator's calls (it is much easier than with function pointers) and it will work almost like you wrote two different loops. The comparator object can be really created, but usually it is not significant. One disadvantage is that it often produces hard-to-read code.
Can you just move the check out of the loop?
if (bSign)
for(int i=0; i<1000; i++)
for(int j=0; j<10000; j++)
for(int k=0; k<10000; k++)
a[i][j][k] = (b[i][j][k]>500);
else
for(int i=0; i<1000; i++)
for(int j=0; j<10000; j++)
for(int k=0; k<10000; k++)
a[i][j][k] = (b[i][j][k]<500);
I think that you can't get rid of such ifs in your program. But you can help processor to run your program faster. Consider sorting b array (or copy & sort if you can't change b). This should help to predict branches by processor. Of course make some benchmarks to be sure that won't be worse than your current code.
I'm not sure if this is any faster (you'd have to do a perfomance check) but you could avoid all the if statements using a function pointer.
int comparisonValue = 500;
auto greater = [comparisonValue](float val) {
return val > comparisonValue;
};
auto lesser = [comparisonValue](float val) {
return val < comparisonValue;
};
bool bSign;
// bSign will be set depending on some criteria, which is omitted here.
// b[][][] is a float array, which is initialized in the program
decltype(greater)* funcPtr;
if(bSign) {
funcPtr = &greater;
} else {
funcPtr = &lesser;
}
for(int i=0; i<1000; i++)
for(int j=0; j<10000; j++)
for(int k=0; k<10000; k++)
a[i][j][k] = (*funcPtr)(b[i][j][k]);
Why don't you use it like that?;
a[i][j][k] = (bSign == (b[i][j][k] > 500)) && (b[i][j][k] != 500);
I have an 3D array z, where every element has the value 1.
Now I do:
#pragma omp parallel for collapse(3) shared(z)
for (int i=0; i < SIZE; ++i) {
for (int j=0; j < SIZE; ++j) {
for (int k=0; k < SIZE; ++k) {
for (int n=0; n < ITERATIONS-1; ++n) {
z[i][j][k] += 1;
}
}
}
}
This should add ITERATIONS to each element and it does. If I then change the collapse(3) to collapse(4) (because there are 4 for-loops) I don't get the right result.
Shouldn't I be able to collapse all four loops?
The issue is that the 4th loop isn't parallelisable the same way the 3 first are. Just to convince yourself, look at it with only the last loop in mind. It would become:
int zz = z[i][j][k];
for (int n=0; n < ITERATIONS-1; ++n) {
zz += 1;
}
z[i][j][k] = zz;
In order to parallelise it, you would need to add a reduction(+:zz) directive, right?
Well, same story for your collapse(4). But adding reduction(+:z), if all possible which I'm not sure, would raise some issues:
The reduction clause for arrays in C or C++ is only supported for OpenMP 4.5 onwards, and I don't know of any compiler supporting it at the moment (although I'm sure some do).
It would probably make the code much slower anyway, due to the complex mechanism of managing the reduction aspect.
So bottom line is: just stick to collapse(3) or less as you need, or parallelise you loop differently.
Here is code to find determinant of matrix n x n.
#include <iostream>
using namespace std;
int determinant(int *matrix[], int size);
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column);
int main()
{
int size;
cout << "What is the size of the matrix for which you want to find the determinant?:\t";
cin >> size;
int **matrix;
matrix = new int*[size];
for (int i = 0 ; i < size ; i++)
matrix[i] = new int[size];
cout << "\nEnter the values of the matrix seperated by spaces:\n\n";
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
cin >> matrix[i][j];
cout << "\nThe determinant of the matrix is:\t" << determinant(matrix, size) << endl;
return 0;
}
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
}
}
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
if(i < row){
if(j < column)minorMatrix[i][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i][j-1] = matrix[i][j];
}
else if(i == row)continue;
else{
if(j < column)minorMatrix[i-1][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i-1][j-1] = matrix[i][j];
}
}
}
}
After adding OpenMP pragmas, I've changed the determinant function and now it looks like this:
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
#pragma omp parallel for default(none) shared(size,matrix,sign) private(j,k) reduction(+ : result)
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
delete [] matrix;
}
}
My problem is that the result is every time different. Sometimes it gives correct value, but most often it is wrong. I think it's because of the sign variable. I am following the formula:
As you can see, in every iteration of my for loop there should be different sign but when I use OpenMP, something is wrong. How can I make this program to run with OpenMP?
Finally, my second issue is that using OpenMP does not make the program run quicker than without OpenMP. I also tried to make a 100,000 x 100,000 matrix, but my program reports an error about allocating memory. How can I run this program with very large matrices?
Your issues as I see it are as follows:
1) As noted by Hristo, your threads are stomping over each other's data with respect to the sign variable. It should be private to each thread so that they have full read/write access to it without having to worry about race conditions. Then, you simply need an algorithm to compute whether sign is plus or minus 1 depending on the iteration j independently from the other iterations. With a little thinking, you'll see that Hristo's suggestion is correct: sign = (j % 2) ? -1 : 1; should do the trick.
2) Your determinant() function is recursive. As is, that means that every iteration of the loop, after forming your minors, you then call your function again on that minor. Therefore, a single thread is going to be performing its iteration, enter the recursive function, and then try to split itself up into nthreads more threads. You can see now how you are oversubscribing your system by launching many more threads than you physically have cores. Two easy solutions:
Call your original serial function from within the omp parallel code. This is the fastest way to do it because this would avoid any OpenMP-startup overhead.
Turn off nested parallelism by calling omp_set_nested(0); before your first call to determinant().
Add an if clause to your parallel for directive: if(omp_in_parallel())
3) Your memory issues are because every iteration of your recursion, you are allocating more memory. If you fix problem #2, then you should be using comparable amounts of memory in the serial case as the parallel case. That being said, it would be much better to allocate all the memory you want before entering your algorithm. Allocating large chunks of memory (and then freeing it!), especially in parallel, is a terrible bottleneck in your code.
Compute the amount of memory you would need (on paper) before entering the first loop and allocate it all at once. I would also strongly suggest you consider allocating your memory contiguously (aka in 1D) to take better advantage of caching as well. Remember that each thread should have its own separate area to work with. Then, change your function to:
int determinant(int *matrix, int *startOfMyWorkspace, int size).
Instead of allocating a new (size-1)x(size-1) matrix inside of your loop, you would simply utilize the next (size-1)*(size-1) integers of your workspace, update what startOfMyWorkspace would be for the next recursive call, and continue along.
const int n=50;
double a[n][n];
double b[n][n];
double c[n][n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
cout << c[i][j] << " ";
}
cout << "\n";
I currently have a working code that multiplies two nxn matrices. I am trying to reorder the indices (ie i,k,j ... k,i,j) without touching the equation that does the multiplication. I am doing this to see how the order of the indices affects performance time, but if I just change the 'j's to 'k's and vice versa in my loops, my multiplication equation will not be correct.
I am wondering if what I am attempting to do is possible and if anyone can shed some light on what steps I can take to achieve this.
First of all, you shouldn't be printing out the c matrix at the point you are doing so, especially if you are trying to time an algorithm. What you should be doing is more similar to this:
const int n=50;
double a[n][n];
double b[n][n];
double c[n][n];
/* First multiply the matrices a,b into c. */
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
/* now print out the result for visual correctness check */
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
std::cout << c[i][j] << ' '; //this will leave a space after last character, but for this use case, nobody cares.
}
std::cout << std::endl;
}
Then you can just switch around the lines containing for loops (ie. for (int i = 0; i < n; i++)) around, and see if changing access pattern changes execution time/results.
Spoiler: It shouldn't affect results except in some border cases of weird values inside the matrices, that are caused by inexactness of floating point math. It should however affect execution time, but it will be brutally dominated by time taken by printing the matrix, unless measured properly.
If you are taking about performance time then It always come down to complexity. No matter how you change the order, your complexity is defined by the area of code which is doing most of your work.
Here all your loops run till n. Now no matter what order you change you have complexity of order O(n^3) Unless you change your logic. The best Matrix Multiplication Algorithm known so far is the Coppersmith-Winograd algorithm with O(n^2.3736 ) complexity but it is not used for practical purposes.
But you can use Strassen's algorithm which has O(n^2.8074 ) complexity