const int n=50;
double a[n][n];
double b[n][n];
double c[n][n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
cout << c[i][j] << " ";
}
cout << "\n";
I currently have a working code that multiplies two nxn matrices. I am trying to reorder the indices (ie i,k,j ... k,i,j) without touching the equation that does the multiplication. I am doing this to see how the order of the indices affects performance time, but if I just change the 'j's to 'k's and vice versa in my loops, my multiplication equation will not be correct.
I am wondering if what I am attempting to do is possible and if anyone can shed some light on what steps I can take to achieve this.
First of all, you shouldn't be printing out the c matrix at the point you are doing so, especially if you are trying to time an algorithm. What you should be doing is more similar to this:
const int n=50;
double a[n][n];
double b[n][n];
double c[n][n];
/* First multiply the matrices a,b into c. */
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
/* now print out the result for visual correctness check */
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
std::cout << c[i][j] << ' '; //this will leave a space after last character, but for this use case, nobody cares.
}
std::cout << std::endl;
}
Then you can just switch around the lines containing for loops (ie. for (int i = 0; i < n; i++)) around, and see if changing access pattern changes execution time/results.
Spoiler: It shouldn't affect results except in some border cases of weird values inside the matrices, that are caused by inexactness of floating point math. It should however affect execution time, but it will be brutally dominated by time taken by printing the matrix, unless measured properly.
If you are taking about performance time then It always come down to complexity. No matter how you change the order, your complexity is defined by the area of code which is doing most of your work.
Here all your loops run till n. Now no matter what order you change you have complexity of order O(n^3) Unless you change your logic. The best Matrix Multiplication Algorithm known so far is the Coppersmith-Winograd algorithm with O(n^2.3736 ) complexity but it is not used for practical purposes.
But you can use Strassen's algorithm which has O(n^2.8074 ) complexity
Related
I am new to C++ and programming so I think I am making inefficient codes.
I was wondering whether there is any way I can speed up the matrix calculation process.
For example, this is the sample code I write which finds the maximum differences(in absolute value) between 3d array 'V' and 'Vnew'.
First, I take subtraction.
And then, I put the value of tempdiff[0][0][0] to 'dif'
Then, I compare 'dif' and tempdiff[i][j][k] and replace if the latter is larger than the former.
This is just a part of my code and there are lots of matrix calculations inside so that I have too many 'for' statements.
So I was wondering whether there is any way I could avoid using 'for' in the matrix calculations.
Thanks in advance.
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
tempdiff[i][j][k] = abs(V[i][j][k] - Vnew[i][j][k]);
}
}
}
dif = tempdiff[0][0][0];
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
if (tempdiff[i][j][k] > dif) {
dif = tempdiff[i][j][k];
}
else {
dif = dif;
}
}
}
}
There's not much you can do with the for loops, as the maximum difference can locate at all possible places. You have already succeeded in iterating the array in the correct, linear, order.
Compilers are generally quite efficient in optimising, but they apparently fail to flatten a contiguous array, such as float V[Na][Nd][Ny];. After you flatten it manually to float V[Na*Nd*Ny], at least clang can auto-vectorise and produce SIMD code for x64 and arm.
A further optimisation is to avoid making this in two steps, as the total memory throughput is exactly doubled with the temporary array compared to a one-pass solution.
I was assuming your matrices are of type float -- if you can select int, gcc can auto-vectorise this as well (relates to NaN handling); furthermore int16_t or int8_t types are even quicker to evaluate, as more operations can be packed to a single SIMD instruction.
I have a function that find maximum value for a range :
A,B,C are 2 d matrices
void solve()
{
for (i = 0; i < n; i++)
{
for (j = 0; j < n; j++)
{
C[i][j] = 0;
for (k = 0; k < n; k++)
{
C[i][j] = max(C[i][j],A[i][k]*B[k][j]);
//C[i][j] can become very large if solve() is called multiple times
}
}
}
for(int i = 0;i<n;i++)
{
for(int j=0;j<n;j++)
{
A[i][j] = C[i][j];
}
}
}
solve() method can be called for large number of times (10^7)
A[i][j] , B[i][j], C[i][j] can be 10^9.
n will be small (about 20).
I need to print final matrix C with modulo m (i.e. C[i][j]%m)
Since we cannot apply mod for intermediate results (it can produce wrong results).
The problem is integer overflow since it can cross max of int and long.
Any suggestions to solve this problem (Any solution other than big int) ?
Since this is competitive programming you should think out of the box. Since you need to calculate the actual maximum and not the maximum of modulus, you can't use modulo while processing. However:
You are only doing multiplications and you are not worried about the actual value but rather the comparison between different results (to know which is bigger). You can use isomorphism. That is, calculate the log() of the numbers and intermediate results and keep the modulus as auxiliar information. You can do this because ab < cd <=> log(ab) < log(cd) <=> log(a) + log(b) < log(c) + log(d). So now you only have to do additions between numbers and the value will remain quite small. You will lose some precision but that should be fine considering the context. The problem is that you won't be able to reconstruct the modulus from the log, so you should keep the mod value in a struct or something.
I am trying to follow the Guassian Elimination algorithm in https://courses.engr.illinois.edu/cs554/fa2015/notes/06_lu_8up.pdf in order to implement LU factorization and eventually parallelize it with openmp. Does the following algorithm look correct, where l is the multiplier and m is the matrix?
void decompose2(double **m) {
begin =clock();
int i=0, j=0, k=0;
for(k = 1; k < size - 1; k++)
{
for(i = k + 1; i < size; i++)
{
l[i][k] = m[i][k]/m[k][k];
}
for(j = k + 1; j < size; j++)
{
for(i = k + 1; k < size; k++)
{
m[i][j] = m[i][j] - (l[i][k]*m[k][j]);
}
}
}
end = clock();
}
I don't think it is correct because according to a different paper the times I am getting after parallelization on the same number of processors are completely different.
"Does the following algorithm look correct, …" -- No, because
arrays are 0-index in C++,
double[size][size] (which you are likely using) is not convertible to double**,
int is not a good type for iterators (use size_t instead),
you don't check if m[k][k] might be (close to) zero, when you might have to swap rows.
Please notice that I only looked at the obvious implementation errors, not at possible instances to make the code better, e.g. increasing the stability of the calculation.
Here is code to find determinant of matrix n x n.
#include <iostream>
using namespace std;
int determinant(int *matrix[], int size);
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column);
int main()
{
int size;
cout << "What is the size of the matrix for which you want to find the determinant?:\t";
cin >> size;
int **matrix;
matrix = new int*[size];
for (int i = 0 ; i < size ; i++)
matrix[i] = new int[size];
cout << "\nEnter the values of the matrix seperated by spaces:\n\n";
for(int i = 0; i < size; i++)
for(int j = 0; j < size; j++)
cin >> matrix[i][j];
cout << "\nThe determinant of the matrix is:\t" << determinant(matrix, size) << endl;
return 0;
}
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
}
}
void ijMinor(int *matrix[], int *minorMatrix[], int size, int row, int column){
for(int i = 0; i < size; i++){
for(int j = 0; j < size; j++){
if(i < row){
if(j < column)minorMatrix[i][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i][j-1] = matrix[i][j];
}
else if(i == row)continue;
else{
if(j < column)minorMatrix[i-1][j] = matrix[i][j];
else if(j == column)continue;
else minorMatrix[i-1][j-1] = matrix[i][j];
}
}
}
}
After adding OpenMP pragmas, I've changed the determinant function and now it looks like this:
int determinant(int *matrix[], int size){
if(size==1)return matrix[0][0];
else{
int result=0, sign=-1;
#pragma omp parallel for default(none) shared(size,matrix,sign) private(j,k) reduction(+ : result)
for(int j = 0; j < size; j++){
int **minorMatrix;
minorMatrix = new int*[size-1];
for (int k = 0 ; k < size-1 ; k++)
minorMatrix[k] = new int[size-1];
ijMinor(matrix, minorMatrix, size, 0, j);
sign*=-1;
result+=sign*matrix[0][j]*determinant(minorMatrix, size-1);
for(int i = 0; i < size-1; i++){
delete minorMatrix[i];
}
}
return result;
delete [] matrix;
}
}
My problem is that the result is every time different. Sometimes it gives correct value, but most often it is wrong. I think it's because of the sign variable. I am following the formula:
As you can see, in every iteration of my for loop there should be different sign but when I use OpenMP, something is wrong. How can I make this program to run with OpenMP?
Finally, my second issue is that using OpenMP does not make the program run quicker than without OpenMP. I also tried to make a 100,000 x 100,000 matrix, but my program reports an error about allocating memory. How can I run this program with very large matrices?
Your issues as I see it are as follows:
1) As noted by Hristo, your threads are stomping over each other's data with respect to the sign variable. It should be private to each thread so that they have full read/write access to it without having to worry about race conditions. Then, you simply need an algorithm to compute whether sign is plus or minus 1 depending on the iteration j independently from the other iterations. With a little thinking, you'll see that Hristo's suggestion is correct: sign = (j % 2) ? -1 : 1; should do the trick.
2) Your determinant() function is recursive. As is, that means that every iteration of the loop, after forming your minors, you then call your function again on that minor. Therefore, a single thread is going to be performing its iteration, enter the recursive function, and then try to split itself up into nthreads more threads. You can see now how you are oversubscribing your system by launching many more threads than you physically have cores. Two easy solutions:
Call your original serial function from within the omp parallel code. This is the fastest way to do it because this would avoid any OpenMP-startup overhead.
Turn off nested parallelism by calling omp_set_nested(0); before your first call to determinant().
Add an if clause to your parallel for directive: if(omp_in_parallel())
3) Your memory issues are because every iteration of your recursion, you are allocating more memory. If you fix problem #2, then you should be using comparable amounts of memory in the serial case as the parallel case. That being said, it would be much better to allocate all the memory you want before entering your algorithm. Allocating large chunks of memory (and then freeing it!), especially in parallel, is a terrible bottleneck in your code.
Compute the amount of memory you would need (on paper) before entering the first loop and allocate it all at once. I would also strongly suggest you consider allocating your memory contiguously (aka in 1D) to take better advantage of caching as well. Remember that each thread should have its own separate area to work with. Then, change your function to:
int determinant(int *matrix, int *startOfMyWorkspace, int size).
Instead of allocating a new (size-1)x(size-1) matrix inside of your loop, you would simply utilize the next (size-1)*(size-1) integers of your workspace, update what startOfMyWorkspace would be for the next recursive call, and continue along.
This is only a compact test case but I have a vector of doubles and I want to populate a square matrix (2D vector) of all pairwise differences. When compiled with -O3 optimization, this takes about 1.96 seconds on my computer (computed only from the nested double for-loop).
#include <vector>
using namespace std;
int main(){
vector<double> a;
vector<vector<double> > b;
unsigned int i, j;
unsigned int n;
double d;
n=10000; //In practice, this value is MUCH bigger
a.resize(n);
for (i=0; i< n; i++){
a[i]=static_cast<double>(i);
}
b.resize(n);
for (i=0; i< n; i++){
b[i].resize(n);
b[i][i]=0.0; //Zero diagonal
}
for (i=0; i< n; i++){
for (j=i+1; j< n; j++){
d=a[i]-a[j];
//Commenting out the next two lines makes the code significantly faster
b[i][j]=d;
b[j][i]=d;
}
}
return 0;
}
However, when I comment out the two lines:
b[i][j]=d;
b[j][i]=d;
The program completes in about 0.000003 seconds (computed only from the nested double for-loop)! I really didn't expect these two lines to be the rate-limiting-step. I've been staring at this code for a while and I'm out of ideas. Can anybody please offer any suggestions as to how I could optimize this simple piece of code so that the time can be significantly reduced?
When you comment out those two lines, all that's left in the nested loop is to keep computing d and then throwing away the result. Since this can't have any effect on the behaviour of the program, the compiler will just optimize out the nested loop. That's why the program finishes almost instantaneously.
In fact, I confirmed this by compiling the code twice with g++ -O3, once with only the d=a[i]-a[j] statement in the nested loop, and once with the nested loop deleted entirely. The code emitted was identical.
Nevertheless, your code is currently slower than it has to be, because it's missing the cache. When you access a two-dimensional array in a nested loop like this, you should always arrange for the iteration to be continuous through memory if possible. This means that the second index should be the one that's varying faster. The access to b[j][i] is violating this rule and missing the cache. So let's rewrite.
Before:
for (i=0; i< n; i++){
for (j=i+1; j< n; j++){
d=a[i]-a[j];
b[i][j]=d;
b[j][i]=d;
}
}
Timing:
real 0m1.026s
user 0m0.824s
sys 0m0.196s
After:
for (i = 0; i < n; i++) {
for (j = 0; j < i; j++) {
b[i][j] = a[j] - a[i];
}
for (j = i+1; j < n; j++) {
b[i][j] = a[i] - a[j];
}
}
Timing:
real 0m0.335s
user 0m0.164s
sys 0m0.164s