I am trying to follow the Guassian Elimination algorithm in https://courses.engr.illinois.edu/cs554/fa2015/notes/06_lu_8up.pdf in order to implement LU factorization and eventually parallelize it with openmp. Does the following algorithm look correct, where l is the multiplier and m is the matrix?
void decompose2(double **m) {
begin =clock();
int i=0, j=0, k=0;
for(k = 1; k < size - 1; k++)
{
for(i = k + 1; i < size; i++)
{
l[i][k] = m[i][k]/m[k][k];
}
for(j = k + 1; j < size; j++)
{
for(i = k + 1; k < size; k++)
{
m[i][j] = m[i][j] - (l[i][k]*m[k][j]);
}
}
}
end = clock();
}
I don't think it is correct because according to a different paper the times I am getting after parallelization on the same number of processors are completely different.
"Does the following algorithm look correct, …" -- No, because
arrays are 0-index in C++,
double[size][size] (which you are likely using) is not convertible to double**,
int is not a good type for iterators (use size_t instead),
you don't check if m[k][k] might be (close to) zero, when you might have to swap rows.
Please notice that I only looked at the obvious implementation errors, not at possible instances to make the code better, e.g. increasing the stability of the calculation.
Related
I am new to C++ and programming so I think I am making inefficient codes.
I was wondering whether there is any way I can speed up the matrix calculation process.
For example, this is the sample code I write which finds the maximum differences(in absolute value) between 3d array 'V' and 'Vnew'.
First, I take subtraction.
And then, I put the value of tempdiff[0][0][0] to 'dif'
Then, I compare 'dif' and tempdiff[i][j][k] and replace if the latter is larger than the former.
This is just a part of my code and there are lots of matrix calculations inside so that I have too many 'for' statements.
So I was wondering whether there is any way I could avoid using 'for' in the matrix calculations.
Thanks in advance.
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
tempdiff[i][j][k] = abs(V[i][j][k] - Vnew[i][j][k]);
}
}
}
dif = tempdiff[0][0][0];
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
if (tempdiff[i][j][k] > dif) {
dif = tempdiff[i][j][k];
}
else {
dif = dif;
}
}
}
}
There's not much you can do with the for loops, as the maximum difference can locate at all possible places. You have already succeeded in iterating the array in the correct, linear, order.
Compilers are generally quite efficient in optimising, but they apparently fail to flatten a contiguous array, such as float V[Na][Nd][Ny];. After you flatten it manually to float V[Na*Nd*Ny], at least clang can auto-vectorise and produce SIMD code for x64 and arm.
A further optimisation is to avoid making this in two steps, as the total memory throughput is exactly doubled with the temporary array compared to a one-pass solution.
I was assuming your matrices are of type float -- if you can select int, gcc can auto-vectorise this as well (relates to NaN handling); furthermore int16_t or int8_t types are even quicker to evaluate, as more operations can be packed to a single SIMD instruction.
I have the following piece of C++ code. The scale of the problem is N and M. Running the code takes about two minutes on my machine. (after g++ -O3 compilation). Is there anyway to further accelerate it, on the same machine? Any kind of option, choosing a better data structure, library, GPU or parallelism, etc, is on the table.
void demo() {
int N = 1000000;
int M=3000;
vector<vector<int> > res(M);
for (int i =0; i <N;i++) {
for (int j=1; j < M; j++){
res[j].push_back(i);
}
}
}
int main() {
demo();
return 0;
}
An additional info: The second loop above for (int j=1; j < M; j++) is a simplified version of the real problem. In fact, j could be in a different range for each i (of the outer loop), but the number of iterations is about 3000.
With the exact code as shown when writing this answer, you could create the inner vector once, with the specific size, and call iota to initialize it. Then just pass this vector along to the outer vector constructor to use it for each element.
Then you don't need any explicit loops at all, and instead use the (highly optimized, hopefully) standard library to do all the work for you.
Perhaps something like this:
void demo()
{
static int const N = 1000000;
static int const M = 3000;
std::vector<int> data(N);
std::iota(begin(data), end(data), 0);
std::vector<std::vector<int>> res(M, data);
}
Alternatively you could try to initialize just one vector with that elements, and then create the other vectors just by copying that part of the memory using std::memcpy or std::copy.
Another optimization would be to allocate the memory in advance (e.g. array.reserve(3000)).
Also if you're sure that all the members of the vector are similar vectors, you could do a hack by just creating a single vector with 3000 elements, and in the other res just put the same reference of that 3000-element vector million times.
On my machine which has enough memory to avoid swapping your original code took 86 seconds.
Adding reserve:
for (auto& v : res)
{
v.reserve(N);
}
made basically no difference (85 seconds but I only ran each version once).
Swapping the loop order:
for (int j = 1; j < M; j++) {
for (int i = 0; i < N; i++) {
res[j].push_back(i);
}
}
reduced the time to 10 seconds, this is likely due to a combination of allowing the compiler to use SIMD optimisations and improving cache coherency by accessing memory in sequential order.
Creating one vector and copying it into the others:
for (int i = 0; i < N; i++) {
res[1].push_back(i);
}
for (int j = 2; j < M; j++) {
res[j] = res[1];
}
reduced the time to 4 seconds.
Using a single vector:
void demo() {
size_t N = 1000000;
size_t M = 3000;
vector<int> res(M*N);
size_t offset = N;
for (size_t i = 0; i < N; i++) {
res[offset++] = i;
}
for (size_t j = 2; j < M; j++) {
std::copy(res.begin() + N, res.begin() + N * 2, res.begin() + offset);
offset += N;
}
}
also took 4 seconds, there probably isn't much improvement because you have 3,000 4 MB vectors, there would likely be more difference if N was smaller or M was larger.
i need to multiply two 10x10 matrices using open mp. I decided to split the rows of one matrice into groups of 3rows,3 rows and 4 rows. how do i fix this code for the first three rows ?
#pragma omg parallel for reduction(+:m[p][q])
{
for (p = 0; p < 3; p++)
for (q = 0; q < 10; q++)
for (k = 0; k < 10; ++k)
{
m[p][q] += l[p][k] * o[k][q];
}
}
For a start - don't split the matrix yourself, but let OpenMP take care of sharing the work in the loops, e.g.
#pragma omg parallel for
{
for (p = 0; p < 10; p++)
for (q = 0; q < 10; q++)
for (k = 0; k < 10; ++k)
{
m[p][q] += l[p][k] * o[k][q];
}
}
In this code there is no need for a reduction because all concurrent write operations happen to different elements of m. Even if you collapse(2) the first two loops, you are still fine in that regard.
That said, optimizing matrix multiplication is an immensely complex topic on modern hardware. Parallelizing it even more so. If you want to get performance, use a BLAS implementation that is optimized for your architecture. If you want to learn - I suggest you start with the serial implementation and then go on parallelizing it. There plenty of educational material available for either.
I have a struct:
struct xyz{
int x,y,z;
};
and I initialize a struct xyz type vector:
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
for (int k = 0; k < N; k++)
{
v.x=i;
v.y=j;
v.z=k;
vect.push_back(v);
}
}
}
then I want to transform that vector to array because array is 2 time faster than vector to manipulate, so I do
xyz arr[vect.size()];
std::copy(vect.begin(), vect.end(), arr);
when I run this program it shows me segmentation fault which I think is because vect.size() is too large.
So I am wondering is there any way to convert that large size vector to array without that problem.
I appreciate for any help
My overly pedantic comment got too big, so instead I'll try to make this a somewhat roundabout answer. The short answer is probably just to stick with vector but make sure to use reserve; oh, and benchmark.
You didn't say what compiler or C++ version you're using, so I'll just go with my current gcc.godbolt.org default of gcc 4.9.2, C++14. I'm also assuming that you really want this as a 1-dimension array, rather than the more natural (for your example) 3.
If you know N at compile time, you could do something like this (assuming I got the array offset calculation correct):
#include <array>
...
std::array<xyz, N*N*N> xyzs;
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < N; k++) {
xyzs[i*N*N+j*N+k] = {i, j, k};
}
}
}
The biggest downsides, IMO:
error-prone offset calculation
depending on N, where the code is run, etc, this can blow the stack
On the compilers I tried this on, the optimizers seem to understand that we're moving through the array in contiguous order, and the generated machine code is more sensible, but it could also be written like so, if you prefer:
#include <array>
...
std::array<xyz, N*N*N> xyzs;
auto p = xyzs.data();
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = 0; k < N; ++k) {
(*p++) = {i, j, k};
}
}
}
Of course, if you actually know N at compile time, and it won't blow the stack, you might consider a 3-dimensional array xyz xyzs[N][N][N]; since this might be more natural for the way these things are being ultimately being used.
As pointed out in comments, variable length arrays aren't legal C++, but they are legal in C99; if you don't know N at compile time you should be allocating off the heap.
A vector and an array will wind up being identical in terms memory layout; they differ in that vector allocates memory from the heap, and the array (as you are writing it) would be on the stack. The only recommendation I'd make is to call reserve before entering your loop:
vect.reserve(N*N*N);
This means you'll only be doing a single memory allocation up front, rather than grow-and-copy mechanism that you'll get from a default constructed vector.
Assuming xyz is as simple as you declare here, you could also do something like the second example above:
std::vector<xyz> xyzs{N*N*N};
auto p = xyzs.data();
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = 0; k < N; ++k) {
(*p++) = {i, j, k};
}
}
}
You lose the safety of push_back, and it is less efficient if xyz default constructor needs to do anything (like if xyz members were changed to have default values).
Having said all that, you really should benchmark. But then, you should probably be benchmarking the code that ultimately uses this array, rather than the code to construct it; I'd have other concerns if construction was dominating usage.
const int n=50;
double a[n][n];
double b[n][n];
double c[n][n];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
cout << c[i][j] << " ";
}
cout << "\n";
I currently have a working code that multiplies two nxn matrices. I am trying to reorder the indices (ie i,k,j ... k,i,j) without touching the equation that does the multiplication. I am doing this to see how the order of the indices affects performance time, but if I just change the 'j's to 'k's and vice versa in my loops, my multiplication equation will not be correct.
I am wondering if what I am attempting to do is possible and if anyone can shed some light on what steps I can take to achieve this.
First of all, you shouldn't be printing out the c matrix at the point you are doing so, especially if you are trying to time an algorithm. What you should be doing is more similar to this:
const int n=50;
double a[n][n];
double b[n][n];
double c[n][n];
/* First multiply the matrices a,b into c. */
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
/* now print out the result for visual correctness check */
for (int i = 0; i < n; i++){
for (int j = 0; j < n; j++){
std::cout << c[i][j] << ' '; //this will leave a space after last character, but for this use case, nobody cares.
}
std::cout << std::endl;
}
Then you can just switch around the lines containing for loops (ie. for (int i = 0; i < n; i++)) around, and see if changing access pattern changes execution time/results.
Spoiler: It shouldn't affect results except in some border cases of weird values inside the matrices, that are caused by inexactness of floating point math. It should however affect execution time, but it will be brutally dominated by time taken by printing the matrix, unless measured properly.
If you are taking about performance time then It always come down to complexity. No matter how you change the order, your complexity is defined by the area of code which is doing most of your work.
Here all your loops run till n. Now no matter what order you change you have complexity of order O(n^3) Unless you change your logic. The best Matrix Multiplication Algorithm known so far is the Coppersmith-Winograd algorithm with O(n^2.3736 ) complexity but it is not used for practical purposes.
But you can use Strassen's algorithm which has O(n^2.8074 ) complexity