Will matrix multiplication using for loops decrease performance?

Will matrix multiplication using for loops decrease performance? - c++

Currently I'm working on a program that uses matrices. I came up with this nested loop to multiply two matrices:
// The matrices are 1-dimensional arrays
for (int i = 0; i < 4; i++)
for (int j = 0; j < 4; j++)
for (int k = 0; k < 4; k++)
result[i * 4 + j] += M1[i * 4 + k] * M2[k * 4 + j];
The loop works. My question is: will this loop be slower compared to writing it all out manually like this:
result[0] = M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12];
result[1] = M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[4]*M2[13];
result[2] = ... etc.
Because in the nested loop, the array positions are calculated and in the second method, they do not.
Thanks.

As with so many things, "it depends", but in this instance I would tend toward the second, expanded form performing just about the same. Any modern compiler will unroll appropriate loops for you, and take care of it.
Two points perhaps worth making:
The second approach is uglier, is more prone to errors and tedious to write/maintain.
This is a nice example of 'premature optimization' (AKA the root of all evil). Do you know if this section is a bottleneck? Is this really the most intensive part of the code? By optimizing so early we incur everything in point #1 for what amounts to a hunch if we haven't bench marked our code.

Your compiler might already do this, take a look at loop unrolling.
Let the compiler do the guessing and the heavy work, stick to the clean code, and as always, measure your performance.

I don't think the loop will be slower. You are accessing the memory of the M1 and M2 arrays in the same way in both instances i.e. . If you want to make the "manual" version faster then use scalar replacement and do the computation on registers e.g.
double M1_0 = M1[0];
double M2_0 = M2[0];
result[0] = M1_0*M2_0 + ...
but you can use scalar replacement within the loop as well. You can do it if you do blocking and loop unrolling (in fact your triple loop looks like a blocking version of the MMM).
What you are trying to do is to speed up the program by improving locality i.e. better use of the memory hierarchy and better locality.

Assuming that you are running code on Intel processors or compatible (AMD) you may actually want to switch to assembly language to do heavy matrix computations. Luckily, you have the Intel-IPP library that does the actual work for you using advanced processor technology and selecting what is expected to be the fastest algorithm depending on your processor.
The IPP includes all the necessary matrix computations that you'd possibly need. The only problem you may encounter is the order in which you created your matrices. You may have to re-organize the order to make it easier to use the IPP functions you'd like to use.
Note that in regard to your two code examples, the second one will be faster because you avoid the += operator which is a read / modify / write cycle and that's generally slow (not only that, it requires the result matrix to be all zeroes to start with whereas the second example does not require clearing the output first), although your matrices are likely to fit in the cache... but, the processors are optimized to read input data in sequence (a[0], a1, a[2], a[3], ...) and also to write that data back in sequence. If you can write your algorithm to be as close as possible to such a sequence, all the better. Don't get me wrong, I know that matrix multiplications cannot be done in sequence. But if you think of that to do your optimization, you'll achieve better results (i.e. change the order in which your matrices are saved in memory could be one of them).

Related

Why is vectorization not beneficial in this for loop?

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?

It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

How can I improve the perfomance of my OpenMP code?

I am currently trying to improve parallel performance on my Code and I am still new to OpenMP. I have to iterate over a large container, in each iteration reading from multiple entries and writing a result to a single entry. Below is a very minmal Code example of what I am trying to do.
data is a pointer to an array, where a lot of datapoints are stored. Before the parallel region I create an Array newData, so can use data as read-only and newData as write-only, afterwards I throw the old data away and use newDatafor further calculations.
To my understanding data and newDataare shared between threads and everything declared inside the parallel region is private.
Can reading from databy multiple threads cause performance issues?
I am using #critical for assigning a new value to an element of newData to avoid race conditions. Is this necessary, since I access every element of newDataonly once and never by multiple threads?
Also I am not sure about scheduling. Do I have to specify if I want a static or dynamic schedule? Can I use nowait since all threads are idependent of each other?
array *newData = new array;
omp_set_num_threads (threads);
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < range; i++)
{
double middle = (*data)[i];
double previous = (*data)[i-1];
double next = (*data)[i+1];
double new_value = (previous + middle + next) / 3.0;
#pragma omp critical(assignment)
(*newData)[i] = new_value;
}
}
delete data;
data = newData;
I am aware that in the first and last iteration previous and next can not be read from data, in the real code this is taken care of but for this minimal example you get the idea of reading multiple times from data.

First of all, get rid of all unnecessary dependencies. #pragma omp critical(assignment) is not necessary because each index of (*newData) is only written to once per loop, so there's no race condition.
Your code could now look like this:
#pragma omp parallel for
for (int i = 0; i < range; i++)
(*newData)[i] = ((*data)[i-1] + (*data)[i] + (*data)[i+1]) / 3.0;
Now we're looking for bottlenecks. The list of potential candidates I came up with is this:
Slow division
Cache thrashing
ILP (Instruction level parallelism)
Memory bandwith limitations
Hidden dependencies
So let's analyze them further.
Slow division:
It takes some CPUs forever to calculate double/double. To know how long and what througput your CPU has, you have to look at its specs. Maybe replacing /3.0 with *0.3333.. might help, but maybe your compiler does this already. Using extended instruction sets (like SSE/AVX) you might shedule several divisions/multiplications at once.
Cache thrashing:
Because your CPU has to load/store one cache line at a time there could be conflicts. Imagine if thread 1 tries to write to (*newdata)[1] and thread 2 to (*newdata)[2] and they are on the same cache line. Now one of them has to wait for the other. You could resolve this with #pragma omp parallel for schedule(static, 64).
ILP:
CPUs can schedule multiple operations into a pipeline if the operations are independent. For this to happen you have to unroll your loop. This could look like this:
assert(range % 4 == 0);
#pragma omp parallel for
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) / 3.0;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) / 3.0;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) / 3.0;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) / 3.0;
}
Memory bandwith limitations:
For your very simple loop think about this. How much memory do you have to load and how long will your CPU be busy processing it. You're loading about 1 cache line and computing some dereferences, some pointer addition, two additions and one division. Which limit you hit depends on your CPU specs.
Now consider cache locality. Can you modify your code to make better use of the cache? If one thread gets i=3 in one loop-iteration, and i=7 in the next, you have to reload 3 (*data)'s. But if you would go from i=3 to i=4, you might not have to load anything, because (*data)[i+1] was in the cacheline previously loaded. You save some RAM bandwith. To make use of this, unroll the loop. Also using float instead of double increases this chance.
Hidden dependencies:
Now this part I personally find very tricky. Sometimes your compiler isn't shure it can reuse some data, because it doesn't know it hasn't changed. Using const helps the compiler. But sometimes you need a restrict to give the compiler the right hint. But I don't understand this well enough to explain it.
So here is what I would try:
const double ONETHIRD = 1.0 / 3.0;
assert(range % 4 == 0);
#pragma omp parallel for schedule(static, 1024)
for (int i = 0; i < range/4; i++) {
(*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) * ONETHIRD;
(*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) * ONETHIRD;
(*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) * ONETHIRD;
(*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) * ONETHIRD;
}
And then benchmark. Benchmark some more, and benchmark some more. Only benchmarks will show you which tricks help.
PS: One more thing to consider. If you see your program hitting the memory bandwith hard. You could consider changing the algorithm. Maybe fuse two steps into one. Like going from
b[i] := (a[i-1] + a[i] + a[i+1]) / 3.0
to
d[i] := (n[i-1] + n[i] + n[i+1]) / 3.0 = (a[i-2] + 2.0 * a[i-1] + 3.0 * a[i] + 2.0 * a[i+1] + a[i+1]) / 3.0. I think the reason for this you will find out yourself.
Have fun optimizing ;-)

Reading an array by multiple threads usually does no harm.
You only need a critical section if multiple threads work on the exact same piece of data, here each thread accesses a different part of the array so you dont need it. Critical sections are very bad for performance so only use them if absolutely necessary. Often they can be replaced by atomic actions:
openMP, atomic vs critical?
Like a critical section, they dont make sense if each thread accesses different data.
For the scheduler its best to test them each and measure the performance as predictions about performance are often wrong. Also try different chunk sizes.
Some other things that might help:
Measuring performance is often interferred by other tasks on your pc so take multiple measurements and take their minimum (except if the input is different each time, then take the average and do more measurements).
Do you really need double precision? Floats are a lot faster.
edit: nowait is for multiple independent for loops: https://msdn.microsoft.com/en-us/library/ek5st0e3.aspx

I assume you are trying to do some kind of convolution or median blur with 1D array. The short answer is: stick to default schedule strategy, and get rid of critical at all.
As I can tell, you are a quit newbie to parallelism, it's a little bit confusion to deal with OpenMP directives, like nowait/private/reduction/critical/atomic/single, etc. I think what you need is a well written textbook to clarify various concept. If you had a sound knowledge, a hour of learning OpenMP could be enough to deal with most daily programming.

Algorithm: taking out every 4th item of an array

I have two huge arrays (int source[1000], dest[1000] in the code below, but having millions of elements in reality). The source array contains a series of ints of which I want to copy 3 out of every 4.
For example, if the source array is:
int source[1000] = {1,2,3,4,5,6,7,8....};
int dest[1000];
Here is my code:
for (int count_small = 0, count_large = 0; count_large < 1000; count_small += 3, count_large +=4)
{
dest[count_small] = source[count_large];
dest[count_small+1] = source[count_large+1];
dest[count_small+2] = source[count_large+2];
}
In the end, dest console output would be:
1 2 3 5 6 7 9 10 11...
But this algorithm is so slow! Is there an algorithm or an open source function that I can use / include?
Thank you :)
Edit: The actual length of my array would be about 1 million (640*480*3)
Edit 2: Processing this for loop takes about 0.98 seconds to 2.28 seconds, while the other code only take 0.08 seconds to 0.14 seconds, so the device uses at least 90 % cpu time only for the loop

Well, the asymptotic complexity there is as good as it's going to get. You might be able to achieve slightly better performance by loading in the values as four 4-way SIMD integers, shuffling them into three 4-way SIMD integers, and writing them back out, but even that's not likely to be hugely faster.
With that said, though, the time to process 1000 elements (Edit: or one million elements) is going to be utterly trivial. If you think this is the bottleneck in your program, you are incorrect.

Before you do much more, try profiling your application and determine if this is the best place to spend your time. Then, if this is a hot spot, determine how fast is it, and how fast you need it to be/might achieve? Then test the alternatives; the overhead of threading or OpenMP might even slow it down (especially, as you now have noted, if you are on a single core processor - in which case it won't help at all). For single threading, I would look to memcpy as per Sean's answer.
#Sneftel has also reference other options below involving SIMD integers.
One option would be to try parallel processing the loop, and see if that helps. You could try using the OpenMP standard (see Wikipedia link here), but you would have to try it for your specific situation and see if it helped. I used this recently on an AI implementation and it helped us a lot.
#pragma omp parallel for
for (...)
{
... do work
}
Other than that, you are limited to the compiler's own optimisations.
You could also look at the recent threading support in C11, though you might be better off using pre-implemented framework tools like parallel_for (available in the new Windows Concurrency Runtime through the PPL in Visual Studio, if that's what you're using) than rolling your own.
parallel_for(0, max_iterations,
[...] (int i)
{
... do stuff
}
);
Inside the for loop, you still have other options. You could try a for loop that iterates and skips every for, instead of doing 3 copies per iteration (just skip when (i+1) % 4 == 0), or doing block memcopy operations for groups of 3 integers as per Seans answer. You might achieve slightly different compiler optimisations for some of these, but it is unlikely (memcpy is probably as fast as you'll get).
for (int i = 0, int j = 0; i < 1000; i++)
{
if ((i+1) % 4 != 0)
{
dest[j] = source[i];
j++;
}
}
You should then develop a test rig so you can quickly performance test and decide on the best one for you. Above all, decide how much time is worth spending on this before optimising elsewhere.

You could try memcpy instead of the individual assignments:
memcpy(&dest[count_small], &source[count_large], sizeof(int) * 3);

Is your array size only a 1000? If so, how is it slow? It should be done in no time!
As long as you are creating a new array and for a single threaded application, this is the only away AFAIK.
However, if the datasets are huge, you could try a multi threaded application.
Also you could explore having a bigger data type holding the value, such that the array size decreases... That is if this is viable to your real life application.

If you have Nvidia card you can consider using CUDA. If thats not the case you can try other parallel programming methods/environments as well.

C++ Adding 2 arrays together quickly

Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5
Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of combinations.
And as a small extra question (thanks if you can help!) what would be the fastest way of checking if any of the values in canvas exceed 100? Loops are slow!

Here is an SSE4 implementation that should perform pretty well on Nehalem (Core i7):
#include <limits.h>
#include <emmintrin.h>
#include <smmintrin.h>
static inline int canvas_add(int canvas[10][10], int addon[10][10])
{
__m128i * cp = (__m128i *)&canvas[0][0];
const __m128i * ap = (__m128i *)&addon[0][0];
const __m128i vlimit = _mm_set1_epi32(100);
__m128i vmax = _mm_set1_epi32(INT_MIN);
__m128i vcmp;
int cmp;
int i;
for (i = 0; i < 10 * 10; i += 4)
{
__m128i vc = _mm_loadu_si128(cp);
__m128i va = _mm_loadu_si128(ap);
vc = _mm_add_epi32(vc, va);
vmax = _mm_max_epi32(vmax, vc); // SSE4 *
_mm_storeu_si128(cp, vc);
cp++;
ap++;
}
vcmp = _mm_cmpgt_epi32(vmax, vlimit); // SSE4 *
cmp = _mm_testz_si128(vcmp, vcmp); // SSE4 *
return cmp == 0;
}
Compile with gcc -msse4.1 ... or equivalent for your particular development environment.
For older CPUs without SSE4 (and with much more expensive misaligned loads/stores) you'll need to (a) use a suitable combination of SSE2/SSE3 intrinsics to replace the SSE4 operations (marked with an * above) and ideally (b) make sure your data is 16-byte aligned and use aligned loads/stores (_mm_load_si128/_mm_store_si128) in place of _mm_loadu_si128/_mm_storeu_si128.

You can't do anything faster than loops in just C++. You would need to use some platform specific vector instructions. That is, you would need to go down to the assembly language level. However, there are some C++ libraries that try to do this for you, so you can write at a high level and have the library take care of doing the low level SIMD work that is appropriate for whatever architecture you are targetting with your compiler.
MacSTL is a library that you might want to look at. It was originally a Macintosh specific library, but it is cross platform now. See their home page for more info.

The best you're going to do in standard C or C++ is to recast that as a one-dimensional array of 100 numbers and add them in a loop. (Single subscripts will use a bit less processing than double ones, unless the compiler can optimize it out. The only way you're going to know how much of an effect there is, if there is one, is to test.)
You could certainly create a class where the addition would be one simple C++ instruction (canvas += addon;), but that wouldn't speed anything up. All that would happen is that the simple C++ instruction would expand into the loop above.
You would need to get into lower-level processing in order to speed that up. There are additional instructions on many modern CPUs to do such processing that you might be able to use. You might be able to run something like this on a GPU using something like Cuda. You could try making the operation parallel and running on several cores, but on such a small instance you'll have to know how caching works on your CPU.
The alternatives are to improve your algorithm (on a knapsack-type problem, you might be able to use dynamic programming in some way - without more information from you, we can't tell you), or to accept the performance. Tens of millions of operations on a 10 by 10 array turn into hundreds of billions of operations on numbers, and that's not as intimidating as it used to be. Of course, I don't know your usage scenario or performance requirements.

Two parts: first, consider your two-dimensional array [10][10] as a single array [100]. The layout rules of C++ should allow this. Second, check your compiler for intrinsic functions implementing some form of SIMD instructions, such as Intel's SSE. For example Microsoft supplies a set. I believe SSE has some instructions for checking against a maximum value, and even clamping to the maximum if you want.

Here is an alternative.
If you are 100% certain that all your values are between 0 and 100, you could change your type from an int to a uint8_t. Then, you could add 4 elements together at once of them together using uint32_t without worrying about overflow.
That is ...
uint8_t array1[10][10];
uint8_t array2[10][10];
uint8_t dest[10][10];
uint32_t *pArr1 = (uint32_t *) &array1[0][0];
uint32_t *pArr2 = (uint32_t *) &array2[0][0];
uint32_t *pDest = (uint32_t *) &dest[0][0];
int i;
for (i = 0; i < sizeof (dest) / sizeof (uint32_t); i++) {
pDest[i] = pArr1[i] + pArr2[i];
}
It may not be the most elegant, but it could help keep you from going to architecture specific code. Additionally, if you were to do this, I would strongly recommend you comment what you are doing and why.

You should check out CUDA. This kind of problem is right up CUDA's street. Recommend the Programming Massively Parallel Processors book.
However, this does require CUDA capable hardware, and CUDA takes a bit of effort to get setup in your development environment, so it would depend how important this really is!
Good luck!

Efficiency of manually written loops vs operator overloads

in the program I'm working on I have 3-element arrays, which I use as mathematical vectors for all intents and purposes.
Through the course of writing my code, I was tempted to just roll my own Vector class with simple arithmetic overloads (+, -, * /) so I can simplify statements like:
// old:
for (int i = 0; i < 3; i++)
r[i] = r1[i] - r2[i];
// new:
r = r1 - r2;
Which should be more or less identical in generated code. But when it comes to more complicated things, could this really impact my performance heavily? One example that I have in my code is this:
Manually written version:
for (int j = 0; j < 3; j++)
{
p.vel[j] = p.oldVel[j] + (p.oldAcc[j] + p.acc[j]) * dt2 + (p.oldJerk[j] - p.jerk[j]) * dt12;
p.pos[j] = p.oldPos[j] + (p.oldVel[j] + p.vel[j]) * dt2 + (p.oldAcc[j] - p.acc[j]) * dt12;
}
Using the Vector class with operator overloads:
p.vel = p.oldVel + (p.oldAcc + p.acc) * dt2 + (p.oldJerk - p.jerk) * dt12;
p.pos = p.oldPos + (p.oldVel + p.vel) * dt2 + (p.oldAcc - p.acc) * dt12;
I am attempting to optimize my code for speed, since this sort of code runs inside of inner loops. Will using the overloaded operators for these things affect performance? I'm doing some numerical integration of a system of n mutually gravitating bodies. These vector operations are extremely common so having this run fast is important.
Any insight would be appreciated, as would any idioms or tricks I'm unaware of.

If the operations are inlined and optimised well by your compiler you shouldn't usually see any difference between writing the code well (using operators to make it readable and maintainable) and manually inlining everything.
Manual inlining also considerably increases the risk of bugs because you won't be re-using a single piece of well-tested code, you'll be writing the same code over and over. I would recommend writing the code with operators, and then if you can prove you can speed it up by manually inlining, duplicate the code and manually inline the second version. Then you can run the two variants of the code off against each other to prove (a) that the manual inlining is effective, and (b) that the readable and manually-inlined code both produce the same result.
Before you start manually inlining, though, there's an easy way for you to answer your question for yourself: Write a few simple test cases both ways, then execute a few million iterations and see which approach executes faster. This will teach you a lot about what's going on and give you a definite answer for your particular implementation and compiler that you will never get from the theoretical answers you'll receive here.

I would like to look at it the other way around; starting with the Vector class, and if you get performance problems with that you can see if manually inlining the calculations is faster.
Aside from the performance you also mention that the calculations has to be accurate. Having the vector specific calculations in a class means that it's easier to test those individually, and also that the code using the class gets shorter and easier to maintain.

Check out the ConCRT code samples
http://code.msdn.microsoft.com/concrtextras/Release/ProjectReleases.aspx?ReleaseId=4270
There's a couple (including an NBody sample) which do a bunch of tricks like this with Vector types and templates etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js