Matrix multiplication code running slower with AVX2

Matrix multiplication code running slower with AVX2 - c++

I am learning to program with AVX. So, I wrote a simple program to multiply matrices of size 4. While with no compiler optimizations, the AVX version is slightly faster than the non-AVX version, with O3 optimization, the non-AVX version becomes almost twice as fast as the AVX version. Any tip on how can I improve the performance of the AVX version? Following is the full code.
#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>
#define MAT_SIZE 4
#define USE_AVX
double A[MAT_SIZE][MAT_SIZE];
double B[MAT_SIZE][MAT_SIZE];
double C[MAT_SIZE][MAT_SIZE];
union {
double m[4][4];
__m256d row[4];
} matB;
void init_matrices()
{
for(int i = 0; i < MAT_SIZE; i++)
for(int j = 0; j < MAT_SIZE; j++)
{
A[i][j] = (float)(i+j);
B[i][j] = (float)(i+j+1);
matB.m[i][j] = B[i][j];
}
}
void print_result()
{
for(int i = 0; i < MAT_SIZE; i++)
{
for(int j = 0; j < MAT_SIZE; j++)
{
printf("%.1f\t", C[i][j]);
}
printf("\n");
}
}
void withoutAVX()
{
for(int row = 0; row < MAT_SIZE; row++)
for(int col = 0; col < MAT_SIZE; col++)
{
float sum = 0;
for(int e = 0; e < MAT_SIZE; e++)
sum += A[row][e] * B[e][col];
C[row][col] = sum;
}
}
void withAVX()
{
for(int row = 0; row < 4; row++)
{
//calculate_resultant_row(row);
const double* rowA = (const double*)&A[row];
__m256d* pr = (__m256d*)(&C[row]);
*pr = _mm256_mul_pd(_mm256_broadcast_sd(&rowA[0]), matB.row[0]);
for(int i = 1; i < 4; i++)
*pr = _mm256_add_pd(*pr, _mm256_mul_pd(_mm256_broadcast_sd(&rowA[i]),
matB.row[i]));
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned hi, lo;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
int main()
{
init_matrices();
// start timer
unsigned long long cycles = rdtsc();
#ifdef USE_AVX
withAVX();
#else
withoutAVX();
#endif
// stop timer
cycles = rdtsc() - cycles;
printf("\nTotal time elapsed : %ld\n\n", cycles);
print_result();
return 0;
}

It's hard to be sure without knowing exactly what compiler and system you are using. You need to check the assembly of generated code to be sure. Below are merely some possible reasons.
The compiler probably generated extra load/store. This will cost.
The inner most loop broadcast elements from A. And thus you have extra loads. The optimal code shall require only 8 loads, 4 for A and B each, and 4 store back in C. However your code will lead to at least 16 extra loads because your use of broadcastsd. These will cost you as much as the computation itself, and probably more.
Edit (too long for comments)
There are situations where compiler won't be able to do smart optimization or sometime it is "too clever" for good. Recently I even had need to use assembly to avoid compiler optimization which actually lead to bad code! That said, if what you need is performance and you don't really care how you get there. I would suggest you first look for good libraries. For example, Eigen for linear algebra will fit your need in this example perfectly. If you do want to learn SIMD programming, I suggest you start with simpler cases, such as adding two vectors. Most likely, you will find that compiler will be able to generate better vectorized binary than your first few attempts. But they are more straightforward and thus you will see where you need improvement more easily. You will learn all kinds of things that you need to write optimal code in the process of attempting to produce code as good as or better than that can be generated by a compiler. And eventually you will be able to provide optimal implementations to code that compiler cannot optimize. One thing you need to keep in mind is that the lower level you go, the less compiler can do for you. You will have more control over what binaries are generated, but it is also your responsibility to make them optimal. These advices are pretty vague. Sorry cannot be of more help.

Related

How to calculate Matrix efficiently in C++?

I am new to C++ and programming so I think I am making inefficient codes.
I was wondering whether there is any way I can speed up the matrix calculation process.
For example, this is the sample code I write which finds the maximum differences(in absolute value) between 3d array 'V' and 'Vnew'.
First, I take subtraction.
And then, I put the value of tempdiff[0][0][0] to 'dif'
Then, I compare 'dif' and tempdiff[i][j][k] and replace if the latter is larger than the former.
This is just a part of my code and there are lots of matrix calculations inside so that I have too many 'for' statements.
So I was wondering whether there is any way I could avoid using 'for' in the matrix calculations.
Thanks in advance.
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
tempdiff[i][j][k] = abs(V[i][j][k] - Vnew[i][j][k]);
}
}
}
dif = tempdiff[0][0][0];
for (int i = 0; i < Na; i++) {
for (int j = 0; j < Nd; j++) {
for (int k = 0; k < Ny; k++) {
if (tempdiff[i][j][k] > dif) {
dif = tempdiff[i][j][k];
}
else {
dif = dif;
}
}
}
}

There's not much you can do with the for loops, as the maximum difference can locate at all possible places. You have already succeeded in iterating the array in the correct, linear, order.
Compilers are generally quite efficient in optimising, but they apparently fail to flatten a contiguous array, such as float V[Na][Nd][Ny];. After you flatten it manually to float V[Na*Nd*Ny], at least clang can auto-vectorise and produce SIMD code for x64 and arm.
A further optimisation is to avoid making this in two steps, as the total memory throughput is exactly doubled with the temporary array compared to a one-pass solution.
I was assuming your matrices are of type float -- if you can select int, gcc can auto-vectorise this as well (relates to NaN handling); furthermore int16_t or int8_t types are even quicker to evaluate, as more operations can be packed to a single SIMD instruction.

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?

Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

Cache it or not Cache it(generate it), that is the questiion

I have a collection of say 100 double values that has to be divided by a fix int variable MANY times:
unsigned int current_interval = double_value / int_value;
I need to know if the following simple cache would be a cheaper calculation or not(and why?):
std::map<double_value,current_interval> cache;
//...
unsigned int get_interval(double_value / int_value){
if((it = cache.find(double_value)) != cache.end()
{
return it->second;
}
unsigned int current_interval = double_value / int_value;
cache[double_value] = current_interval;
return current_interval;
}
thank you

Summary: Doing the repeated division is probably faster than looking up the value in a map
Details:
I thought this was an interesting question. At least from the perspective of execution time for a map lookup compared to floating point division. Before addressing this though I want to reiterate two of the comments:
1) If you really only have 100 doubles that you are dividing by a fixed value and then using many times, I expect that you should be able to transform your algorithm to use this result directly. I expect that would be more efficient than the proposed caching algorithm.
2) Don't use a double as a key in a map.
Now on to the main question. To answer it I wrote two small programs. The first simply looks up values in a map of size 100. The second performs floating point division. I've included the full source code in case anyone wants to replicate my results. Also for the second program I've included some extra code simply to keep the same structure, but what matters in the second loop.
map.cpp
#include <map>
#include <stdlib.h>
std::map<int,int> testmap;
int main(int argc, char **argv) {
int count = atoi(argv[1]);
int val = atoi(argv[2]);
int total = 0;
for(int i = 0; i < 100; i++) {
testmap[i] = val + i;
}
for(int i = 0; i < count; i++) {
for(int j = 0; j < 100; j++) {
total += testmap[j];
}
}
return total;
}
double.cpp
#include <map>
#include <stdlib.h>
std::map<int,int> testmap;
int main(int argc, char **argv) {
int count = atoi(argv[1]);
double val = atof(argv[2]);
double total = 0;
for(int i = 0; i < 100; i++) {
testmap[i] = val + i;
}
for(int i = 0; i < count; i++) {
for(int j = 0; j < 100; j++) {
total += val / j;
}
}
return total;
}
I compiled with both O1 and O3 to make sure the compiler wasn't optimizing away the loop. And I also tested with a few different iteration sizes to make sure the execution time scaled with the iteration count.
I ran my tests on a Intel(R) Xeon(R) CPU E3-1275 v3 # 3.50GHz system compiling with g++ (GCC) 4.9.2
For O1 my results running with 10000000 iterations are:
map: 7.1 seconds
double: 3.6 seconds
For O3 my results are:
map: 5.3 seconds
double: 3.5 seconds
So the difference is not that large, but clearly the division implementation is faster in the small micro-benchmark I wrote here. And also it is what is simpler to implement. So I think it is highly unlikely that trying to memoize the results of division, and then look them up in a map, will be faster than simply calculating the values when you need them. For the memoization to be useful the basic operation would need to be more expensive than floating point division which is heavily optimized in modern CPUs.

How can I optimize C++ for loops/if statements?

Is there anyway to make c++ code run faster, im trying to optimize the slowest parts of my code such as this:
void removeTrail ( char floor[][SIZEX],int trail[][SIZEX])
{
for (int y=1; y < SIZEY-1; y++)
for (int x=1; x < SIZEX-1; x++)
{ if (trail [y][x] <= SLIMELIFE && trail [y][x] > 0)
{ trail [y][x] --;
if (trail [y][x] == 0)
floor [y][x] = NONE;
}
}
}
Most of the guides i have found online are for more complex C++.

It really depends on what kind of optimization you are seeking. Is seems to me that you are talking about a more "low-level" optimization, which can be achieved, in combination with compile flags, by techniques such as changing the order of the nested loops, changing where you place your if statements, deciding between recursive vs. iterative approaches, etc.
However, the most effective optimizations are those targeted at the algorithms, which means that you are changing the complexity of your routines and, thus, often diminishing execution time by orders of magnitude. This would be the case, for example, when you decide to implement a Quicksort instead of a Selection Sort. An optimization from an O(n^2) to an O(n lg n) algorithm will hardly be beaten by any micro optimization.
In this particular case, I see that you are trying to "remove" the elements from the matrix when they reach a certain value. Depending on how those values change, simply tracking when they reach that and adding them to a queue for removal right there, instead of always checking the whole matrix, might do it:
trail[y][x]--; // In some part of your code, this happens
if (trail[y][x] == 0) { //add for removal
removalQueueY[yQueueTail++] = y;
removalQueueX[xQueueTail++] = x;
}
//Then, instead of checking for removal as you currently do:
while (yQueueHead < yQueueTail) {
//Remove the current element and advance the heads
floor[removalQueueY[yQueueHead]][removalQueueX[xQueueHead]] = NONE;
yQueueHead++, xQueueHead++;
}
Depending on how those values change (if it is not a simple trail[y][x]--), another data structure might prove more useful. You could try using a heap, for example, or an std::set, std::priority_queue, among other possibilities. It all comes down to what operations your algorithm must support, and which data structures allow you to execute those operations as efficiently as possible (contemplating memory and execution time, depending on your priorities and needs).

The first thing to do is to turn on compiler optimization. The most powerful optimization I know is profile guided optimization. For gcc:
1) g++ -fprofile-generate .... -o my_program
2) run my_program (typical load)
3) g++ -fprofile-use -O3 ... -o optimized_program
With profile O3 does make sense.
The next thing is to perform algorithmic optimization, like in Renato_Ferreira answer. If it doesn't work for your situation you can improve your performance by factor of 2..8 using vectorization. Your code looks vectorizable:
#include <cassert>
#include <emmintrin.h>
#include <iostream>
#define SIZEX 100 // SIZEX % 4 == 0
#define SIZEY 100
#define SLIMELIFE 100
#define NONE 0xFF
void removeTrail(char floor[][SIZEX], int trail[][SIZEX]) {
// check if trail is 16 bytes alligned
assert((((size_t)(&trail[0][0])) & (size_t)0xF) == 0);
static const int lower_a[] = {0,0,0,0};
static const int sub_a[] = {1,1,1,1};
static const int floor_a[] = {1,1,1,1}; // will underflow after decrement
static const int upper_a[] = {SLIMELIFE, SLIMELIFE, SLIMELIFE, SLIMELIFE};
__m128i lower_v = *(__m128i*) lower_a;
__m128i upper_v = *(__m128i*) upper_a;
__m128i sub_v = *(__m128i*) sub_a;
__m128i floor_v = *(__m128i*) floor_a;
for (int i = 0; i < SIZEY; i++) {
for (int j = 0; j < SIZEX; j += 4) { // only for SIZEX % 4 == 0
__m128i x = *(__m128i*)(&trail[i][j]);
__m128i floor_mask = _mm_cmpeq_epi32(x, floor_v); // 32-bit
floor_mask = _mm_packs_epi32(floor_mask, floor_mask); // now 16-bit
floor_mask = _mm_packs_epi16(floor_mask, floor_mask); // now 8-bit
int32_t fl_mask[4];
*(__m128i*)fl_mask = floor_mask;
*(int32_t*)(&floor[i][j]) |= fl_mask[0];
__m128i less_mask = _mm_cmplt_epi32(lower_v, x);
__m128i upper_mask = _mm_cmplt_epi32(x, upper_v);
__m128i mask = less_mask & upper_mask;
*(__m128i*)(&trail[i][j]) = _mm_sub_epi32(x, mask & sub_v);
}
}
}
int main()
{
int T[SIZEY][SIZEX];
char F[SIZEY][SIZEX];
for (int i = 0; i < SIZEY; i++) {
for (int j = 0; j < SIZEX; j++) {
F[i][j] = 0x0;
T[i][j] = j-10;
}
}
removeTrail(F, T);
for (int j = 0; j < SIZEX; j++) {
std::cout << (int) F[2][j] << " " << T[2][j] << '\n';
}
return 0;
}
Looks like it does what it suppose to do. No ifs and 4 values for iteration. Works only for NONE = 0xFF. Could be done for another but it is difficult.

What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
Performance of breaking apart one loop into two loops
(6 answers)
Closed 9 years ago.
What is the overhead in splitting a for-loop like this,
int i;
for (i = 0; i < exchanges; i++)
{
// some code
// some more code
// even more code
}
into multiple for-loops like this?
int i;
for (i = 0; i < exchanges; i++)
{
// some code
}
for (i = 0; i < exchanges; i++)
{
// some more code
}
for (i = 0; i < exchanges; i++)
{
// even more code
}
The code is performance-sensitive, but doing the latter would improve readability significantly. (In case it matters, there are no other loops, variable declarations, or function calls, save for a few accessors, within each loop.)
I'm not exactly a low-level programming guru, so it'd be even better if someone could measure up the performance hit in comparison to basic operations, e.g. "Each additional for-loop would cost the equivalent of two int allocations." But, I understand (and wouldn't be surprised) if it's not that simple.
Many thanks, in advance.

There are often way too many factors at play... And it's easy to demonstrate both ways:
For example, splitting the following loop results in almost a 2x slow-down (full test code at the bottom):
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
And this is almost stating the obvious since you need to loop through 3 times instead of once and you make 3 passes over the entire array instead of 1.
On the other hand, if you take a look at this question: Why are elementwise additions much faster in separate loops than in a combined loop?
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
The opposite is sometimes true due to memory alignment.
What to take from this?
Pretty much anything can happen. Neither way is always faster and it depends heavily on what's inside the loops.
And as such, determining whether such an optimization will increase performance is usually trial-and-error. With enough experience you can make fairly confident (educated) guesses. But in general, expect anything.
"Each additional for-loop would cost the equivalent of two int allocations."
You are correct that it's not that simple. In fact it's so complicated that the numbers don't mean much. A loop iteration may take X cycles in one context, but Y cycles in another due to a multitude of factors such as Out-of-order Execution and data dependencies.
Not only is the performance context-dependent, but it also vary with different processors.
Here's the test code:
#include <time.h>
#include <iostream>
using namespace std;
int main(){
int size = 10000;
int *data = new int[size];
clock_t start = clock();
for (int i = 0; i < 1000000; i++){
#ifdef TOGETHER
for (int c = 0; c < size; c++){
data[c] *= 10;
data[c] += 7;
data[c] &= 15;
}
#else
for (int c = 0; c < size; c++){
data[c] *= 10;
}
for (int c = 0; c < size; c++){
data[c] += 7;
}
for (int c = 0; c < size; c++){
data[c] &= 15;
}
#endif
}
clock_t end = clock();
cout << (double)(end - start) / CLOCKS_PER_SEC << endl;
system("pause");
}
Output (one loop): 4.08 seconds
Output (3 loops): 7.17 seconds

Processors prefer to have a higher ratio of data instructions to jump instructions.
Branch instructions may force your processor to clear the instruction pipeline and reload.
Based on the reloading of the instruction pipeline, the first method would be faster, but not significantly. You would add at least 2 new branch instructions by splitting.
A faster optimization is to unroll the loop. Unrolling the loop tries to improve the ratio of data instructions to branch instructions by performing more instructions inside the loop before branching to the top of the loop.
Another significant performance optimization is to organize the data so it fits into one of the processor's cache lines. So for example, you could split have inner loops that process a single cache of data and the outer loop would load new items into the cache.
This optimizations should only be applied after the program runs correctly and robustly and the environment demands more performance. The environment defined as observers (animation / movies), users (waiting for a response) or hardware (performing operations before a critical time event). Any other purpose is a waste of your time, as the OS (running concurrent programs) and storage access will contribute more to your program's performance issues.

This will give you a good indication of whether or not one version is faster than another.
#include <array>
#include <chrono>
#include <iostream>
#include <numeric>
#include <string>
const int iterations = 100;
namespace
{
const int exchanges = 200;
template<typename TTest>
void Test(const std::string &name, TTest &&test)
{
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::duration<float, std::milli> ms;
std::array<float, iterations> timings;
for (auto i = 0; i != iterations; ++i)
{
auto t0 = Clock::now();
test();
timings[i] = ms(Clock::now() - t0).count();
}
auto avg = std::accumulate(timings.begin(), timings.end(), 0) / iterations;
std::cout << "Average time, " << name << ": " << avg << std::endl;
}
}
int main()
{
Test("single loop",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
// some more code
// even more code
}
});
Test("separated loops",
[]()
{
for (auto i = 0; i < exchanges; ++i)
{
// some code
}
for (auto i = 0; i < exchanges; ++i)
{
// some more code
}
for (auto i = 0; i < exchanges; ++i)
{
// even more code
}
});
}

The thing is quite simple. The first code is like taking a single lap on a race track and the other code is like taking a full 3-lap race. So, more time required to take three laps rather than one lap. However, if the loops are doing something that needs to be done in sequence and they depend on each other then second code will do the stuff. for example if first loop is doing some calculations and second loop is doing some work with those calculations then both loops need to be done in sequence otherwise not...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js