SSE slower than standard logic [duplicate] - c++

This question already has answers here:
SSE intrinsics without compiler optimization
(2 answers)
Idiomatic way of performance evaluation?
(1 answer)
Closed 3 days ago.
Add 2 arrays element by element. Standart logic and by SSE
int count = std::pow(2, 20);
alignas(16) float* fm1 = new float[count];
alignas(16) float* fm2 = new float[count];
alignas(16) float* res = new float[count];
for (int i = 0; i < count; ++i) {
fm1[i] = static_cast<float>(i);
fm2[i] = static_cast<float>(i);
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < 1000; ++j) {
for (int i = 0; i < count; ++i) {
res[i] = fm1[i] + fm2[i];
}
}
auto diff = std::chrono::high_resolution_clock::now() - start;
std::cout << "execute time duration = " << std::chrono::duration<double, std::milli>(diff).count() << " milliseconds" << std::endl;
}
{
assert(count % 4 == 0);
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < 1000; ++j) {
for (int i = 0; i < count; i += 4) {
__m128 a = _mm_load_ps(&fm1[i]);
__m128 b = _mm_load_ps(&fm2[i]);
__m128 r = _mm_add_ps(a, b);
_mm_store_ps(&res[i], r);
}
}
auto diff = std::chrono::high_resolution_clock::now() - start;
std::cout << "execute time duration = " << std::chrono::duration<double, std::milli>(diff).count() << " milliseconds" << std::endl;
}
result
execute time duration = 1692.19 milliseconds
execute time duration = 2339.49 milliseconds
laptop configuration
11th Gen Intel(R) Core(TM) i7-11370H # 3.30GHz 3.30 GHz
16,0 ГБ
I expect that SSE logic wil be faster at least 3 times, but it slover

Related

Parallel execution taking more time than serial

I am basically writing code to count if a pair sum is even(among all pairs from 1 to 100000). I wrote a code using pthreads and without pthreads. But the code with pthreads is taking more time than the serial one. Here is my serial code
#include<bits/stdc++.h>
using namespace std;
int main()
{
long long sum = 0, count = 0, n = 100000;
auto start = chrono::high_resolution_clock::now();
for(int i = 1; i <= n; i++)
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
count++;
}
cout<<"count is "<<count<<endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
and here is my parallel code
#include<bits/stdc++.h>
using namespace std;
#define MAX_THREAD 3
long long cnt[5] = {0};
long long n = 100000;
int work_per_thread;
int start[] = {1, 60001, 83001, 100001};
void *count_array(void* arg)
{
int t = *((int*)arg);
long long sum = 0;
for(int i = start[t]; i < start[t+1]; i++)
for(int j = i-1; j >=0; j--)
{
sum = i + j;
if(sum%2 == 0)
cnt[t]++;
}
cout<<"thread"<<t<<" finished work "<<cnt[t]<<endl;
return NULL;
}
int main()
{
pthread_t threads[MAX_THREAD];
int arr[] = {0,1,2};
long long total_count = 0;
work_per_thread = n/MAX_THREAD;
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, count_array, &arr[i]);
for(int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
for(int i = 0; i < MAX_THREAD; i++)
total_count += cnt[i];
cout << "count is " << total_count << endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
In the parallel code I am creating three threads and 1st thread will be doing its computation from 1 to 60000, 2nd thread from 60001 to 83000 and so on. I have chosen these numbers so that each thread gets to do approximately similar number of computations. The parallel execution takes 10.3 secs whereas serial one takes 7.7 secs. I have 6 cores and 2 threads per core. I also used htop command to check if the required number of threads are running or not and it seems to be working fine. I don't understand where the problem is.
The all cores in the threaded version compete for cnt[].
Use a local counter inside the loop and copy the result into cnt[t] after the loop is ready.

Why is c++ foreach slower than naive single thread loop?

My code is like this:
auto t1 = std::chrono::steady_clock::now();
for (int t{0}; t < 100; ++t) {
vector<int> table(256, 0);
Mat im2 = cv::imread(impth, cv::ImreadModes::IMREAD_COLOR);
im2.forEach<cv::Vec3b>([&table](cv::Vec3b &pix, const int* pos) {
for (int i{0}; i < 3; ++i) ++table[pix[i]];
});
}
auto t2 = std::chrono::steady_clock::now();
cout << "time is: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << endl;
auto t3 = std::chrono::steady_clock::now();
for (int t{0}; t < 100; ++t) {
vector<int> table(256, 0);
Mat im2 = cv::imread(impth, cv::ImreadModes::IMREAD_COLOR);
for (int r{0}; r < im2.rows; ++r) {
auto ptr = im2.ptr<uint8_t>(r);
for (int c{0}; c < im2.cols; ++c) {
for (int i{0}; i < 3; ++i) ++table[ptr[i]];
ptr += 3;
}
}
}
auto t4 = std::chrono::steady_clock::now();
cout << "time is: " << std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count() << endl;
Intuitively, I feel that foreach should work faster since it used multi-thread mechanism to do the work, but the result turns out that the foreach methods took 14759ms while the naive loop method took only 6791ms. What is the cause of this slower foreach method, and how could make it faster ?

Measuring time with chrono changes after printing

I want to measure the execution time of a program in ns in C++. For that purpose I am using the chrono library.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
return 0;
}
I measured the time and it executed in 90 ns . However when I add a printing afterwards the time changes.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
std::cout<<input[i][j];
}
return 0;
}
The time changes to 89700 ns. What could be the problem. I only want to measure the execution time of the for.

Determining CPU time required to execute loop

I've done some SO searching and found this and that outlining timing methods.
My problem is that I need to determine the CPU time (in milliseconds) required to execute the following loop:
for (int i = 0, temp = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
I've looked at two methods, clock() and stead_clock::now(). Per the docs, I know that clock() returns "ticks" so I can get it in seconds by dividing the difference using CLOCKS_PER_SEC. The docs also mention that steady_clock is designed for interval timing, but you have to call duration_cast<milliseconds> to change its unit.
What I've done to time the two (since doing both in the same run may lead to one taking longer since the other was called first) is run them each by themselves:
clock_t t = clock();
for (int i = 0, temp = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
t = clock() - t;
cout << (float(t)/CLOCKS_PER_SEC) * 1000 << "ms taken" << endl;
chrono::steady_clock::time_point p1 = chrono::steady_clock::now();
for (int i = 0, temp = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
chrono::steady_clock::time_point p2 = chrono::steady_clock::now();
cout << chrono::duration_cast<milliseconds>(p2-p1).count() << "ms taken" << endl;
Output:
0ms taken
0ms taken
Do both these methods floor the result? Surely some fractal of milliseconds took place?
So which is ideal (or rather, more appropriate) for determining the CPU time required to execute the loop? At first glance, I would argue for clock() since the docs specifically tell me that its for determining CPU time.
For context, my CLOCKS_PER_SEC holds a value of 1000.
Edit/Update:
Tried the following:
clock_t t = clock();
for (int j = 0; j < 1000000; j++) {
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
}
t = clock() - t;
cout << (float(t) * 1000.0f / CLOCKS_PER_SEC / 1000000.0f) << "ms taken" << endl;
Outputs: 0.019953ms taken
clock_t start = clock();
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
clock_t end = clock();
cout << fixed << setprecision(2) << 1000.0 * (end - start) / CLOCKS_PER_SEC << "ms taken" << endl;
Outputs: 0.00ms taken
chrono::high_resolution_clock::time_point p1 = chrono::high_resolution_clock::now();
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
chrono::high_resolution_clock::time_point p2 = chrono::high_resolution_clock::now();
cout << (chrono::duration_cast<chrono::microseconds>(p2 - p1).count()) / 1000.0 << "ms taken" << endl;
Outputs: 0.072ms taken
chrono::steady_clock::time_point p1 = chrono::steady_clock::now();
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
chrono::steady_clock::time_point p2 = chrono::steady_clock::now();
cout << (chrono::duration_cast<chrono::microseconds>(p2 - p1).count()) / 1000.0f << "ms taken" << endl;
Outputs: 0.044ms
So the question becomes, which is valid? The second method to me seems invalid because I think the loop is completing faster than a millisecond.
I understand the first method (simply to execute longer) but the last two methods produce drastically different results.
One thing I've noticed is that after compiling the program, the first time I run it I may get 0.073ms (for the high_resolution_clock) and 0.044ms (for the steady_clock) at first, but all subsequent runs are within the range of 0.019 - 0.025ms.
You can do the loop a million times, and divide. You can also add the volatile keyword to avoid some compiler optimizations.
clock_t t = clock();
for (int j = 0, j < 1000000; j++) {
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
}
t = clock() - t;
cout << (float(t) * 1000.0f / CLOCKS_PER_SEC / 1000000.0f) << "ms taken" << endl;
Well using GetTickCount() seems to be solution, I hope
double start_s = GetTickCount();
for (int i = 0, temp = 0; i < 10000000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
double stop_s = GetTickCount();
cout << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000 << "ms taken" << endl;
For me returns between 16-31ms

AVX2 slower than SSE on Haswell

I have the following code (normal, SSE and AVX):
int testSSE(const aligned_vector & ghs, const aligned_vector & lhs) {
int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
for (int k = 0; k < ghs.size(); k += 4) {
v1 = _mm_load_si128((__m128i *) & lhs[k]);
v2 = _mm_load_si128((__m128i *) & ghs[k]);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 4; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testAVX(const aligned_vector & ghs, const aligned_vector & lhs) {
int result[8] __attribute__((aligned(32))) = {0};
__m256i vresult = _mm256_set1_epi32(0);
__m256i v1, v2, vmax;
for (int k = 0; k < ghs.size(); k += 8) {
v1 = _mm256_load_si256((__m256i *) & ghs[ k]);
v2 = _mm256_load_si256((__m256i *) & lhs[k]);
vmax = _mm256_add_epi32(v1, v2);
vresult = _mm256_max_epi32(vresult, vmax);
}
_mm256_store_si256((__m256i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 8; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testNormal(const aligned_vector & ghs, const aligned_vector & lhs) {
int max = 0;
int tempMax;
for (int k = 0; k < ghs.size(); k++) {
tempMax = lhs[k] + ghs[k];
if (max < tempMax) {
max = tempMax;
}
}
return max;
}
All these functions are tested with the following code:
void alignTestSSE() {
aligned_vector lhs;
aligned_vector ghs;
int mySize = 4096;
int FinalResult;
int nofTestCases = 1000;
double time, time1, time2, time3;
vector<int> lhs2;
vector<int> ghs2;
lhs.resize(mySize);
ghs.resize(mySize);
lhs2.resize(mySize);
ghs2.resize(mySize);
srand(1);
for (int k = 0; k < mySize; k++) {
lhs[k] = randomNodeID(1000000);
lhs2[k] = lhs[k];
ghs[k] = randomNodeID(1000000);
ghs2[k] = ghs[k];
}
/* Warming UP */
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testNormal(lhs, ghs);
}
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testSSE(lhs, ghs);
}
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testAVX(lhs, ghs);
}
cout << "===========================" << endl;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testSSE(lhs, ghs);
}
time = timestamp() - time;
time1 = time;
cout << "SSE took " << time << " s" << endl;
cout << "SSE Result: " << FinalResult << endl;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testAVX(lhs, ghs);
}
time = timestamp() - time;
time3 = time;
cout << "AVX took " << time << " s" << endl;
cout << "AVX Result: " << FinalResult << endl;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testNormal(lhs, ghs);
}
time = timestamp() - time;
cout << "Normal took " << time << " s" << endl;
cout << "Normal Result: " << FinalResult << endl;
cout << "SpeedUP SSE= " << time / time1 << " s" << endl;
cout << "SpeedUP AVX= " << time / time3 << " s" << endl;
cout << "===========================" << endl;
ghs.clear();
lhs.clear();
}
Where
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
And
typedef vector<int, aligned_allocator<int, sizeof (int)> > aligned_vector;
is an aligned vector using the AlignedAllocator of https://gist.github.com/donny-dont/1471329
I have an intel-i7 haswell 4771, and latest Ubuntu 14.04 64bit and gcc 4.8.2. Everything is up-to-date. I compiled with -march=native -mtune=native -O3 -m64.
Results are:
SSE took 0.000375986 s
SSE Result: 1982689
AVX took 0.000459909 s
AVX Result: 1982689
Normal took 0.00315714 s
Normal Result: 1982689
SpeedUP SSE= 8.39696 s
SpeedUP AVX= 6.8647 s
Which shows that the exact same code is 22% slower on AVX2 than SSE. Am I doing something wrong or is this normal behavior?
I converted your code to more vanilla C++ (plain arrays, no vectors, etc), cleaned it up and tested it with auto-vectorization disabled and got reasonable results:
#include <iostream>
using namespace std;
#include <sys/time.h>
#include <cstdlib>
#include <cstdint>
#include <immintrin.h>
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
int testSSE(const int32_t * ghs, const int32_t * lhs, size_t n) {
int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
for (int k = 0; k < n; k += 4) {
v1 = _mm_load_si128((__m128i *) & lhs[k]);
v2 = _mm_load_si128((__m128i *) & ghs[k]);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 4; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testAVX(const int32_t * ghs, const int32_t * lhs, size_t n) {
int result[8] __attribute__((aligned(32))) = {0};
__m256i vresult = _mm256_set1_epi32(0);
__m256i v1, v2, vmax;
for (int k = 0; k < n; k += 8) {
v1 = _mm256_load_si256((__m256i *) & ghs[k]);
v2 = _mm256_load_si256((__m256i *) & lhs[k]);
vmax = _mm256_add_epi32(v1, v2);
vresult = _mm256_max_epi32(vresult, vmax);
}
_mm256_store_si256((__m256i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 8; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testNormal(const int32_t * ghs, const int32_t * lhs, size_t n) {
int max = 0;
int tempMax;
for (int k = 0; k < n; k++) {
tempMax = lhs[k] + ghs[k];
if (max < tempMax) {
max = tempMax;
}
}
return max;
}
void alignTestSSE() {
int n = 4096;
int normalResult, sseResult, avxResult;
int nofTestCases = 1000;
double time, normalTime, sseTime, avxTime;
int lhs[n] __attribute__ ((aligned(32)));
int ghs[n] __attribute__ ((aligned(32)));
for (int k = 0; k < n; k++) {
lhs[k] = arc4random();
ghs[k] = arc4random();
}
/* Warming UP */
for (int k = 0; k < nofTestCases; k++) {
normalResult = testNormal(lhs, ghs, n);
}
for (int k = 0; k < nofTestCases; k++) {
sseResult = testSSE(lhs, ghs, n);
}
for (int k = 0; k < nofTestCases; k++) {
avxResult = testAVX(lhs, ghs, n);
}
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
normalResult = testNormal(lhs, ghs, n);
}
normalTime = timestamp() - time;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
sseResult = testSSE(lhs, ghs, n);
}
sseTime = timestamp() - time;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
avxResult = testAVX(lhs, ghs, n);
}
avxTime = timestamp() - time;
cout << "===========================" << endl;
cout << "Normal took " << normalTime << " s" << endl;
cout << "Normal Result: " << normalResult << endl;
cout << "SSE took " << sseTime << " s" << endl;
cout << "SSE Result: " << sseResult << endl;
cout << "AVX took " << avxTime << " s" << endl;
cout << "AVX Result: " << avxResult << endl;
cout << "SpeedUP SSE= " << normalTime / sseTime << endl;
cout << "SpeedUP AVX= " << normalTime / avxTime << endl;
cout << "===========================" << endl;
}
int main()
{
alignTestSSE();
return 0;
}
Test:
$ clang++ -Wall -mavx2 -O3 -fno-vectorize SO_avx.cpp && ./a.out
===========================
Normal took 0.00324106 s
Normal Result: 2143749391
SSE took 0.000527859 s
SSE Result: 2143749391
AVX took 0.000221968 s
AVX Result: 2143749391
SpeedUP SSE= 6.14002
SpeedUP AVX= 14.6015
===========================
I suggest you try the above code, with -fno-vectorize (or -fno-tree-vectorize if using g++), and see if you get similar results. If you do then you can work backwards towards your original code to see where the inconsistency might be coming from.
On my machine (core i7-4900M), based on updated code from Paul R, with g++ 4.8.2with 100,000 iterations instead of 1000, I have the following results:
g++ -Wall -mavx2 -O3 -std=c++11 test_avx.cpp && ./a.exe
SSE took 508,029 us
AVX took 1,308,075 us
Normal took 297,017 us
g++ -Wall -mavx2 -O3 -std=c++11 -fno-tree-vectorize test_avx.cpp && ./a.exe
SSE took 509,029 us
AVX took 1,307,075 us
Normal took 3,436,197 us
GCC is doing an amazing job optimizing the "Normal" code. Yet the slow performance of the "AVX" code can be explained by the lines below, which requires a full 256 bit store (ouch!) followed by a max search over 8 integers.
_mm256_store_si256((__m256i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 8; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
It is best to continue using AVX intrinsics for the max of 8. I can propose the following changes
v1 = _mm256_permute2x128_si256(vresult,vresult,1); // from ABCD-EFGH to ????-ABCD
vresult = _mm256_max_epi32(vresult, v1);
v1 = _mm256_permute4x64_epi64(vresult,1); // from ????-ABCD to ????-??AB
vresult = _mm256_max_epi32(vresult, v1);
v1 = _mm256_shuffle_epi32(vresult,1); // from ????-???AB to ????-???A
vresult = _mm256_max_epi32(vresult, v1);
// no _mm256_extract_epi32 => need extra step
__m128i vres128 = _mm256_extracti128_si256(vresult,0);
return _mm_extract_epi32(vres128,0);
For a fair comparaison, I have also updated the SSE code, I have then:
SSE took 483,028 us
AVX took 258,015 us
Normal took 307,017 us
AVX time has decreased by a factor 5!
Doing loop unrolling manually can speed up above SSE/AVX code.
Original version on my i5-5300U:
Normal took 0.347 s
Normal Result: 2146591543
AVX took 0.409 s
AVX Result: 2146591543
SpeedUP AVX= 0.848411
After manual loop unrolling:
Normal took 0.375 s
Normal Result: 2146591543
AVX took 0.297 s
AVX Result: 2146591543
SpeedUP AVX= 1.26263