Visual C++ SSE function slow when starting without debugger

Visual C++ SSE function slow when starting without debugger - c++

i have a pretty weird problem regarding SSE usage.
I wrote the following function where i use SSE to calculate the maximum of the difference of two float arrays, each containing 64 floats.
The dists-array is a 2d-array allocated via _aligned_malloc.
#include <iostream>
#include <xmmintrin.h>
#include <time.h>
#include <stdio.h>
#include <algorithm>
#include <fstream>
#include "hr_time.h"
using namespace std;
float** dists;
float** dists2;
__m128* a;
__m128* b;
__m128* c;
__m128* d;
__m128 diff;
__m128 diff2;
__m128 mymax;
float* myfmax;
float test(int s, int t)
{
a = (__m128*) dists[s];
b = (__m128*) dists[t];
c = (__m128*) dists2[s];
d = (__m128*) dists2[t];
diff;
mymax = _mm_set_ps(0.0, 0.0, 0.0, 0.0);
for (int i = 0; i <= 16; i++)
{
diff = _mm_sub_ps(*a, *b);
mymax = _mm_max_ps(diff, mymax);
diff2 = _mm_sub_ps(*d, *c);
mymax = _mm_max_ps(diff2, mymax);
a++;
b++;
c++;
d++;
}
_mm_store_ps(myfmax, mymax);
float res = max(max(max(myfmax[0], myfmax[1]), myfmax[2]), myfmax[3]);
return res;
}
int Deserialize(std::istream* stream)
{
int numOfElements, arraySize;
stream->read((char*)&numOfElements, sizeof(int)); // numOfElements = 64
stream->read((char*)&arraySize, sizeof(int)); // arraySize = 8000000
dists = (float**)_aligned_malloc(arraySize * sizeof(float*), 16);
dists2 = (float**)_aligned_malloc(arraySize * sizeof(float*), 16);
for (int j = 0; j < arraySize; j++)
{
dists[j] = (float*)_aligned_malloc(numOfElements * sizeof(float), 16);
dists2[j] = (float*)_aligned_malloc(numOfElements * sizeof(float), 16);
}
for (int i = 0; i < arraySize; i++)
{
stream->read((char*)dists[i], (numOfElements*sizeof(float)));
}
for (int i = 0; i < arraySize; i++)
{
stream->read((char*)dists2[i], (numOfElements*sizeof(float)));
}
return 0;
}
int main(int argc, char** argv)
{
int entries = 8000000;
myfmax = (float*)_aligned_malloc(4 * sizeof(float), 16);
ifstream fs("binary_file", std::ios::binary);
Deserialize(&fs);
CStopWatch* watch = new CStopWatch();
watch->StartTimer();
int i;
for (i = 0; i < entries; i++)
{
int s = rand() % entries;
int t = rand() % entries;
test(s, t);
}
watch->StopTimer();
cout << i << " iterations took " << watch->GetElapsedTimeMs() << "ms" << endl;
cin.get();
}
My problem is, that this code runs very fast if i run it in Visual Studio with an attached debugger. But as soon as i execute it without the debugger it gets very slow.
So i did a little reasearch and found out that one difference between those two starting methods is the "Debug Heap". So i disabled that by defining "_NO_DEBUG_HEAP=1". With that option i get very poor performance with an attached debugger too.
But i don't understand how i can get better performance by using the Debug Heap? And i don't know how to solve this problem, so i hope one of you guys can help me.
Thanks in advance.
Regards,
Karsten

Your code has a bug. _mm_store_ps stores an array of four floats but you only declare one. The compiler should not even allow you do to that.
Change
float fmax;
_mm_store_ps(fmax, max);
pi = std::max(std::max(std::max(fmax[0], fmax[1]), fmax[2]), fmax[3]);
to
float __declspec(align(16)) fmax[4];
_mm_store_ps(fmax, max);
return std::max(std::max(std::max(fmax[0], fmax[1]), fmax[2]), fmax[3]);

Related

Need to add elements of two array together

Adding two integer array elements
if array1 = {0,0,0,0,9,9,9,9}—————> 00009999
and
array2 = {0,0,0,0,0,0,0,1}————————> 00000001
adding the two arrays together should result in 10000 being in array1, since 9999 + 1 = 10000
therefore, the result should be
array1 = {0,0,0,1,0,0,0,0}
Does anyone know how to write a code for this? I was trying to use a while loop which didn't work effectively within another for loop. Fairly new to coding and stack overflow, any help will be appreciated!
CODE THAT I TRIED
Note: both arrays will have the same number of elements, they were initialized with the same size
int length = sizeof(array1)/sizeof(array1[0]);
for(int i = length; i>0; i--){
if(array1[i-1] + array2[i-1] < 10){
array1[i-1] += array2[i-1];
}else{
array1[i-1] = array1[i-1] + array2[i-1] - 10;
if(array1[i-2]!=9){
array1[i-2]++;
} else{
int j = i-2;
while(j == 9){
array1[j] = 0;
j--;
}
array1[j-1]++;
}
}
}

Code below performs base10 arithmetic: you need to iterate the arrays in reverse, do the addition of i-th digit by modulo 10, then carry over any excess digits to the next array element:
#include <iostream>
#include <iterator>
using namespace std;
int main()
{
int a1[] = { 0,0,0,0,9,9,9,9 };
int a2[] = { 0,0,0,0,0,0,0,1 };
const int b = 10;
int s = 0;
for (int i = size(a1) - 1; i >= 0; --i)
{
a1[i] += a2[i] + s;
s = a1[i] / b;
a1[i] %= b;
}
std::copy(a1, a1 + size(a1), ostream_iterator<int>(cout, " "));
cout << endl;
}
Alternative with C arrays and transform algorithm + make_reverse_iterator looks too heavy. Variant with std::arrays looks better:
#include <algorithm>
#include <array>
#include <iterator>
#include <iostream>
using namespace std;
int main()
{
std::array<int, 8> a1 = { 0,0,0,0,9,9,9,9 };
std::array<int, 8> a2 = { 0,0,0,0,0,0,0,1 };
const int b = 10;
int s = 0;
transform(a1.rbegin(), a1.rend(), a2.rbegin(), a1.rbegin(), [b, &s](int &i, int &j)
{
i += j + s;
s = i / b;
return i % b;
});
copy(a1.begin(), a1.end(), ostream_iterator<int>(cout, " "));
cout << endl;
}

It looks like you've overcomplicated the problem a bit. Your task is to perform base 10 addition on two arrays, carrying excess digits. You can simply iterate both arrays in reverse, perform addition on the individual elements, and ensure you carry over a digit when you exceed 10.
Here is an example based on the requirements you've described. You can further abstract this as needed.
I updated this code such that the result is now in array1.
#include <iostream>
int main(int argc, char *argv[], char *argp[]){
/* Initialize arrays and calculate array size */
int array1[] = {0,0,0,0,9,9,9,9};
int array2[] = {0,0,0,0,0,0,0,1};
int array_size = sizeof(array1) / sizeof(int);
/* Iterate the arrays in reverse */
int carry = 0;
for (int i = array_size - 1; i >= 0; i--) {
/* Perform addition on the current elements, remembering to include the carry digit */
int result = array1[i] + array2[i] + carry;
/* Determine if the addition should result in another carry */
if (result >= 10) {
carry = 1;
} else {
carry = 0;
}
/* Store the non-carried addition result */
array1[i] = result % 10;
}
/* View result */
for (int i = 0; i < array_size; i++) {
std::cout << array1[i];
}
std::cout << std::endl;
return 0;
}

uint32_t decode(int array[8]){
uint32_t multiple_of_ten=10000000;
uint32_t result = 0;
for(int i=0;i<8;i++){
result += (array[i] * multiple_of_ten);
multiple_of_ten = multiple_of_ten / 10;
}
return result;
}
int* encode(uint32_t number){
int result[8]={0,0,0,0,0,0,0,0};
uint32_t multiple_of_ten=10000000;
uint32_t recent = number;
for(int i=0;i<8;i++){
if(recent>0 && recent / multiple_of_ten > 0)
{
result[i] = recent / multiple_of_ten;
recent %= multiple_of_ten;
}
multiple_of_ten /= 10;
}
return result;
}
in your case , use encode(decode(array1) + decode(array2))

Program crashes after checking index at array

Please note: I am not sure if this fits here, if not, please move to the proper forum.
So I have a progrmam that tries to solve th Traveling Salesman Problem, TSP for short.
My code seems to run fine until I try to use 33810 cities, in which the program crashes after trying to access the position costs[69378120], it simply stops responding and end soon after.
I am trying the folowing code:
#include <iostream>
#include <stdlib.h>
#include <malloc.h>
#include <fstream>
#include <math.h>
#include <vector>
#include <limits>
using namespace std;
typedef long long int itype;
int main(int argc, char *argv[]) {
itype n;
ifstream fenter;
fenter.open(argv[1]);
ofstream fexit;
fexit.open(argv[2]);
fenter>> n;
double *x;
double *y;
x = (double*) malloc(sizeof(double)*n);
y = (double*) malloc(sizeof(double)*n);
cout<<"N : "<<n<<endl;
for (int p = 0; p < n; p++) {
fenter>> x[p] >> y[p];
}
fenter.close();
int *costs;
costs = (int*) malloc(sizeof(int)*(n*n));
for (int u = 0; u < n; u++) {
for (int v = u+1; v < n; v++) {
itype cost = floor(sqrt(pow(x[u] - x[v], 2) + pow(y[u] - y[v], 2)));
cout<<"U: "<<u<<" V: "<<v<<" COST: "<<cost<<endl;
costs[u*n + v] = cost;
cout<<"POS (u*n + v): "<<(u*n + v)<<endl;
cout<<"POS (v*n + u): "<<(v*n + u)<<endl;
costs[v*n + u] = cost;
}
}
return 0;
}
According with some verifications, the cost array should use 9.14493GB, but Windows only gives 0.277497GB. Then after triying to read costs[69378120], it closes.
For now, I not worried about the efficiency, nor the solution to the TSP, just need to fix this issue. Any clues?
---UPDATE---
Following the sugestions I tried changing a few things. the result is the code below
int main(int argc, char *argv[]) {
int n;
ifstream entrada;
entrada.open(argv[1]);
ofstream saida;
saida.open(argv[2]);
entrada >> n;
vector<double> x(n);
vector<double> y(n);
for (int p = 0; p < n; p++) {
entrada >> x[p] >> y[p];
}
entrada.close();
vector<itype> costs(n*n);
if(costs == NULL){ cout << "Sem memória!" << endl; return -1;}
for (int u = 0; u < n; u++) {
for (int v = u+1; v < n; v++) {
itype cost = floor(sqrt(pow(x[u] - x[v], 2) + pow(y[u] - y[v], 2)));
costs[u*n + v] = cost;
costs[v*n + u] = cost;
}
}
return 0;
}
The problem still persists

If compiling in 32-bit size_t is 32 bit and then
1143116100*4
is larger than the largest 32-bit number hence int overrun.
Compiling in 64-bit
size_t siz = 1143116100;
std::vector<long long> big(siz);
std::cout << big.size() << ", " << big.max_size() << std::endl;
which prints
1143116100, 2305843009213693951
if I change it to
size_t siz = 1024*1143116100;
I get a bad_alloc as my swap disk is not big enough for that.

I don't understand where I have a problem in code using sse

I am new with sse programming. I want to write code in which I sum up 4 consecutive numbers from vector v and write the result of this sum in ans vector. I want to write optimized code using sse. But when I set up size is equal to 4 my program is working. But when I set up size is 8 my program doesn't work and I have this error message:
"Exception thrown: read access violation.
ans was 0x1110112.
If there is a handler for this exception, the program may be safely continued."
I don't understand where I have a problem. I allocate the memory right, in which place I have a problem. Could somebody help me, I will be really grateful.
#include <iostream>
#include <immintrin.h>
#include <pmmintrin.h>
#include <vector>
#include <math.h>
using namespace std;
arith_t = double
void init(arith_t *&v, size_t size) {
for (int i = 0; i < size; ++i) {
v[i] = i / 10.0;
}
}
//accumulate with sse
void sub_func_sse(arith_t *v, size_t size, int start_idx, arith_t *ans, size_t start_idx_ans) {
__m128d first_part = _mm_loadu_pd(v + start_idx);
__m128d second_part = _mm_loadu_pd(v + start_idx + 2);
__m128d sum = _mm_add_pd(first_part, second_part);
sum = _mm_hadd_pd(sum, sum);
_mm_store_pd(ans + start_idx_ans, sum);
}
int main() {
const size_t size = 8;
arith_t *v = new arith_t[size];
arith_t *ans_sse = new arith_t[size / 4];
init(v, size);
init(ans_sse, size / 4);
int num_repeat = 1;
arith_t total_time_sse = 0;
for (int p = 0; p < num_repeat; ++p) {
for (int idx = 0, ans_idx = 0; idx < size; idx += 4, ans_idx++) {
sub_func_sse(v, size, idx, ans_sse, ans_idx);
}
}
for (size_t i = 0; i < size / 4; ++i) {
cout << *(ans_sse + i) << endl;
}
delete[] ans_sse;
delete[] v;
}

You're using unaligned memory which requires special versions of load and store functions. You correctly used _mm_loadu_pd but the_mm_store_pd isn't appropriate for working with unaligned memory so you should change it to _mm_storeu_pd. Also consider using aligned memory which will result in better performance.

POSIX pthread_create scrambles the values of variables in a struct, how to avoid that?

So I have my program here:
#include <iostream>
#include <string>
#include <pthread.h>
#include <unistd.h>
#include <math.h>
#include <stdlib.h>
using namespace std;
int const size = 3;
struct Arguments{
int array[];
float result1[];
float result2[];
};
//void calc(int arr[], float rarr1[], float rarr2[], int size);
void* calc(void *param);
int main(int argc, char *argv[]){
time_t t;
srand((unsigned) time(&t));
int arr[size][size] = {};
float rarr1[size][size-1] = {};
float rarr2[size][size-1] = {};
for(int x = 0; x < size; x++){
for(int y = 0; y < size; y++){
int number = rand()%10;
arr[x][y] = number;
}
}
for(int x = 0; x < size; x++){
for(int y = 0; y < size; y++){
cout << arr[x][y] << " ";
}
cout << endl;
}
cout << endl;
/////////////////////////////////////////
pthread_t child;
struct Arguments input;
for(int i = 0; i < size; i++){
input.array[i] = arr[0][i];
}
pthread_create(&child, NULL, calc, (void*)&input);
pthread_join(child, NULL);
//calc(&input);
for(int i = 0; i < size-1; i++){
rarr1[0][i] = input.result1[i];
cout << "Test: " << rarr1[0][i] << endl;
}
//////////////////////////////////
return 0;
}
//void calc(int arr[], float rarr1[], float rarr2[], int size){
void* calc(void *param){
struct Arguments *input = (struct Arguments*)param;
int arr1[] = {};
float rarr1[] = {};
float rarr2[] = {};
for(int i = 0; i < size; i++){
arr1[i] = input->array[i];
}
for(int i = 0; i < size; i++){
int a = arr1[i];
int b = arr1[i+1];
int difference = a-b;
if(difference < 0){
difference = difference * -1;
}
float euc = 1 + pow(difference, 2);
euc = sqrt(euc);
rarr1[i] = euc;
}
for(int i = 0; i <size-1; i++){
input->result1[i] = rarr1[i];
}
for(int i = 0; i <size-1; i++){
int a = arr1[i];
int b = arr1[i+1];
int difference = a-b;
if(difference < 0){
difference = difference * -1;
}
float apar = (difference/rarr1[i]);
float result = asin(apar);
result = result*(180/3.14);
rarr2[i] = result;
}
return NULL;
}
The important part that causes the trouble is between ////// lines but I left the rest of the code for the context, since it might be useful.
So I have the function calc(param); that does the important calculation in the program.
It is working just fine as long as I call it myself (by actually including the function call in the code) and the test loop right after it gives the correct results.
However, when I try to use pthread_create(); to create a new thread that will take care of executing that function, the test loop spits out nonsense and some random huge numbers different each time.
It's kinda weird because the code compiles either way, and literally the only thing that I change is these 2 lines.
What am I doing wrong and why the function spits out garbage when started by the Pthread? Is there a way to fix it?

Ok so if anyone's having a similar problem:
Declare the size of arrays no matter what. It turns out that my program didn't work properly because I initialized my result arrays as float result1[]; instead of float result1[size];

Avoid blas when involving temporary memory allocation?

I have a program that computes the matrix product x'Ay repeatedly. Is it better practice to compute this by making calls to MKL's blas, i.e. cblas_dgemv and cblas_ddot, which requires allocating memory to a temporary vector, or is better to simply take the sum of x_i * a_ij * y_j? In other words, does MKL's blas theoretically add any value?
I benchmarked this for my laptop. There was virtually no difference in each of the tests, other than g++_no_blas performed twice as poorly as the other tests (why?). There was also no difference between O2, O3 and Ofast.
g++_blas_static 57ms
g++_blas_dynamic 58ms
g++_no_blas 100ms
icpc_blas_static 57ms
icpc_blas_dynamic 58ms
icpc_no_blas 58ms
util.h
#ifndef UTIL_H
#define UTIL_H
#include <random>
#include <memory>
#include <iostream>
struct rng
{
rng() : unif(0.0, 1.0)
{
}
std::default_random_engine re;
std::uniform_real_distribution<double> unif;
double rand_double()
{
return unif(re);
}
std::unique_ptr<double[]> generate_square_matrix(const unsigned N)
{
std::unique_ptr<double[]> p (new double[N * N]);
for (unsigned i = 0; i < N; ++i)
{
for (unsigned j = 0; j < N; ++j)
{
p.get()[i*N + j] = rand_double();
}
}
return p;
}
std::unique_ptr<double[]> generate_vector(const unsigned N)
{
std::unique_ptr<double[]> p (new double[N]);
for (unsigned i = 0; i < N; ++i)
{
p.get()[i] = rand_double();
}
return p;
}
};
#endif // UTIL_H
main.cpp
#include <iostream>
#include <iomanip>
#include <memory>
#include <chrono>
#include "util.h"
#include "mkl.h"
double vtmv_blas(double* x, double* A, double* y, const unsigned n)
{
double temp[n];
cblas_dgemv(CblasRowMajor, CblasNoTrans, n, n, 1.0, A, n, y, 1, 0.0, temp, 1);
return cblas_ddot(n, temp, 1, x, 1);
}
double vtmv_non_blas(double* x, double* A, double* y, const unsigned n)
{
double r = 0;
for (unsigned i = 0; i < n; ++i)
{
for (unsigned j = 0; j < n; ++j)
{
r += x[i] * A[i*n + j] * y[j];
}
}
return r;
}
int main()
{
std::cout << std::fixed;
std::cout << std::setprecision(2);
constexpr unsigned N = 10000;
rng r;
std::unique_ptr<double[]> A = r.generate_square_matrix(N);
std::unique_ptr<double[]> x = r.generate_vector(N);
std::unique_ptr<double[]> y = r.generate_vector(N);
auto start = std::chrono::system_clock::now();
const double prod = vtmv_blas(x.get(), A.get(), y.get(), N);
auto end = std::chrono::system_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end - start);
std::cout << "Result: " << prod << std::endl;
std::cout << "Time (ms): " << duration.count() << std::endl;

GCC no blas is poor because it does not use vectorized SMID instructions, while others all do. icpc will auto-vectorize you loop.
You don't show your matrix size, but generally gemv is memory bound. As the matrix is much larger than a temp vector, eliminating it may not be able to increase the performance a lot.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Visual C++ SSE function slow when starting without debugger - c++

Related

Need to add elements of two array together

Program crashes after checking index at array

I don't understand where I have a problem in code using sse

POSIX pthread_create scrambles the values of variables in a struct, how to avoid that?

Avoid blas when involving temporary memory allocation?

Categories

Resources