MSVC compiled program is 3x slower than program, compiled with MinGW - c++

I need to write some addon for NodeJS. Since addon should work on Windows, I need to compile it with MSVC. But after compilation I discovered that addon was slower than original program. So, after check, I made a conclusion that problem is MSVC. See results below. So, I don't understand, why MSVC produce so slow program. Can I optimize compilation with MSVC in some way, to reach speed of MinGW compiled program? Or simply MSVC produce less optimized code and maximum optimization for MSVC is already reached?
Compiler: g++ (x86_64-win32-sjlj-rev3, Built by MinGW-W64 project)
12.1.0
Compilation: g++ -O3 main.cpp
Execution time: 18517 ms
Compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.33.31629 for x64 (Toolset v143, SDK 10.0.20348.0)
Compilation: cl /O2 main.cpp
Execution time: 58144 ms
Notes:
I added swap(word1, word2) only to prevent MinGW optimization, that results in 0s execution time.
If I switch from string to char[] the execution time in MinGW almost not change (1-2s faster), but it reduces significantly in MSVC - 36 s after switch. If I replace with vector of int it reduces again to 25 s.
Minimal reproducible example:
#include <cmath>
#include <string>
#include <chrono>
#include <iostream>
using namespace std;
int cached[6];
int SIZE;
void setSize(int size) {
SIZE = size;
for (int i = 0; i < 6; i++)
cached[i] = pow(3, i);
}
int getMask(const string& guess, const string& answer) {
int results[6];
bool visited[6];
for (int i = 0; i < SIZE; i++) {
if (guess[i] == answer[i]) {
results[i] = 2;
visited[i] = true;
}
else {
results[i] = 0;
visited[i] = false;
}
}
for (int i = 0; i < SIZE; i++) {
if (results[i] != 2) {
for (int j = 0; j < SIZE; j++) {
if (answer[j] == guess[i] && !visited[j]) {
results[i] = 1;
visited[j] = true;
break;
}
}
}
}
int result = results[0];
for (int i = 1; i < SIZE; i++) {
result += results[i] * cached[i];
}
return result;
}
int main() {
setSize(6);
int sum = 0;
auto t0 = chrono::steady_clock::now();
string word1 = "abcdef";
string word2 = "fedcba";
for (int i = 0; i < 30000; i++) {
for (int j = 0; j < 30000; j++) {
sum += getMask(word1, word2);
swap(word1, word2);
}
}
auto t1 = chrono::steady_clock::now();
cout << chrono::duration_cast<chrono::milliseconds>(t1 - t0).count() << "[ms]" << endl;
cout << sum << endl;
return 0;
}

Related

Why do i have an out of range exception working with c++ strings

The code gets an exception of type out of range and i don't know why. It seems to work when i debug it, teh string is converted to what i want it to be.
First time on stack overflow btw:)
#include <iostream>
using namespace std;
string s;
string alpha = "abcdefghijklmnopqrstuvwxyz";
string crypto(string& s);
int main()
{
cin >> s;
cout << crypto(s);
return 0;
}
string crypto(string& s)
{
size_t i = 0;
while (i < s.length()) {
for (size_t j = 0; j < alpha.length(); j++) {
if (s.at(i) == alpha.at(j)) {
s.at(i) = alpha.at(alpha.length() - 1 - j);
++i;
}
}
}
return s;
}
Think about the case: if s.length() < alpha.length().
The problem was the i++ at the wrong place. You should format your code better, then this is much easier to spot.
Also avoid non-constant global variables and using namespace std:
#include <iostream>
#include <string> //Include the headers you use
const std::string alpha="abcdefghijklmnopqrstuvwxyz";
void crypto(std::string &s);
int main(){
std::string s;
crypto(s);
std::cout<<s;
return 0;
}
void crypto(std::string &s) //If you take the string by reference, it is changed, so you do not have to return it
{
for(std::size_t i = 0; i < s.length(); ++i) { //The for loop avoids the mistake completely
for(size_t j=0; j < alpha.length(); ++j){
if(s.at(i)==alpha.at(j)){
s.at(i)=alpha.at(alpha.length()-1-j);
}
}
}
}
To not solve your problem completely, there is still a bug in the code, that was also in yours. Try to find the error yourself.
The i++ was not put at the right place.
Moreover, there is another issue in the code: once an element has been replaced, you must leave the inner loop immediately (break). If not, you can have a -> z -> a in the same loop.
Input
abcyz
Output
zyxba
abcyz
#include <iostream>
#include <string>
std::string crypto(std::string& s) {
const std::string alpha = "abcdefghijklmnopqrstuvwxyz";
for (size_t i = 0; i < s.length(); ++i) {
for (size_t j = 0; j < alpha.length(); j++) {
if (s.at(i) == alpha.at(j)) {
s.at(i) = alpha.at(alpha.length() - 1 - j);
break;
}
}
}
return s;
}
int main() {
std::string s;
std::cin >> s;
std::cout << crypto(s) << std::endl;
std::cout << crypto(s) << std::endl;
return 0;
}
The inner loop of the crypto function can increment i past the end of the string. There's no test to stop it.
You could avoid this by breaking out of the loop when a letter is changed (that's more correct and potentially faster). That then means you should increment i outside the inner loop which means the outer loop can be a for loop as well.
string crypto(string &s) {
for (size_t i = 0; i < s.length(); ++i) {
for (size_t j = 0; j < alpha.length(); ++j) {
if (s.at(i) == alpha.at(j)) {
s.at(i) = alpha.at(alpha.length() - 1 - j);
break;
}
}
}
return s;
}

read/write to large array using large loop - execution time concerns

So recently I ran into a problem that I thought was interesting and I couldn't fully explain. I've highlighted the nature of the problem in the following code:
#include <cstring>
#include <chrono>
#include <iostream>
#define NLOOPS 10
void doWorkFast(int total, int *write, int *read)
{
for (int j = 0; j < NLOOPS; j++) {
for (int i = 0; i < total; i++) {
write[i] = read[i] + i;
}
}
}
void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
for (int i = 0; i < NLOOPS; i++) {
for (int j = 0; j < total/innerLoopSize; j++) {
for (int k = 0; k < innerLoopSize; k++) {
write[j*k + k] = read[j*k + k] + j*k + k;
}
}
}
}
int main(int argc, char *argv[])
{
int n = 1000000000;
int *heapMemoryWrite = new int[n];
int *heapMemoryRead = new int[n];
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start1 = std::chrono::high_resolution_clock::now();
doWorkFast(n,heapMemoryWrite, heapMemoryRead);
auto finish1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1);
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start2 = std::chrono::high_resolution_clock::now();
doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);
auto finish2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2);
std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" <<
"Large inner loop:" << duration2.count() << " microseconds." << std::endl;
delete[] heapMemoryWrite;
delete[] heapMemoryRead;
}
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses. I understand that in the doWorkSlow implementation, we are doing one or two more operations to resolve j*k + k, however, I think it's reasonably safe to assume that relative to the time it takes to do the load/stores for memory read and write, the time contribution of these operations is negligible.
Nevertheless, doWorkSlow takes about twice as long (46.8s) compared to doWorkFast (25.5s) on my i7-3700 using g++ --version 7.5.0. While things like cache prefetching and branch prediction come to mind, I don't have a great explanation as to why doWorkFast is much faster than doWorkSlow. Does anyone have insight?
Thanks
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.
This is not true!
In doWorkFast, you index each integer incrementally, as array[i].
array[0]
array[1]
array[2]
array[3]
In doWorkSlow, you index each integer as array[j*k + k], which jumps around and repeats.
When j is 10, for example, and you iterate k from 0 onwards, you are accessing
array[0] // 10*0+0
array[11] // 10*1+1
array[22] // 10*2+2
array[33] // 10*3+3
This will prevent your optimizer from using instructions that can operate on many adjacent integers at once.

Why is this vectorized code subject to vector size?

I compile the following code without vectorization (-O2) and compare the time with vectorization (-O3 -march=native) for three different vector lengths (determined by uncommenting the respective #define SIZE), obtaining 29::9, 247::145 and 4866::4884, for vector sizes 10000, 100000 and 1000000, respectively.
#include <iostream>
#include <random>
#include<chrono>
#include<cmath>
using namespace std;
using namespace std::chrono;
//#define SIZE (10000) // 29::9
//#define SIZE (100000) // 247::145
#define SIZE (1000000) // 4866::4884
void vector_op_2(int * __restrict__ v1, int * __restrict__ v2) {
for (unsigned i = 0; i < SIZE; i++)
v1[i] = 2 * v2[i];
}
int main() {
using namespace std;
int* v = new int[SIZE];
int* w = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
v[i] = i;
}
auto start = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
for (int k = 0; k < 5000; k++) {
vector_op_2(w, v);
}
auto end = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
std::cout << "Time " << end.count() - start.count() << std::endl;
for (int i = 0; i < SIZE; i++) {
if (abs(w[i]-2*v[i])>0.01) {
throw 1;
}
}
delete v;
return 0;
}
Why does no speedup occur in the case of vector size 1000000?
What is the optimal length?
Why does this vector length issue not occur with the following example?
[shortened]
long vector_op_1(int v[SIZE]) throw()
{
long s = 0;
for (unsigned i=0; i<SIZE; i++) s += v[i];
return s;
}
[... I am using g++ 7 on Ubuntu 16.04 ...]
[... For short vector size 1000 I am achieving a 6:1 ratio! ...]

Program crashes when using std::bitset but only when compiled with VC 2015

I'm working on a simple program which converts strings into binary code and back. I'm working on Ubuntu and wanted to compile the program on Windows with Visual Studio 2015. While the Linux build runs fine, on Windows it compiles, but crashes with something like
bitset<N> char
when calling following function:
bool bin2string(std::string *psBinaryString, std::string *psCharakterString)
{
psCharakterString->clear();
char cTempArray[8];
char cTemp;
for(unsigned int i = 0; i < psBinaryString->length(); i += 8)
{
for(unsigned int j = 0; j < 8; ++j)
{
cTempArray[j] = psBinaryString->c_str()[j + i];
}
std::bitset<8> Bitset(cTempArray);
cTemp = static_cast<char>(Bitset.to_ulong());
psCharakterString->push_back(cTemp);
}
return true;
}
Now, my questions are, what is wrong with this code? Why does it work on Linux (gcc) and Windows (MinGW), but not on Windows with Visual Studio 2015?
My current workarround for this is:
#ifdef _WIN32
bool bin2string(std::string *psBinaryString, std::string *psCharakterString)
{
psCharakterString->clear();
char cTempArray[8];
char cTemp;
for(size_t i = 0; i < psBinaryString->length(); i += 8)
{
for(size_t j = 0; j < 8; ++j)
{
cTempArray[j] = psBinaryString->c_str()[j + i];
}
cTemp = static_cast<char>(strtol(cTempArray, 0, 2));
psCharakterString->push_back(cTemp);
}
return true;
}
#else
bool bin2string(std::string *psBinaryString, std::string *psCharakterString)
{
psCharakterString->clear();
char cTempArray[8];
char cTemp;
for(unsigned int i = 0; i < psBinaryString->length(); i += 8)
{
for(unsigned int j = 0; j < 8; ++j)
{
cTempArray[j] = psBinaryString->c_str()[j + i];
}
std::bitset<8> Bitset(cTempArray);
cTemp = static_cast<char>(Bitset.to_ulong());
psCharakterString->push_back(cTemp);
}
return true;
}
#endif
Which one of these two solutions is better?
Your crash is caused by the fact that std::bitset's constructor expects either an array length, or a null terminated string.
(see the documentation at http://en.cppreference.com/w/cpp/utility/bitset/bitset)
So you should be using:
std::bitset<8> Bitset(cTempArray, 8);

Segmentation fault / glibc detected when creating shared library

EDITS----------------I tried with gcc-4.8.1 and still the same error.-------------I am trying to implement a simple matrix multiplication example using pthreads via a shared library. But I get this error when I try to create a shared library:
g++ -shared -o libMatmul.so matmul.o
collect2: ld terminated with signal 11 [Segmentation fault], core dumped
Here is the code I am using:
matmul.h:
#ifndef matmul_h__
#define matmul_h__
#define SIZE 10
typedef struct {
int dim;
int slice;
} matThread;
int num_thrd;
int A[SIZE][SIZE], B[SIZE][SIZE], C[SIZE][SIZE];
int m[SIZE][SIZE];
extern void init_matrix(int m[SIZE][SIZE]);
extern void print_matrix(int m[SIZE][SIZE]);
extern void* multiply(void* matThread);
#endif
matmul.c:
extern "C"
{
#include <pthread.h>
#include <unistd.h>
}
#include <iostream>
#include "matmul.h"
using namespace std ;
matThread* s=NULL;
// initialize a matrix
void init_matrix(int m[SIZE][SIZE])
{
int i, j, val = 0;
for (i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
m[i][j] = val++;
}
void print_matrix(int m[SIZE][SIZE])
{
int i, j;
for (i = 0; i < SIZE; i++) {
cout<<"\n\t|" ;
for (j = 0; j < SIZE; j++)
cout<<m[i][j] ;
cout<<"|";
}
}
// thread function: taking "slice" as its argument
void* multiply(void* param)
{
matThread* s = (matThread*)param; // retrive the slice info
int slice1=s->slice;
int D= s->dim=10;
int from = (slice1 * D)/num_thrd; // note that this 'slicing' works fine
int to = ((slice1+1) * D)/num_thrd; // even if SIZE is not divisible by num_thrd
int i,j,k;
cout<<"computing slice " << slice1<<" from row "<< from<< " to " <<to-1<<endl;
for (i = from; i < to; i++)
{
for (j = 0; j < D; j++)
{
C[i][j] = 0;
for ( k = 0; k < D; k++)
C[i][j] += A[i][k]*B[k][j];
}
}
cout<<" finished slice "<<slice1<<endl;
return NULL;
}
main.c:
extern "C"
{
#include <pthread.h>
#include <unistd.h>
}
#include <iostream>
#include "matmul.h"
using namespace std;
// Size by SIZE matrices
// number of threads
matThread* parm=NULL;
int main(int argc, char* argv[])
{
pthread_t* thread; // pointer to a group of threads
int i;
if (argc!=2)
{
cout<<"Usage:"<< argv[0]<<" number_of_threads"<<endl;
exit(-1);
}
num_thrd = atoi(argv[1]);
init_matrix(A);
init_matrix(B);
thread = (pthread_t*) malloc(num_thrd*sizeof(pthread_t));
matThread *parm = new matThread();
for (i = 0; i < num_thrd; i++)
{
parm->slice=i;
// creates each thread working on its own slice of i
if (pthread_create (&thread[i], NULL, multiply, (void*)parm) != 0)
{
cerr<<"Can't create thread"<<endl;
free(thread);
exit(-1);
}
}
for (i = 1; i < num_thrd; i++)
pthread_join (thread[i], NULL);
cout<<"\n\n";
print_matrix(A);
cout<<"\n\n\t *"<<endl;
print_matrix(B);
cout<<"\n\n\t="<<endl;
print_matrix(C);
cout<<"\n\n";
free(thread);
return 0;
}
The commands that I use are:
g++ -c -Wall -fPIC matmul.cpp -o matmul.o and
g++ -shared -o libMatmul.so matmul.o
The code might look little off because I am passing SIZE(dim) in a struct when its already in #define, but this is how I want it to be implemented. Its a test program for a bigger project that I am doing.
Any help is greatly appreciated! Thanks in advance.
First, you're mixing a lot of C and C++ idioms (calling free and new for instance) and you're not using any C++ library/STL features (like a std::vector or std::list instead of a C array), so while your code is 'technically' valid (minus some bugs) it's not good practice to mix C and C++ like that, there are many small idiosyncratic differences between C and C++ (syntax, compilation and linkage differences for example) that can add confusion to the code if it's not explicitly clear to the intentions.
That being said, I've made some changes to your code to make it C++98 compatible (and fix the bugs):
start matmul.h:
#ifndef matmul_h__
#define matmul_h__
#define SIZE 10
#include <pthread.h>
typedef struct matThread {
int slice;
int dim;
pthread_t handle;
matThread() : slice(0), dim(0), handle(0) {}
matThread(int s) : slice(s), dim(0), handle(0) {}
matThread(int s, int d) : slice(s), dim(d), handle(0) {}
} matThread;
// explicitly define as extern (for clarity)
extern int num_thrd;
extern int A[SIZE][SIZE];
extern int B[SIZE][SIZE];
extern int C[SIZE][SIZE];
extern void init_matrix(int m[][SIZE]);
extern void print_matrix(int m[][SIZE]);
extern void* multiply(void* matThread);
#endif
start matmul.cpp:
#include <iostream> // <stdio.h>
#include "matmul.h"
int num_thrd = 1;
int A[SIZE][SIZE];
int B[SIZE][SIZE];
int C[SIZE][SIZE];
// initialize a matrix
void init_matrix(int m[][SIZE])
{
int i, j, val;
for (i = 0, val = -1; i < SIZE; i++) {
for (j = 0; j < SIZE; j++) {
m[i][j] = ++val;
}
}
}
void print_matrix(int m[][SIZE])
{
int i, j;
for (i = 0; i < SIZE; i++) {
std::cout << "\n\t|"; // printf
for (j = 0; j < SIZE; j++) {
std::cout << m[i][j];
}
std::cout << "|"; // printf
}
}
// thread function: taking "slice" as its argument
void* multiply(void* param)
{
matThread* s = (matThread*)param; // retrive the slice info
int slice1 = s->slice;
int D = s->dim = 10;
int from = (slice1 * D) / num_thrd; // note that this 'slicing' works fine
int to = ((slice1+1) * D) / num_thrd; // even if SIZE is not divisible by num_thrd
int i, j, k;
std::cout << "computing slice " << slice1 << " from row " << from << " to " << (to-1) << std::endl; // printf
for (i = from; i < to; i++) {
for (j = 0; j < D; j++) {
C[i][j] = 0;
for ( k = 0; k < D; k++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}
std::cout << " finished slice " << slice1 << std::endl; // printf
return NULL;
}
start main.cpp:
#include <iostream>
#include <cstdlib> // atoi .. if C++11, you could use std::stoi in <string>
#include "matmul.h"
int main(int argc, char** argv)
{
if (argc != 2) {
std::cout << "Usage: " << argv[0] << " number_of_threads" << std::endl;
return -1;
} else {
num_thrd = std::atoi(argv[1]);
}
matThread mt[num_thrd];
int i = 0;
init_matrix(A);
init_matrix(B);
for (i = 0; i < num_thrd; i++) {
mt[i].slice = i;
// creates each thread working on its own slice of i
if (pthread_create(&mt[i].handle, NULL, &multiply, static_cast<void*>(&mt[i])) != 0) {
printf("Can't create thread\n");
return -1;
}
}
for (i = 0; i < num_thrd; i++) {
pthread_join(mt[i].handle, NULL);
}
std::cout << "\n\n";
print_matrix(A);
std::cout << "\n\n\t *\n";
print_matrix(B);
std::cout << "\n\n\t=\n";
print_matrix(C);
std::cout << "\n\n";
return 0;
}
To compile and use it you'll need to do the following commands:
g++ -c -Wall -fPIC matmul.cpp -o matmul.o
g++ -shared -Wl,-soname,libMatmul.so -o libMatmul.so.1 matmul.o
ln /full/path/to/libMatmul.so.1 /usr/lib/libMatmul.so
g++ main.cpp -o matmul -Wall -L. -lMatmul -pthread
Note that for your system to be able to find and link against the shared library you've just created, you'll need to ensure it's in your distro's lib folder (like /usr/lib/). You can copy/move it over, create a link to it (or a sym link via ln -s if you can't do hard links), and if you don't want to copy/move/link it, you can also ensure your LD_LIBRARY_PATH is properly set to include the build directory.
As I said; your code is NOT inherently C++ aside from the few print statements (std::cout, etc), and changing the C++ code (std::cout to printf and some other minor things for example) you could compile this as standard C99 code. I'm not 100% sure how the rest of your shared library will be designed so I didn't change the structure of the lib code (i.e. the functions you have) but if you wanted this code to be 'more C++' (i.e. with classes/namespaces, STL, etc.), you'd basically need to redesign your code, but given the context of your code, I don't think that's absolutely necessary unless you have a specific need for it.
I hope that can help.
Should
for (i = 1; i < num_thrd; i++)
not be
for (i = 0; i < num_thrd; i++)
You created num_thrd threads, but did not join all of them, therefore, a race condition is created as you're trying to read the data before the thread is finished.