What is wrong with vectors of Armadillo library - c++

Previously I posted a question which was a bit complicated in some sense. In this question, I made a simple example to generate the same problem with Armadillo library which is supposed to be very fast.
Considering the following code with two main functions only one is activated at each time:
#include <armadillo>
const int runmax=50000000;
arma::vec::fixed<3u> x,y;
double a,b;
void init()
{
x<<0.2<<arma::endr<<-0.3<<arma::endr<<0.1;
y<<1<<arma::endr<<1<<arma::endr<<1;
a=0.3;
b=a*a;
}
template<unsigned N>
void scaledsum(double a,arma::vec::fixed<N> &x,double b,arma::vec::fixed<N> &y)
{
for(int i=0;i<N;i++)
{
y(i)=a*x(i)+b*y(i);
}
}
void main1()
{
for(int i=0;i<runmax;i++)
{
y=a*x+b*y;
}
}
void main2()
{
for(int i=0;i<runmax;i++)
{
scaledsum(a,x,b,y);
}
}
int main()
{
init();
//main1();
main2();
y.print();
return 0;
}
perf stat ./main1
perf stat ./main2
I expect main1 runs faster than main2. or at least very close to it. But main1 runs slower.
I do not understand such profiling:
main1
0.209682235 seconds time elapsed
main2
0.121644777 seconds time elapsed
PS
Compilation command: g++ -std=c++11 -O3 -s -DNDEBUG test.cpp

Related

Program crashing with C++ DLL using OpenMP

I have a program using OpenMP on C++ and I need it to port into Dll so I can call it from Python. It returns an array of double values, which calculated using a lot of for loops with openmp pragma. I was doubtful if it is going to work, so I started from a little test program that calculates Pi value in a loop with different precision values, then I would measure performance and ensure that OpenMP works properly that way. Plain (w/o Omp) implementation works fine from Python and C++, however Omp variant gives a runtime error in Python (exception: access violation writing 0x000000000000A6C8) and crashes without an error in C++. Also Omp variant works fine if it is not a Dll and just a regular executable. The Dll is made with a makefile. App that uses the Dll built into an executable with g++ with no flags (source code is in UnitMain.cpp). All the relevant code and a Makefile below (I didn't include some files and functions for brevity).
UPD: I tried Microsoft compiler and it works, also I tested a linux dynamic library on WSL/g++ and it also works. Looks like it is Windows gcc specific, I'll try another version of gcc (btw my current version is this):
Thread model: posix gcc version 8.1.0 (x86_64-posix-seh-rev0, Built by MinGW-W64 project)
UnitFunctions.cpp
#include "UnitFunctions.h"
#include <omp.h>
#include <stdio.h>
#include <string.h>
typedef long long int64_t;
double pi(int64_t n) {
double sum = 0.0;
int64_t sign = 1;
for (int64_t i = 0; i < n; ++i) {
sum += sign/(2.0*i+1.0);
sign *= -1;
}
return 4.0*sum;
}
void calcPiOmp(double* arr, int N) {
int64_t base = 10e5;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
arr[i] = pi(base+i);
}
}
UnitMain.cpp
#include <windows.h>
#include <iostream>
using namespace std;
struct DllHandle
{
DllHandle(const char * const filename)
: h(LoadLibrary(filename)) {}
~DllHandle() { if (h) FreeLibrary(h); }
const HINSTANCE Get() const { return h; }
private:
HINSTANCE h;
};
int main()
{
const DllHandle h("Functions.DLL");
if (!h.Get())
{
MessageBox(0,"Could not load DLL","UnitCallDll",MB_OK);
return 1;
}
typedef const void (*calcPiOmp_t) (double*, int);
const auto calcPiOmp = reinterpret_cast<calcPiOmp_t>(GetProcAddress(h.Get(), "calcPiOmp"));
double arr[80];
calcPiOmp(arr, 80);
cout << arr[0] << endl;
return 0;
}
Makefile
all: UnitEntryPoint.o UnitFunctions.o
g++ -m64 -fopenmp -s -o Functions.dll UnitEntryPoint.o UnitFunctions.o
UnitEntryPoint.o: UnitEntryPoint.cpp
g++ -m64 -fopenmp -c UnitEntryPoint.cpp
UnitFunctions.o: UnitFunctions.cpp
g++ -m64 -fopenmp -c UnitFunctions.cpp
A Python script
import numpy as np
import ctypes as ct
cpp_fun = ct.CDLL('./Functions.dll')
cpp_fun.calcPiNaive.argtypes = [np.ctypeslib.ndpointer(), ct.c_int]
cpp_fun.calcPiOmp.argtypes = [np.ctypeslib.ndpointer(), ct.c_int]
arrOmp = np.zeros(N).astype('float64')
cpp_fun.calcPiOmp(arrOmp, N)

Exclude C++ standard library function calls from gprof output

I am using the C++ standard library in some C++ code and this makefile:
CC=g++
CXXFLAGS=-Wall -Werror -ggdb3 -std=c++11 -pedantic $(OTHERFLAGS)
cpp_sort: cpp_sort.o
g++ -o $# $(CXXFLAGS) $^
clean:
rm -rf *.o cpp_sort *~
The source code:
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
void get_input(vector<int>& items, int size) {
for (int i = 0; i < size; ++i) {
int element;
cin >> element;
items.push_back(element);
}
}
void cpp_sort(vector<int>& items) {
sort(items.begin(), items.end());
}
void print_array(vector<int>& items) {
for (auto& item : items) {
cout << item << ' ';
}
cout << endl;
}
int main() {
int size;
cin >> size;
vector<int> items;
items.reserve(size);
get_input(items, size);
cpp_sort(items);
print_array(items);
}
I call make like this:
make OTHERFLAGS=-pg
run the program (where large.txt is a long list of integers):
./cpp_sort <large.txt
and view the profiling information:
grof ./cpp_sort
Which is fine and it does work, but the calling of my functions is obscured by all the C++ standard library function calls. Is there a way to exclude the standard library internal function calls?

The Cost of C++ Exceptions and setjmp/longjmp

I wrote a test to measure the cost of C++ exceptions with threads.
#include <cstdlib>
#include <iostream>
#include <vector>
#include <thread>
static const int N = 100000;
static void doSomething(int& n)
{
--n;
throw 1;
}
static void throwManyManyTimes()
{
int n = N;
while (n)
{
try
{
doSomething(n);
}
catch (int n)
{
switch (n)
{
case 1:
continue;
default:
std::cout << "error" << std::endl;
std::exit(EXIT_FAILURE);
}
}
}
}
int main(void)
{
int nCPUs = std::thread::hardware_concurrency();
std::vector<std::thread> threads(nCPUs);
for (int i = 0; i < nCPUs; ++i)
{
threads[i] = std::thread(throwManyManyTimes);
}
for (int i = 0; i < nCPUs; ++i)
{
threads[i].join();
}
return EXIT_SUCCESS;
}
Here's the C version that I initially wrote for fun.
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
#include <glib.h>
#define N 100000
static GPrivate jumpBuffer;
static void doSomething(volatile int *pn)
{
jmp_buf *pjb = g_private_get(&jumpBuffer);
--*pn;
longjmp(*pjb, 1);
}
static void *throwManyManyTimes(void *p)
{
jmp_buf jb;
volatile int n = N;
(void)p;
g_private_set(&jumpBuffer, &jb);
while (n)
{
switch (setjmp(jb))
{
case 0:
doSomething(&n);
case 1:
continue;
default:
printf("error\n");
exit(EXIT_FAILURE);
}
}
return NULL;
}
int main(void)
{
int nCPUs = g_get_num_processors();
GThread *threads[nCPUs];
int i;
for (i = 0; i < nCPUs; ++i)
{
threads[i] = g_thread_new(NULL, throwManyManyTimes, NULL);
}
for (i = 0; i < nCPUs; ++i)
{
g_thread_join(threads[i]);
}
return EXIT_SUCCESS;
}
The C++ version runs very slow compared to the C version.
$ g++ -O3 -g -std=c++11 test.cpp -o cpp-test -pthread
$ gcc -O3 -g -std=c89 test.c -o c-test `pkg-config glib-2.0 --cflags --libs`
$ time ./cpp-test
real 0m1.089s
user 0m2.345s
sys 0m1.637s
$ time ./c-test
real 0m0.024s
user 0m0.067s
sys 0m0.000s
So I ran the callgrind profiler.
For cpp-test, __cxz_throw was called exactly 400,000 times with self-cost of 8,000,032.
For c-test, __longjmp_chk was called exactly 400,000 times with self-cost of 5,600,000.
The whole cost of cpp-test is 4,048,441,756.
The whole cost of c-test is 60,417,722.
I guess something much more than simply saving the state of the jump-point and later resuming is done with C++ exceptions. I couldn't test with larger N because the callgrind profiler will run forever for the C++ test.
What is the extra cost involved in C++ exceptions making it many times slower than the setjmp/longjmp pair at least in this example?
This is by design.
C++ exceptions are expected to be exceptional in nature and are optimized thusly. The program is compiled to be most efficient when an exception does not happen.
You can verify this by commenting out the exception from your tests.
In C++:
//throw 1;
$ g++ -O3 -g -std=c++11 test.cpp -o cpp-test -pthread
$ time ./cpp-test
real 0m0.003s
user 0m0.004s
sys 0m0.000s
In C:
/*longjmp(*pjb, 1);*/
$ gcc -O3 -g -std=c89 test.c -o c-test `pkg-config glib-2.0 --cflags --libs`
$ time ./c-test
real 0m0.008s
user 0m0.012s
sys 0m0.004s
What is the extra cost involved in C++ exceptions making it many times slower than the setjmp/longjmp pair at least in this example?
g++ implements zero-cost model exceptions, which have no effective overhead* when an exception is not thrown. Machine code is produced as if there were no try/catch block.
The cost of this zero-overhead is that a table lookup must be performed on the program counter when an exception is thrown, to determine a jump to the appropriate code for performing stack unwinding. This puts the entire try/catch block implementation within the code performing a throw.
Your extra cost is a table lookup.
*Some minor timing voodoo may occur, as the presence of a PC lookup table may affect memory layout, which may affect CPU cache misses.

execution time serial and parallel openmp program mac os

I have a problem in understanding the execution time of my program (serial and parallel version).
Here you can find the part of the main function I am talking about:
stopwatch temp2;
temp2.start();
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<100; i++){
int a=itemVectorTraining->at(mapScaledItem->at(5755)).timesIWatchedAfterC(itemVectorTraining->at(mapScaledItem->at(236611)), deviceVectorTraining, 180);
}
}
temp2.stop();
cout<<"Parallel: "<<temp2.elapsed_ms()<<endl;
stopwatch temp1;
temp1.start();
for (int i=0; i<100; i++){
int a=itemVectorTraining->at(mapScaledItem->at(5755)).timesIWatchedAfterC(itemVectorTraining->at(mapScaledItem->at(236611)), deviceVectorTraining, 180);
}
temp1.stop();
cout<<"Serial: "<<temp1.elapsed_ms()<<endl;
where "stopwatch" is an object well defined (I hope so, since my professor has created it :) ) in order to have a corrected measure of time in milliseconds.
The problem is that when I execute the main with this command line:
g++-4.9 -std=c++11 -o test -Iinclude main.cpp
I obtain this output
Parallel: 140821125
Serial: 89847
while adding "-fopenmp", i.e. with this command line:
g++-4.9 -fopenmp -std=c++11 -o testVale main.cpp
I get:
Parallel: 39413
Serial: 2089786185294
And it doesn't make any sense! Moreover while the program return me such big values for Parallel in the first case and for Serial in the second case, actually it doesn't take such a long time to run the code.
I am compiling from the terminal of a MAC OS X , and normally I should obtain something like:
Parallel:38548
Serial 68007
Does anyone have an idea of what's going on with the compilation of the program?
Thank you very much!
Code of stopwatch:
#ifndef CGLIFE_STOPWATCH_HPP
#define CGLIFE_STOPWATCH_HPP
#include <chrono>
class stopwatch {
private:
typedef std::chrono::high_resolution_clock clock;
bool running;
clock::time_point start_time;
clock::duration elapsed;
public:
stopwatch() {
running = false;
}
// Starts the stopwatch.
void start() {
if (!running) {
running = true;
start_time = clock::now();
}
}
// Stops the stopwatch.
void stop() {
if (running) {
running = false;
elapsed += clock::now() - start_time;
}
}
// Resets the elapsed time to 0.
void reset() {
elapsed = clock::duration();
}
// Returns the total elapsed time in milliseconds.
// If the stopwatch is running, the elapsed time
// includes the time elapsed in the current interval.
long long elapsed_ms() const {
clock::duration total;
if (!running) {
total = elapsed;
} else {
total = elapsed + (clock::now() - start_time);
}
return std::chrono::duration_cast<std::chrono::milliseconds>(total).count();
}
};
#endif
The stopwatch::elapsed seems uninitialized. I am unsure how it can be, since it must be of a class type.
Either initialize it in stopwatch constructor:
stopwatch() {
running = false;
elapsed = clock::duration();
}
or call reset always before starting one:
stopwatch temp2;
temp2.reset();
temp2.start();

Why is my program giving a totally different output when I compile with mingw as compared to g++

So when i compile this code (using the mersenne twister found here: http://www-personal.umich.edu/~wagnerr/MersenneTwister.html ):
#include <iostream>
#include <cmath>
#include "mtrand.h"
using namespace std;
double pythag(double x, double y) {
double derp=0;
derp=(x*x)+(y*y);
derp=sqrt(derp);
}
int main() {
double x=0;
double y=0;
double pi=0;
double hold1=0;
double hold2=0;
double hits=0;
MTRand mt;
mt.seed();
// cout.precision(10);
for(long i=1; i<=100000000000l; i++) {
x=abs(mt.rand());
y=abs(mt.rand());
if(pythag(x,y)<=1) {
hits++;
}
if(i%100000l==0) {
pi=(4*hits)/i;
cout << "\r" << i << " " << pi ;
}
}
cout <<"\n";
return 42;
}
Using g++ ("g++ pi.cc -o pi")
And run the resulting application, I get the output i wanted, a running tally of pi calculated using the Monte Carlo method.
But, when i compile with mingw g++ ("i686-pc-mingw32-g++ -static-libstdc++ -static-libgcc pi.cc -o pi.exe")
I always get a running tally of 0.
Any help is greatly appreciated.
Perhaps it's because you omitted the return statement:
double pythag(double x, double y) {
double derp=0;
derp=(x*x)+(y*y);
derp=sqrt(derp);
// You're missing this!!!
return derp;
}
I'd be surprised that you didn't get any warnings or errors on this.
pythag() does not return anything, as Loki is trying to say without telling you the exact answer. That means the return value is not specified.
Why do you return 42 in main()?! 8-)