I have a small function in written Haskell with the following type:
foreign export ccall sget :: Ptr CInt -> CSize -> Ptr CSize -> IO (Ptr CInt)
I am calling this from multiple C++ threads running concurrently (via
TBB). During this part of the execution of my program I can barely get a
load average above 1.4 even though I'm running on a six-core CPU (12
logical cores). I therefore suspect that either the calls into Haskell all
get funnelled through a single thread, or there is some significant
synchronization going on.
I am not doing any such thing explicitly, all the function does is operate
on the incoming data (after storing it into a Data.Vector.Storable), and
return the result back as a newly allocated array (from Data.Marshal.Array).
Is there anything I need to do to fully enable concurrent calls like this?
I am using GHC 8.6.5 on Debian Linux (bullseye/testing), and I am compiling with -threaded -O2.
Looking forward to reading some advice,
Sebastian
Using the simple example at the end of this answer, if I compile with:
$ ghc -O2 Worker.hs
$ ghc -O2 -threaded Worker.o caller.c -lpthread -no-hs-main -o test
then running it with ./test occupies only one core at 100%. I need to run it with ./test +RTS -N, and then on my 4-core desktop, it runs at 400% with a load average of around 4.0.
So, the RTS -N flag affects the number of parallel threads that can simultaneously run an exported Haskell function and there is no special action required (other than compiling with -threaded and running with +RTS -n) to fully utilize all available cores.
So, there must be something about your example that's causing the problem. It could be contention between threads over some shared data structure. Or, maybe parallel garbage collection is causing problems; I've observed parallel GC causing worse performance with increasing -N in a simple test case (details forgotten, sadly), so you could try turning off parallel GC with -qg or limiting the number of cores involved with -qn2 or something. To enable these options, you need to call hs_init_with_rtsopts() in place of the usual hs_init() as in my example.
If that doesn't work, I think you'll have to try to narrow down the problem and post a minimal example that illustrates the performance issue to get more help.
My example:
caller.c
#include "HsFFI.h"
#include "Rts.h"
#include "Worker_stub.h"
#include <pthread.h>
#define NUM_THREAD 4
void*
work(void* arg)
{
for (;;) {
fibIO(30);
}
}
int
main(int argc, char **argv)
{
hs_init_with_rtsopts(&argc, &argv);
pthread_t threads[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; ++i) {
int rc = pthread_create(&threads[i], NULL, work, NULL);
}
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_join(threads[i], NULL);
}
hs_exit();
return 0;
}
Worker.hs
module Worker where
import Foreign
fibIO :: Int -> IO Int
fibIO = return . fib
fib :: Int -> Int
fib n | n > 1 = fib (n-1) + fib (n-2)
| otherwise = 1
foreign export ccall fibIO :: Int -> IO Int
Related
Consider the following code:
#include <iostream>
#include <chrono>
using Time = std::chrono::high_resolution_clock;
using us = std::chrono::microseconds;
int main()
{
volatile int i, k;
const int n = 1000000;
for(k = 0; k < 200; ++k) {
auto begin = Time::now();
for (i = 0; i < n; ++i); // <--
auto end = Time::now();
auto dur = std::chrono::duration_cast<us>(end - begin).count();
std::cout << dur << std::endl;
}
return 0;
}
I am repeatedly measuring the execution time of the inner for loop.
The results are shown in the following plot (y: duration, x: repetition):
What is causing the decreasing of the loop execution time?
Environment: linux (kernel 4.2) # Intel i7-2600, compiled using: g++ -std=c++11 main.cpp -O0 -o main
Edit 1
The question is not about compiler optimization or performance benchmarks.
The question is, why the performance gets better over time.
I am trying to understand what is happening at run-time.
Edit 2
As proposed by Vaughn Cato, I have changed the CPU frequency scaling policy to "Performance". Now I am getting the following results:
It confirms Vaughn Cato's conjecture. Sorry for the silly question.
What you are probably seeing is CPU frequency scaling (throttling). The CPU goes into a low-frequency state to save power when it isn't being heavily used.
Just before running your program, the CPU clock speed is probably fairly low, since there is no big load. When you run your program, the busy loop increases the load, and the CPU clock speed goes up until you hit the maximum clock speed, decreasing your times.
If you run your program several times in a row, you'll probably see the times stay at a lower value after the first run.
In you original experiment, there are too many variables than can affect the measurements:
the use of your processor by other active processes (i.e. scheduling of your OS)
The question whether your loop is optimized away or not
The access and buffering to the console.
The initial mode of your CPU (see answer about throtling)
I must admit that I was very skeptical about your observations. I therefore wrote a small variant using a preallocated vector, to avoid I/O synchronisation effects:
volatile int i, k;
const int n = 1000000, kmax=200,n_avg=30;
std::vector<long> v(kmax,0);
for(k = 0; k < kmax; ++k) {
auto begin = Time::now();
for (i = 0; i < n; ++i); // <-- remain thanks to volatile
auto end = Time::now();
auto dur = std::chrono::duration_cast<us>(end - begin).count();
v[k]=dur;
}
I then ran it several times on ideone (which, given the scale of its use, we can assume that in average the processor whould be in a constantly sollicitated state). Indeed your observations seemed to be confirmed.
I guess that this could be related to branch prediction, which should improve through the repetitive patterns.
I however went on, updated the code slightly and added a loop to repeat the experiment several times. Then I started to get also runs where your observation was not confirmed (i.e. at the end, the time was higher). But it may also be that the many other processes running on the ideone also influence the branch prediction in a different manner.
So in the end, to conclude anything would require a more cautious experiment, on a machine running this benchmark (and only it) a couple of hours.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.
Consider the following code. It runs nThreads threads to copy floats from data1 to data2. It appears to have no speedup as nThreads increases, even works slower. I thought that it might be related to thread creation overhead, so increased sizes of the arrays to insane values, but it still doesn't speedup. Then I read about false sharing, but it appeared to only matter when false shared data are close enough to each other to fit in a cache line, definitely not hundreds of megabytes away.
#include <iostream>
#include <thread>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
void mythread(float* timePrev, float* timeNext, int kMin, int kMax)
{
for(int q=0;q<16;++q) // take more time
for(int k=kMin;k<kMax;++k)
timeNext[k]=timePrev[k];
}
static inline void runParallelJob(float* timePrev, float* timeNext, int W, int H, int nThreads)
{
std::thread* threads[nThreads];
int total=W*H;
int perThread=total/nThreads;
for(int t=0;t<nThreads;++t)
{
int k0=t*perThread;
int k1=(t+1)*perThread;
threads[t]=new std::thread(mythread,timePrev,timeNext,k0,k1);
}
for(int t=0;t<nThreads;++t)
{
threads[t]->join();
delete threads[t];
}
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
for(float nThreads=1;nThreads<=8;++nThreads)
{
long double time1=currentTime();
runParallelJob(data1, data2, W, H, nThreads);
long double time2=currentTime();
std::cerr << nThreads << " threads: " << (time2-time1)*1e+3 << " ms\n";
}
}
I compile this program with g++ 4.5.1, with command g++ main.cpp -o threads -std=c++0x -O3 -lrt && ./threads. The output of this program on my Core i7 930 (quad core with HyperThreading) reads:
1 threads: 5426.82 ms
2 threads: 5298.8 ms
3 threads: 5865.99 ms
4 threads: 5845.62 ms
5 threads: 5827.3 ms
6 threads: 5843.36 ms
7 threads: 5919.97 ms
8 threads: 5862.17 ms
Originally the program, which was reduced to this test case did a bit of multiplications, divisions and additions in the thread loop instead of plain copying, with the same performances.
Interestingly, if I omit -O3 from the compiler command line, 1 thread appears to execute for 11303 ms while 2 threads for 6398 ms (~2x speedup), but more threads still execute for about 5700 ms (no more speedup).
So, my question is: what am I missing? Why doesn't the performance scale with number of threads in my case?
I think that the factor that limits copy speed here is memory bandwidth. Therefore, having multiple cores copy the data makes no difference, since all the threads have to share the same memory bandwidth.
In general, throwing more threads at a given task won't necessarily make it faster. The overhead of managing threads and context switches is expensive, so there needs to be a specific reason to go this route. Waiting for some I/O (database calls, service calls, disk access) would be one common reason to use additional threads, high-CPU tasks could benefit from threading on a multicore machine, and ensuring the user maintains control of an app with a dedicated UI thread is another case.
I have a real puzzler for you folks.
Below is a small, self-contained, simple 40-line program that calculates partial sums of a bunch of numbers and routinely (but stochastically) crashes nodes on a distributed memory cluster that I'm using. If I spawn 50 PBS jobs that run this code, between 0 and 4 of them will crash their nodes. It will happen on a different repeat of the main loop each time and on different nodes each time, there is no discernible pattern. The nodes just go "down" on the ganglia report and I can't ssh to them ("no route to host"). If instead of submitting jobs I ssh onto one of the nodes and run my program there, if I'm unlucky and it crashes then I just stop seeing text and then see that that node is dead on ganglia.
The program is threaded with openmp and the crashes only happen when a large number of threads are spawned (like 12).
The cluster it's killing is a RHEL 5 cluster with nodes that have 2 6-core x5650 processors:
[jamelang#hooke ~]$ tail /etc/redhat-release
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
I have tried enabling core dumps ulimit -c unlimited but no files show up. This is the code, with comments:
#include <cstdlib>
#include <cstdio>
#include <omp.h>
int main() {
const unsigned int numberOfThreads = 12;
const unsigned int numberOfPartialSums = 30000;
const unsigned int numbersPerPartialSum = 40;
// make some numbers
srand(0); // every instance of program should get same results
const unsigned int totalNumbersToSum = numbersPerPartialSum * numberOfPartialSums;
double * inputData = new double[totalNumbersToSum];
for (unsigned int index = 0; index < totalNumbersToSum; ++index) {
inputData[index] = rand()/double(RAND_MAX);
}
omp_set_num_threads(numberOfThreads);
// prepare a place to dump output
double * partialSums = new double[numberOfPartialSums];
// do the following algorithm many times to induce a problem
for (unsigned int repeatIndex = 0; repeatIndex < 100000; ++repeatIndex) {
if (repeatIndex % 1000 == 0) {
printf("Absurd testing is on repeat %06u\n", repeatIndex);
}
#pragma omp parallel for
for (unsigned int partialSumIndex = 0; partialSumIndex < numberOfPartialSums;
++partialSumIndex) {
// get this partial sum's limits
const unsigned int beginIndex = numbersPerPartialSum * partialSumIndex;
const unsigned int endIndex = numbersPerPartialSum * (partialSumIndex + 1);
// we just sum the 40 numbers, can't get much simpler
double sumOfNumbers = 0;
for (unsigned int index = beginIndex; index < endIndex; ++index) {
// only reading, thread-safe
sumOfNumbers += inputData[index];
}
// writing to non-overlapping indices (guaranteed by omp),
// should be thread-safe.
// at worst we would have false sharing, but that would just affect
// performance, not throw sigabrts.
partialSums[partialSumIndex] = sumOfNumbers;
}
}
delete[] inputData;
delete[] partialSums;
return 0;
}
I compile it with the following:
/home/jamelang/gcc-4.8.1/bin/g++ -O3 -Wall -fopenmp Killer.cc -o Killer
It seems to be linking against the right shared objects:
[jamelang#hooke Killer]$ ldd Killer
linux-vdso.so.1 => (0x00007fffc0599000)
libstdc++.so.6 => /home/jamelang/gcc-4.8.1/lib64/libstdc++.so.6 (0x00002b155b636000)
libm.so.6 => /lib64/libm.so.6 (0x0000003293600000)
libgomp.so.1 => /home/jamelang/gcc-4.8.1/lib64/libgomp.so.1 (0x00002b155b983000)
libgcc_s.so.1 => /home/jamelang/gcc-4.8.1/lib64/libgcc_s.so.1 (0x00002b155bb92000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003293a00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003292e00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003292a00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003298600000)
Some Notes:
1. On osx lion with gcc 4.7, this code will throw a SIGABRT, similar to this question: Why is this code giving SIGABRT with openMP?. Using gcc 4.8 seems to fix the issue on OSX. However, using gcc 4.8 on the RHEL5 machine does not fix it. The RHEL5 machine has GLIBC version 2.5, and it seems that yum doesn't provide a later one, so the admins are sticking with 2.5.
2. If I define a SIGABRT signal handler, it doesn't catch the problem on the RHEL5 machine, but it does catch it on OSX with gcc47.
3. I believe that no variables should need to be shared in the omp clause because they can all have private copies, but adding them as shared does not change the behavior.
4. The killing of nodes occurs regardless of the level of optimization used.
5. The killing of nodes occurs even if I run the program from within gdb (i.e. put "gdb -batch -x gdbCommands Killer" in the pbs file) where "gdbCommands" is a file with one line: "run"
6. This example spawns threads on every repeat. One strategy would be to make a parallel block that contains the repeats loop in order to prevent this. However, that does not help me - this example is only representative of a much larger research code in which I cannot use that strategy.
I'm all out of ideas, at my last straw, at my wit's end, ready to pull my hair out, etc with this. Does anyone have suggestions or ideas?
You are trying to parallelize nested for loops, in this case you need to make the variables in the inner loop private, so that each thread has its own variable. It can be done using private clause, as in the example below.
#pragma omp parallel for private(j)
for (i = 0; i < height; i++)
for (j = 0; j < width; j++)
c[i][j] = 2;
In your case, index and sumOfNumbers need to be private.
I have a C++ application in which I need to compare two values and decide which is greater. The only complication is that one number is represented in log-space, the other is not. For example:
double log_num_1 = log(1.23);
double num_2 = 1.24;
If I want to compare num_1 and num_2, I have to use either log() or exp(), and I'm wondering if one is easier to compute than the other (i.e. runs in less time, in general). You can assume I'm using the standard cmath library.
In other words, the following are semantically equivalent, so which is faster:
if(exp(log_num_1) > num_2)) cout << "num_1 is greater";
or
if(log_num_1 > log(num_2)) cout << "num_1 is greater";
AFAIK the algorithms, the complexity is the same, the difference should be only a (hopefully negligible) constant.
Due to this, I'd use the exp(a) > b, simply because it doesn't break on invalid input.
Do you really need to know? Is this going to occupy a large fraction of you running time? How do you know?
Worse, it may be platform dependent. Then what?
So sure, test it if you care, but spending much time agonizing over micro-optimization is usually a bad idea.
Edit: Modified the code to avoid exp() overflow. This caused the margin between the two functions to shrink considerably. Thanks, fredrikj.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char **argv)
{
if (argc != 3) {
return 0;
}
int type = atoi(argv[1]);
int count = atoi(argv[2]);
double (*func)(double) = type == 1 ? exp : log;
int i;
for (i = 0; i < count; i++) {
func(i%100);
}
return 0;
}
(Compile using:)
emil#lanfear /home/emil/dev $ gcc -o test_log test_log.c -lm
The results seems rather conclusive:
emil#lanfear /home/emil/dev $ time ./test_log 0 10000000
real 0m2.307s
user 0m2.040s
sys 0m0.000s
emil#lanfear /home/emil/dev $ time ./test_log 1 10000000
real 0m2.639s
user 0m2.632s
sys 0m0.004s
A bit surprisingly log seems to be the faster one.
Pure speculation:
Perhaps the underlying mathematical taylor series converges faster for log or something? It actually seems to me like the natural logarithm is easier to calculate than the exponential function:
ln(1+x) = x - x^2/2 + x^3/3 ...
e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! ...
Not sure that the c library functions even does it like this, however. But it doesn't seem totally unlikely.
Since you're working with values << 1, note that x-1 > log(x) for x<1,
which means that x-1 < log(y) implies log(x) < log(y), which already
takes care of 1/e ~ 37% of the cases without having to use log or exp.
some quick tests in python (which uses c for math):
$ time python -c "from math import log, exp;[exp(100) for i in xrange(1000000)]"
real 0m0.590s
user 0m0.520s
sys 0m0.042s
$ time python -c "from math import log, exp;[log(100) for i in xrange(1000000)]"
real 0m0.685s
user 0m0.558s
sys 0m0.044s
would indicate that log is slightly slower
Edit: the C functions are being optimized out by compiler it seems, so the loop is what is taking up the time
Interestingly, in C they seem to be the same speed (possibly for reasons Mark mentioned in comment)
#include <math.h>
void runExp(int n) {
int i;
for (i=0; i<n; i++) {
exp(100);
}
}
void runLog(int n) {
int i;
for (i=0; i<n; i++) {
log(100);
}
}
int main(int argc, char **argv) {
if (argc <= 1) {
return 0;
}
if (argv[1][0] == 'e') {
runExp(1000000000);
} else if (argv[1][0] == 'l') {
runLog(1000000000);
}
return 0;
}
giving times:
$ time ./exp l
real 0m2.987s
user 0m2.940s
sys 0m0.015s
$ time ./exp e
real 0m2.988s
user 0m2.942s
sys 0m0.012s
It can depend on your libm, platform and processor. You're best off writing some code that calls exp/log a large number of times, and using time to call it a few times to see if there's a noticeable difference.
Both take basically the same time on my computer (Windows), so I'd use exp, since it's defined for all values (assuming you check for ERANGE). But if it's more natural to use log, you should use that instead of trying to optimise without good reason.
It would make sense that log be faster... Exp has to perform several multiplications to arrive at its answer whereas log only has to convert the mantissa and exponent to base-e from base-2.
Just be sure to boundary check (as many others have said) if you're using log.
if you are sure that this is hotspot -- compiler instrinsics are your friends. Although it`s platform-dependent (if you go for performance in places like this -- you cannot be platform-agnostic)
So. The question really is -- which one is asm instruction on your target architecture -- and latency+cycles. Without this it is pure speculation.