I've developed a multi platform program (using the FLTK toolkit) and implement multithreading to do background intensive tasks.
I have followed the FLTK tutorials/examples on multithreading which involve using pthread on Mac, ie the function pthread_create and windows threading on windows ie _beginthread
What I have noticed is that the performance is much higher on Windows ie 3 to 4 times faster in these background threads (in the time to execute them).
Why might this be? Is it the threading libraries I'm using? Surely there shouldn't be such a difference? Or could it be the runtime libraries underneath it all?
Here are my machine details
Mac:
Intel(R) Core(TM) i7-3820QM CPU # 2.70GHz
16 GB DDR3 1600 MHz
Model MacBookPro9,1
OS: Mac OSX 10.8.5
Windows:
Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
16 GB DDR3 1600 MHz
Model: Dell Latitude E5530
OS: Windows 7 Service Pack 1
EDIT
To just do a basic speed comparison I compiled this on both machines running from the command line
int main(int agrc, char **argv)
{
time_t t = time(NULL);
tm* tt=localtime(&t);
std::stringstream s;
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"1: "<<s.str()<<std::endl;
double sum=0;
for (int i=0;i<100000000;i++){
double ii=i*0.123456789;
sum=sum+sin(ii)*cos(ii);
}
t = time(NULL);
tt=localtime(&t);
s.str("");
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"2: "<<s.str()<<std::endl;
}
Windows takes less than a second. Mac takes 4/5 seconds. Any ideas?
On Mac I'm compiling with g++, with visual studio 2013 on windows.
SECOND EDIT
if I change the line
std::cout<<"2: "<<s.str()<<std::endl;
to
std::cout<<"2: "<<s.str()<<" "<<sum<<std::endl;
Then all of a sudden Windows takes a little bit longer...
This makes me think that the whole thing might be compiler optimisation. So the question would be is g++ (4.2 is the version I have) worse at optimisation or do I need to provide additional flags?
THIRD(!) AND FINAL EDIT
I can report that I achieve comparable performance by ensuring g++ optimisation flags -O were provided at compile time. One of those annoying things that happens so often
A: Im tearing my hair out on problem x
B: Are you sure you're not doing y?
A: That works, why is this information not plastered all over the place and in every tutorial on problem x on the web?
B: Did you read the manual?
A: No, if I completely read the manual for every single bit of code/program I used I would never actually get round to doing anything...
Meh.
Related
I'm building my own Embedded Linux OS for Raspberry PI3 using Buildroot. This OS will be used to handle several applications, one of them performs objects detection based on OpenCV (v3.3.0).
I started with Raspbian Jessy + Python but it turned out that it takes a lot of time to execute a simple example, So I decided to design my own RTOS with Optimized features + C++ development instead of Python.
I thought that with these optimizations the 4 cores of RPI + the 1GB RAM will handle such applications. The problem is that even with these things, the simplest Computer Vision programs take a lot of time.
PC vs. Raspberry PI3 Comparaison
This is a simple program I wrote to have an idea of the order of magnitude of execution time of each part of the program.
#include <stdio.h>
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
#include <time.h> /* clock_t, clock, CLOCKS_PER_SEC */
using namespace cv;
using namespace std;
int main()
{
setUseOptimized(true);
clock_t t_access, t_proc, t_save, t_total;
// Access time.
t_access = clock();
Mat img0 = imread("img0.jpg", IMREAD_COLOR);// takes ~90ms
t_access = clock() - t_access;
// Processing time
t_proc = clock();
cvtColor(img0, img0, CV_BGR2GRAY);
blur(img0, img0, Size(9,9));// takes ~18ms
t_proc = clock() - t_proc;
// Saving time
t_save = clock();
imwrite("img1.jpg", img0);
t_save = clock() - t_save;
t_total = t_access + t_proc + t_save;
//printf("CLOCKS_PER_SEC = %d\n\n", CLOCKS_PER_SEC);
printf("(TEST 0) Total execution time\t %d cycles \t= %f ms!\n", t_total,((float)t_total)*1000./CLOCKS_PER_SEC);
printf("---->> Accessing in\t %d cycles \t= %f ms.\n", t_access,((float)t_access)*1000./CLOCKS_PER_SEC);
printf("---->> Processing in\t %d cycles \t= %f ms.\n", t_proc,((float)t_proc)*1000./CLOCKS_PER_SEC);
printf("---->> Saving in\t %d cycles \t= %f ms.\n", t_save,((float)t_save)*1000./CLOCKS_PER_SEC);
return 0;
}
Results of Execution on an i7 PC
Results of Execution on Raspberry PI (Generated OS from Buildroot)
As you can see there is a huge difference. What I need is to optimize every single detail so that this example processing step occurs in "near" real-time at in a maximum 15ms processing time instead of the 44ms. So these are my questions:
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Is there any other possibilities instead of OpenCV?
Should I use C instead of C++?
Any hardware improvements you recommend?
Well as i understand, you want to get about 30-40fps. In case of your I7: it is fast and having tone of acceleration techniques enabled default by itel. In case of raspberry pi: well, we love it but it is slow, especially for image processing program.
How can I optimize my OS so that it can handle intensive calculations applications and how can control the priorities of each part?
You should include some acceleration library for arm and re-compiled opencv again with those features enabled.
How can I fully use the 4 Cores of RPI3 to fulfill the requirements?
Paralleling your code so it could run on 4 cores
Is there any other possibilities instead of OpenCV?
Ask your self first, what features do you need from OpenCV.
Should I use C instead of C++?
Changing language will not help you at all, stay and love C++. It is a beautiful language and very fast
Any hardware improvements you recommend?
How about other board with mali GPU supported. So you could run opencv code directly on GPU, that will boost up your speed a lot.
I run a time-critical application on Windows 10 using Python 2.7x. And it seems Windows sometimes interrupts my program for a fraction of a seconds. This happens every like 5 - 10 seconds.
How can I "tell" Windows that my program is the only thing which is important as long as it is running?
Sorry, I wasn't clear about my link; I meant the psutil usage as proposed in the second comment on http://code.activestate.com/recipes/496767-set-process-priority-in-windows/:
import psutil, os
p = psutil.Process(os.getpid())
print(p.nice())
p.nice(psutil.HIGH_PRIORITY_CLASS)
print(p.nice())
... yields:
32
128
I'm usining Windows 7, 64bits, 8GB ram
I'm needing to make alloc more than 2GB but I'm getting runtime error
look at my piece of code
#define MAX_PESSOAS 30000000
int i;
double ** totalPessoas = new double *[MAX_PESSOAS];
for(i = 0; i < MAX_PESSOAS; i++)
totalPessoas[i] = new double [5];
MAX_PESSOAS is set to 30milion, but I'll need at least 1billion (ok, I know I'll need more than 8GB but nvm, I can get it, I only need to know how to do that )
I'm using visual studio 2012
If your application is building to a 64-bit binary, it can address more than 8 GB without any special steps.
If your application is building to a 32-bit binary, you can address up to 3 GB (or 4 GB if you're running 64-bit Windows) by enabling 4-gigabyte tuning, as long as the system supports it.
Your best bet is probably to compile your application as a 64-bit binary, if you know that the operating system it will be running on is 64-bit.
When I call function execution time is 6.8 sec.
Call it from a thread time is 3.4 sec
and when using 2 thread 1.8 sec. No matter what optimization I use rations stay same.
In Visual Studio times are like expected 3.1, 3 and 1.7 sec.
#include<math.h>
#include<stdio.h>
#include<windows.h>
#include <time.h>
using namespace std;
#define N 400
float a[N][N];
struct b{
int begin;
int end;
};
DWORD WINAPI thread(LPVOID p)
{
b b_t = *(b*)p;
for(int i=0;i<N;i++)
for(int j=b_t.begin;j<b_t.end;j++)
{
a[i][j] = 0;
for(int k=0;k<i;k++)
a[i][j]+=k*sin(j)-j*cos(k);
}
return (0);
}
int main()
{
clock_t t;
HANDLE hn[2];
b b_t[3];
b_t[0].begin = 0;
b_t[0].end = N;
b_t[1].begin = 0;
b_t[1].end = N/2;
b_t[2].begin = N/2;
b_t[2].end = N;
t = clock();
thread(&b_t[0]);
printf("0 - %d\n",clock()-t);
t = clock();
hn[0] = CreateThread ( NULL, 0, thread, &b_t[0], 0, NULL);
WaitForSingleObject(hn[0], INFINITE );
printf("1 - %d\n",clock()-t);
t = clock();
hn[0] = CreateThread ( NULL, 0, thread, &b_t[1], 0, NULL);
hn[1] = CreateThread ( NULL, 0, thread, &b_t[2], 0, NULL);
WaitForMultipleObjects(2, hn, TRUE, INFINITE );
printf("2 - %d\n",clock()-t);
return 0;
}
Times:
0 - 6868
1 - 3362
2 - 1827
CPU - Core 2 Duo T9300
OS - Windows 8, 64 - bit
compiler: mingw32-g++.exe, gcc version 4.6.2
edit:
Tried different order, same result, even tried separate applications.
Task Manager showing CPU Utilization around 50% for function and 1 thread and 100% for 2 thread
Sum of all elements after each call is the same: 3189909.237955
Cygwin result: 2.5, 2.5 and 2.5 sec
Linux result(pthread): 3.7, 3.7 and 2.1 sec
#borisbn results: 0 - 1446 1 - 1439 2 - 721.
The difference is a result of something in the math library implementing sin() and cos() - if you replace the calls to those functions with something else that takes time the significant difference between step and 0 and step 1 goes away.
Note that I see the difference with gcc (tdm-1) 4.6.1, which is a 32-bit toolchain targeting 32 bit binaries. Optimization makes no difference (not surprising since it seems to be something in the math library).
However, if I build using gcc (tdm64-1) 4.6.1, which is a 64-bit toolchain, the difference does not appear - regardless if the build is creating a 32-bit program (using the -m32 option) or a 64-bit program (-m64).
Here are some example test runs (I made minor modifications to the source to make it C99 compatible):
Using the 32-bit TDM MinGW 4.6.1 compiler:
C:\temp>gcc --version
gcc (tdm-1) 4.6.1
C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
C:\temp>test
0 - 4082
1 - 2439
2 - 1238
Using the 64-bit TDM 4.6.1 compiler:
C:\temp>gcc --version
gcc (tdm64-1) 4.6.1
C:\temp>gcc -m32 -std=gnu99 -o test.exe test.c
C:\temp>test
0 - 2506
1 - 2476
2 - 1254
C:\temp>gcc -m64 -std=gnu99 -o test.exe test.c
C:\temp>test
0 - 3031
1 - 3031
2 - 1539
A little more information:
The 32-bit TDM distribution (gcc (tdm-1) 4.6.1) links to the sin()/cos() implementations in the msvcrt.dll system DLL via a provided import library:
c:/mingw32/bin/../lib/gcc/mingw32/4.6.1/../../../libmsvcrt.a(dcfls00599.o)
0x004a113c _imp__cos
While the 64-bit distribution (gcc (tdm64-1) 4.6.1) doesn't appear to do that, instead linking to some static library implementation provided with the distribution:
c:/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/4.6.1/../../../../x86_64-w64-mingw32/lib/../lib32/libmingwex.a(lib32_libmingwex_a-cos.o)
C:\Users\mikeb\AppData\Local\Temp\cc3pk20i.o (cos)
Update/Conclusion:
After a bit of spelunking in a debugger stepping through the assembly of msvcrt.dll's implementation of cos() I've found that the difference in the timing of the main thread versus an explicitly created thread is due to the FPU's precision being set to a non-default setting (presumably the MinGW runtime in question does this at start up). In the situation where the thread() function takes twice as long, the FPU is set to 64-bit precision (REAL10 or in MSVC-speak _PC_64). When the FPU control word is something other than 0x27f (the default state?), the msvcrt.dll runtime will perform the following steps in the sin() and cos() function (and probably other floating point functions):
save the current FPU control word
set the FPU control word to 0x27f (I believe it's possible for this value to be modified)
perform the fsin/fcos operation
restore the saved FPU control word
The save/restore of the FPU control word is skipped if it's already set to the expected/desired 0x27f value. Apparently saving/restoring the FPU control word is expensive, since it appears to double the amount of time the function takes.
You can solve the problem by adding the following line to main() before calling thread():
_control87( _PC_53, _MCW_PC); // requires <float.h>
Not a cache matterhere.
Likely different runtime libraries for user created threads and main thread.
You may compare the calculations a[i][j]+=k*sin(j)-j*cos(k); in detail (numbers) for specific values of i, j, and k to confirm differences.
The reason is the main thread is doing 64 bit float math and the threads are doing 53 bit math.
You can know this / fix it by changing the code to
...
extern "C" unsigned int _control87( unsigned int newv, unsigned int mask );
DWORD WINAPI thread(LPVOID p)
{
printf( "_control87(): 0x%.4x\n", _control87( 0, 0 ) );
_control87(0x00010000,0x00010000);
...
The output will be:
c:\temp>test
_control87(): 0x8001f
0 - 2667
_control87(): 0x9001f
1 - 2683
_control87(): 0x9001f
_control87(): 0x9001f
2 - 1373
c:\temp>mingw32-c++ --version
mingw32-c++ (GCC) 4.6.2
You can see that 0 was going to run w/o the 0x10000 flag, but once set, runs at the same speed as 1 & 2. If you look up the _control87() function, you'll see that this value is the _PC_53 flag, which sets the precision to be 53 instead of 64 had it been left as zero.
For some reason, Mingw isn't setting it to the same value at process init time that CreateThread() does at thread create time.
Another work around it to turn on SSE2 with _set_SSE2_enable(1), which will run even faster, but may give different results.
c:\temp>test
0 - 1341
1 - 1326
2 - 702
I believe this is on by default for the 64 bit because all 64 bit processors support SSE2.
As others suggested, change the order of your three tests to get some more insight. Also, the fact that you have a multi-core machine pretty well explains why using two threads doing half the work each takes half the time. Take a look at your CPU usage monitor (Control-Shift-Escape) to find out how many cores are maxed out during the running time.
We have migrated a C++ RHEL 5.4 app from RH 6.2 and found that application has broken. One of our investigations lead to findings that the code in 5.4 box refers 'futex'. Note out app is compiled using 32 bit compiler option -
grep futex tool_strace.txt
futex(0xff8ea454, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xf6d1f4fc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0xf6c10a4c, FUTEX_WAKE_PRIVATE, 2147483647) = 0
As per http://www.akkadia.org/drepper/assumekernel.html I added the code on 5.4 build -
setenv("LD_ASSUME_KERNEL" , "2.4.1" , 1); // to use Linux Threads
But the strace dump still shows me 'futex' being called.
All the addresses ff8ea454, f6d1f4fc and f6c10a4c are 32 bit addresses. So if my assumption is right how can I code that 'futex' calls can be suppressed or be not called at all.
Is there any known issue with futex calls?
I believe the following to be true:
LD_ASSUME_KERNEL has to be set before your program starts to have any effect.
futex is used to implement any type of locks, so you can't avoid it.
You shouldn't need LD_ASSUME_KERNEL when you are compiling your own code, as it should
use newer interfaces as appropriate.
2.4.1 is an ancient kernel version to be trying to work as. Given your mentioning of 32 bit compiles, suggests you are on an AMD64 architecture machine, and that may not even support libraries for going back that far.