i would like to encrypt the file with multiple threads in order to reduce the time taken. im running on intel i5 processor, 4 GB memory, visual c++ 2008. the problem is when i run below code in debug mode (visual c++ 2008), the time taken is longer, example if i use one thread to encrypt 3 mb file, time taken is 5 seconds but when i use two threads, time taken is 10 seconds. The time is supposed to be short when using 2 threads in debug mode. but in release mode, there is no problem, time taken is short using multiple threads.
is it possible to run the code in debug mode with shorter time taken? is there setting to change in visual c++ 2008?
void load()
{
ifstream readF ("3mb.txt");
string output; string out;
if(readF.is_open())
{
while(!readF.eof())
{
getline(readF,out);
output=output+'\n'+out;
}
readF.close();
//cout<<output<<endl;
//cout<<output.size()<<endl;
text[0]=output;
}
else
cout<<"couldnt open file!"<<endl;
}
unsigned Counter;
unsigned __stdcall SecondThreadFunc( void* pArguments )
{
cout<<"encrypting..."<<endl;
Enc(text[0]);
_endthreadex( 0 );
return 0;
}
unsigned __stdcall SecondThreadFunc2( void* pArguments )
{
cout<<"encrypting..."<<endl;
//Enc(text[0]);
_endthreadex( 0 );
return 0;
}
int main()
{
load();
HANDLE hThread[10];
unsigned threadID;
time_t start, end;
start =time(0);
hThread[0] = (HANDLE)_beginthreadex( NULL, 0, &SecondThreadFunc, NULL, 0, &threadID);
hThread[1] = (HANDLE)_beginthreadex( NULL, 0, &SecondThreadFunc2, NULL, 0, &threadID );
WaitForSingleObject( hThread[0], INFINITE );
WaitForSingleObject( hThread[1], INFINITE );
CloseHandle( hThread[0] );
end=time(0);
cout<<"Time taken : "<<difftime(end, start) << "second(s)" << endl;
system("pause");
}
A potential reason it may be slower is that multiple threads will need to load data from memory into the cpu cache. In debug mode, there may be extra padding around data structures etc which is intended to catch buffer overflows. That might mean when the cpu switches from one thread to the other, it needs to flush out the cache and reload all data from ram. But, in release mode where there is no padding, enough data for both thread does fit into the cache, so it will run quicker.
You will find even in release mode if you add more threads you will reach a point where adding more threads gives diminishing returns and then actually starts to go slower than less threads.
Related
This question seems to be asked a lot. I had some legacy production code that was seemingly fine, until it started getting many more connections per day. Each connection kicked off a new thread. Eventually, it would exhaust memory and crash.
I'm going back over pthread (and C sockets) which I've not dealt with in years. The tutorial I had was informative, but I'm seeing the same thing when I use top. All the threads exit, but there's still some virtual memory taken up. Valgrind tells me there is a possible memory loss when calling pthread_create(). The very basic sample code is below.
The scariest part is that pthread_exit( NULL ) seems to leave about 100m in VIRT unaccounted for when all the threads exit. If I comment out this line, it's much more liveable, but there is still some there. On my system it start with about 14k, and ends with 47k.
If I up the thread count to 10,000, VIRT goes up to 70+ gigs, but finishes somewhere around 50k, assuming I comment out pthread_exit( NULL ). If I use pthread_exit( NULL ) it finishes with about 113m still in VIRT. Are these acceptable? Is top not telling me everything?
void* run_thread( void* id )
{
int thread_id = *(int*)id;
int count = 0;
while ( count < 10 ) {
sleep( 1 );
printf( "Thread %d at count %d\n", thread_id, count++ );
}
pthread_exit( NULL );
return 0;
}
int main( int argc, char* argv[] )
{
sleep( 5 );
int thread_count = 0;
while( thread_count < 10 ) {
pthread_t my_thread;
if ( pthread_create( &my_thread, NULL, run_thread, (void*)&thread_count ) < 0 ) {
perror( "Error making thread...\n" );
return 1;
}
pthread_detach( my_thread );
thread_count++;
sleep( 1 );
}
pthread_exit( 0 ); // added as per request
return 0;
}
I know this is rather old question, but I hope others wil benefit.
This is indeed a memory leak. The thread is created with default attributes. By default the thread is joinable. A joinable threads keeps its underlying bookkeeping until it is finished... and joined.
If a thread is never joined, set de Detached attribute. All (thread) resources will be freed once the thread terminates.
Here's an example:
pthread_attr_t attr;
pthread_t thread;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, 1);
pthread_create(&thread, &attr, &threadfunction, NULL);
pthread_attr_destroy(&attr);
Prior to your edit of adding pthread_exit(0) to the end of main(), your program would finish executing before all the threads had finished running. valgrind thus reported the resources that were still being held by the threads that were still active at the time the program terminated, making it look like your program had a memory leak.
The call to pthread_exit(0) in main() makes the main thread wait for all the other spawned threads to exit before it itself exits. This lets valgrind observe a clean run in terms of memory utilization.
(I am assuming linux is your operating system below, but it seems you are running some variety of UNIX from your comments.)
The extra virtual memory you see is just linux assigning some pages to your program since it was a big memory user. As long as your resident memory utilization is low and constant when you reach the idle state, and the virtual utilization is relatively constant, you can assume your system is well behaved.
By default, each thread gets 2MB of stack space on linux. If each thread stack does not need that much space, you can adjust it by initializing a pthread_attr_t and setting it with a smaller stack size using pthread_attr_setstacksize(). What stack size is appropriate depends on how deep your function call stack grows and how much space the local variables for those functions take.
#define SMALLEST_STACKSZ PTHREAD_STACK_MIN
#define SMALL_STACK (24*1024)
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, SMALL_STACK);
/* ... */
pthread_create(&my_thread, &attr, run_thread, (void *)thread_count);
/* ... */
pthread_attr_destroy(&attr);
I want to run some benchmarks on a C++ algorithm and want to get the CPU time it takes, depending on inputs. I use Visual Studio 2012 on Windows 7. I already discovered one way to calculate the CPU time in Windows: How can I measure CPU time and wall clock time on both Linux/Windows?
However, I use the system() command in my algorithm, which is not measured that way. So, how can I measure CPU time and include the times of my script calls via system()?
I should add a small example. This is my get_cpu_time-function (From the link described above):
double get_cpu_time(){
FILETIME a,b,c,d;
if (GetProcessTimes(GetCurrentProcess(),&a,&b,&c,&d) != 0){
// Returns total user time.
// Can be tweaked to include kernel times as well.
return
(double)(d.dwLowDateTime |
((unsigned long long)d.dwHighDateTime << 32)) * 0.0000001;
}else{
// Handle error
return 0;
}
}
That works fine so far, and when I made a program, that sorts some array (or does some other stuff that takes some time), it works fine. However, when I use the system()-command like in this case, it doesn't:
int main( int argc, const char* argv[] )
{
double start = get_cpu_time();
double end;
system("Bla.exe");
end = get_cpu_time();
printf("Everything took %f seconds of CPU time", end - start);
std::cin.get();
}
The execution of the given exe-file is measured in the same way and takes about 5 seconds. When I run it via system(), the whole thing takes a CPU time of 0 seconds, which obviously does not include the execution of the exe-file.
One possibility would be to get a HANDLE on the system call, is that possible somehow?
Linux:
For the wall clock time, use gettimeofday() or clock_gettime()
For the CPU time, use getrusage() or times()
It will actually prints the CPU time that your program takes. But if you use threads in your program, It will not work properly. You should wait for thread to finish it's job before taking the finish CPU time. So basically you should write this:
WaitForSingleObject(threadhandle, INFINITE);
If you dont know what exactly you use in your program (if it's multithreaded or not..) you can create a thread for doing that job and wait for termination of thread and measure the time.
DWORD WINAPI MyThreadFunction( LPVOID lpParam );
int main()
{
DWORD dwThreadId;
HANDLE hThread;
int startcputime, endcputime, wcts, wcte;
startcputime = cputime();
hThread = CreateThread(
NULL, // default security attributes
0, // use default stack size
MyThreadFunction, // thread function name
NULL, // argument to thread function
0, // use default creation flags
dwThreadIdArray);
WaitForSingleObject(hThread, INFINITE);
endcputime = cputime();
std::cout << "it took " << endcputime - startcputime << " s of CPU to execute this\n";
return 0;
}
DWORD WINAPI MyThreadFunction( LPVOID lpParam )
{
//do your job here
return 0;
}
If your using C++11 (or have access to it) std::chrono has all of the functions you need to calculate how long a program has run.
You'll need to add your process to a Job object before creating any child processes. Child processes will then automatically run in the same job, and the information you want can be found in the TotalUserTime and TotalKernelTime members of the JOBOBJECT_BASIC_ACCOUNTING_INFORMATION structure, available through the QueryInformationJobObject function.
Further information:
Resource Accounting for Jobs
JOBOBJECT_BASIC_ACCOUNTING_INFORMATION structure
Beginning with Windows 8, nested jobs are supported, so you can use this method even if some of the programs already rely on job objects.
I don't think there is a cross platform mechanism. Using CreateProcess to launch the application, with a WaitForSingleObject for the application to finish, would allow you to get direct descendants times. After that you would need job objects for complete accounting (if you needed to time grandchildren)
You might also give external sampling profilers a shot. I've used the freebie "Sleepy" [http://sleepy.sourceforge.net/]and even better "Very Sleepy" [http://www.codersnotes.com/sleepy/] profilers under Windows and been very happy with the results -- nicely formatted info in a few minutes with virtually no effort.
There is a similar project called "Shiny" [http://sourceforge.net/projects/shinyprofiler/] that is supposed to work on both Windows and *nix.
You can try using boost timer. It is cross-platform capable. Sample code from boost web-site:
#include <boost/timer/timer.hpp>
#include <cmath>
int main() {
boost::timer::auto_cpu_timer t;
for (long i = 0; i < 100000000; ++i)
std::sqrt(123.456L); // burn some time
return 0;
}
First off, I found a lot of information on this topic, but no solutions that solved the issue unfortunately.
I'm simply trying to regulate my C++ program to run at 60 iterations per second. I've tried everything from GetClockTicks() to GetLocalTime() to help in the regulation but every single time I run the program on my Windows Server 2008 machine, it runs slower than on my local machine and I have no clue why!
I understand that "clock" based function calls return CPU time spend on the execution so I went to GetLocalTime and then tried to differentiate between the start time and the stop time then call Sleep((FPS / 1000) - millisecondExecutionTime)
My local machine is quite faster than the servers CPU so obviously the thought was that it was going off of CPU ticks, but that doesn't explain why the GetLocalTime doesn't work. I've been basing this method off of http://www.lazyfoo.net/SDL_tutorials/lesson14/index.php changing the get_ticks() with all of the time returning functions I could find on the web.
For example take this code:
#include <Windows.h>
#include <time.h>
#include <string>
#include <iostream>
using namespace std;
int main() {
int tFps = 60;
int counter = 0;
SYSTEMTIME gStart, gEnd, start_time, end_time;
GetLocalTime( &gStart );
bool done = false;
while(!done) {
GetLocalTime( &start_time );
Sleep(10);
counter++;
GetLocalTime( &end_time );
int startTimeMilli = (start_time.wSecond * 1000 + start_time.wMilliseconds);
int endTimeMilli = (end_time.wSecond * 1000 + end_time.wMilliseconds);
int time_to_sleep = (1000 / tFps) - (endTimeMilli - startTimeMilli);
if (counter > 240)
done = true;
if (time_to_sleep > 0)
Sleep(time_to_sleep);
}
GetLocalTime( &gEnd );
cout << "Total Time: " << (gEnd.wSecond*1000 + gEnd.wMilliseconds) - (gStart.wSecond*1000 + gStart.wMilliseconds) << endl;
cin.get();
}
For this code snippet, run on my computer (3.06 GHz) I get a total time (ms) of 3856 whereas on my server (2.53 GHz) I get 6256. So it potentially could be the speed of the processor though the ratio of 2.53/3.06 is only .826797386 versus 3856/6271 is .614893956.
I can't tell if the Sleep function is doing something drastically different than expected though I don't see why it would, or if it is my method for getting the time (even though it should be in world time (ms) not clock cycle time. Any help would be greatly appreciated, thanks.
For one thing, Sleep's default resolution is the computer's quota length - usually either 10ms or 15ms, depending on the Windows edition. To get a resolution of, say, 1ms, you have to issue a timeBeginPeriod(1), which reprograms the timer hardware to fire (roughly) once every millisecond.
In your main loop you can
int main()
{
// Timers
LONGLONG curTime = NULL;
LONGLONG nextTime = NULL;
Timers::GameClock::GetInstance()->GetTime(&nextTime);
while (true) {
Timers::GameClock::GetInstance()->GetTime(&curTime);
if ( curTime > nextTime && loops <= MAX_FRAMESKIP ) {
nextTime += Timers::GameClock::GetInstance()->timeCount;
// Business logic goes here and occurr based on the specified framerate
}
}
}
using this time library
include "stdafx.h"
LONGLONG cacheTime;
Timers::SWGameClock* Timers::SWGameClock::pInstance = NULL;
Timers::SWGameClock* Timers::SWGameClock::GetInstance ( ) {
if (pInstance == NULL) {
pInstance = new SWGameClock();
}
return pInstance;
}
Timers::SWGameClock::SWGameClock(void) {
this->Initialize ( );
}
void Timers::SWGameClock::GetTime ( LONGLONG * t ) {
// Use timeGetTime() if queryperformancecounter is not supported
if (!QueryPerformanceCounter( (LARGE_INTEGER *) t)) {
*t = timeGetTime();
}
cacheTime = *t;
}
LONGLONG Timers::SWGameClock::GetTimeElapsed ( void ) {
LONGLONG t;
// Use timeGetTime() if queryperformancecounter is not supported
if (!QueryPerformanceCounter( (LARGE_INTEGER *) &t )) {
t = timeGetTime();
}
return (t - cacheTime);
}
void Timers::SWGameClock::Initialize ( void ) {
if ( !QueryPerformanceFrequency((LARGE_INTEGER *) &this->frequency) ) {
this->frequency = 1000; // 1000ms to one second
}
this->timeCount = DWORD(this->frequency / TICKS_PER_SECOND);
}
Timers::SWGameClock::~SWGameClock(void)
{
}
with a header file that contains the following:
// Required for rendering stuff on time
#pragma once
#define TICKS_PER_SECOND 60
#define MAX_FRAMESKIP 5
namespace Timers {
class SWGameClock
{
public:
static SWGameClock* GetInstance();
void Initialize ( void );
DWORD timeCount;
void GetTime ( LONGLONG* t );
LONGLONG GetTimeElapsed ( void );
LONGLONG frequency;
~SWGameClock(void);
protected:
SWGameClock(void);
private:
static SWGameClock* pInstance;
}; // SWGameClock
} // Timers
This will ensure that your code runs at 60FPS (or whatever you put in) though you can probably dump the MAX_FRAMESKIP as that's not truly implemented in this example!
You could try a WinMain function and use the SetTimer function and a regular message loop (you can also take advantage of the filter mechanism of GetMessage( ... ) ) in which you test for the WM_TIMER message with the requested time and when your counter reaches the limit do a PostQuitMessage(0) to terminate the message loop.
For a duty cycle that fast, you can use a high accuracy timer (like QueryPerformanceTimer) and a busy-wait loop.
If you had a much lower duty cycle, but still wanted precision, then you could Sleep for part of the time and then eat up the leftover time with a busy-wait loop.
Another option is to use something like DirectX to sync yourself to the VSync interrupt (which is almost always 60 Hz). This can make a lot of sense if you're coding a game or a/v presentation.
Windows is not a real-time OS, so there will never be a perfect way to do something like this, as there's no guarantee your thread will be scheduled to run exactly when you need it to.
Note that in the remarks for Sleep, the actual amount of time will be at least one "tick" and possible one whole "tick" longer than the delay you requested before the thread is scheduled to run again (and then we have to assume the thread is scheduled). The "tick" can vary a lot depending on hardware and the version of Windows. It is commonly in the 10-15 ms range, and I've seen it as bad as 19 ms. For 60 Hz, you need 16.666 ms per iteration, so this is obviously not nearly precise enough to give you what you need.
What about rendering (iterating) based on the time elapsed between rendering of each frame? Consider creating a void render(double timePassed) function and render depending on the timePassed parameter instead of putting program to sleep.
Imagine, for example, you want to render a ball falling or bouncing. You would know it's speed, acceleration and all other physics that you need. Calculate the position of the ball based on timePassed and all other physics parameters (speed, acceleration, etc.).
Or if you prefer, you could just skip the render() function execution if time passed is a value to small, instead of puttin program to sleep.
I am trying to fix a problem with a legacy Visual Studio win32 un-managed c++ app which is not keeping up with input. As a part of my solution, I am exploring bumping up the class and thread priorities.
My PC has 4 xeon processors, running 64 bit XP. I wrote a short win32 test app which creates 4 background looping threads, each one running on their own processor. Some code samples are shown following. The problem is that even when I bump the priorities to the extreme, the cpu utilization is still less than 1%.
My test app is 32 bit, running on WOW64. The same test app also utilizes less than 1% cpu utilization on a 32 bit xp machine. I am an administrator on both machines. What else do I need to do to get this to work?
DWORD __stdcall ThreadProc4 (LPVOID)
{
SetThreadPriority(GetCurrentThread(),THREAD_PRIORITY_TIME_CRITICAL);
while (true)
{
for (int i = 0; i < 1000; i++)
{
int p = i;
int red = p *5;
theClassPrior4 = GetPriorityClass(theProcessHandle);
}
Sleep(1);
}
}
int APIENTRY _tWinMain(...)
{
...
theProcessHandle = GetCurrentProcess();
BOOL theAffinity = GetProcessAffinityMask(
theProcessHandle,&theProcessMask,&theSystemMask);
SetPriorityClass(theProcessHandle,REALTIME_PRIORITY_CLASS);
DWORD threadid4 = 0;
HANDLE thread4 = CreateThread((LPSECURITY_ATTRIBUTES)NULL,
0,
(LPTHREAD_START_ROUTINE)ThreadProc4,
NULL,
0,
&threadid4);
DWORD_PTR theAff4 = 8;
DWORD_PTR theAf4 = SetThreadAffinityMask(thread1,theAff4);
SetThreadPriority(thread4,THREAD_PRIORITY_TIME_CRITICAL);
ResumeThread(thread4);
Well, if you want it to actually eat CPU time, you'll want to remove that 'Sleep' call - your 'processing' is taking no significant amount of time, and so it's spending most of it's time sleeping.
You'll also want to look at what the optimizer is doing to your code. I wouldn't be totally surprised if it completely removed 'p' and 'red' (and the multiply) in your loop (because the results are never used). You could trying marking 'red' as volatile, that should force it to not remove the calculation.
I'm writing a program which splits off into threads, each thread is timed, and then the times are added together in total_time.
I protect total_time using a mutex.
The program was working fine, until I added 'OutputDebugStringW', which is when I started getting these Unhandled exception / Access violation errors.
for (int loop = 0; loop < THREADS; loop++)
{
threads[loop] = (HANDLE) _beginthreadex(NULL, 0, MandelbrotThread, &m_args[loop], 0, NULL);
}
WaitForMultipleObjects(THREADS, threads, TRUE, INFINITE);
OutputDebugStringW(LPCWSTR(total_time));
Within each of these threads, it does some calculation which it times, EntersCriticalSection, adds the time taken to total_time, LeaveCriticalSections, then ends.
I tried adding EnterCriticalSection and LeaveCriticalSection around OutputDebugStringW() but it didn't help to fix the error.
Any thoughts?
Update 1:
Here is the MandelbrotThread function -
unsigned int __stdcall MandelbrotThread(void *data)
{
long long int time = get_time();
MandelbrotArgs *m_args = (MandelbrotArgs *) data;
compute_mandelbrot(m_args->left, m_args->right, m_args->top, m_args->bottom, m_args->y_start, m_args->lines_to_render);
time = time - get_time();
EnterCriticalSection(&time_mutex);
total_time = total_time + time;
LeaveCriticalSection(&time_mutex);
return 0;
}
m_args are the sides of the set to be rendered (so the same for every thread), and the line to start on (y_start) and the number of lines to render.
reinterpret_cast'ing a number to a string will certainly shut up the compiler, but will not make your program magically work. You need to convert it, using sprintf, or preferrably boost::lexical_cast (although I'm guessing the latter's not an option for you).
WCHAR buf[32];
wsprintf(buf, L"%I64d\n", total_time);
OutputDebugStringW(buf);