the relationship between thread running time, cpu context switching and performance - c++

I did an experiment to simulate what happened in our server code, I started 1024 threads and every thread execute a system call, this takes about 2.8s to finish execution on my machine. Then I add usleep(1000000) in function of every thread, the execution time increase to 16s and time will decrease to 8s when I run same program at second time. I guess this maybe caused by cpu cache and the cpu context switch, but I'm not quite sure how to explain it.
Besides, what is the best practice to avoid this happening (increasing the running time for every threads a little lead to the decreasing for whole program performance).
I attached the test code here, thanks a lot for your help.
//largetest.cc
#include "local.h"
#include <time.h>
#include <thread>
#include <string>
#include "unistd.h"
using namespace std;
#define BILLION 1000000000L
int main()
{
struct timespec start, end;
double diff;
clock_gettime(CLOCK_REALTIME, &start);
int i = 0;
int reqNum = 1024;
for (i = 0; i < reqNum; i++)
{
string command = string("echo abc");
thread{localTaskStart, command}.detach();
}
while (1)
{
if ((localFinishNum) == reqNum)
{
break;
}
else
{
usleep(1000000);
}
printf("curr num %d\n", localFinishNum);
}
clock_gettime(CLOCK_REALTIME, &end); /* mark the end time */
diff = (end.tv_sec - start.tv_sec) * 1.0 + (end.tv_nsec - start.tv_nsec) * 1.0 / BILLION;
printf("debug for running time = (%lf) second\n", diff);
return 0;
}
//local.cc
#include "time.h"
#include "stdlib.h"
#include "stdio.h"
#include "local.h"
#include "unistd.h"
#include <string>
#include <mutex>
using namespace std;
mutex testNotifiedNumMtx;
int localFinishNum = 0;
int localTaskStart(string batchPath)
{
char command[200];
sprintf(command, "%s", batchPath.data());
usleep(1000000);
system(command);
testNotifiedNumMtx.lock();
localFinishNum++;
testNotifiedNumMtx.unlock();
return 0;
}
//local.h
#ifndef local_h
#define local_h
#include <string>
using namespace std;
int localTaskStart( string batchPath);
extern int localFinishNum;
#endif

The read of localFinishNum should also be protected by mutex, otherwise the results are unpredictable based on where (i.e. on which cores) threads get scheduled, when and how the cache gets invalidated, etc.
In fact, the program might not even terminate if you compile it in optimized mode if the compiler decides to put localFinishNum in the register (instead of always loading it from memory).

Related

How to Run Code in a Loop Asynchronously without Stopping other Code C++

I was wondering how to run a loop in the background of a C++ program without stopping the main function, something like setInterval in JavaScript.
I don't really want to use any libraries for this, as I don't want to complicate the installation in the embedded machine.
This should give enough of an example to build from.
#include <thread>
#include <chrono>
void background(std::chrono::milliseconds interval) {
while (1) {
// do your task
std::this_thread::sleep_for(interval);
}
}
int main() {
auto interval = std::chrono::milliseconds(500);
std::thread background_worker(&background, interval);
// main work
background_worker.join();
}
EDIT: For those without std::thread on POSIX systems:
#include <pthread.h>
#include <unistd.h>
void *background(void *interval) {
unsigned int interval_ms = (*(unsigned int*)interval) * 1000;
while (1) {
// do your task
usleep(interval_ms);
}
}
int main() {
unsigned int interval = 500;
pthread_t background_worker;
pthread_create(&background_worker, NULL, background, (void*)&interval);
// main work
pthread_join(background_worker, NULL);
}

Measuring execution time when using threads

I would like to measure the execution time of some code. The code starts in the main() function and finishes in an event handler.
I have a C++11 code that looks like this:
#include <iostream>
#include <time.h>
...
volatile clock_t t;
void EventHandler()
{
// when this function called is the end of the part that I want to measure
t = clock() - t;
std::cout << "time in seconds: " << ((float)t)/CLOCKS_PER_SEC;
}
int main()
{
MyClass* instance = new MyClass(EventHandler); // this function starts a new std::thread
instance->start(...); // this function only passes some data to the thread working data, later the thread will call EventHandler()
t = clock();
return 0;
}
So it is guaranteed that the EventHandler() will be called only once, and only after an instance->start() call.
It is working, this code give me some output, but it is a horrible code, it uses global variable and different threads access global variable. However I can't change the used API (the constructor, the way the thread calls to EventHandler).
I would like to ask if a better solution exists.
Thank you.
Global variable is unavoidable, as long as MyClass expects a plain function and there's no way to pass some context pointer along with the function...
You could write the code in a slightly more tidy way, though:
#include <future>
#include <thread>
#include <chrono>
#include <iostream>
struct MyClass
{
typedef void (CallbackFunc)();
constexpr explicit MyClass(CallbackFunc* handler)
: m_handler(handler)
{
}
void Start()
{
std::thread(&MyClass::ThreadFunc, this).detach();
}
private:
void ThreadFunc()
{
std::this_thread::sleep_for(std::chrono::seconds(5));
m_handler();
}
CallbackFunc* m_handler;
};
std::promise<std::chrono::time_point<std::chrono::high_resolution_clock>> gEndTime;
void EventHandler()
{
gEndTime.set_value(std::chrono::high_resolution_clock::now());
}
int main()
{
MyClass task(EventHandler);
auto trigger = gEndTime.get_future();
auto startTime = std::chrono::high_resolution_clock::now();
task.Start();
trigger.wait();
std::chrono::duration<double> diff = trigger.get() - startTime;
std::cout << "Duration = " << diff.count() << " secs." << std::endl;
return 0;
}
clock() call will not filter out executions of different processes and threads run by scheduler in parallel with program event handler thread. There are alternative like times() and getrusage() which tells cpu time of process. Though it is not clearly mentioned about thread behaviour for these calls but if it is Linux, threads are treated as processes but it has to be investigated.
clock() is the wrong tool here, because it does not count the time actually required by the CPU to run your operation, for example, if the thread is not running at all, the time is still counted.
Instead you have to use platform-specific APIs, such as pthread_getcpuclockid for POSIX-compliant systems (Check if _POSIX_THREAD_CPUTIME is defined), that counts the actual time spent by a specific thread.
You can take a look at a benchmarking library I wrote for C++ that supports thread-aware measuring (see struct thread_clock implementation).
Or, you can use the code snippet from the man page:
/* Link with "-lrt" */
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <string.h>
#include <errno.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
#define handle_error_en(en, msg) \
do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
static void *
thread_start(void *arg)
{
printf("Subthread starting infinite loop\n");
for (;;)
continue;
}
static void
pclock(char *msg, clockid_t cid)
{
struct timespec ts;
printf("%s", msg);
if (clock_gettime(cid, &ts) == -1)
handle_error("clock_gettime");
printf("%4ld.%03ld\n", ts.tv_sec, ts.tv_nsec / 1000000);
}
int
main(int argc, char *argv[])
{
pthread_t thread;
clockid_t cid;
int j, s;
s = pthread_create(&thread, NULL, thread_start, NULL);
if (s != 0)
handle_error_en(s, "pthread_create");
printf("Main thread sleeping\n");
sleep(1);
printf("Main thread consuming some CPU time...\n");
for (j = 0; j < 2000000; j++)
getppid();
pclock("Process total CPU time: ", CLOCK_PROCESS_CPUTIME_ID);
s = pthread_getcpuclockid(pthread_self(), &cid);
if (s != 0)
handle_error_en(s, "pthread_getcpuclockid");
pclock("Main thread CPU time: ", cid);
/* The preceding 4 lines of code could have been replaced by:
pclock("Main thread CPU time: ", CLOCK_THREAD_CPUTIME_ID); */
s = pthread_getcpuclockid(thread, &cid);
if (s != 0)
handle_error_en(s, "pthread_getcpuclockid");
pclock("Subthread CPU time: 1 ", cid);
exit(EXIT_SUCCESS); /* Terminates both threads */
}

Using std::thread with std::mutex

I am trying mutex lock with independent threads. The requirement is, I have many threads which will run independently and access/update a common recourse. To ensure that the recourse is updated via a single task, I used mutex. However this is not working.
I have pasted code, a representation of what I am trying to do below:
#include <iostream>
#include <map>
#include <string>
#include <chrono>
#include <thread>
#include <mutex>
#include <unistd.h>
std::mutex mt;
static int iMem = 0;
int maxITr = 1000;
void renum()
{
// Ensure that only 1 task will update the variable
mt.lock();
int tmpMem = iMem;
usleep(100); // Make the system sleep/induce delay
iMem = tmpMem + 1;
mt.unlock();
printf("iMem = %d\n", iMem);
}
int main()
{
for (int i = 0; i < maxITr; i++) {
std::thread mth(renum);
mth.detach(); // Run each task in an independent thread
}
return 0;
}
but this is terminating with the below error:
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
I want to know if the usage of <thread>.detach() is correct above? If I use .join() it works, but I want each thread to run independently and not wait for the thread to finish.
I also want to know what is the best way to achieve the above logic.
Try this:
int main()
{
std::vector<std::thread> mths;
mths.reserve(maxITr);
for (int i = 0; i < maxITr; i++) {
mths.emplace_back(renum);
}
for (auto& mth : mths) {
mth.join();
}
}
This way, you retain control of the threads (by not calling detach()), and you can join them all at the end, so you know they have completed their tasks.

Can properly written code using mutex be still volatile?

I've been doing pretty basic stuff with std::thread without any particular reason, simply in order to learn it. I thought that the simple example I created, where few threads are operating on the same data, locking each other before doing so, worked just fine, until I realized that every time I run it the returned value is different, while very close to each other, I am pretty sure they should equal each other. Some of the values I have received:
21.692524
21.699258
21.678871
21.705947
21.685744
Am I doing something wrong or maybe there is underlying reason for that behaviour?
#include <string>
#include <iostream>
#include <thread>
#include <math.h>
#include <time.h>
#include <windows.h>
#include <mutex>
using namespace std;
mutex mtx;
mutex mtx2;
int currentValue = 1;
double suma = 0;
int assignPart() {
mtx.lock();
int localValue = currentValue;
currentValue+=10000000;
mtx.unlock();
return localValue;
}
void calculatePart()
{
int value;
double sumaLokalna = 0;
while(currentValue<1500000000){
value = assignPart();
for(double i=value;i<(value+10000000);i++){
sumaLokalna = sumaLokalna + (1/(i));
}
mtx2.lock();
suma+=sumaLokalna;
mtx2.unlock();
sumaLokalna = 0;
}
}
int main()
{
clock_t startTime = clock();
// Constructs the new thread and runs it. Does not block execution.
thread watek(calculatePart);
thread watek2(calculatePart);
thread watek3(calculatePart);
thread watek4(calculatePart);
while(currentValue<1500000000){
Sleep(100);
printf("%-12d %-12lf \n",currentValue, suma);
}
watek.join();
watek2.join();
watek3.join();
watek4.join();
cout << double( clock() - startTime ) / (double)CLOCKS_PER_SEC<< " seconds." << endl;
//Makes the main thread wait for the new thread to finish execution, therefore blocks its own execution.
}
Your loop
while(currentValue<1500000000){
Sleep(100);
printf("%-12d %-12lf \n",currentValue, suma);
}
is printing intermediate results, but you're not printing the final result.
To print the final result, add the line
printf("%-12d %-12lf \n",currentValue, suma);
after joining the threads.

The fastest way to lock access to a data within single process on Linux

I'm experimenting with locking data on Windows vs Linux.
The code I'm using for testing looks something like this:
#include <mutex>
#include <time.h>
#include <iostream>
#include <vector>
#include <thread>
using namespace std;
mutex m;
unsigned long long dd = 0;
void RunTest()
{
for(int i = 0; i < 100000000; i++)
{
unique_lock<mutex> lck{m};
//boost::mutex::scoped_lock guard(m1);
dd++;
}
}
int main(int argc, char *argv[])
{
clock_t tStart = clock();
int tCount = 0;
vector<shared_ptr<thread>> threads;
for(int i = 0; i < 10;i++)
{
threads.push_back(shared_ptr<thread>{new thread(RunTest)});
}
RunTest();
for(auto t:threads)
{
t->join();
}
cout << ((double)(clock() - tStart)/CLOCKS_PER_SEC) << endl;
return 0; //v.size();
}
I'm testing g++ -O3 vs Visual Studio 2013 compiled release mode.
When I use unique_lock<mutex> for sync, Linux beats Windows in most scenarios, sometimes significantly.
But when I use Windows' CRITICAL_SECTION, the situation reverses, and windows code becomes much faster than that on Linux, especially as thread count increases.
Here's the code I'm using for windows' critical section testing:
#include <stdafx.h>
#include <mutex>
#include <time.h>
#include <iostream>
//#include <boost/mutex>
#include <vector>
#include <thread>
#include<memory>
#include <Windows.h>
using namespace std;
mutex m;
unsigned long long dd = 0;
CRITICAL_SECTION critSec;
void RunTest()
{
for (int i = 0; i < 100000000; i++)
{
//unique_lock<mutex> lck{ m };
EnterCriticalSection(&critSec);
dd++;
LeaveCriticalSection(&critSec);
}
}
int _tmain(int argc, _TCHAR* argv[])
{
InitializeCriticalSection(&critSec);
clock_t tStart = clock();
int tCount = 0;
vector<shared_ptr<thread>> threads;
for (int i = 0; i < 10; i++)
{
threads.push_back(shared_ptr<thread>{new thread(RunTest)});
}
RunTest();
for (auto t : threads)
{
t->join();
}
cout << ((double)(clock() - tStart) / CLOCKS_PER_SEC) << endl;
DeleteCriticalSection(&critSec);
return 0;
}
The way I understand why this is happening is that critical sections are process-specific.
Most of sync I'll be doing will be inside a single process.
Is there anything on Linux, which is faster than mutex or windows' critical section?
First, your code have a huge race problem thus does not reflect any sane situation, and is very not suitable for benchmark. This is because, most mutex implementation are optimized in case where a lock can be acquired without waiting, in other case, ie high contention, which involve blocking a thread, then the mutex overhead become insignificant and you should redesign the system to get decent improvement, like split into multiple locks, use lockless algorithm or use transactional memory (available as TSX extension in some haswell processor, or software implementation).
Now, to explain the difference, CriticalSection on windows actually do a short time spinlock before resolving to thread-blocking mutex. Since blocking a thread involve order of magnitude overhead, in low contention situation a spinlock may greatly reduce the chance of getting into such overhead (Note that in high contention situation a spinlock actually make it worst).
On linux, you may want to look into fast userspace mutex, or futex, which adopt a similar idea.