std::thread does not start immediately as expected (c++11) - c++

I have the following code in my main.cpp
std::thread t1(&AgentsSourcesManager::Run, &sim.GetAgentSrcManager());
doSomething(); // in the main Thread
t1.join();
I was expecting t1 to start immediately and start along the main thread.
However, this is not the case. I measure the execution time of my program, repeat this 100 times and make some plots.
See the peak in the following picture.
Now if I wait a bit after the creation of t1
std::this_thread::sleep_for(std::chrono::milliseconds(100));
I get better results. See the following picture.
(Still with a peak there, but well ..)
Obviously my questions are:
Why a peak?
Why I don't have a straight line?
EDIT
Ok, from the comments I understand by now, that there might be some scheduler magic going on.
Here is a working example
#include <thread>
#include <chrono>
#include <iostream>
#include <pthread.h>
#include <functional>
int main() {
float x = 0; float y = 0;
std::chrono::time_point<std::chrono::system_clock> start, stop;
start= std::chrono::system_clock::now();
auto Thread = std::thread([](){std::cout<<"Excuting thread"<<std::endl;});
stop = std::chrono::system_clock::now();
for(int i = 0 ; i<10000 ; i++)
y += x*x*x*x*x;
std::this_thread::sleep_for(std::chrono::milliseconds(100));
Thread.join();
std::chrono::duration<double> elapsed_time = stop - start;
std::cout << "Taken time: " << std::to_string(elapsed_time.count()) << " "<< std::endl;
return 0;
}
Compiling:
g++-7 -lpthread threads.cpp -o out2.out
For Analysis I use this code
import subprocess
import matplotlib.pyplot as plt
import numpy as np
RUNS = 1000
factor = 1000
times = []
for i in range(RUNS):
p = subprocess.run(["./out2.out"], stdout=subprocess.PIPE)
line = p.stdout
times.append(float(line.split()[-1]))
print(i, RUNS)
times = np.array(times) * factor
plt.plot(times, "-")
plt.ylabel("time * %d" % factor)
plt.xlabel("#runs")
plt.title("mean %.3f (+- %.3f), min = %.3f, max = %.3f" %
(np.mean(times), np.std(times), np.min(times), np.max(times)))
plt.savefig("log2.png")
Result
I think I should better ask: How can I reduce this latency and tell my OS, that this thread is really important to me and should have a higher priority?

You are not measuring what you think you are measuring here:
start= std::chrono::system_clock::now();
auto Thread = std::thread([](){std::cout<<"Excuting thread"<<std::endl;});
stop = std::chrono::system_clock::now();
The stop timestamp only gives you an upper bound on how long it takes main to spawn that thread and it actually tells you nothing about when that thread will start doing any actual work (for that you would need to take a timestamp inside the thread).
Also, system_clock is not the best clock for such measurements on most platforms, you should use steady_clock by default and resort to high_resolution_clock if that one doesn't give you enough precision (but note that you will have to deal with the non-monotonic nature of that clock by yourself then, which can easily mess up the gained precision for you).
As was mentioned already in the comments, spawning a new thread (and thus also constructing a new std::thread) is a very complex and time-consuming operation. If you need high responsiveness, what you want to do is spawn a couple of threads during startup of your program and then have them wait on a std::condition_variable that will get signalled as soon as work for them becomes available. That way you can be sure that on an otherwise idle system a thread will start processing the work that was assigned to him very quickly (immediately is not possible on most systems due to how the operating system schedules threads, but the delay should be well under a millisecond).

Related

how to write a measurement function for multithreaded function [duplicate]

I am running a .cpp code (i) in sequential style and (ii) using OpenMP statements. I am trying to see the time difference. For calculating time, I use this:
#include <time.h>
.....
main()
{
clock_t start, finish;
start = clock();
.
.
.
finish = clock();
processing time = (double(finish-start)/CLOCKS_PER_SEC);
}
The time is pretty accurate in sequential (above) run of the code. It takes about 8 seconds to run this. When I insert OpenMP statements in the code and thereafter calculate the time I get a reduction in time, but the time displayed is about 8-9 seconds on the console, when actually its just 3-4 seconds in real time!
Here is how my code looks abstractly:
#include <time.h>
.....
main()
{
clock_t start, finish;
start = clock();
.
.
#pragma omp parallel for
for( ... )
for( ... )
for (...)
{
...;
}
.
.
finish = clock();
processing time = (double(finish-start)/CLOCKS_PER_SEC);
}
When I run the above code, I get the reduction in time but the time displayed is not accurate in terms of real time. It seems to me as though the clock () function is calculating each thread's individual time and adding up them up and displaying them.
Can someone tell the reason for this or suggest me any other timing function to use to measure the time in OpenMP programs?
Thanks.
It seems to me as though the clock () function is calculating each thread's individual time and adding up them up and displaying them.
This is exactly what clock() does - it measures the CPU time used by the process, which at least on Linux and Mac OS X means the cumulative CPU time of all threads that have ever existed in the process since it was started.
Real-clock (a.k.a. wall-clock) timing of OpenMP applications should be done using the high resolution OpenMP timer call omp_get_wtime() which returns a double value of the number of seconds since an arbitrary point in the past. It is a portable function, e.g. exists in both Unix and Windows OpenMP run-times, unlike gettimeofday() which is Unix-only.
I've seen clock() reporting CPU time, instead of real time.
You could use
struct timeval start, end;
gettimeofday(&start, NULL);
// benchmark code
gettimeofday(&end, NULL);
delta = ((end.tv_sec - start.tv_sec) * 1000000u +
end.tv_usec - start.tv_usec) / 1.e6;
To time things instead
You could use the built in omp_get_wtime function in omp library itself. Following is an example code snippet to find out execution time.
#include <stdio.h>
#include <omp.h>
int main(){
double itime, ftime, exec_time;
itime = omp_get_wtime();
// Required code for which execution time needs to be computed
ftime = omp_get_wtime();
exec_time = ftime - itime;
printf("\n\nTime taken is %f", exec_time);
}
Well yes, that's what clock() is supposed to do, tell you how much processor time the program used.
If you want to find elapsed real time, instead of CPU time, use a function that returns wall clock time, such as gettimeofday().
#include "ctime"
std::time_t start, end;
long delta = 0;
start = std::time(NULL);
// do your code here
end = std::time(NULL);
delta = end - start;
// output delta

Simple division of labour over threads is not reducing the time taken

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.
What I have attempted to do is:
do a task X times in one thread - check the time taken.
do a task X / Y times in Y threads - check the time taken.
So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:
2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds
and so on until I reach some threading limit (number of cores or whatever).
So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.
1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s
So clearly something is not right here!
I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr
Finally here is the code for reference:
#include <iostream>
#include <math.h>
#include <vector>
#include <thread>
#define randf() ((double) rand()) / ((double) (RAND_MAX))
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
for (auto i = 0u; i < interations; i++)
{
double value = randf();
double calc = std::atan(value);
(void) calc;
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << std::endl;
}
int main()
{
// Note these numbers vary by about probably due to godbolt servers load (?)
// 1 Threads takes: ~8s
// 2 Threads takes: ~8s
// 3 Threads takes: ~8s
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Seed rand
std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.
I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:
CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)
Questions are:
Why might this be happening?
What am I doing wrong?
How can I improve this?
The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.
You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.
Never mind that rand() is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.
What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.
No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.
It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.

Best way to implement a high resolution timer

What is the best way in C++11 to implement a high-resolution timer that continuously checks for time in a loop, and executes some code after it passes a certain point in time? e.g. check what time it is in a loop from 9am onwards and execute some code exactly at 11am. I require the timing to be precise (i.e. no more than 1 microsecond after 9am).
I will be implementing this program on Linux CentOS 7.3, and have no issues with dedicating CPU resources to execute this task.
Instead of implementing this manually, you could use e.g. a systemd.timer. Make sure to specify the desired accuracy which can apparently be as precise as 1us.
a high-resolution timer that continuously checks for time in a loop,
First of all, you do not want to continuously check the time in a loop; that's extremely inefficient and simply unnecessary.
...executes some code after it passes a certain point in time?
Ok so you want to run some code at a given time in the future, as accurately as possible.
The simplest way is to simply start a background thread, compute how long until the target time (in the desired resolution) and then put the thread to sleep for that time period. When your thread wakes up, it executes the actual task. This should be accurate enough for the vast majority of needs.
The std::chrono library provides calls which make this easy:
System clock in std::chrono
High resolution clock in std::chrono
Here's a snippet of code which does what you want using the system clock (which makes it easier to set a wall clock time):
// c++ --std=c++11 ans.cpp -o ans
#include <thread>
#include <iostream>
#include <iomanip>
// do some busy work
int work(int count)
{
int sum = 0;
for (unsigned i = 0; i < count; i++)
{
sum += i;
}
return sum;
}
std::chrono::system_clock::time_point make_scheduled_time (int yyyy, int mm, int dd, int HH, int MM, int SS)
{
tm datetime = tm{};
datetime.tm_year = yyyy - 1900; // Year since 1900
datetime.tm_mon = mm - 1; // Month since January
datetime.tm_mday = dd; // Day of the month [1-31]
datetime.tm_hour = HH; // Hour of the day [00-23]
datetime.tm_min = MM;
datetime.tm_sec = SS;
time_t ttime_t = mktime(&datetime);
std::chrono::system_clock::time_point scheduled = std::chrono::system_clock::from_time_t(ttime_t);
return scheduled;
}
void do_work_at_scheduled_time()
{
using period = std::chrono::system_clock::period;
auto sched_start = make_scheduled_time(2019, 9, 17, // date
00, 14, 00); // time
// Wait until the scheduled time to actually do the work
std::this_thread::sleep_until(sched_start);
// Figoure out how close to scheduled time we actually awoke
auto actual_start = std::chrono::system_clock::now();
auto start_delta = actual_start - sched_start;
float delta_ms = float(start_delta.count())*period::num/period::den * 1e3f;
std::cout << "worker: awoken within " << delta_ms << " ms" << std::endl;
// Now do some actual work!
int sum = work(12345);
}
int main()
{
std::thread worker(do_work_at_scheduled_time);
worker.join();
return 0;
}
On my laptop, the typical latency is about 2-3ms. If you use the high_resolution_clock you should be able to get even better results.
There are other APIs you could use too, such as Boost where you could use ASIO to implement high res timeout.
I require the timing to be precise (i.e. no more than 1 microsecond after 9am).
Do you really need it to be accurate to the microsecond? Consider that at this resolution, you will also need to take into account all sorts of other factors, including system load, latency, clock jitter, and so on. Your code can start to execute at close to that time, but that's only part of the problem.
My suggestion would be to use timer_create(). This allows you to get notified by a signal at a given time. You can then implement your action in the signal handler.
In any case you should be aware that the accuracy of course depends on the system clock accuracy.

What is the most efficient way to calling a function every n seconds in c++?

So I'm trying to call a function every n seconds. The below is a simple representation of what I'm trying to achieve. I wanted to know if the below method is the only way to achieve this. I would love if the "if" condition can be avoided.
#include <stdio.h>
#include <time.h>
void print_hello(int i) {
printf("hello\n");
printf("%d\n", i);
}
int main () {
time_t start_t, end_t;
double diff_t;
time(&start_t);
int i = 0;
while(1) {
time(&end_t);
// printf("here in main");
i = i + 1;
diff_t = difftime(end_t, start_t);
if(diff_t==5) {
// printf("Execution time = %f\n", diff_t);
print_hello(i);
time(&start_t);
}
}
return(0);
}
The usage of time in OPs program can be reduced to something like
// get tStart;
// set tEnd = tStart + x;
do {
// get t;
} while (t < tEnd);
This is what is called busy-wait.
It might be used to write code with most precise timing as well as in other special cases. The draw-back is that the waiting consumes ful CPU load. (You might be even able to hear this – by raising ventilation noise.)
In general, however, spinning is considered an anti-pattern and should be avoided, as processor time that could be used to execute a different task is instead wasted on useless activity.
Another option is to delegate the wake-up to the system, which reduces the load of process/thread to minimum while waiting:
#include <chrono>
#include <iostream>
#include <thread>
void print_hello(int i)
{
std::cout << "hello\n"
<< i << '\n';
}
int main ()
{
using namespace std::chrono_literals; // to support e.g. 5s for 5 sceconds
auto tStart = std::chrono::system_clock::now();
for (int i = 1; i <= 3; ++i) {
auto tEnd = tStart + 2s;
std::this_thread::sleep_until(tEnd);
print_hello(i);
tStart = tEnd;
}
}
Output:
hello
1
hello
2
hello
3
Live Demo on coliru
(I had to reduce number of iterations and the waiting times to prevent the TLE in online compiler.)
std::this_thread::sleep_until
Blocks the execution of the current thread until specified sleep_time has been reached.
The clock tied to sleep_time is used, which means that adjustments of the clock are taken into account. Thus, the duration of the block might, but might not, be less or more than sleep_time - Clock::now() at the time of the call, depending on the direction of the adjustment. The function also may block for longer than until after sleep_time has been reached due to scheduling or resource contention delays.
The last sentence mentions the draw-back of this solution: The OS may decide to wake-up the thread/process later than requested. That may happen e.g. is OS is under high load. In the “normal” case, the latency shouldn't be more than a few milli-seconds. So, the latency might be tolerable.
Please, note how tEnd and tStart are updated in loop. The current wake-up time is not considered to prevent accumulation of latencies.

Simplest way to have the following timer

I am currently trying to re-write some software in C++ from an old python code.
In the python version I used to have timers like these:
from time import time, sleep
t_start = time()
while (time()-t_start < 5) # 5 second timer
# do some stuff
sleep(0.001) #Don't slam the CPU
sleep(1)
print time()-t_start # would print something like 6.123145 notice the precision!
However, in C++ when I try to use time(0) from < time.h > I can only get precision in seconds as an integer, not a float.
#include <time.h>
#include <iostream>
time_t t_start = time(0)
while (time(0) - t_start < 5) // again a 5 second timer.
{
// Do some stuff
sleep(0.001) // long boost sleep method.
}
sleep(1);
std::cout << time(0)-t_start; // Prints 6 instead of 6.123145
I have also tried gettimeofday(struct, NULL) from < sys/time.h > however whenever I sleep the program with boost::this_thread::sleep it doesn't count that time...
I hope somebody here has come across a similar problem and found a solution.
Also, I really do need the dt in at least millisecond precision, because in the
// Do some stuff
part of the code I may break out of the while loop early and I need to know how long I was inside, etc.
Thank you for reading!
gettimeofday() is known to have issues when there are discontinuous jumps in the system time.
For portable milisecond precision have a look at chrono::high_resolution_clock()
Here a little snippet:
chrono::high_resolution_clock::time_point ts = chrono::high_resolution_clock::now();
chrono::high_resolution_clock::time_point te = chrono::high_resolution_clock::now();
// ... do something ...
cout << "took " << chrono::duration_cast<chrono::milliseconds>(te - ts).count() << " millisecs\n";
Please note the real clock resolution is bound to the operating system. For instance, on windows you usually have a 15ms precision, and you can go beyond this constraint only by using platform dependent clocking systems.