Does clock_t calculate the time of all threads, c++, pthreads? - c++

Lets say my code is made of main() and in main I call 2 threads that run in parallel.
lets say that main takes 5 seconds to finish, and each thread takes 10 seconds to finish.
if I time the main program using clock_t, assuming the 2 threads run in parallel, the real time that the program will take is 15 seconds.
Now if I time it using clock_t, will that give me a time of 15 seconds or 25 seconds?
Although thread 1 and thread 2 ran in parallel, will the clock_t() calculate every cycle used by thread 1 and thread 2 and return the total number of cycles used?
I use windows mingw32, and pthreads.
example code:
main(){
clock_t begin_time ;
for (unsigned int id = 0; id < 2; ++id)
{
pthread_create(&(threads[id]), NULL, thread_func, (void *) &(data[id]));
}
for (unsigned int id = 0; id < 2; ++id)
{
pthread_join(threads[id], NULL);
}
time = double( clock () - begin_time )/CLOCKS_PER_SEC;
}

The function clock does different things in different implementations (in particular, in different OS's). The clock function in Windows gives the number of clock-ticks from when your program started, regardless of number of threads, and regardless of whether the machine is busy or not [I believe this design decision stems from the ancient days when DOS and Windows 2.x was the fashionable things to use, and the OS didn't have a way of "not running" something].
In Linux, it gives the CPU-time used, as is the case in all Unix-like operating systems, as far as I'm aware.
Edit to clarify: My Linux system says this:
In glibc 2.17 and earlier, clock() was implemented on top of times(2).
For improved precision, since glibc 2.18, it is implemented on top of
clock_gettime(2) (using the CLOCK_PROCESS_CPUTIME_ID clock).
In other words, the time is for the process, not for the current thread.
To get the actual CPU-time used by your process if you are using Windows, you can (and should) use GetProcessTimes

Related

Best way to implement a periodic linux task in c++20

I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter
Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.
5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).

Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Problem
We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).
Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.
Question
Why are we not measuring a (close to) constant computation time even for our basic program?
How can we solve this problem?
Environment Description
First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:
PC Characteristics
Details
CPU
Intel(R) Atom(TM) Processor E3950 # 1.60GHz with 4 cores
Memory RAM
8 GB
Operating System
Ubunut 20.04.1 LTS
Kernel
Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture
x86-64
Tests
The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:
The CPU execution times
The wall time (the world real-time)
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).
We also did a latency test on the PC.
Latency Test
For this one, we followed this tutorial, and these are the results:
Latency Test Generic Kernel
Latency Test RT Kernel
The processes are shown in htop with a priority of RT
Test Program - Complex
We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:
From this we observed that:
The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.
The mode is around 0.17 ms.
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time. For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).
When the difference is not 0, it is usually negative. This, from what we have read here and here, is because more than 1 CPU is being used.
Test Program - Simple
We did the same test but this time with a simpler program:
#include <vector>
#include <iostream>
#include <time.h>
int main(int argc, char** argv) {
int iterations = 5000;
double a = 5.5;
double b = 5.5;
double c = 4.5;
std::vector<double> wallTime(iterations, 0);
std::vector<double> cpuTime(iterations, 0);
struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;
std::cout << "Iteration | WallTime | cpuTime" << std::endl;
for (unsigned int i = 0; i < iterations; i++) {
// Start measuring time
clock_gettime(CLOCK_REALTIME, &beginWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);
// Function
a = b + c + i;
// Stop measuring time and calculate the elapsed time
clock_gettime(CLOCK_REALTIME, &endWallTime);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);
wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;
std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
}
return 0;
}
Final Thoughts
We understand that:
If the ratio == number of CPUs used, they are saturated and there is no waiting time.
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).
Of course, we can give more details.
Thanks a lot for your help!
Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:
The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.
The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.
For everything else I think you can blame scheduling:
an IRQ happens but nothing gets rescheduled
the process gets paused while another process runs
the process gets moved to another CPU thread leaving the caches hot
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)
You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.
But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

In C++, how would I make a random number (either 1 or 2) that changes every 5 minutes?

I'm trying to make a simple game and I have a shop in the game. I want it to be every 5 minutes(if the function changeItem() is called) the item in the shop either switches or stays the same. I have no problem generating the random number, but I have yet to find a thread that shows how to make it generate differently each 5 minutes. Thank you.
In short, keep track of the last time the changeItem() function was called. If it is more than 5 minutes since the last time it was called, then use your random number generator to generate a new number. Otherwise, use the saved number from the last time it was generated.
You've already accepted an answer but I would like to say that for apps that need simple timing like this and don't need great accuracy, a simple calculation in the main loop as all you need.
Kicking off a thread for a single timer is a lot of unnecessary overhead.
So, here's the code showing how you'd go about doing it.
#define FIVE_MINUTES (60*5)
int main(int argc, char** argv){
time_t lastChange = 0, tick;
run_game_loop = true;
while (run_game_loop){
// ... game loop
tick = time(NULL);
if ((tick - lastChange) >= FIVE_MINUTES){
changeItem();
lastChange = tick;
}
}
return 0;
}
It somewhat assumes to be called reasonably regularly though. If on the other hand you need it accurate then a thread would be better. And depending on platform there exist API's for timers that get called by the system.
Standard and portable approach:
You could consider C++11 threads. The general idea would be :
#include <thread>
#include <chrono>
void myrandogen () // function that refreshes your randum number:
// will be executed as a thread
{
while (! gameover ) {
this_thread::sleep_for (std::chrono::minutes(5)); // wait 5 minutes
... // generate your random number and update your game data structure
}
}
in the main function, you would then instantiate a thread with your function:
thread t1 (myrandomgen); // create an launch thread
... // do your stuff until game over
t1.join (); // wait until thread returns
Of course you could also pass parameters (references to shared variables, etc...) when you create the thread, like this:
thread t1 (myrandomgen, param1, param2, ....);
The advantage of this approach is that it's standard and portable.
Non-portable alternatives:
I'm less familiar with these, but:
In a MSWIN environment, you could use SetTimer(...) to define a function to be called at regular interval (and KillTimer(...) to delete it). But this requires a programm structure build around the windows event processing loop.
In a linux environment, you could similarly define a call back function with signal(SIGALRM, ...) and activate periodic calls with alarm().
Small update on performance considerations:
Following several reamrks about overkill of therads and performance, I've done a benchmark, executing 1 billion loop iterations an waiting 1 microsecond each 100K iterations. The whole thing was run on an i7 multicore CPU:
Non threaded execution yielded 213K iterations per millisec.
2 thread execution yielded 209K iterations per millisec and per thread. So slightly slower for each thread. The total execution time was however only 70 to 90 ms longer, so that the overall throughput is at 418 K iterations.
How come ? Because the second thread is using a non used core on the processor. This means that with the adequate architecture, a game could process many more calculatios when using multithreading...

clock() vs getsystemtime()

I developed a class for calculations on multithreads and only one instance of this class is used by a thread. Also I want to measure the duration of calculations by iterating over a container of this class from another thread. The application is win32. The thing is I have read QueryPerformanceCounter is useful when comparing the measuremnts on a single thread. Because I can not use it my problem, I think of clock() or GetSystemTime(). It is sad that both methods have a 'resolution' of milliseconds (since CLOCKS_PER_SEC is 1000 on win32). Which method should I use or to generalize, is there a better option for me?
As a rule I have to take the measurements outside the working thread.
Here is some code as an example.
unsinged long GetCounter()
{
SYSTEMTIME ww;
GetSystemTime(&ww);
return ww.wMilliseconds + 1000 * ww.wSeconds;
// or
return clock();
}
class WorkClass
{
bool is_working;
unsigned long counter;
HANDLE threadHandle;
public:
DoWork()
{
threadHandle = GetCurrentThread();
is_working = true;
counter = GetCounter();
// Do some work
is_working = false;
}
};
void CheckDurations() // will work on another thread;
{
for(size_t i =0;i < vector_of_workClass.size(); ++i)
{
WorkClass & wc = vector_of_workClass[i];
if(wc.is_working)
{
unsigned long dur = GetCounter() - wc.counter;
ReportDuration(wc,dur);
if( dur > someLimitValue)
TerminateThread(wc.threadHandle);
}
}
}
QueryPerformanceCounter is fine for multithreaded applications. The processor instruction that may be used (rdtsc) can potentially provide invalid results when called on different processors.
I recommend reading "Game Timing and Multicore Processors".
For your specific application, the problem it appears you are trying to solve is using a timeout on some potentially long-running threads. The proper solution to this would be to use the WaitForMultipleObjects function with a timeout value. If the time expires, then you can terminate any threads that are still running - ideally by setting a flag that each thread checks, but TerminateThread may be suitable.
both methods have a precision of milliseconds
They don't. They have a resolution of a millisecond, the precision is far worse. Most machines increment the value only at intervals of 15.625 msec. That's a heckofalot of CPU cycles, usually not good enough to get any reliable indicator of code efficiency.
QPF does much better, no idea why you couldn't use it. A profiler is a the standard tool to measure code efficiency. Beats taking dependencies you don't want.
QueryPerformanceCounter should give you the best precision, but there is issues when the function get run on different processors (you get a different result for each processor). So when running in a thread you will experience shifts when the thread switch processor. To solve this you can set processor affinity for the thread that measures time.
GetSystemTime gets an absolute time, clock is a relative time but both measure elapsed time, not CPU time related to the actual thread/process.
Of course clock() is more portable. Having said that I use clock_gettime on Linux because I can get both elapsed and thread CPU time with that call.
boost has some time functions that you could use that will run on multiple platforms if you want platform independent code.

Need a better wait solution

Recently I have been writing a program in C++ that pings three different websites and then depending on pass or fail it will wait 5 minutes or 30 seconds before it tries again.
Currently I have been using the ctime library and the following function to process my waiting. However, according to my CPU meter this is an unacceptable solution.
void wait (int seconds)
{
clock_t endwait;
endwait = clock () + seconds * CLOCKS_PER_SEC;
while (clock () < endwait) {}
}
The reason why this solution is unacceptable is because according to my CPU meter the program runs at 48% to 50% of my CPU when waiting. I have a Athlon 64 x2 1.2 GHz processor. There is no way my modest 130 line program should even get near 50%.
How can I write my wait function better so that it is only using minimal resources?
To stay portable you could use Boost::Thread for sleeping:
#include <boost/thread/thread.hpp>
int main()
{
//waits 2 seconds
boost::this_thread::sleep( boost::posix_time::seconds(1) );
boost::this_thread::sleep( boost::posix_time::milliseconds(1000) );
return 0;
}
With the C++11 standard the following approach can be used:
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::this_thread::sleep_for(std::chrono::seconds(100));
Alternatively sleep_until could be used.
Use sleep rather than an empty while loop.
Just to explain what's happening: when you call clock() your program retrieves the time again: you're asking it to do that as fast as it can until it reaches the endtime... that leaves the CPU core running the program "spinning" as fast as it can through your loop, reading the time millions of times a second in the hope it'll have rolled over to the endtime. You need to instead tell the operating system that you want to be woken up after an interval... then they can suspend your program and let other programs run (or the system idle)... that's what the various sleep functions mentioned in other answers are for.
There's Sleep in windows.h, on *nix there's sleep in unistd.h.
There's a more elegant solution # http://www.faqs.org/faqs/unix-faq/faq/part4/section-6.html