pthread_join is being a bottleneck

pthread_join is being a bottleneck - c++

I have an application where pthread_join is being the bottleneck. I need help to resolve this problem.
void *calc_corr(void *t) {
begin = clock();
// do work
end = clock();
duration = (double) (1000*((double)end - (double)begin)/CLOCKS_PER_SEC);
cout << "Time is "<<duration<<"\t"<<h<<endl;
pthread_exit(NULL);
}
int main() {
start_t = clock();
for (ii=0; ii<16; ii++)
pthread_create(&threads.p[ii], NULL, &calc_corr, (void *)ii);
for (i=0; i<16; i++)
pthread_join(threads.p[15-i], NULL);
stop_t = clock();
duration2 = (double) (1000*((double)stop_t - (double)start_t)/CLOCKS_PER_SEC);
cout << "\n Time is "<<duration2<<"\t"<<endl;
return 0;
}
The time printed in the thread function is in the range of 40ms - 60ms where as the time printed in the main function is in the 650ms - 670ms. The irony is, my serial code runs in 650ms - 670ms time. what can I do to reduce the time taken by pthread_join?
Thanks in advance!

On Linux, clock() measures the combined CPU time. It does not measure the wall time.
This is explains why you get ~640 ms = 16 * 40ms. (as pointed out in the comments)
To measure wall time, you should be using something like:
gettimeofday()
clock_gettime()

By creating some threads you are adding an overhead to your system: Creation time, scheduling time. Creating a thread require allocating the stack, etc; scheduling means more context switching. Also, pthread_join suspends execution of the calling thread until the target thread terminates. Which means you want for thread 1 to finish, when he does you are rescheduled as quick as possible but not instantly, then you wait for thread 2, etc...
Now your computer has few cores, like one or 2, and you are creating 16 threads. At best 2 threads of your program will run at the same time and just by adding their clock measurements you have something around 400 ms.
Again It depends on lot of things, so I quickly flown over what is happening.

Related

C++ time measurement looks too slow

I am programming a game using OpenGL GLUT code, and I am applying a game developing technique that consists in measuring the time consumed on each iteration of the game's main loop, so you can use it to update the game scene proportionally to the last time it was updated. To achieve this, I have this at the start of the loop:
void logicLoop () {
float finalTime = (float) clock() / CLOCKS_PER_SEC;
float deltaTime = finalTime - initialTime;
initialTime = finalTime;
...
// Here I move things using deltaTime value
...
}
The problem came when I added a bullet to the game. If the bullet does not hit any target in two seconds, it must be destroyed. Then, what I did was to keep a reference to the moment the bullet was created like this:
class Bullet: public GameObject {
float birthday;
public:
Bullet () {
...
// Some initialization staff
...
birthday = (float) clock() / CLOCKS_PER_SEC;
}
float getBirthday () { return birthday; }
}
And then I added this to the logic just beyond the finalTime and deltaTime measurement:
if (bullet != NULL) {
if (finalTime - bullet->getBirthday() > 2) {
world.remove(bullet);
bullet = NULL;
}
}
It looked nice, but when I ran the code, the bullet keeps alive too much time. Looking for the problem, I printed the value of (finalTime - bullet->getBirthday()), and I watched that it increases really really slow, like it was not a time measured in seconds.
Where is the problem? I though that the result would be in seconds, so the bullet would be removed in two seconds.

This is a common mistake. clock() does not measure the passage of actual time; it measures how much time has elapsed while the CPU was running this particular process.
Other processes also take CPU time, so the two clocks are not the same. Whenever your operating system is executing some other process's code, including when this one is "sleeping", does not count to clock(). And if your program is multithreaded on a system with more than one CPU, clock() may "double count" time!
Humans have no knowledge or perception of OS time slices: we just perceive the actual passage of actual time (known as "wall time"). Ultimately, then, you will see clock()'s timebase being different to wall time.
Do not use clock() to measure wall time!
You want something like gettimeofday() or clock_gettime() instead. In order to allay the effects of people changing the system time, on Linux I personally recommend clock_gettime() with the system's "monotonic clock", a clock that steps in sync with wall time but has an arbitrary epoch unaffected by people playing around with the computer's time settings. (Obviously switch to a portable alternative if needs be.)
This is actually discussed on the cppreference.com page for clock():
std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock.
Please get into the habit of reading documentation for all the functions you use, when you are not sure what is going on.
Edit: Turns out GLUT itself has a function you can use for this, which is might convenient. glutGet(GLUT_ELAPSED_TIME) gives you the number of wall milliseconds elapsed since your call to glutInit(). So I guess that's what you need here. It may be slightly more performant, particularly if GLUT (or some other part of OpenGL) is already requesting wall time periodically, and if this function merely queries that already-obtained time… thus saving you from an unnecessary second system call (which costs).

If you are on windows you can use QueryPerformanceFrequency / QueryPerformanceCounter which gives pretty accurate time measurements.
Here's an example.
#include <Windows.h>
using namespace std;
int main()
{
LARGE_INTEGER freq = {0, 0};
QueryPerformanceFrequency(&freq);
LARGE_INTEGER startTime = {0, 0};
QueryPerformanceCounter(&startTime);
// STUFF.
for(size_t i = 0; i < 100; ++i) {
cout << i << endl;
}
LARGE_INTEGER stopTime = {0, 0};
QueryPerformanceCounter(&stopTime);
const double ellapsed = ((double)stopTime.QuadPart - (double)startTime.QuadPart) / freq.QuadPart;
cout << "Ellapsed: " << ellapsed << endl;
return 0;
}

C++ multithreads run time issue

I have been studying C++ multithreads and get a question about it.
Here is what I am understanding about multithreads.
One of the reasons we use multithreads is to reduce the run time, right?
For example, I think if we use two threads we can expect half of the execution time.
So, I tried to code to prove it.
Here is the code.
#include <vector>
#include <iostream>
#include <thread>
#include <future>
using namespace std;
#define iterationNumber 1000000
void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) {
clock_t begin,end;
int firstIndex = index * numberInThread;
int lastIndex = firstIndex + numberInThread;
vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
vector<int>::const_iterator last = numberList.cbegin() + lastIndex;
vector<int> numbers(first,last);
unsigned long result = 0;
begin = clock();
for(int i = 0 ; i < numbers.size(); i++) {
result += numbers.at(i);
}
end = clock();
cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
p.set_value(result);
}
int main(void)
{
vector<int> numberList;
vector<thread> t;
vector<future<unsigned long>> futures;
vector<unsigned long> result;
const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
int numberInThread = iterationNumber / NumberOfThreads;
clock_t begin,end;
for(int i = 0 ; i < iterationNumber ; i++) {
int randomN = rand() % 10000 + 1;
numberList.push_back(randomN);
}
for(int j = 0 ; j < NumberOfThreads; j++){
promise<unsigned long> promises;
futures.push_back(promises.get_future());
t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
}
for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));
for (int i = 0; i < futures.size(); i++) {
result.push_back(futures.at(i).get());
}
unsigned long RRR = 0;
begin = clock();
for(int i = 0 ; i < numberList.size(); i++) {
RRR += numberList.at(i);
}
end = clock();
cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
}
Because the hardware concurrency of my laptop is 4, it will create 4 threads and each takes a quarter of numberList and sum up the numbers.
However, the result was different than I expected.
thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654
Why? Why it took more time than serial version(not by thread).

For example, I think if we use two threads we can expect half of the
execution time.
You'd think so, but sadly, that is often not the case in practice. The ideal "N cores means 1/Nth the execution time" scenario occurs only when the N cores can execute completely in parallel, without any core's actions interfering with the performance of the other cores.
But what your threads are doing is just summing up different sub-sections of an array... surely that can benefit from being executed in parallel? The answer is that in principle it can, but on a modern CPU, simple addition is so blindingly fast that it isn't really a factor in how long it takes a loop to complete. What really does limit the execute speed of a loop is access to RAM. Compared to the speed of the CPU, RAM access is very slow -- and on most desktop computers, each CPU has only one connection to RAM, regardless of how many cores it has. That means that what you are really measuring in your program is the speed at which a big array of integers can be read in from RAM to the CPU, and that speed is roughly the same -- equal to the CPU's memory-bus bandwidth -- regardless of whether it's one core doing the reading-in of the memory, or four.
To demonstrate how much RAM access is a factor, below is a modified/simplified version of your test program. In this version of the program, I've removed the big vectors, and instead the computation is just a series of calls to the (relatively expensive) sin() function. Note that in this version, the loop is only accessing a few memory locations, rather than thousands, and thus a core that is running the computation loop will not have to periodically wait for more data to be copied in from RAM to its local cache:
#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>
using namespace std;
static int iterationNumber = 1000000;
unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];
void myFunction(const int index, const int numberInThread)
{
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i=0; i<numberInThread; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
threadResults[index] = result;
threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
// We'll print out the value of threadElapsedTimeMicros[index] later on,
// after all the threads have been join()'d.
// If we printed it out now it might affect the timing of the other threads
// that may still be executing
}
int main(void)
{
vector<thread> t;
const int NumberOfThreads = thread::hardware_concurrency();
const int numberInThread = iterationNumber / NumberOfThreads;
// Multithreaded approach
std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;
// And now, the single-threaded approach, for comparison
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
return 0;
}
When I run the above program on my dual-core Mac mini (i7 with hyperthreading), here are the results I get:
Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
The computations in thread #0: result=1062, took 11718 microseconds
The computations in thread #1: result=1062, took 11481 microseconds
The computations in thread #2: result=1062, took 11525 microseconds
The computations in thread #3: result=1062, took 11230 microseconds
Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds
So in this case the results are more like what you'd expect -- because memory access was not a bottleneck, each core was able to run at full speed, and complete its 25% portion of the total calculations in about 25% of the time that it took a single thread to complete 100% of the calculations... and since the four cores were running truly in parallel, the total time spent doing the calculations was about 33% of the time it took for the single-threaded routine to complete (ideally it would be 25% but there's some overhead involved in starting up and shutting down the threads, etc).

This is an explanation, for the beginner.
It's not technically accurate, but IMHO not that far from it that anyone takes damage from reading it.
It provides an entry into understanding the parallel processing terms.
Threads, Tasks, and Processes
It is important to know the difference between threads, and processes.
By default starting a new process, allocates a dedicated memory for that process. So they share memory with no other processes, and could (in theory) be run on separate computers.
(You can share memory with other processes, via operating system, or "shared memory", but you have to add these features, they are not by default available for your process)
Having multiple cores means that the each running process can be executed on any idle core.
So basically one program runs on one core, another program runs on a second core, and the background service doing something for you, runs on a third, (and so on and so forth)
Threads is something different.
For instance all processes will run in a main thread.
The operating system implements a scheduler, that is supposed to allocate cpu time for programs. In principle it will say:
Program A, get 0.01 seconds, than pause!
Program B, get 0.01 seconds, then pause!
Program A, get 0.01 seconds, then pause!
Program B, get 0.01 seconds, then pause!
you get the idea..
The scheduler typically can prioritize between threads, so some programs get more CPU time than others.
The scheduler can of course schedule threads on all cores, but if it does this within a process, (splits a process's threads over multiple cores) there can be a performance penalty as each core holds it's own very fast memory cache.
Since threads from the same process can access the same cache, sharing memory between threads is quite fast.
Accessing another cores cache is not as fast, (if even possible without going via RAM), so in general schedulers will not split a process over multiple cores.
The result is that all the threads belonging to a process runs on the same core.
| Core 1 | Core 2 | Core 3 |
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
A process can spawn multiple threads, they all share the parent threads memory area, and will normally all run on the core that the parent was running on.
It makes sense to spawn threads within a process, if you have an application that needs to respond to something that it cannot control the timing of.
I.E. the users presses on a cancel button, or attempts to move a window, while the application is running calculations that takes a long time to complete.
Responsiveness of the UI, requires the application to spend time reading, and handling what the user is attempting to do. This could be achieved in a main loop, if the program does parts of the calculation in each iteration.
However that get's complicated real fast, so instead of having the calculation code, exit in the middle of a calculation to check the UI, and update the UI, and then continue. You run the calculation code in another thread.
The scheduler then makes sure that the UI thread, and the calculation thread gets CPU time, so the UI responds to user input, while the calculation continues..
And your code stays fairly simple.
But I want to run my calculations another core to gain speed
To distribute calculations on multiple cores, you could spawn a new process for each calculation job. In this way the scheduler will know that each process get's it's own memory, and it can easily be launched on an idle core.
However you have a problem, you need to share memory with the other process, so it knows what to do.
A simple way of doing this, is sharing memory via the filesystem.
You could create a file with the data for the calculation, and then spawn a thread governing the execution (and communication) with another program, (so your UI is responsive, while we wait for the results).
The governing thread runs the other program via system commands, which starts it as another process.
The other program will be written such that it runs with the input file as input argument, so we can run it in multiple instances, on different files.
If the program self terminates when it's done, and creates an output file, it can run on any core, (or multiple) and your process can read the output file.
This actually works, and should the calculation take a long time (like many minutes) this is perhaps ok, even though we use files to communicate between our processes.
For calculations that only takes seconds, however, the file system is slow, and waiting for it will almost remove the gained performance of using processes instead of just using threads. So other more efficient memory sharing is used in real life. For instance creating a shared memory area in RAM.
The "create governing thread, and spawn subprocess, allow communication with process via governing thread, collect data when process is complete, and expose via governing thread" can be implemented in multiple ways.
Tasks
Well "tasks" is ambiguous.
In general it means "Process or thread that solves a task".
However, in certain languages like C#, it is something that implements a thread like thing, that the scheduler can treat as a process. Other languages that provide a similar feature typically dubs this either tasks or workers.
So with workers/tasks it appears to the programmer as if it was merely a thread, that you can share memory with easily, via references, and control like any other thread, by invoking methods on the thread.
But it appears to the scheduler as if it's a process that can be run on any core.
It implements the shared memory problem in a fairly efficient way, as part of the language, so the programmer won't have to re-invent this wheel for all tasks.
This is often referred to as "Hybrid threading" or simply "parallel threads"

Seems that you have some misconception about multi-threading. Simply using two threads cannot halve the processing time.
Multi-threading is a kind of complicated concept but you can easily find related materials on the web. You should read one of them first. But I will try to give a simple explanation with an example.
No matter how many CPUs(or cores) you have, the total handling capacity of the CPU will be always the same whether you use multi-thread or not, right? Then, where does the performance difference come from?
When a program runs on a device(computer) it uses not only CPU but also other system resources such as Networks, RAM, Hard drives, etc. If the flow of the program is serialized there will be a certain point of time when the CPU is idle waiting for other system resources to get done. But, in the case that the program runs with multiple threads(multiple flow), if a thread turns to idle(waiting some tasks done by other system resources) the other threads can use the CPU. Therefore, you can minimize the idle time of the CPU and improve the time performance. This is one of the most simple example about multi-threading.
Since your sample code is almost 'only CPU-consuming', using multi-thread could bring little improvement of performance. Sometimes it can be worse because multi-threading also comes with time cost of context-switching.
FYI, parallel processing is not the same as multi-threading.

This is very good to point out the problems with macs.
Provided you use a o.s. that can schedule threads in a useful manner, you have to consider if a problem is basically the product of 1 problem many times. An example is matrix multiplication. When you multiply 2 matrices there is a certain parts of it which are independent of the others. A 3x3 matrix times another 3x3 requires 9 dot products which can be computed independently of the others, which themselves require 3 multiplications and 2 additions but here the multiplications must be done first. So we see if we wanted to utilize multithreaded processor for this task we could use 9 cores or threads and given they get equal compute time or have same priority level (which is adjustable on windows) you would reduce the time to multiply a 3x3 matrices by 9. This is because we are essentially doing something 9 times which can be done at the same time by 9 people.
now for each of 9 threads we could have 3 cores perform multiplications totaling 3x9=24 cores all together now. Reducing time by t/24. But we have 18 additions and here we can get no gain from more cores. One addition must be piped into another. And the problem takes time t with one core or time t/24 ideally with 24 cores working together. Now you can see why problems are often seeked out if they are 'linear' because they can be done in parallel pretty good like graphics for example (some things like backside culling are sorting problems and inherently not linear so parallel processing has diminished performance boosts).
Then there is added overhead of starting threads and how they are scheduled by the o.s. and processor. Hope this helps.

c++ implementing clock to measure execution time

I have written a program in c++ and am trying to measure the time it takes to execute completely
int main (int argc, char**argv){
clock_t tStart = clock();
//doing my program's work here
printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
return 0;
}
My issue is that it will always print out 0.00s for the execution time. Could this be due to using multiple pthreads in my program (my program uses pthread_join to make sure that all threads have completed executing so I don't think this should be an issue)?
edit: //doing program's work =...
for(i = 0;i<4;i++){
err = pthread_create(&threads[i], NULL, print, NULL);
pthread_join(threads[i], NULL);
}
void *print(void *data){
printf("hello world");
}

printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
All three of your arithmetic operands are integers, so you perform integer division and get 0.
Cast either the LHS or the RHS of the / symbol to a floating-point type. And run your code more times! Your benchmark is useless if it measures just a single run (which is pretty evident since you got 0, not 1 or like 300 or something).

It really depends on what is in //doing my program's work here. If it is kicking off other threads, then you will definitely need to wait or poll to get a time. Show the code to get more help! In a similar situation I was in recently, however, it turned out that my code was actually running in less than 0.01 seconds.

Why main thread is slower than worker thread in pthread-win32?

void* worker(void*)
{
int clk = clock();
float val = 0;
for(int i = 0; i != 100000000; ++i)
{
val += sin(i);
}
printf("val: %f\n", val);
printf("worker: %d ms\n", clock() - clk);
return 0;
}
int main()
{
pthread_t tid;
pthread_create(&tid, NULL, worker, NULL);
int clk = clock();
float val = 0;
for(int i = 0; i != 100000000; ++i)
{
val += sin(i);
}
printf("val: %f\n", val);
printf("main: %d ms\n", clock() - clk);
pthread_join(tid, 0);
return 0;
}
Main thread and the worker thread are supposed to run equally fast, but the result is:
val: 0.782206
worker: 5017 ms
val: 0.782206
main: 8252 ms
Main thread is much slower, I don't know why....
Problem solved. It's the compiler's problem, GCC(MinGW) behaves weirdly on Windows.
I compliled the code in Visual Studio 2012, there's no speed difference.

Main thread and the worker thread are supposed to run equally fast, but the result is:
I have never seen a threading system outside a realtime OS which provided such guarantees. With windows threads and all other threading systems(I have also use posix threads, and whatever the lightweight threading on MacOS X is, and threads in C# threads) in Desktop systems it is my understanding that there are no performance guarantees in terms or how fast one thread will be in relation to another.
A possible explanation (speculation) could be that since you are using a modern quadcore it could be raising the clock rate on the main core. When there are mostly single threaded workloads modern i5/i7/AMD-FX systems raise the clock rate on one core to a pre-rated level that stock cooling can dissipate the heat for. On more parallel workloads all the cores get a smaller bump in clock speed, again pre-rated based on heat dissipation and when idle all of the cores are throttled down to minimize power usage. It is possible that the amount of background work is mostly performed on a single core and the amount of time the second thread spends on the second core is not enough to justify switching to the mode where all the cores speed is boosted.
I would try again with 4 threads and 10x the workload. If you have a tool that monitors CPU load and clock-speeds I would check that. Using that information you can infer if I am right or wrong.
Another option might be profiling and seeing if what part of the work is taking time. It could be that the OS calls are taking more time than your workload.
You could also test your software on another machine with different performance characteristics such as steady clock-speed or single core. This would provide more information.

What could be happening is that the worker thread execution is being interleaved with main's execution, so that some of the worker thread's execution time is being counted against main's time. You could try putting a sleep(10) (some time larger than the run-time of the worker and of main) at the very beginning of the worker and run again.

pthread sleep function, cpu consumption

On behalf, sorry for my far from perfect English.
I've recently wrote my self a demon for Linux (to be exact OpenWRT router) in C++ and i came to problem.
Well there are few threads there, one for each opened TCP connection, main thread waiting for new TCP connections and, as I call it, commander thread to check for status.
Every thing works fine, but my CPU is always at 100%. I now that its because of the commander code:
void *CommanderThread(void* arg)
{
Commander* commander = (Commander*)arg;
pthread_detach(pthread_self());
clock_t endwait;
while(true)
{
uint8_t temp;
endwait = clock () + (int)(1 * CLOCKS_PER_SEC);
for(int i=0;i<commander->GetCount();i++)
{
ptrRelayBoard rb = commander->GetBoard(i);
if (rb!= NULL)
rb->Get(0x01,&temp);
}
while (clock() < endwait);
}
return NULL;
}
As you can see the program do stuff every 1s. Time is not critical here. I know that CPU is always checking did the time passed. I've tried do do something like this:
while (clock() < endwait)
usleep(200);
But when the function usleep (and sleep also) seam to freeze the clock increment (its always a constant value after the usleep).
Is there any solution, ready functions (like phread_sleep(20ms)), or walk around for my problem? Maybe i should access the main clock somehow?
Here its not so critical i can pretty much check how long did the execution of status checking took (latch the clock() before, compare with after), and count the value to put as an argument to the usleep function. But in other thread, I would like to use this form.
Do usleep is putting whole process to freeze?
I'm currently debugging it on Cygwin, but don't think the problem lies here.
Thanks for any answers and suggestions its much appreciated.
J.L.

If it doesn't need to be exactly 1s, then just usleep a second. usleep and sleep put the current thread into an efficient wait state that is at least the amount of time you requested (and then it becomes eligible for being scheduled again).
If you aren't trying to get near exact time there's no need to check clock().

I've I have resolved it other way.
#include <sys/time.h>
#define CLOCK_US_IN_SECOND 1000000
static long myclock()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (tv.tv_sec * CLOCK_US_IN_SECOND) + tv.tv_usec;
}
void *MainThread(void* arg)
{
Commander* commander = (Commander*)arg;
pthread_detach(pthread_self());
long endwait;
while(true)
{
uint8_t temp;
endwait = myclock() + (int)(1 * CLOCK_US_IN_SECOND);
for(int i=0;i<commander->GetCount();i++)
{
ptrRelayBoard rb = commander->GetBoard(i);
if (rb!= NULL)
rb->Get(0x01,&temp);
}
while (myclock() < endwait)
usleep((int)0.05*CLOCK_US_IN_SECOND);
}
return NULL;
}
Bare in mind, that this code is vulnerable for time change during execution. Don't have idea how to omit that, but in my case its not really important.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js