Unexpected result in multithreading scenario in C/C++ under linux CFS schedualar - c++

I have created multiple threads ( 4 threads) inside main thread. While every thread execute same functions,
the scheduling of threads are not same as expected. As per my understanding of OS , linux CFS scheduler will
assign "t" virtual run time quantum and on expiry of that time quantum, CPU is preempted from current thread and
allocate to next thread. In this manner every thread will get fair share of CPU. What I am getting is not as per expectation.
I am expecting all threads (thread 1-4, main thread ) will get CPU before same thread(any) get CPU next time.
Expected output is
foo3-->1--->Time Now : 00:17:45.346225000
foo3-->1--->Time Now : 00:17:45.348818000
foo4-->1--->Time Now : 00:17:45.350216000
foo4-->1--->Time Now : 00:17:45.352800000
main is running ---> 1--->Time Now : 00:17:45.355803000
main is running ---> 1--->Time Now : 00:17:45.360606000
foo2-->1--->Time Now : 00:17:45.345305000
foo2-->1--->Time Now : 00:17:45.361666000
foo1-->1--->Time Now : 00:17:45.354203000
foo1-->1--->Time Now : 00:17:45.362696000
foo1-->2--->Time Now : 00:17:45.362716000 // foo1 thread got CPU 2nd time as expected
foo1-->2--->Time Now : 00:17:45.365306000
but I am getting
foo3-->1--->Time Now : 00:17:45.346225000
foo3-->1--->Time Now : 00:17:45.348818000
foo4-->1--->Time Now : 00:17:45.350216000
foo4-->1--->Time Now : 00:17:45.352800000
main is running ---> 1--->Time Now : 00:17:45.355803000
main is running ---> 1--->Time Now : 00:17:45.360606000
foo3-->2--->Time Now : 00:17:45.345305000 // // foo3 thread got CPU 2nd time UNEXPECTEDLY before scheduling other threads as per CFS
foo3-->2--->Time Now : 00:17:45.361666000
foo1-->1--->Time Now : 00:17:45.354203000
foo1-->1--->Time Now : 00:17:45.362696000
foo1-->2--->Time Now : 00:17:45.362716000
foo1-->2--->Time Now : 00:17:45.365306000
Here is my program (thread_multi.cpp)
#include <pthread.h>
#include <stdio.h>
#include "boost/date_time/posix_time/posix_time.hpp"
#include <iostream>
#include <cstdlib>
#include <fstream>
#define NUM_THREADS 4
using namespace std;
std::string now_str()
{
// Get current time from the clock, using microseconds resolution
const boost::posix_time::ptime now =
boost::posix_time::microsec_clock::local_time();
// Get the time offset in current day
const boost::posix_time::time_duration td = now.time_of_day();
const long hours = td.hours();
const long minutes = td.minutes();
const long seconds = td.seconds();
const long nanoseconds = td.total_nanoseconds() - ((hours * 3600 + minutes * 60 + seconds) * 1000000000);
char buf[40];
sprintf(buf, "Time Now : %02ld:%02ld:%02ld.%03ld", hours, minutes, seconds, nanoseconds);
return buf;
}
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
char *str;
int i = 0;
str=(char*)arg;
while(i < 100 )
{
++i;
ofstream myfile ("example.txt", ios::out | ios::app | ios::binary);
if (myfile.is_open())
{
myfile << str <<"-->"<<i<<"--->" <<now_str() <<" \n";
}
else cout << "Unable to open file";
// generate delay
for(volatile int k=0;k<1000000;k++);
if (myfile.is_open())
{
myfile << str <<"-->"<<i<<"--->" <<now_str() <<"\n\n";
myfile.close();
}
else cout << "Unable to open file";
}
}
int main(void)
{
pthread_t pth[NUM_THREADS]; // this is our thread identifier
int i = 0;
pthread_create(&pth[0],NULL, threadFunc, (void *) "foo1");
pthread_create(&pth[1],NULL, threadFunc, (void *) "foo2");
pthread_create(&pth[2],NULL, threadFunc, (void *) "foo3");
pthread_create(&pth[3],NULL, threadFunc, (void *) "foo4");
std::cout <<".............\n" <<now_str() << '\n';
while(i < 100)
{
for(int k=0;k<1000000;k++);
ofstream myfile ("example.txt", ios::out | ios::app | ios::binary);
if (myfile.is_open())
{
myfile << "main is running ---> "<< i <<"--->"<<now_str() <<'\n';
myfile.close();
}
else cout << "Unable to open file";
++i;
}
// printf("main waiting for thread to terminate...\n");
for(int k=0;k<4;k++)
pthread_join(pth[k],NULL);
std::cout <<".............\n" <<now_str() << '\n';
return 0;
}
Here is Completely Fair Scheduler details
kernel.sched_min_granularity_ns = 100000
kernel.sched_wakeup_granularity_ns = 25000 kernel.sched_latency_ns =
1000000
As per sched_min_granularity_ns value,any task will be execute for that minimum amount of time and if the task needs more than that minimum time then time slice is calculated and every task will be executed for that time slice.
Here time slice is calculated using the formula ,
time slice = ( weight of each task / total weight of all tasks under
that CFS run-queue ) x sched_latency_ns
Can anyone explain why I am getting those results of scheduling ????
Any help to understand the output will be highly appreciated.
Thank you in advance.
I am using gcc under linux.
EDIT 1:
If I change this loop
for(int k=0;k<100000;k++);
into
for(int k=0;k<10000;k++);
then sometimes thread 1 got CPU 10 times consecutively, thread 2 got CPU for 5 times consecutively, thread 3 got CPU for 5 times consecutively, main thread 2 times consecutively, thread 4 got CPU for 7 times consecutively. It looks like different threads are preempted at random time.
Any clue for these random no of times consecutive CPU allocation to different threads.??

CPU allocate some time to execute each thread. Why each thread doesn't make same number of print?
I'll explain this within an example :
Admit that you computer can make 100 instructions by ns
Admit that make 1 print is equivalent to use 25 instructions
Admit that each thread have 1ns to work
Now you have to understand that all program in the computer is consumming the 100 available instructions
If when your thread want to print something there is 100 instructions available, it can print 4 sentences.
If when your thread want to print something there is 40 instructions available, it can print 1 sentences. There is only 40 instructions because some other program is using instructions.
Do you get it?
If you have any question, you are welcome. :)

Related

Simple division of labour over threads is not reducing the time taken

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.
What I have attempted to do is:
do a task X times in one thread - check the time taken.
do a task X / Y times in Y threads - check the time taken.
So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:
2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds
and so on until I reach some threading limit (number of cores or whatever).
So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.
1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s
So clearly something is not right here!
I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr
Finally here is the code for reference:
#include <iostream>
#include <math.h>
#include <vector>
#include <thread>
#define randf() ((double) rand()) / ((double) (RAND_MAX))
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
for (auto i = 0u; i < interations; i++)
{
double value = randf();
double calc = std::atan(value);
(void) calc;
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << std::endl;
}
int main()
{
// Note these numbers vary by about probably due to godbolt servers load (?)
// 1 Threads takes: ~8s
// 2 Threads takes: ~8s
// 3 Threads takes: ~8s
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Seed rand
std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.
I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:
CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)
Questions are:
Why might this be happening?
What am I doing wrong?
How can I improve this?
The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.
You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.
Never mind that rand() is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.
What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.
No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.
It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.

opencv much slower in multithreading

Im writting a console application that uses multithreading. Each thread process a set of images using opencv functions.
If the function that uses opencv functions is executed in a single thread I get a reference computational time. If I execute this function from multiple threads the function (individually in each thread) is much slower (nearly double), when it should be nearly the same.
¿Does opencv parallelizes, serializes or blocks itself the execution?.
I have test the aplication using opencv libraries compiled WITH_TBB and without TBB and the result is almost the same. I don't know if it may have any inffluence, but I have seen also that some functions like cv::threshold or cv::findcontours create 12 additional subprocesses when beein executed. If The open cv calls are commented the time is the same for all threads and is the same to the obtained in a single thread execution, so in this case the multithreading is working well. The question is if there is maybe an opencv compilation option or a function call that allows to obtain the same time in multithreading and in single threading execution??.
EDIT
This is the result of increasing the number of threads (cores) in a 4 cores CPU, executing with 1, 2, 3 and 4 cores the same function. Each core process 768 images with 1600x1200 resolution in a for loop. Inside the loop the function causing the increasing delay is called. I shoud expect that, independently of the number of cores the time is approx the same obtained for a single thread (35000ms) or 10% more, but, as can be seen the time raises up when the number of threads is increased, I can not find why...
TIMES: (Sorry, the system not allow me to upload images to the posts)
time in File No. 3 --> 35463
Mean time using 1 cores is: 47ms
time in File No. 3 --> 42747
time in File No. 3 --> 42709
Mean time using 2 cores is: 28ms
time in File No. 3 --> 54587
time in File No. 3 --> 54595
time in File No. 3 --> 54437
Mean time using 3 cores is: 24ms
time in File No. 3 --> 68751
time in File No. 3 --> 68865
time in File No. 3 --> 68878
time in File No. 3 --> 68622
Mean time using 4 cores is: 22ms
If no opencv code is used insithe the function, the time, as expected, is similar for all the cases 1, 2 3 or 4 threads but when an open cv function is used, for example only with a a simple call to:
img.convertTo(img,CV_32F);
beeing img a cv::Mat, the time increases when the number of threads is increased. I have made test also disabling the hiper-threading option in the CPU Bios. In that case all the times decrease, been the time with 1 thread 25.000ms, but the problem of time increase is still present (33sec with 2 threads, 43 with 3, 57 with 4)... I dont know if this tells you something
Edit 2
A mcve:
#include "stdafx.h"
#include <future>
#include <chrono>
#include "Filter.h"
#include <iostream>
#include <future>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
long long Ticks();
int WithOpencv(cv::Mat img);
int With_OUT_Opencv(cv::Mat img);
int TestThreads (char *buffer,std::string file);
#define Blur3x3(matrix,f,c) ((matrix[(f-1)*1600+(c-1)] + matrix[(f-1)*1600+c] + matrix[(f-1)*1600+(c+1)] + matrix[f*1600+(c-1)] + matrix[f*1600+c] + matrix[f*1600+(c+1)] + matrix[(f+1)*1600+(c-1)] + matrix[(f+1)*1600+c] + matrix[(f+1)*1600+(c+1)])/9)
int _tmain(int argc, _TCHAR* argv[])
{
std::string file="Test.bmp";
auto function = [&](char *buffer){return TestThreads(buffer,file);};
char *buffers[12];
std::future<int> frames[12];
DWORD tid;
int i,j;
int nframes = 0;
int ncores;
cv::setNumThreads(8);
for (i=0;i<8;i++) buffers[i] = new char[1000*1024*1024];
for (j=1;j<9;j++)
{
ncores = j;
long long t = Ticks();
for (i=0;i<ncores;i++) frames[i] = std::async(std::launch::async,function,buffers[i]);
for (i=0;i<ncores;i++) nframes += frames[i].get();
t = Ticks() - t;
std::cout << "Mean time using " << ncores << " cores is: " << t/nframes << "ms" << std::endl << std::endl;
nframes = 0;
Sleep(2000);
}
for (int i=0;i<8;i++) delete buffers[i];
return NULL;
return 0;
}
int TestThreads (char *buffer,std::string file)
{
long long ta;
int res;
char *ruta=new char[file.length() + 1];
strcpy(ruta,file.c_str());
cv::Mat img (1200, 1600, CV_8UC1);
img=cv::imread(file);
ta = Ticks();
for (int i=0;i<15;i++) {
//Uncomment this and comment next line to test without opencv calls. With_OUT_Opencv implements simple filters with direct operations over mat data
//res = With_OUT_Opencv(img);
res = WithOpencv(img);
}
ta = Ticks() - ta;
std::cout << "Time in file No. 3--> " << ta << std::endl;
return 15;
}
int WithOpencv(cv::Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
return 0;
}
int With_OUT_Opencv(cv::Mat img){
unsigned char *baux1 = new unsigned char[1600*1200];
unsigned short *baux2 = new unsigned short[1600*1200];
unsigned char max=0;
int f,c,i;
unsigned char threshold = 177;
for (f=1;f<1199;f++) // Bad Blur filters
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = Blur3x3(img.data,f,c);
baux1[f*1600+c] = baux1[f*1600+c] * baux1[f*1600+c];
baux2[f*1600+c] = img.data[f*1600+c] * img.data[f*1600+c];
}
}
for (f=1;f<1199;f++)
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = sqrt(Blur3x3(baux2,f,c) - baux1[f*1600+c]);
if (baux1[f*1600+c] > max) max = baux1[f*1600+c];
}
}
threshold = threshold * ((float)max/255.0); // Bad Norm/Bin
for (i=0;i<1600*1200;i++)
{
if (baux1[i]>threshold) baux1[i] = 1;
else baux1[i] = 0;
}
delete []baux1;
delete []baux2;
return 0;
}
long long Ticks()
{
static long long last = 0;
static unsigned ticksPerMS = 0;
LARGE_INTEGER largo;
if (last==0)
{
QueryPerformanceFrequency(&largo);
ticksPerMS = (unsigned)(largo.QuadPart/1000);
QueryPerformanceCounter(&largo);
last = largo.QuadPart;
return 0;
}
QueryPerformanceCounter(&largo);
return (largo.QuadPart-last)/ticksPerMS;
}
I'm confused as to what your question is.
Your initial question suggested that running x number of iterations in serial is considerably faster than running them in parallel. Note: when the same target function is used. And you're wondering why running the same target function is considerably slower in a multithreaded scenario.
However, I now see that your example is comparing the performance of OpenCV with some other custom code. Is that what your question is about?
Related to the question as I initially thought the question was, the answer is: no, running the target function in serial is not considerably faster than running it in parallel. See results and code below.
Results
eight threads took 4104.38 ms
single thread took 7272.68 ms
four threads took 3687 ms
two threads took 4500.15 ms
(on a Apple MBA 2012 i5 & opencv3)
Test code
#include <iostream>
#include <vector>
#include <chrono>
#include <thread>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace std::chrono;
using namespace cv;
class benchmark {
time_point<steady_clock> start = steady_clock::now();
string title;
public:
benchmark(const string& title) : title(title) {}
~benchmark() {
auto diff = steady_clock::now() - start;
cout << title << " took " << duration <double, milli> (diff).count() << " ms" << endl;
}
};
template <typename F>
void repeat(unsigned n, F f) {
while (n--) f();
};
int targetFunction(Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
//imshow("WithOpencv", img_bin);
return 0;
}
void runTargetFunction(int nIterations, int nThreads, const Mat& img) {
int nIterationsPerThread = nIterations / nThreads;
vector<thread> threads;
auto targetFunctionFn = [&img]() {
targetFunction(img);
};
setNumThreads(nThreads);
repeat(nThreads, [&] {
threads.push_back(thread([=]() {
repeat(nIterationsPerThread, targetFunctionFn);
}));
});
for(auto& thread : threads)
thread.join();
}
int main(int argc, const char * argv[]) {
string file = "../../opencv-test/Test.bmp";
auto img = imread(file);
const int nIterations = 64;
// let's run using eight threads
{
benchmark b("eight threads");
runTargetFunction(nIterations, 8, img);
}
// let's run using a single thread
{
benchmark b("single thread");
runTargetFunction(nIterations, 1, img);
}
// let's run using four threads
{
benchmark b("four threads");
runTargetFunction(nIterations, 4, img);
}
// let's run using a two threads
{
benchmark b("two threads");
runTargetFunction(nIterations, 2, img);
}
return 0;
}
You are measuring three things:
The time that all threads need to complete the whole task divided by the size of the whole task.
The time required by each individual thread to complete its part of the task.
The time required to complete the whole task.
You are observing that the first time is going down from 47ms to 22ms when increasing the number of threads. That is good! At the same time you are realizing that the time requried by an individual thread increases from 35463 to about 68751 (whatever units). Finally, you are realizing that the overall executing time goes up.
Regarding the second measurement: When increasing the number of threads, the individual threads need longer to perform there respective operations. Two possible explanations:
Your threads are competing for memory bus bandwidth.
Your threads are triggering computations that are multi-threaded by themselves, so effectively they are competing with each other for CPU time.
Now for the question why the overall working time increases. The reason is simple: You are not only increasing the number of threads, but you are increasing the work load at the same rate. If your threads were not competing with each other at all and there would be no overhead involved, N threads would require the same time to do N times the work. It does not, so you are noticing a slow down.

Avoiding CPU Contention

I have a program that I want to calculate its time of execution :
#include <iostream>
#include <boost/chrono.hpp>
using namespace std;
int main(int argc, char* const argv[])
{
boost::chrono::system_clock::time_point start = boost::chrono::system_clock::now();
// Intructions to burn time
boost::chrono::duration<double> sec = boost::chrono::system_clock::now() - start;
cout <<"---- time execution is " << sec.count() << ";";
return 0;
}
For example the result after one run:
---- time execution is 0.0223588
This result isn't very conscious because the CPU time is included .
I had An idea to avoid CPU contention by testing many runs and getting there average .
The problem is :
How can I store the time value of the previous run ?
Can we do that via a file ?
How to incrementally calculate the average after each run ?
Your suggestion / pseudocodes are welcome.
You may pass average number via command line using 2 args: current average value and the number of iterations performed.
Then:
NewAverage = ((CurrentAverage*N) + CurrentValue) / (N+1);
where N is the number of iterations.

Odd results when adding artificial delays to C++ code. Embedded Linux

I have been looking at the performance of our C++ server application running on embedded Linux (ARM). The pseudo code for the main processing loop of the server is this -
for i = 1 to 1000
Process item i
Sleep for 20 ms
The processing for one item takes about 2ms. The "Sleep" here is really a call to the Poco library to do a "tryWait" on an event. If the event is fired (which it never is in my tests) or the time expires, it comes returns. I don't know what system call this equates to. Although we ask for a 2ms block, it turns out to be roughly 20ms. I can live with that - that's not the problem. The sleep is just an artificial delay so that other threads in the process are not starved.
The loop takes about 24 seconds to go through 1000 items.
The problem is, we changed the way the sleep is used so that we had a bit more control. I mean - 20ms delay for 2ms processing doesn't allow us to do much processing. With this new parameter set to a certain value it does something like this -
For i = 1 to 1000
Process item i
if i % 50 == 0 then sleep for 1000ms
That's the rough code, in reality the number of sleeps is slightly different and it happens to work out at a 24s cycle to get through all the items - just as before.
So we are doing exactly the same amount of processing in the same amount of time.
Problem 1 - the CPU usage for the original code is reported at around 1% (it varies a little but that's about average) and the CPU usage reported for the new code is about 5%. I think they should be the same.
Well perhaps this CPU reporting isn't accurate so I thought I'd sort a large text file at the same time and see how much it's slowed up by our server. This is a CPU bound process (98% CPU usage according to top). The results are very odd. With the old code, the time taken to sort the file goes up by 21% when our server is running.
Problem 2 - If the server is only using 1% of the CPU then wouldn't the time taken to do the sort be pretty much the same?
Also, the time taken to go through all the items doesn't change - it's still 24 seconds with or without the sort running.
Then I tried the new code, it only slows the sort down by about 12% but it now takes about 40% longer to get through all the items it has to process.
Problem 3 - Why do the two ways of introducing an artificial delay cause such different results. It seems that the server which sleeps more frequently but for a minimum time is getting more priority.
I have a half baked theory on the last one - whatever the system call that is used to do the "sleep" is switching back to the server process when the time is elapsed. This gives the process another bite at the time slice on a regular basis.
Any help appreciated. I suspect I'm just not understanding it correctly and that things are more complicated than I thought. I can provide more details if required.
Thanks.
Update: replaced tryWait(2) with usleep(2000) - no change. In fact, sched_yield() does the same.
Well I can at least answer problem 1 and problem 2 (as they are the same issue).
After trying out various options in the actual server code, we came to the conclusion that the CPU reporting from the OS is incorrect. It's quite result so to make sure, I wrote a stand alone program that doesn't use Poco or any of our code. Just plain Linux system calls and standard C++ features. It implements the pseudo code above. The processing is replaced with a tight loop just checking the elapsed time to see if 2ms is up. The sleeps are proper sleeps.
The small test program shows exactly the same problem. i.e. doing the same amount of processing but splitting up the way the sleep function is called, produces very different results for CPU usage. In the case of the test program, the reported CPU usage was 0.0078 seconds using 1000 20ms sleeps but 1.96875 when a less frequent 1000ms sleep was used. The amount of processing done is the same.
Running the test on a Linux PC did not show the problem. Both ways of sleeping produced exactly the same CPU usage.
So clearly a problem with our embedded system and the way it measures CPU time when a process is yielding so often (you get the same problem with sched_yeild instead of a sleep).
Update: Here's the code. RunLoop is where the main bit is done -
int sleepCount;
double getCPUTime( )
{
clockid_t id = CLOCK_PROCESS_CPUTIME_ID;
struct timespec ts;
if ( id != (clockid_t)-1 && clock_gettime( id, &ts ) != -1 )
return (double)ts.tv_sec +
(double)ts.tv_nsec / 1000000000.0;
return -1;
}
double GetElapsedMilliseconds(const timeval& startTime)
{
timeval endTime;
gettimeofday(&endTime, NULL);
double elapsedTime = (endTime.tv_sec - startTime.tv_sec) * 1000.0; // sec to ms
elapsedTime += (endTime.tv_usec - startTime.tv_usec) / 1000.0; // us to ms
return elapsedTime;
}
void SleepMilliseconds(int milliseconds)
{
timeval startTime;
gettimeofday(&startTime, NULL);
usleep(milliseconds * 1000);
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > milliseconds + 0.3)
std::cout << "Sleep took longer than it should " << elapsedMilliseconds;
sleepCount++;
}
void DoSomeProcessingForAnItem()
{
timeval startTime;
gettimeofday(&startTime, NULL);
double processingTimeMilliseconds = 2.0;
double elapsedMilliseconds;
do
{
elapsedMilliseconds = GetElapsedMilliseconds(startTime);
} while (elapsedMilliseconds <= processingTimeMilliseconds);
if (elapsedMilliseconds > processingTimeMilliseconds + 0.1)
std::cout << "Processing took longer than it should " << elapsedMilliseconds;
}
void RunLoop(bool longSleep)
{
int numberOfItems = 1000;
timeval startTime;
gettimeofday(&startTime, NULL);
timeval startMainLoopTime;
gettimeofday(&startMainLoopTime, NULL);
for (int i = 0; i < numberOfItems; i++)
{
DoSomeProcessingForAnItem();
double elapsedMilliseconds = GetElapsedMilliseconds(startTime);
if (elapsedMilliseconds > 100)
{
std::cout << "Item count = " << i << "\n";
if (longSleep)
{
SleepMilliseconds(1000);
}
gettimeofday(&startTime, NULL);
}
if (longSleep == false)
{
// Does 1000 * 20 ms sleeps.
SleepMilliseconds(20);
}
}
double elapsedMilliseconds = GetElapsedMilliseconds(startMainLoopTime);
std::cout << "Main loop took " << elapsedMilliseconds / 1000 <<" seconds\n";
}
void DoTest(bool longSleep)
{
timeval startTime;
gettimeofday(&startTime, NULL);
double startCPUtime = getCPUTime();
sleepCount = 0;
int runLoopCount = 1;
for (int i = 0; i < runLoopCount; i++)
{
RunLoop(longSleep);
std::cout << "**** Done one loop of processing ****\n";
}
double endCPUtime = getCPUTime();
std::cout << "Elapsed time is " <<GetElapsedMilliseconds(startTime) / 1000 << " seconds\n";
std::cout << "CPU time used is " << endCPUtime - startCPUtime << " seconds\n";
std::cout << "Sleep count " << sleepCount << "\n";
}
void testLong()
{
std::cout << "Running testLong\n";
DoTest(true);
}
void testShort()
{
std::cout << "Running testShort\n";
DoTest(false);
}

Delay execution 1 second

So I am trying to program a simple tick-based game. I write in C++ on a linux machine. The code below illustrates what I'm trying to accomplish.
for (unsigned int i = 0; i < 40; ++i)
{
functioncall();
sleep(1000); // wait 1 second for the next function call
}
Well, this doesn't work. It seems that it sleeps for 40 seconds, then prints out whatever the result is from the function call.
I also tried creating a new function called delay, and it looked like this:
void delay(int seconds)
{
time_t start, current;
time(&start);
do
{
time(&current);
}
while ((current - start) < seconds);
}
Same result here. Anybody?
To reiterate on what has already been stated by others with a concrete example:
Assuming you're using std::cout for output, you should call std::cout.flush(); right before the sleep command. See this MS knowledgebase article.
sleep(n) waits for n seconds, not n microseconds.
Also, as mentioned by Bart, if you're writing to stdout, you should flush the stream after each write - otherwise, you won't see anything until the buffer is flushed.
So I am trying to program a simple tick-based game. I write in C++ on a linux machine.
if functioncall() may take a considerable time then your ticks won't be equal if you sleep the same amount of time.
You might be trying to do this:
while 1: // mainloop
functioncall()
tick() # wait for the next tick
Here tick() sleeps approximately delay - time_it_takes_for(functioncall) i.e., the longer functioncall() takes the less time tick() sleeps.
sleep() sleeps an integer number of seconds. You might need a finer time resolution. You could use clock_nanosleep() for that.
Example Clock::tick() implementation
// $ g++ *.cpp -lrt && time ./a.out
#include <iostream>
#include <stdio.h> // perror()
#include <stdlib.h> // ldiv()
#include <time.h> // clock_nanosleep()
namespace {
class Clock {
const long delay_nanoseconds;
bool running;
struct timespec time;
const clockid_t clock_id;
public:
explicit Clock(unsigned fps) : // specify frames per second
delay_nanoseconds(1e9/fps), running(false), time(),
clock_id(CLOCK_MONOTONIC) {}
void tick() {
if (clock_nanosleep(clock_id, TIMER_ABSTIME, nexttick(), 0)) {
// interrupted by a signal handler or an error
perror("clock_nanosleep");
exit(EXIT_FAILURE);
}
}
private:
struct timespec* nexttick() {
if (not running) { // initialize `time`
running = true;
if (clock_gettime(clock_id, &time)) {
//process errors
perror("clock_gettime");
exit(EXIT_FAILURE);
}
}
// increment `time`
// time += delay_nanoseconds
ldiv_t q = ldiv(time.tv_nsec + delay_nanoseconds, 1000000000);
time.tv_sec += q.quot;
time.tv_nsec = q.rem;
return &time;
}
};
}
int main() {
Clock clock(20);
char arrows[] = "\\|/-";
for (int nframe = 0; nframe < 100; ++nframe) { // mainloop
// process a single frame
std::cout << arrows[nframe % (sizeof(arrows)-1)] << '\r' << std::flush;
clock.tick(); // wait for the next tick
}
}
Note: I've used std::flush() to update the output immediately.
If you run the program it should take about 5 seconds (100 frames, 20 frames per second).
I guess on linux u have to use usleep() and it must be found in ctime
And in windows you can use delay(), sleep(), msleep()