Make select based loop as responsive as possible - c++

This thread will be very responsive to network activity but can be guaranteed to process the message queue only as often as 100 times a second. I can keep reducing the timeout but after a certain point I will be busy-waiting and chewing up CPU. Is it true that this solution is about as good as I'll get without switching to another method?
// semi pseudocode
while (1) {
process_thread_message_queue(); // function returns near-instantly
struct timeval t;
t.tv_sec = 0;
t.tv_usec = 10 * 1000; // 10ms = 0.01s
if (select(n,&fdset,0,0,t)) // see if there are incoming packets for next 1/100 sec
{
... // respond with more packets or processing
}
}

It depends on what your OS provides for your. On Windows you can wait for a thread message and a bunch of handles simultaneously using MsgWaitForMultipleObjectsEx. This solves your problem. On other OS you should have something similar.

Related

Posix Timer on SCHED_RR Thread is using 100% CPU

I have the following code snippet:
#include <iostream>
#include <thread>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/timerfd.h>
int main() {
std::thread rr_thread([](){
struct sched_param params = {5};
pthread_setschedparam(pthread_self(), SCHED_RR, &params);
struct itimerspec ts;
struct epoll_event ev;
int tfd ,epfd;
ts.it_interval.tv_sec = 0;
ts.it_interval.tv_nsec = 0;
ts.it_value.tv_sec = 0;
ts.it_value.tv_nsec = 20000; // 50 kHz timer
tfd = timerfd_create(CLOCK_MONOTONIC, 0);
timerfd_settime(tfd, 0, &ts, NULL);
epfd = epoll_create(1);
ev.events = EPOLLIN;
epoll_ctl(epfd, EPOLL_CTL_ADD, tfd, &ev);
while (true) {
epoll_wait(epfd, &ev, 1, -1); // wait forever for the timer
read(tfd, &missed, sizeof(missed));
// Here i have a blocking function (dummy in this example) which
// takes on average 15ns to execute, less than the timer period anyways
func15ns();
}
});
rr_thread.join();
}
I have a posix thread using the SCHED_RR policy and on this thread there is a POSIX Timer running with a timeout of 20000ns = 50KHz = 50000 ticks/sec.
After the timer fires i am executing a function that takes roughly 15ns so less than the timer period, but this doesn't really matter.
When i execute this i am getting 100% CPU Usage, the whole system is getting slow but i don't understand why this is happening and some things are confusing.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled in theory right? even if this is a high priority thread.
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary ones. Is this normal? While waiting for the timer to fire the scheduler should schedule other tasks right? I should see at least 20000 * 2 context switches / sec
As presented, your program does not behave as you describe. This is because you program the timer as a one-shot, not a repeating timer. For a timer that fires every 20000 ns, you want to set a 20000-ns interval:
ts.it_interval.tv_nsec = 20000;
Having modified that, I get a program that works produces heavy load on one core.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled
in theory right? even if this is a high priority thread.
Sure, your thread blocks in epoll_wait() to await timer ticks, if in fact it manages to loop back there before the timer ticks again. On my machine, your program consumes only about 30% of one core, which seems to confirm that such blocking will indeed happen. That you see 100% CPU use suggests that my computer runs the program more efficiently than yours does, for whatever reason.
But you have to appreciate that the load is very heavy. You are asking to perform all the processing of the timer itself, the epoll call, the read, and func15ns() once every 20000 ns. Yes, whatever time may be left, if any, is available to be scheduled for another task, but the task swap takes a bit more time again. 20000 ns is not very much time. Consider that just fetching a word from main memory costs about 100 ns (though reading one from cache is of course faster).
In particular, do not neglect the work other than func15ns(). If the latter indeed takes only 15 ns to run then it's the least of your worries. You're performing two system calls, and these are expensive. Just how expensive depends on a lot of factors, but consider that removing the epoll_wait() call reduces the load for me from 30% to 25% of a core (and note that the whole epoll setup is superfluous here because simply allowing the read() to block serves the purpose).
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary
ones. Is this normal? While waiting for the timer to fire the
scheduler should schedule other tasks right? I should see at least
20000 * 2 context switches / sec
You're occupying a full CPU with a high priority task, so why do you expect switching?
On the other hand, I'm also observing a low number of context switches for the process running your (modified) program, even though it's occupying only 25% of a core. I'm not prepared at the moment to reason about why that is.

how to simulate time delay in network

Let's say that we need to send this message Hellow World using UDP protocol between two PCs A and B . Computer A will send the message to B with some time delay (i.e. constant or time-varying). Now to simulate this scenario, my first attempt is to use sleep function but this solution will freezes the entire application. Another solution is to implement mutlithreads and use sleep() with the thread that is responsible for getting the data and store this in a global variable and access this variable from another thread. In this solution, there might be difficulties in the synchronization between the threads. To overcome this problem, I will write the received data in txt file and read it from another thread. My question is what is the proper way to carry out this trivial experiment? I will appreciate if the answer has some C++ pseudo.
Edit:
My attempt to solve it is as follows, for the Master side (client),
Master masterObj
int main()
{
masterObj.initialize();
masterObj.connect();
while( masterObj.isConnected() == true ){
get currentTime and data; // currentTime here is sendTime
datagram = currentTime + data;
masterObj.send( datagram );
}
}
For the Slave side (server), the pseudo code is
Slave slaveObj
int main()
{
slaveObj.initialize();
slaveObj.connect();
slaveObj.slaveThreadInit();
while( slaveObj.isConnected() == true ){
slaveObj.getData();
}
}
Slave::recieve()
{
get currentTime and call it recievedTime
get datagram from Master;
this->slaveThread( recievedTime + datagram );
}
Slave::slaveThread( info )
{
sleep( 1 msec );
info = recievedTime + datagram ;
get time delay;
time delay = sendTime - recievedTime;
extract data from datagram;
insert data and time delay in txt file ( call it txtSlaveData);
}
Slave::getData()
{
read from txtSlaveData;
}
As you can see, I'm using an independent thread which inside it, I'm using sleep(). I'm not sure if this approach is applicable.
A simple way to simulate sending UDP datagrams from one computer to another is to send the datagrams through the loopback interface to another - or the same - process on the same computer. That will function exactly like the real thing except for the delay.
You can simulate the delay either when sending or receiving. Once you've implemented it one way, the other should be trivial. I think delaying the sending side is more natural option. Here is an approach for the more general problem of simulating network delay. See the last paragraph for a trivial experiment of sending only one datagram.
In case you choose delaying on send, what you could do is, instead of sending, store the datagram in a queue, along with the time it should be sent (target = now + delay).
Then, in another thread, wait for a datagram to become available, then sleep for max(target - now, 0). After sleeping, send the datagram and move on to the next one. Wait if queue is empty.
To simulate jitter, randomize the delay. To let jitter simulation send the datagrams in non-sequential order, use a priority queue, sorted by the target send-time.
Remember to synchronize the access to the queue.
For a single datagram, you can do much simpler. Simply start a new thread, sleep for the delay, send and end thread. No need for synchronization. Here's c++ code for that:
std::thread([]{
std::this_thread::sleep_for(delay);
send("foo");
}).detach();

High CPU usage while using poll system call to wait on fds

I have this unique problem where the poll system call of linux used in my code gets the fds it waits on polled in , i mean POLLIN every millisecond. This is causing high CPU usage . I have supplied a timeout of 100 milliseconds and it seems to be of no use. Can any one suggest an alternative.
for (;;) {
ACE_Time_Value doWork(0, 20000);
ACE_OS::sleep(doWork); ----------------------------> Causing low throughput, put to decrease CPU usage / On removing this we see high CPU , but throughput is achieved.
..
.
..
if ((exitCode = fxDoWork()) < 0) {
break;}
}
fxDoWork()
{
ACE_Time_Value selectTime;
selectTime.set(0, 100000);
..
..
..
ACE_INT32 waitResult = ACE_OS::poll(myPollfds, eventCount, &selectTime);-----------------------------> Pollin happens for every milli second/Timeout is not at all useful
..
..
..
}
===============================================================
It sounds like you want to accumulate enough data OR a specific timeout happens to reduce CPU usage, right? If that's the case, you can use recvmmsg(): http://man7.org/linux/man-pages/man2/recvmmsg.2.html
The recvmmsg() system call is an extension of recvmsg(2) that allows
the caller to receive multiple messages from a socket using a single
system call. (This has performance benefits for some applications.)
A further extension over recvmsg(2) is support for a timeout on the
receive operation.

does while loop always take full CPU usage?

I need create a server side game loop, the problem is how to limit the loop cpu usage.
In my experience of programming, a busy loop always take maximal CPU usage it could. But I am reading the code of SDL(Simple DirectMedia Layer), it has a function SDL_Delay(UINT32 ms), and it has a while loop, does it take max cpu usage, if not, why?
https://github.com/eddieringle/SDL/blob/master/src/timer/unix/SDL_systimer.c#L137-158
do {
errno = 0;
#if HAVE_NANOSLEEP
tv.tv_sec = elapsed.tv_sec;
tv.tv_nsec = elapsed.tv_nsec;
was_error = nanosleep(&tv, &elapsed);
#else
/* Calculate the time interval left (in case of interrupt) */
now = SDL_GetTicks();
elapsed = (now - then);
then = now;
if (elapsed >= ms) {
break;
}
ms -= elapsed;
tv.tv_sec = ms / 1000;
tv.tv_usec = (ms % 1000) * 1000;
was_error = select(0, NULL, NULL, NULL, &tv);
#endif /* HAVE_NANOSLEEP */
} while (was_error && (errno == EINTR));
This code uses select for a timeout. select usually takes a file descriptor, and makes the caller wait until an IO event occurs on the fd. It also takes a timeout argument for the maximum time to wait. Here the fd is 0, so no events will occur, and the function will always return when the timeout is reached.
The select(3) that you get from the C library is a wrapper around the select(2) system call, which means calling select(3) eventually gets you in the kernel. The kernel then doesn't schedule the process unless an IO event occurs, or the timeout is reached. So the process is not using the CPU while waiting.
Obviously, the jump into the kernel and process scheduling introduce delays. So if you must have very low latency (nanoseconds) you should use busy waiting.
That loop won't take up all CPU. It utilizes one of two different functions to tell the operating system to pause the thread for a given amount of time and letting another thread utilize the CPU:
// First function call - if HAVE_NANOSLEEP is defined.
was_error = nanosleep(&tv, &elapsed);
// Second function call - fallback without nanosleep.
was_error = select(0, NULL, NULL, NULL, &tv);
While the thread is blocked in SDL_Delay, it yields the CPU to other tasks. If the delay is long enough, the operating system will even put the CPU in an idle or halt mode if there is no other work to do. Note that this won't work well if the delay time isn't at least 20 milliseconds or so.
However, this is usually not the right way to do whatever it is you are trying to do. What is your outer problem? Why doesn't your game loop ever finish doing whatever needs to be done at this time and so then need to wait for something to happen so that it has more work to do? How can it always have an infinite amount of work to do immediately?

Limit iterations per time unit

Is there a way to limit iterations per time unit? For example, I have a loop like this:
for (int i = 0; i < 100000; i++)
{
// do stuff
}
I want to limit the loop above so there will be maximum of 30 iterations per second.
I would also like the iterations to be evenly positioned in the timeline so not something like 30 iterations in first 0.4s and then wait 0.6s.
Is that possible? It does not have to be completely precise (though the more precise it will be the better).
#FredOverflow My program is running
very fast. It is sending data over
wifi to another program which is not
fast enough to handle them at the
current rate. – Richard Knop
Then you should probably have the program you're sending data to send an acknowledgment when it's finished receiving the last chunk of data you sent then send the next chunk. Anything else will just cause you frustrations down the line as circumstances change.
Suppose you have a good Now() function (GetTickCount() is bad example, it's OS specific and has bad precision):
for (int i = 0; i < 1000; i++){
DWORD have_to_sleep_until = GetTickCount() + EXPECTED_ITERATION_TIME_MS;
// do stuff
Sleep(max(0, have_to_sleep_until - GetTickCount()));
};
You can check elapsed time inside the loop, but it may be not an usual solution. Because computation time is totally up to the performance of the machine and algorithm, people optimize it during their development time(ex. many game programmer requires at least 25-30 frames per second for properly smooth animation).
easiest way (for windows) is to use QueryPerformanceCounter(). Some pseudo-code below.
QueryPerformanceFrequency(&freq)
timeWanted = 1.0/30.0 //time per iteration if 30 iterations / sec
for i
QueryPerf(count1)
do stuff
queryPerf(count2)
timeElapsed = (double)(c2 - c1) * (double)(1e3) / double(freq) //time in milliseconds
timeDiff = timeWanted - timeElapsed
if (timeDiff > 0)
QueryPerf(c3)
QueryPerf(c4)
while ((double)(c4 - c3) * (double)(1e3) / double(freq) < timeDiff)
queryPerf(c4)
end for
EDIT: You must make sure that the 'do stuff' area takes less time than your framerate or else it doesn't matter. Also instead of 1e3 for milliseconds, you can go all the way to nanoseconds if you do 1e9 (if you want that much accuracy)
WARNING... this will eat your CPU but give you good 'software' timing... Do it in a separate thread (and only if you have more than 1 processor) so that any guis wont lock. You can put a conditional in there to stop the loop if this is a multi-threaded app too.
#FredOverflow My program is running very fast. It is sending data over wifi to another program which is not fast enough to handle them at the current rate. – Richard Knop
What you might need a buffer or queue at the receiver side. The thread that receives the messages from the client (like through a socket) get the message and put it in the queue. The actual consumer of the messages reads/pops from the queue. Of course you need concurrency control for your queue.
Besides the flow control methods mentioned, if you also have the need to maintain an accurate specific data sending rate in your sender part. Usually it can be done like this.
E.x. if you want to send at 10Mbps, create a timer of interval 1ms so it will call a predefined function every 1ms. Then in the timer handler function, by keep tracking of 2 static variables 1)Time elapsed since beginning of sending data 2)How much data in bytes have been sent up to last call, you can easily calculate how much data is needed to be sent in the current call (or just sleep and wait for next call).
By this way, you can do "streaming" of data in a very stable way with very little jitterness, and this is usually adopted in streaming of videos. Of course it also depends on how accurate the timer is.