High CPU usage while using poll system call to wait on fds - c++

I have this unique problem where the poll system call of linux used in my code gets the fds it waits on polled in , i mean POLLIN every millisecond. This is causing high CPU usage . I have supplied a timeout of 100 milliseconds and it seems to be of no use. Can any one suggest an alternative.
for (;;) {
ACE_Time_Value doWork(0, 20000);
ACE_OS::sleep(doWork); ----------------------------> Causing low throughput, put to decrease CPU usage / On removing this we see high CPU , but throughput is achieved.
..
.
..
if ((exitCode = fxDoWork()) < 0) {
break;}
}
fxDoWork()
{
ACE_Time_Value selectTime;
selectTime.set(0, 100000);
..
..
..
ACE_INT32 waitResult = ACE_OS::poll(myPollfds, eventCount, &selectTime);-----------------------------> Pollin happens for every milli second/Timeout is not at all useful
..
..
..
}
===============================================================

It sounds like you want to accumulate enough data OR a specific timeout happens to reduce CPU usage, right? If that's the case, you can use recvmmsg(): http://man7.org/linux/man-pages/man2/recvmmsg.2.html
The recvmmsg() system call is an extension of recvmsg(2) that allows
the caller to receive multiple messages from a socket using a single
system call. (This has performance benefits for some applications.)
A further extension over recvmsg(2) is support for a timeout on the
receive operation.

Related

Monitor task CPU utilization in VxWorks while program is running

I'm running a VxWorks 6.9 OS embedded system and I need to see when I'm starving low priority tasks. Ideally I'd like to have CPU utilization by task so I know what is eating up all my CPU time.
I know this is a built in feature in many operating systems but have been so far unable to find it for VxWorks 6.9.
If I can't measure by task I'd like to at least to see what percentage of time the CPU is idle.
To that end I've been trying to make a lowest priority task that will run the function below that would try to measure it indirectly.
float Foo::IdleTime(Foo* f)
{
bool inIdleTask;
float timeIdle;
float totalTime;
float percentIdle;
while(true)
{
startTime = _time(); //get time before before measurement starts
inIdleTask = true;
timeIdle = 0;
while(inIdleTask) // I have no clue how to detect when the task left and set this to false
{
timeIdle += (amount_of_time_for_inner_loop); //measure idle time
}
returnTime = _time(); //get time after you return to IdleTime task
totalTime = ( returnTime - startTime );
percentIdle = ( timeIdle / totalTime ) * 100; //calculate percentage of idle time
//logic to report percentIdle
}
The big problem with this concept is I don't know how I would detect when this task is left for a higher priority task.
If you are looking for a one time measurement done during the developement, then spyLib is what you are looking for. Simply call spy from the command line to get per task CPU usage report in 10s intervals. Call spyHelp to learn how to configure the spy. (Might need to inculude the spyLib to kernel if not already included.)
If you want to go the extra mile, taskHookLib is what you need. Simply put, you hook a function to be called in every task switch. Call gives you the TASK_IDs of tasks going in and out of the CPU. You can either simply monitor the starvation of low pri tasks or take action and increase their priority temporarily.
From experience, spy adds a little performance overhead, especially if stdout faces to a slow I/O (e.g. a 9600 baud serial), but fairly easy to use. taskHook'ing adds little to none overhead if you are not immediately printing the results on the terminal, but takes a bit of programming to get it running.
Another thing that might be of interest is WindRiver's remote debugger. Haven't use that one personally, imagine it would require setting up the workbench and the target properly.

ZeroMQ: how to reduce multithread-communication latency with inproc?

I'm using inproc and PAIR to achieve inter-thread communication and trying to solve a latency problem due to polling. Correct me if I'm wrong: Polling is inevitable, because a plain recv() call will usually block and cannot take a specific timeout.
In my current case, among N threads, each of the N-1 worker threads has a main while-loop. The N-th thread is a controller thread which will notify all the worker threads to quit at any time. However, worker threads have to use polling with a timeout to get that quit message. This introduces a latency, the latency parameter is usually 1000ms.
Here is an example
while (true) {
const std::chrono::milliseconds nTimeoutMs(1000);
std::vector<zmq::poller_event<std::size_t>> events(n);
size_t nEvents = m_poller.wait_all(events, nTimeoutMs);
bool isToQuit = false;
for (auto& evt : events) {
zmq::message_t out_recved;
try {
evt.socket.recv(out_recved, zmq::recv_flags::dontwait);
}
catch (std::exception& e) {
trace("{}: Caught exception while polling: {}. Skipped.", GetLogTitle(), e.what());
continue;
}
if (!out_recved.empty()) {
if (IsToQuit(out_recved))
isToQuit = true;
break;
}
}
if (isToQuit)
break;
//
// main business
//
...
}
To make things worse, when the main loop has nested loops, the worker threads then need to include more polling code in each layer of the nested loops. Very ugly.
The reason why I chose ZMQ for multithread communication is because of its elegance and the potential of getting rid of thread-locking. But I never realized the polling overhead.
Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation? Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?
An above posted statement ( a hypothesis ):
"...a plain recv() call will usually block and cannot take a specific timeout."
is not correct:
a plain .recv( ZMQ_NOBLOCK )-call will never "block",
a plain .recv( ZMQ_NOBLOCK )-call can get decorated so as to mimick "a specific timeout"
An above posted statement ( a hypothesis ):
"...have to use polling with a timeout ... introduces a latency, the latency parameter is usually 1000ms."
is not correct:
- one need not use polling with a timeout
- the less one need not set 1000 ms code-"injected"-latency, spent obviously only on-no-new-message state
Q : "Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation?"
Yes.
Q : "Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?"
No. inproc-transport-class is the fastest of all these kinds as it is principally protocol-less / stack-less and has more to do with ultimately fast pointer-mechanics, like in a dual-end ring-buffer pointer-management.
The Best Next Step:
1 )Re-factor your code, so as to always harness but the zero-wait { .poll() | .recv() }-methods, properly decorated for both { event- | no-event- }-specific looping.
2 )
If then willing to shave the last few [us] from the smart-loop-detection turn-around-time, may focus on improved Context()-instance setting it to work with larger amount of nIOthreads > N "under the hood".
optionally 3 )
For almost hard-Real-Time systems' design one may finally harness a deterministically driven Context()-threads' and socket-specific mapping of these execution-vehicles onto specific, non-overlapped CPU-cores ( using a carefully-crafted affinity-map )
Having set 1000 [ms] in code, no one is fair to complain about spending those very 1000 [ms] waiting in a timeout, coded by herself / himself. No excuse for doing this.
Do not blame ZeroMQ for behaviour, that was coded from the application side of the API.
Never.

Posix Timer on SCHED_RR Thread is using 100% CPU

I have the following code snippet:
#include <iostream>
#include <thread>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/timerfd.h>
int main() {
std::thread rr_thread([](){
struct sched_param params = {5};
pthread_setschedparam(pthread_self(), SCHED_RR, &params);
struct itimerspec ts;
struct epoll_event ev;
int tfd ,epfd;
ts.it_interval.tv_sec = 0;
ts.it_interval.tv_nsec = 0;
ts.it_value.tv_sec = 0;
ts.it_value.tv_nsec = 20000; // 50 kHz timer
tfd = timerfd_create(CLOCK_MONOTONIC, 0);
timerfd_settime(tfd, 0, &ts, NULL);
epfd = epoll_create(1);
ev.events = EPOLLIN;
epoll_ctl(epfd, EPOLL_CTL_ADD, tfd, &ev);
while (true) {
epoll_wait(epfd, &ev, 1, -1); // wait forever for the timer
read(tfd, &missed, sizeof(missed));
// Here i have a blocking function (dummy in this example) which
// takes on average 15ns to execute, less than the timer period anyways
func15ns();
}
});
rr_thread.join();
}
I have a posix thread using the SCHED_RR policy and on this thread there is a POSIX Timer running with a timeout of 20000ns = 50KHz = 50000 ticks/sec.
After the timer fires i am executing a function that takes roughly 15ns so less than the timer period, but this doesn't really matter.
When i execute this i am getting 100% CPU Usage, the whole system is getting slow but i don't understand why this is happening and some things are confusing.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled in theory right? even if this is a high priority thread.
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary ones. Is this normal? While waiting for the timer to fire the scheduler should schedule other tasks right? I should see at least 20000 * 2 context switches / sec
As presented, your program does not behave as you describe. This is because you program the timer as a one-shot, not a repeating timer. For a timer that fires every 20000 ns, you want to set a 20000-ns interval:
ts.it_interval.tv_nsec = 20000;
Having modified that, I get a program that works produces heavy load on one core.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled
in theory right? even if this is a high priority thread.
Sure, your thread blocks in epoll_wait() to await timer ticks, if in fact it manages to loop back there before the timer ticks again. On my machine, your program consumes only about 30% of one core, which seems to confirm that such blocking will indeed happen. That you see 100% CPU use suggests that my computer runs the program more efficiently than yours does, for whatever reason.
But you have to appreciate that the load is very heavy. You are asking to perform all the processing of the timer itself, the epoll call, the read, and func15ns() once every 20000 ns. Yes, whatever time may be left, if any, is available to be scheduled for another task, but the task swap takes a bit more time again. 20000 ns is not very much time. Consider that just fetching a word from main memory costs about 100 ns (though reading one from cache is of course faster).
In particular, do not neglect the work other than func15ns(). If the latter indeed takes only 15 ns to run then it's the least of your worries. You're performing two system calls, and these are expensive. Just how expensive depends on a lot of factors, but consider that removing the epoll_wait() call reduces the load for me from 30% to 25% of a core (and note that the whole epoll setup is superfluous here because simply allowing the read() to block serves the purpose).
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary
ones. Is this normal? While waiting for the timer to fire the
scheduler should schedule other tasks right? I should see at least
20000 * 2 context switches / sec
You're occupying a full CPU with a high priority task, so why do you expect switching?
On the other hand, I'm also observing a low number of context switches for the process running your (modified) program, even though it's occupying only 25% of a core. I'm not prepared at the moment to reason about why that is.

Make select based loop as responsive as possible

This thread will be very responsive to network activity but can be guaranteed to process the message queue only as often as 100 times a second. I can keep reducing the timeout but after a certain point I will be busy-waiting and chewing up CPU. Is it true that this solution is about as good as I'll get without switching to another method?
// semi pseudocode
while (1) {
process_thread_message_queue(); // function returns near-instantly
struct timeval t;
t.tv_sec = 0;
t.tv_usec = 10 * 1000; // 10ms = 0.01s
if (select(n,&fdset,0,0,t)) // see if there are incoming packets for next 1/100 sec
{
... // respond with more packets or processing
}
}
It depends on what your OS provides for your. On Windows you can wait for a thread message and a bunch of handles simultaneously using MsgWaitForMultipleObjectsEx. This solves your problem. On other OS you should have something similar.

What is the cleanest way to create a timeout for a while loop?

Windows API/C/C++
1. ....
2. ....
3. ....
4. while (flag1 != flag2)
5. {
6. SleepEx(100,FALSE);
//waiting for flags to be equal (flags are set from another thread).
7. }
8. .....
9. .....
If the flags don't equal each other after 7 seconds, I would like to continue to line 8.
Any help is appreciated. Thanks.
If you are waiting for a particular flag to be set or a time to be reached, a much cleaner solution may be to use an auto / manual reset event. These are designed for signalling conditions between threads and have very rich APIs designed on top of them. For instance you could use the WaitForMultipleObjects API which takes an explicit timeout value.
Do not poll for the flags to change. Even with a sleep or yield during the loop, this just wastes CPU cycles.
Instead, get the thread which sets the flags to signal you that they've been changed, probably using an event. Your wait on the event takes a timeout, which you can tweak to allow waiting of 7 seconds total.
For example:
Thread1:
flag1 = foo;
SetEvent(hEvent);
Thread2:
DWORD timeOutTotal = 7000; // 7 second timeout to start.
while (flag1 != flag2 && timeOutTotal > 0)
{
// Wait for flags to change
DWORD start = GetTickCount();
WaitForSingleObject(hEvent, timeOutTotal);
DWORD end = GetTickCount();
// Don't let timeOutTotal accidently dip below 0.
if ((end - start) > timeOutTotal)
{
timeOutTotal = 0;
}
else
{
timeOutTotal -= (end - start);
}
}
You can use QueryPerformanceCounter from WinAPI. Check it before while starts, and query if the amount of time has passed. However, this is a high resolution timer. For a lower resolution use GetTickCount (milliseconds).
All depends whether you are actively waiting (doing something) or passively waiting for an external process. If the latter, then the following code using Sleep will be a lot easier:
int count = 0;
while ( flag1 != flag2 && count < 700 )
{
Sleep( 10 ); // wait 10ms
++count;
}
If you don't use Sleep (or Yield) and your app is constantly checking on a condition, then you'll bloat the CPU the app is running on.
If you use WinAPI extensively, you should try out a more native solution, read about WinAPI's Synchronization Functions.
You failed to mention what will happen if the flags are equal.
Also, if you just test them with no memory barriers then you cannot guarantee to see any writes made by the other thread.
Your best bet is to use an Event, and use the WaitForSingleObject function with a 7000 millisecond time out.
Make sure you do a sleep() or yield() in there or you will eat up all the entire CPU (or core) waiting.
If your application does some networking stuff, have a look at the POSIX select() call, especially the timeout functionality!
I would say "check the time and if nothing has happened in seven seconds later, then break the loop.