CPU utlization on poll mode

CPU utlization on poll mode - c++

For our project written in c++, we run the processor cores on poll mode to poll the driver (dpdk), but in poll mode the cpu utilization shows up as 100% in top/htop. As we started seeing glitch of packet drops, calculated the number of loops or polls executed per sec on a core (varies based on the processor speed and type).
Sample code used to calculate the polls/second with and with out the overhead of driver poll function is as below.
#include <iostream>
#include <sys/time.h>
int main() {
unsigned long long counter;
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
gettimeofday(&tv2, NULL);
while(1) {
gettimeofday(&tv2, NULL);
//Some function here to measure the overhead
//Poll the driver
if ((double) (tv2.tv_usec - tv1.tv_usec) / 1000000 + (double) (tv2.tv_sec - tv1.tv_sec) > 1.0) {
std::cout << std::dec << "Executions per second = " << counter << " per second" << std::endl;
counter = 0;
gettimeofday(&tv1, NULL);
}
counter++;
}
}
The poll count results are varying, sometimes we see a glitch and the number go down 50% or lower than regular counts, thought this could be problem with the linux scheduling the task so
Isolated the cores using linux command line (isolcpus=...), Set affinity, Increase priority for the process/thread to highest nice value and type to realtime (RT)
But no difference.
So questions are,
Can we rely on the number of loops/polls per sec executed on a processor core in poll mode?
Is there a way to calculate the CPU occupancy on poll mode since the cores CPU utilization shows up as 100% on top?
Is this the right approach for this problem?
Environment:
Intel(R) Xeon(R) CPU E5-2680 v3 # 2.50GHz
8G ram
Ubuntu virtual machine on Vmware hypervisor.
Not sure if this was previously answered, any references will be helpful.

No, you cannot rely "the number of loops/polls per sec executed on a processor core in poll mode".
This is a fundamental aspect of the execution environment in a traditional operating system, such as the one you are using: mainstream Linux.
At any time, a heavy-weight cron job can get kicked off that makes immediate demands on some resources, and the kernel's scheduler decides to preempt your application and do something else. That would be just one of hundreds of possible reasons why your process gets preempted.
Even if you're running as root, you won't be in full control of your process's resources.
The fact that you're seeing such a wide, occasional, disparity in your polling metrics should be a big, honking clue: multi-tasking operating systems don't work like this.
There are other "realtime" operating systems where userspace apps can have specific "service level" guarantees, i.e. minimum CPU or I/O resources available, which you can rely on for guaranteeing a floor on the number of times a particular code sequence can be executed, per second or some other metric.
On Linux, there are a few things that can be fiddled with, such as the process's nice level, and a few other things. But that still will not give you any absolute guarantees, whatsoever.
Especially since you're not even running on bare metal, but you're running inside a virtual hypervisor. So, your actual execution profile is affected not just by your host operating system, but your guest operating system as well!
The only way to guarantee the kind of a metric that you're looking for, is to use a realtime operating system, instead of Linux. Some years ago I have heard about realtime extensions to the Linux kernel (Google food: "linux rtos"), but haven't heard much of that recently. I don't believe that mainstream Linux distributions include that kernel extension, so, if you want to go that way, you'll be on your own.

modern linux on intel cpu does provide ways to make poll loop fully occupy the cpu core near 100%. things you haven't considered are, remove system call that will cause context switching, turn off hyperthreading or don't use the other thread that is on the same cache line, turn off dynamic cpu frequency boost in bios, move interrupt handling out.

Related

Latency between thread test: Linux bare metal machine vs QNX on a VM

I recently did a latency comparison on these two setups:
a) Ubuntu 16.04 running on a 12 core host;
b) A guest QNX running a VMware on a laptop host (4 cores assigned to the QNX VM) - I do not have a better setup currently for QNX.
The test scenario:
10 threads running, each thread sends out a message to a randomly chosen receiving thread every 30ms-ish - very low message rate indeed; the message mechanism is implemented using conditional variables and each thread has its own dedicated rx prod-consumer queue and its own conditional variable and mutex - so no interference between queues. I measure the time between the message got constructed/sent and the receiving thread gets the message. The mean and std_dev min max are all captured for each thread.
The result is surprisingly favoring QNX (although it is running on a VM). 10us vs 40us.
for a thread on Linux (seconds): mean=0.000038076 std_dev=2.7523e-05 min=0.000000254 max=0.000177410 sampleSize=1023
for a thread QNX (seconds): mean=0.000011351 std_dev=0.000105937 min=0.000000000 max=0.000999987 sampleSize=969
I noticed that the QNX side the clock is not as precise (resolution wise) as the Linux side since I do see the min latency measurable is 0.
I just wonder if it conforms with other people's experiences - does Linux thread conditional wake-up take 40us in average?
Btw, if the QNX time precision is in 100us and Linux is in nanosec, does this diff impact stat?
Thanks.

The latency test turns out to be more sensitive (in a different way) to the HW than what I previous thought. The newer 12 core CPU host has hyper thread enabled and it produces 40us latency numbers. I moved to a older 4 CPU with hyper thread disabled machine, the latency number drops to 15-16us. That is in the same ball park range as the QNX VM number. I wish I could make the platform more aligned in the tests to get a more conclusive answer in the future.
new Linux result: mean=0.000015819 std_dev=1.39205e-05 min=0.000000503 max=0.000117652 sampleSize=1528

Slow cloud processing. How to use all CPU cores in grabber callback?

I am working with Kinect2Grabber and want to do some real-time processing, but the performance I get is around 1 fps. Ultimately I want to estimate a trajectory of a ball thrown in the air. I cannot do it with such slow processing :(
I am using Windows 10 64-bit, Visual Studio 2017, PCL 1.9.1 All-in-one Installer MSVC2017 x64 on a AMD Ryzen Threadripper 1900X 8-Core Processor. OpenMP is enabled in my project and so are optimizations. However, when I run my program, the CPU usage for it is around 12-13%. What am I doing wrong?
int main(int argc, char* argv[])
{
boost::shared_ptr<visualization::PCLVisualizer> viewer(new visualization::PCLVisualizer("Point Cloud Viewer"));
viewer->setCameraPosition(0.0, 0.0, -1.0, 0.0, 0.0, 0.0);
PointCloud<PointType>::Ptr cloud(new PointCloud<PointType>);
// Retrieved Point Cloud Callback Function
boost::mutex mutex;
boost::function<void(const PointCloud<PointType>::ConstPtr&)> function = [&cloud, &mutex](const PointCloud<PointType>::ConstPtr& ptr) {
boost::mutex::scoped_lock lock(mutex);
//Point Cloud Processing
cloud = ptr->makeShared();
std::vector<int> indices;
removeNaNFromPointCloud<PointType>(*cloud, *cloud, indices);
pass_filter(0.5, 0.90, cloud);
outlier_removal(50, 1.0, cloud);
downsampling_vox_grid(0.005f, cloud);
normals(0.04, cloud, cloud_normals);
segmentation(cloud, cloud_normals);
};
boost::shared_ptr<Grabber> grabber = boost::make_shared<Kinect2Grabber>();
boost::signals2::connection connection = grabber->registerCallback(function);
grabber->start();
while (!viewer->wasStopped()) {
// Update Viewer
viewer->spinOnce();
boost::mutex::scoped_try_lock lock(mutex);
if (lock.owns_lock() && cloud) {
// Update Point Cloud
if (!viewer->updatePointCloud(cloud, "chmura")) {
viewer->addPointCloud(cloud, "chmura");
}
}
}
grabber->stop();
// Disconnect Callback Function
if (connection.connected()) {
connection.disconnect();
}
return 0;
}
The omitted code for pass_filter, outlier_removal etc is taken directly from the tutorials and it is working, but very slow starting from outlier_removal (inclusive).
Your help will be greatly apprieciated.
I do not have to use Kinect2Grabber. Anything will be good to grab and process frames from Kinec2 on Windows.

I see a few mitigations for your issues. 12-13% usage for Threadripper sounds good (100/16 is ~6.25%) as this implies full use of 1 physical core (1 thread for IO and 1 for computation? Just my guess).
To get better performance, you need to profile in order to understand what's causing the bottleneck. Perf is a great tool for that. There's an awesome video about how to profile code by Chandler. This is not the best place for a perf tutorial, but TL;DW:
compile using -fno-omit-frame-pointer flag or equivalent
perf record -g <executable>
perf report -g to know which functions take most CPU cycles
perf report -g 'graph,0.5,caller' to know which call-paths take most CPU cycles
Most likely, the issue identified would be
Repeated creation of single-use objects such as VoxelGrid: Instantiating objects just once is a better use of your CPU cycles
Lock and IO for the grabber
This will get you slightly more frame rate but still limit your CPU utilization to 12-13% aka the single-thread limit.
Since you are using a ThreadRipper, you might use threads to use the other CPU and decouple IO, computation and visualization
One thread for the grabber which grabs frames and pushes them into a Queue
Another thread to consume frames from the Queue based on CPU availability. This saves the computed data into yet another Queue
Visualization thread which takes data from the output Queue
This allows you to tune the Queue sizes to drop frames to improve latency. This might not be required based on the design of your custom Kinect2Grabber. This will be evident after you profile your code.
This has potential to dramatically reduce latency as well as improve frame-rates by perhaps increasing CPU utilization to 20% (because the grabber and visualization threads will be working at full throttle)
In order to fully utilize all the threads, the consumer thread can offload frames to other threads to allow CPU to process multiple frames at once. For this, you can adopt the following options (and they aren't exclusive. You can use option 2 as the workhorse for option 1 for complete benefit)
For parallel executing multiple independent functions at once (eg: your workhorse lambda), look at boost::thread_pool or boost::thread_group
For a pipeline based model (where every stage is run in a different thread), you can use frameworks like TaskFlow
Option 1 is suitable when you're not going for a specific metric like latency. It also requires least change to your code, but has trade-off between high latency (current design issue pointed above) or creating one copy of object per-thread for preventing setup every time.
Option 2 is suitable when the latency required is low and the frames need to be ordered. However, in order to maintain low latency in TaskFlow, you'd need to feed the frames at the rate your CPU cores can process them. Else, you can overload the CPU, leave no free RAM causing page thrashing, and actually reduce performance
Both of these options require you to ensure that the output arrives in the correct order. This can be done using promises or a Queue which drops out of order frames. By implementing part of these solutions or all, you can ensure that your CPU utilization remains high, maybe even 100%.
In order to know what fits your situation the best, know what your target performance metrics are, profile the code intelligently and test without bias.
Without metrics, you can enter a deep hole of improving performance even if it's not required (eg: 5 fps might be sufficient instead of 60fps)
Without measuring, you can't know if you're getting the correct result or if you're indeed attacking the bottleneck.
Without unbiased testing, you can arrive at wrong conclusions.

How to decrease CPU usage of high resolution (10 micro second) precise timer?

I'm writing up a timer for some complex communication application in windows 10 with qt5 and c++. I want to use max 3 percent of CPU with micro second resolution.
Initially i used qTimer (qt5) in this app. It was fine with low CPU usage and developer friendly interface. But It was not precise as i need.It takes only millisecond as parameter but i need microsecond. And the accuracy of the timer wasn't equal this resolution in many real-world situations like heavy load on cpu. Sometimes the timer fires at 1 millisecond, sometimes 15 millisecond. You can see this problem in picture:
I searched a solution for days. But in the end i found Windows is a non real-time Operating System (RTOS) and don't give high resolution and precise timer.
I wrote my own High resolution precise timer with CPU polling for this goal. I developed a singleton class working in separate thread. It works at 10 micro second resolution.
But it is consuming one logical core in CPU. Equivalent to 6.25 percent at ryzen 2700.
For my application this CPU usage is unacceptable. How can i reduce this CPU usage without give high resolution away ?
This is the code that does the job:
void CsPreciseTimerThread::run()
{
while (true)
{
QMutexLocker locker(&mMutex);
for (int i=0;i<mTimerList.size();i++)
{
CsPreciseTimerMiddleLayer* timer = mTimerList[i];
int interval = timer->getInterval();
if ( (timer->isActive() == true&&timer->remainingTime()<0))
{
timer->emitTimeout();
timer->resetTime();
}
}
}
}
I tried to down priority of timer thread. I used this lines:
QThread::start(QThread::Priority::LowestPriority);
And this:
QThread::start(QThread::Priority::IdlePriority);
That changes makes timer less precise but CPU usage didn't decrease.
After that i tried force the current thread to sleep for few microseconds in loop.
QThread::usleep(15);
As you might guess sleep function did screw up the accuracy. Sometimes timer sleeps longer than expected , like 10 ms or 15 ms.

I'm going to reference Windows APIs directly instead of the Qt abstractions.
I don't think you want to lower your thread priority, I think you want to raise your thread priority and use the smallest amount of Sleep between polling to balance between latency and CPU overhead.
Two ideas:
In Windows Vista, they introduced the Multimedia Class Scheduler Service specifically so that they could move the Windows audio components out of kernel mode and running in user mode, without impacting pro-audio tools. That's probably going to be helpful to you - it's not precisesly "real time" guararteed, but it's meant for low latency operations.
Going the classic way - raise your process and thread priority to high or critical, while using a reasonable sleep statement of a few milliseconds. That is, raise your thread priority to THREAD_PRIORITY_TIME_CRITICAL. Then do a very small Sleep after completion of the for loop. This sleep amount should be between 0..10 milliseconds. Some experimentation required, but I would sleep no more than half the time to the next expected timeout, with a max of 10ms. And when you are within N microseconds of your timer, you might need to just spin instead of yielding. Some experimentation is required. You can also experiment with raising your Process priority to REALTIME_PRIORITY_CLASS.
Be careful - A handful of runaway processes and threads at these higher priority levels that isn't sleeping can lock up the system.

Adapting program from single to multicore

I am considering a programming project. Will run under Ubuntu or other Linux OS on a small board. Quad core x86 - N-Series Pentium. The software generates 8 fast signals; square wave pulse trains for stepper motor motion control of 4 axes. Step signals being 50-100 KHz maximum, but usually slower. Want to avoid jitter in these stepping signals (call it good fidelity), so that around 1-2us for each thread loop cycle would be a nice target. The program does other kinds of tasks, like read/write hard drive, Ethernet, continues update on the graphics display, keyboard. The existing Single core programs just can not process motion signals with this kind of timing and require external hardware/techniques to achieve this.
I have been reading other posts here, like on a thread running selected core, continuously. The exact meaning in these posts is "lose", not sure really what is meant. Continuous might mean testing every minute or ?????
So, I might be wordy, but it will clear I hope. The considered program has all the threads, routines, memory, shared memory all included. I am not considering that this program launches another program or service. Other threads are written in this program and launched when the program starts up. I will call this signal generating thread the FAST THREAD.
The FAST THREAD is to be launched to an otherwise "free" core. It needs to be the only thread that runs on the core. Hopefully, the OS thread scheduler component on that core can be "turned off", so that it does not even interrupt on that core to decide what thread runs next. In looking at the processor manual, Each core has a counter timer chip. Is it possible then that I can use it to provide a continuous train of interrupts then into my "locked in" FAST THREAD for timing purposes? This is the range of about 1-2 us. If not, then just reading one channel on that CTC to provide software sync. This fast thread will, therefore, see (experience) no delays from the interrupts issued in the other cores and associated multicore fabric. This FAST THREAD, when running, will continue to run until the program closes. This could be hours.
Input data to drive this FAST THREAD will be common shared memory defined in the program. There are also hardware signals for motion limits (From GPIOs or SDI port). If any go TRUE, that forces a programmed halt all motion. It does not need a 1~2us response. It could go to a slower Motion loop.
Ah, the output:
Some motion data is written back to the shared memory (assigned for this purpose). Like current location, and current loop number,
Some signals need to be output (the 8 outputs). There are numerous free GPIOs. Not sure of the path taken to get the signaled GPIO pin to change the output. The system call to Linux initiates the pin change event. There is also an SDI port available, running up to the 25Mhz clock. It seems these ports (GPIO, UART, USB, SDI) exist in the fabric that is not on any specific core. I am not sure of the latency from the issuance of these signals in the program until the associated external pin actually presents that signal. In the fast thread, even 10us would be OK, if it was always the same latency! I know that will not be so, there will jitter. I need to think on this spec.
There will possibly be a second dedicated core (similar to above) for slower motion planning. That leaves two cores for everything thing else. Since then everything else items (sata, video screen, keyboard ...) are already able to work in a single core, then the remaining two cores should be great.
At close of program, the FAST THREAD returns the CTC and any other device on its core back to "as it was", re-enables the OS components in this core to their more normal operation. End of thread.
Concluding: I have described the overall program, so as for you to understand what I want to do with this FAST THREAD running, how responsive it needs to be, and that it needs to be undisturbed!! This processor runs in the 1.5 ~ 2.0 GHz range. It certainly can do the repeated calculations in the required time frame.
DESIRED: I do not know the system calls that would allow me to use a selected x86 core in this way. Any pointers would be helpful. Any manual or document that described these calls/procedures.
Can this use of a core also be done in windows 7,10)?
Thanks for reading and any pointers you have.
Stan

How to repeat a process ( or to set a period of process) in linux?

I have a process that does something and needs to be repeated after a period of 1ms. How can I set period of a process on linux ?
I am using linux 3.2.0-4-rt-amd64 (with RT-Preempt patch) on Intel i7 -2600 CPU (total 8 cores) # 3.40 Ghz.
Basically I have about 6 threads in while loop shown in code and I want threads to be executed at every 1 ms. At the end I want to measure latency of each thread.
So How to set the period 1ms ?
for example in following code, how can I repeat Task1 after every 1ms ?
while(1){
//Task1(having threads)
}
Thank you.

A call to usleep(1000) inside the while loop will do the job, i.e.:
while (1) {
// Task1
usleep(1000); // 1000 microseconds = 1 millisecond
}
EDIT
Since usleep() is already deprecated in favor of nanosleep(), let's use the latter instead:
struct timespec timer;
timer.tv_sec = 0;
timer.tv_nsec = 1000000L;
while (1) {
// Task1
nanosleep(&timer, NULL);
}

Read time(7).
One millisecond is really a small period of time. (Can't you bear with e.g. a ten milliseconds delay?). I'm not sure regular processes on regular Linux on common laptop hardware are able to deal reliably with such a small period. Maybe you need RTLinux or at least real time scheduling (see sched_setscheduler(2) and this question) and perhaps a specially configured recent 3.x kernel
You can't be sure that your processing (inside your loop) is smaller than a millisecond.
You should explain what is your application doing, and what happens inside the loop.
You might have some event loop, consider using ppoll(2), timer_create(2) (see also timer_getoverrun(2)...) and/or timerfd_create(2) and clock_nanosleep(2)
(I would try something using ppoll and timerfd_create but I would accept some millisecond ticks to be skipped)
You should tell us more about your hardware and your kernel. I'm not even sure my desktop i3770K processor, asus P8Z77V motherboard, (3.13.3 PREEMPT Linux kernel) is able to reliably deal with a single millisecond delay.
(Of course, a plain loop simply calling clock_nanosleep, or better yet, using timerfd_create with ppoll, will usually do the job. But that is not reliable...)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js