My program's really consuming CPU time far more than I'd like (2 displays shoots it up to 80-90%). I'm using Qtimers, and some of them are as short as 2ms. At any given time, I can have 12+ timers going per display -- 2ms, 2ms, 2ms, 250ms, the rest ranging between 200ms and 500ms. Would it be better if I used threads for some or all of these (especially the short ones)? Would it make much of a difference?
The main time issue is going to come in on the high priority timers. First off make sure you really need these every 2ms, secondly to overcome some of the overhead in the QTimer class you could group your 3 2ms timeouts into one, and everytime it goes off just execute the 3 sections of code sequentially. I don't think threading will solve the issue though.
The 2 ms seams suspect to me. People have been reading and writing to Serial Ports at 19200 baud for years (for example on 486 hardware) without overloading the cpu. Maybe your approach is wrong.
What api are you using to access the port? I sounds like you are polling them, if the api supports blocked reads and writes this would be a much better approach.
The simplest would then be to put the read and write in their own thread, and use blocking reads in a loop, then your thread will only be busy when there is data to read and you are processing it. Your application should know when it needs to write, so the right thread should wait on a condition variable, when data is available this condition is triggered waking up the write thread.
There are probably easier single threaded approaches to this as I am sure the first applications to read and write on serial ports (e.g. x modem) were not multi-threaded, but I do not know them but they should be documented in the api you are using.
Related
I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)
I know there are many questions about this, yet I still couldn't find the answer that helps me.
Let's take a small tcp server with epoll and we want it to utilize as many cpu cores as possible. I've thought about 2 ways it could be done, but none of them worked really well.
1 - Each thread has its own epoll fd and in a "while(1)" loop uses "epoll_wait()" and processes the requests.
2 - Only one epoll fd and creating a new thread for each request when processing it.
In one single thread I could do around 25k req/s, so I was assuming the first method would help a lot, but in reality when I used 2 epoll fd the app could only process ~10k req/s. Obviously I didn't even consider a 2nd method a real one, it was meant to fail, so yeah.
So basically my question is: how should I implement multithreading so it can really utilize more cpu cores?
The socket is non-blocking, TCP_NODELAY, TCP_FASTOPEN set, and I'm using EPOLLET mode as well.
To use multiple cores, you would want to split the process into different threads, and have each thread waiting on it's own file descriptor. However, in this case, if you are only waiting on a single file descriptor, simply multi-threading it and using blocking reads on each file descriptor may be more efficient. You can also affine different threads to different cores, as the scheduler will often try to place different threads on the same core (because their TLBs are the same), so using:
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
Would help you affine things separately. Obviously, if you have more FDs than CPUs, you are going to have to make tradeoffs.
Here's what I am thinking: If you have two threads (like you tried) that aren't CPU bound, but largley waiting on I/O, the scheduler will think "they both have the same TLB footprint, and are both just waiting on I/O - so I'll just leave them both on the same CPU". It's a logical thing to do, will give good CPU and cache performance, but you need less latency than this (because roughly speaking, OPS/sec = 1/Latency) - so lock those two threads to different cores with the command above - at least see what it does.
Please be more specific about how data being processed with your 1st option, is there any data synchronization between the fd(s) ? Maybe that's what lowering overall performance.
And for the other option, the more reasonable way to go is using 1 epollfd and calling epoll_wait on it in multiple threads. It's kinda more complicated but may give better performance for absolutely io bound apps given there are no (or little) data dependence between the fd(s).
My Linux C++ application is periodically reading sensor data. Readout is done by simple file I/O operation (OS is writing to file, application is reading from this file).
Some information about my platform:
I have single core processor with hyper-threading
sensor data update frequency is 1 second
application GUI runs in main thread and shouldn't be blocked
I considered two approaches for sensor data read out:
timer running in main application thread
separate thread with infinite loop which does sensor data readout and then sleeps
Which approach makes more sens, are there any other alternatives ? What are the costs of both solution (e.g. blocking of main thread in first or context switching in second approach) ?
I don't know anything about your application or the hardware, but here are a few things to consider:
If you use a thread, you will have to create a communication channel of some sort to tell the main thread that data has been updated. Usually this would be a pipe(), as signals are inherently unreliable and condition locks don't work with I/O multiplexing (i.e. select()/poll()).
Can you get the entire set of data without blocking? If so, then just reading it in the main thread is probably easier. However, if your read can block you'll probably need some more "keep track of my read state to incorporate it into my central select()", whereas a thread can just block until more data is available.
Thus, neither solution is automatically "easier" to do.
I wouldn't worry about "context switching" for a read that only occurs once per second; that's irrelevant.
What else does the main thread have to do? Is it ok if it blocks? If so, then you dont need to do the timer, etc in a separate thread.
If the main thread cant block waiting for the periodic timer, then a separate thread must be created. The communication of data between the threads can be via an object that is accessible to both threads and protected via a mutex (look up pthread_mutex_t), which is quite simple to do.
As for which solution would be better and what are the costs, it depends on what else the main thread is doing. But for something this simple, either way should be about the same, and the context switching shouldnt affect anything. What should affect performance the most is how performance intensive the reads are.
I believe that cost of the context switch once a second is not an issue even for single-core CPU without hyper-threading especially taking to the account that the application is running in user space, thus is not really time-critical. The polling of your sensor in the main thread complicates the logic of the application. So, I would recommend you to start a thread for that purpose.
A sleep-loop will skew the timing because each iteration is going to take longer than 1sec. Timers don't have that problem, and they are made for this scenario. So choose a timer.
Performance-wise there is no difference because you are only triggering once a second.
If the Linux driver is reading a sensor data and writing it to a device file every second, you shouldn't duplicate the timer logic in your application. It may happen that after 1 second sleep your application will still read the same data as 1 second ago. A better approach would be to have a thread that would call a blocking read on a device file. When new sensor data is available, blocking read returns, the thread can process the data and call read again.
I have a UDP network application that reads packets sent to it and then processes them (same thread). The reads are non-blocking so I'm not using poll or select.
Packets received are grouped by sessions.
Work is governed by whether there is a session in progress. If there is no work to be done i.e. there are no sessions, or there are no packets to process then I need to spin.
I've been looking at the hybrid algorithm found here:
http://www.1024cores.net/home/lock-free-algorithms/tricks/spinning
Been playing with it. I'm told it's more for busy waits. What methods do you use to prevent unnecessary processing and needlessly high CPU usage?
EDIT:
Thanks for all the answers and comments.
I'm now doing the following. When it comes to reading from the network I look to see if there is other work to be done. If there is, then I call poll with a timeout of zero. I then read as many packets as I can and place them into an in memory queue for processing. If no other work then I poll indefinite (i.e. -1). It seems to work well, CPU is only high when things are busy, otherwise it drops to zero.
If you have nothing to do, you should be blocking - if not on the socket itself (i.e. if it's an event loop that processes more than one network socket or event type), then on a gate that gets signaled when something happens (the design depends on how your OS does async I/O).
Spinning is something you should only be doing when you're waiting for a very short period of time (usually only in kernel mode).
How many packets per second are you processing? How long does it take to process those packets? If you use blocking threads, what is the average CPU usage you get?
Unless blocking wait is close to 100% usage, where shaving a few bits of performance from the blocking itself can help, spinning will not improve but rather worsen performance. By spinning, you lock one core that will not be available to run other code (possibly including the code that feeds you with work: i.e. kernel code that reads network and passes up to your app the packets), you burn resources without performing any work at all...
Note that when the article says that it is harder to write blocking code than non blocking spin waits, the author is not talking about operations for which the blocking version is implemented in the system, but rather for situations where on thread must wait on a condition triggered by other threads (a shared variable value goes above/below a limit, a flag is changed...).
Also, if the cost of checking the condition is high, then spinning will incur in that cost for each and every iteration of the loop, and that might well exceed the cost of checking once and performing an expensive wait.
Remember that spinning is an active wait, it does not make sense to ask how to actively wait while not consuming processor, as the active wait approach implies consuming processor time. What can you do to avoid needless CPU usage? Use a blocking call to get the next packet. In the particular case of reading an UDP packet I doubt that two calls to the non-blocking read are not more expensive in processing time than a single call to the blocking read operation.
Again think on the questions in the beginning, that can be summed to: Is blocking proven to be the bottleneck? *Is this an scenario where active waits can actually help?*
Since you have to read from a socket, you can just do a blocking read. Without a packet, you have no reason to be running, right?
If there is more than one socket, then the blocking read won't work, so you need pselect() to monitor multiple descriptors.
Am I missing something obvious?
It occurs to me that you may have some long-term processing after you do receive a datagram. If the reason you are going with non-blocking I/O is to avoid ignoring incoming traffic while working on a session, then in that case the obvious thing to do is to fork() the sessions. (Hmm, so I still think I must be missing something...)
I've written a C++ library that does some seriously heavy CPU work (all of it math and calculations) and if left to its own devices, will easily consume 100% of all available CPU resources (it's also multithreaded to the number of available logical cores on the machine).
As such, I have a callback inside the main calculation loop that software using the library is supposed to call:
while(true)
{
//do math here
callback(percent_complete);
}
In the callback, the client calls Sleep(x) to slow down the thread.
Originally, the clientside code was a fixed Sleep(100) call, but this led to bad unreliable performance because some machines finish the math faster than others, but the sleep is the same on all machines. So now the client checks the system time, and if more than 1 second has passed (which == several iterations), it will sleep for half a second.
Is this an acceptable way of slowing down a thread? Should I be using a semaphore/mutex instead of Sleep() in order to maximize performance? Is sleeping x milliseconds for each 1 second of processing work fine or is there something wrong that I'm not noticing?
The reason I ask is that the machine still gets heavily bogged down even though taskman shows the process taking up ~10% of the CPU. I've already explored hard disk and memory contention to no avail, so now I'm wondering if the way I'm slowing down the thread is causing this problem.
Thanks!
Why don't you use a lower priority for the calculation threads? That will ensure other threads are scheduled while allowing your calculation threads to run as fast as possible if no other threads need to run.
What is wrong with the CPU at 100%? That's what you should strive for, not try to avoid. These math calculations are important, no? Unless you're trying to avoid hogging some other resource not explicitly managed by the OS (a mutex, the disk, etc) and used by the main thread, generally trying to slow your thread down is a bad idea. What about on multicore systems (which almost all systems will be, going forward)? You'd be slowing down a thread for absolutely no reason.
The OS has a concept of a thread quantum. It will take care of ensuring that no important thread on your system is starved. And, as I mentioned, on multicore systems spiking one thread on one CPU does not hurt performance for other threads on other cores at all.
I also see in another comment that this thread is also doing a lot of disk I/O - these operations will already cause your thread to yield while it's waiting for the results, so the sleeps will do nothing.
In general, if you're calling Sleep(x), there is something wrong/lazy with your design, and if x==0, you're opening yourself up to live locks (the thread calling Sleep(0) can actually be rescheduled immediately, making it a noop).
Sleep should be fine for throttling an app, which from your comments is what you're after. Perhaps you just need to be more precise how long you sleep for.
The only software in which I use a feature like this is the BOINC client. I don't know what mechanism it uses, but it's open-source and multi-platform, so help yourself.
It has a configuration option ("limit CPU use to X%"). The way I'd expect to implement that is to use platform-dependent APIs like clock() or GetSystemTimes(), and compare processor time against elapsed wall clock time. Do a bit of real work, check whether you're over or under par, and if you're over par sleep for a while to get back under.
The BOINC client plays nicely with priorities, and doesn't cause any performance issues for other apps even at 100% max CPU. The reason I use the throttle it is that otherwise, the client runs the CPU flat-out all the time, and drives up the fan speed and noise. So I run it at the level where the fan stays quiet. With better cooling maybe I wouldn't need it :-)
Another, not so elaborate, method could be to time one iteration and let the thread sleep for (x * t) milliseconds before the next iteration where t is the millisecond time for one iteration and x is the choosen sleep time fraction (between 0 and 1).
Have a look at cpulimit. It sends SIGSTOP and SIGCONT as required to keep a process below a given CPU usage percentage.
Even still, WTF at "crazy complaints and outlandish reviews about your software killing PC performance". I'd be more likely to complain that your software was slow and not making the best use of my hardware, but I'm not your customer.
Edit: on Windows, SuspendThread() and ResumeThread() can probably produce similar behaviour.