Threads ordering in C++/Linux

Threads ordering in C++/Linux - c++

I'm currently doing a simulation of a hard disk drive IOs in C++, and I'm using pthread threads and a mutex to do the reading on the disk.
However I'm trying to optimize the reading time by ordering my threads. The problem is that is my disk is currently reading a sector, and a bunch of requests to read arrive, any of them will be executed. What I want is ordering them so that the request with the closest sector is executed next.
This way, the head of the virtual hard disk drive won't move excessively.
My question is : Is using Linux process priority system a good way to make sure that the closest reading request will be executed before the others? If not, what could I rely on to do this?
PS: Sorry for my english.
Thanks for your help.

It is very rarely a good idea to rely on the exact behaviour of process priority schemes, especially on a general purpose operating system like Linux, because they don't really guarantee you any particular behaviour. Making something the very highest priority won't help if it references some address in memory or some I/O call that causes it to held up for an instant - the operating system will then run some lower priority process instead, and you will be unpleasantly surprised.
If you want to be sure of the order in which disk I/O requests are completed, or to simulate this, you could create a thread that keeps a list of pending I/O and asks for the requests to be executed one at a time, in an order it controls.

The I/O schedulers in the Linux kernel can re-order and coalesce reads (and to some extent writes) so that their ordering is more favorable for the disk, just like you are describing. This affects the process scheduler (which takes care of threads too) in that the threads waiting for I/O also get "re-ordered" - their read or write requests complete in the order in which the disk served them, not in the order in which they made their request. (This is a very simplified view of what really happens.)
But if you're simulating disk I/O, i.e. if you're not actually doing real I/O, the I/O scheduler isn't involved at all. Only the process scheduler. And the process scheduler has no idea that you're "simulating" a hard disk - it has no information about what the processes are doing, just information about whether or not they're in need of CPU resources. (Again this is a simplified view of how things work).
So the process scheduler will not help you in re-ordering or coalescing your simulation of read requests. You need to implement that logic in your code. (Reading about I/O schedulers is a great idea.)
If you do submit real I/O, then doing the re-ordering yourself could improve performance in some situations, and indeed the I/O scheduler's algorithms for optimizing throughput or latency will affect the way your threads are scheduled (for blocking I/O anyway - asynchronous I/O makes it a bit more complicated still).

Related

What actually happens in asynchronous IO

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio

Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.

The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.

Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??

It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.

Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.

Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.

Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.

As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.

I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.

For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.

In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.

As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

Do Asynchronous Loggers really help in performance?

We know that synchronous logging, writes the log message to the file and then continues to the program execution. Asynchronous loggers queues the log messages and writes them in a separate thread. I'm starting to implement Log4CPlus in my Project and couple of things came to my mind.
I can't initialize more LogObjects, because that will open more file handles and we don't need that. (I Know we should use Feature based logging objects, example for UploadLogObj,DownloadLogOb,WebReqLogObj,AuthLogObj,etc). Hope each and every addition of log object may increase logging threads too.
Still for argument sake, if i use a Single Log Object and push log messages from Multiple Threads, i suppose there must be some mutex lock to prevent writing to the message queue. My Question won't this mutex lock slow down the process, won't it create performance issue ..?
I'm just wondering how Asynchronous loggers work, i can look into the code, that's one way. But Hope the answers will be enlightening to a lot of people.

Yes, the mutex will slow down the process a bit, but if you are logging from multiple threads to the same destination you will need some form of synchronization anyway, since you don't want lines from different threads to be mixed up.
In the end it's a matter of deciding where to synchronize, not if. With asynchronous logging this happens when the object to be logged is pushed to the queue of the logging thread. In the synchronous case probably at the time the line is written (though it depends on the implementation).
In the first case the time spent inside the mutex will be much shorter and predictable, since no disk flushes happens while in the mutex. This means that you may have less performance degradation and better scaling than in the second case (plus the time that you didn't spend writing the actual data, because the other thread is taking care of it).
If you don't have a lot of threads competing for the mutex anyway it won't a problem. I had the chance to write and use an asynchronous logger for a real-time system some time ago, and we reached disk-bandwidth related issues long before sychronization issues.
One downside of asynchronous logging is more memory related: since you need to pass the data to be logged around you need to be careful and avoid unneeded allocations/deallocations.

Mutex lock takes something like 40-60 nanoseconds (if mutex is not locked by another thread) on modern hardware. This is nothing comparing to IO operation which is theoretically can write file to a slow HDD or network drive for a few seconds.
Lock-free is a different thing - in this case you don't even have mutexes. However, there is price for it - you'll have to write a more complicated code.

Low-latency read of UDP port

I am reading a single data item from a UDP port. It's essential that this read be the lowest latency possible. At present I'm reading via the boost::asio library's async_receive_from method. Does anyone know the kind of latency I will experience between the packet arriving at the network card, and the callback method being invoked in my user code?
Boost is a very good library, but quite generic, is there a lower latency alternative?
All opinions on writing low-latency UDP network programs are very welcome.
EDIT: Another question, is there a relatively feasible way to estimate the latency that I'm experiencing between NIC and user mode?

Your latency will vary, but it will be far from the best you can get. Here are few things that will stand in your way to the better latency:
Boost.ASIO
It constantly allocates/deallocates memory to store "state" in order to invoke a callback function associated with your read operation.
It does unnecessary mutex locking/unlocking in order to support a broken mix of async and sync approaches.
The worst, it constantly adds and removes event descriptors from the underlying notification mechanism.
All in all, asio is a good library for high-level application developers, but it comes with a big price tag and a lot of CPU cycle eating gremlins. Another alternative is libevent, it is a lot better, but still aims to support many notification mechanisms and be platform-independent. Nothing can beat native mechanisms, i.e. epoll.
Other things
UDP stack. It doesn't do a very good job for latency sensitive applications. One of the most popular solutions is OpenOnload. It by-passes the stack and works directly with your NIC.
Scheduler. By default, scheduler is optimized for throughput and not latency. You will have to tweak and tune your OS in order to make it latency oriented. Linux, for example, has a lot of "rt" patches for that purpose.
Watch out not to sleep. Once your process is sleeping, you will never get a good wakeup latency compared to constantly burning CPU and waiting for a packet to arrive.
Interference with other IRQs, processes etc.
I cannot tell you exact numbers, but assuming that you won't be getting a lot of traffic, using Boost and a regular Linux kernel, with a regular hardware, your latency will range somewhere between ~50 microseconds to ~100 milliseconds. It will improve a bit as you get more data, and after some point start dropping, and will always be ranging. I'd say that if you are OK with those numbers, don't bother optimizing.

I think using recv() in a "spin" loop thread and attach the thread to a single CPU core(Processor Affinity), the latency should be lower than using select(), the precision of select() varies from 1 to 10 micro-seconds while spin loop at 1 micro-second in my test.

Impact of hundreds of idle threads

I am considering the use of potentially hundreds of threads to implement tasks that manage devices over a network.
This is a C++ application running on a powerpc processor with a linux kernel.
After an initial phase when each task does synchronization to copy data from the device into the task, the task becomes idle, and only wakes up when it receives an alarm, or needs to change some data (configuration), which is rare after the start phase. Once all tasks reach the "idle" phase, I expect that only a few per second will need to wake.
So, my main concern is, if I have hundreds of threads will they have a negative impact on the system once they become idle?
Thanks.
amso
edit:
I'm updating the question based on the answers that I got. Thanks guys.
So it seems that having a ton of threads idling (IO blocked, waiting, sleeping, etc), per se , will not have an impact on the system in terms of responsiveness.
Of course, they will spend extra money for each thread's stack and TLS data but that's okay as long as we throw more memory at the thing (making it more €€€)
But then, other issues have to be accounted for. Having 100s of threads waiting will likely increase memory usage on the kernel, due to the need of wait queues or other similar resources. There's also a latency issue, which looks non-deterministic. To check the responsiveness and memory usage of each solution one should measure it and compare.
Finally, the whole idea of hundreds of threads that will be mostly idling may be modeled like a thread pool. This reduces a bit of code linearity but dramatically increases the scalability of the solution and with propper care can be easily tunable to adjust the compromise between performance and resource usage.
I think that's all. Thanks everyone for their input.
--
amso

Each thread has overhead - most importantly each one has its own stack and TLS. Performance is not that much of a problem since they will not get any time slices unless they actually do anything. You may still want to consider using thread pools.

Chiefly they will use up address space and memory for stacks; once you get, say, 1000 threads, this gets quite significant as I've seen that 10M per thread is typical for stacks (on x86_64). It is changable, but only with care.
If you have a 32-bit processor, address space will be the main limitation once you hit 1000s of threads, you can easily exhaust the AS.
They use up some kernel memory, but probably not as much as userspace.
Edit: of course threads share address space with each other only if they are in the same process; I am assuming that they are.

I'm not a Linux hacker, but assuming that Linux's thread scheduling is similar to Windows'...
Yes, of course the will be some impact. Every bit of memory you consume will potentially have some impact.
However, in a time-sliced environment, threads that are in a Wait/Sleep/Join state will not consume CPU cycles until they are awoken.

I would be worried about offering 1:1 thread-connections mappings, if nothing else because it leaves you rather exposed to denial of service attacks. (pthread_create() is a fairly expensive operation compared to just a call to accept())
EboMike has already answered the question directly - provided threads are blocked and not busy-waiting then they won't consume much in the way of resources although they will occupy memory and swap for all the per-thread state.

I'm learning the basics of the kernel now. I can't give you a specific answer yet; I'm still a noob... but here are some things for you to chew on.
Linux implements each POSIX thread as a unique process. This will create overhead as others have mentioned. In addition to this, your waiting model appears flawed any way you do it. If you create one conditional variable for each thread, then I think (based off of my interpretation of the website below) that you'll actually be expending a lot of kernel memory, as each thread would be placed into its own wait queue. If instead you break your threads up for each group of X threads to share a conditional variable, then you've got problems as well because every time the variable signals, you must wake up _EVERY_DARN_PROCESS_ in that variable's wait queue.
I also assume that you will need to do some object sharing an synchronization. In this case, your code may get slower because of the need to wake up all processes waiting on a resource, as I mentioned earlier.
I know this wasn't much help, but as I said, I'm a kernel noob. Hope it helped a little.
http://book.chinaunix.net/special/ebook/PrenticeHall/PrenticeHallPTRTheLinuxKernelPrimer/0131181637/ch03lev1sec7.html

I'm not sure what "device" you are talking about, but if it's a file descriptor, I'd suggest that you look at starting to migrate to using either poll or epoll (Id suggest the latter given the description of how active you expect each file descriptor to be). That way, you could use one process which would be responsible for all the fds.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js