Increase io priority on Windows? - c++

Originally my producer function would just write the data, now I have a second thread that is responsible for writing the data. The producer function does a memcpy into a circular buffer and triggers the consumer thread to start writing.
When I use the 2 threaded scheme I get the desired thread isolation, program stability and the ability to variable computation before writing - but the io performance is 50% worse.
My theory is that there is some kind of priority that can be set per thread that I want to adjust. Is this possible.
I am using 2 SSDs in a RAID0 data stripping configuration.

What do you mean by "io performance is 50% worse"? According to your resource monitor it is as high as it can be: disk queue is full, disk active time is 100%. If you mean write speed jumps - they have nothing to do with any possible thread priority. They are cause by disk head positioning due to files fragmentation, fs table modifications and so on.

Related

Chronicle Queue: What is the recommended way to multiplex producers to write into a single queue?

Let there be 5 producer threads, and 1 queue. It seems like I have 2 options:
create an appender for each producer thread, append concurrently and let chronicle queue deal with synchronization (enable double-buffereing?)
synchronize the 5 producer threads first (lockfree mechanism e.g. disruptor), create 1 extra thread with 1 appender that writes into chronicle queue
Why this question?
I had the initial impression that writing to a chronicle queue is lock-free and should therefore be really fast. But github documentation mentioned multiple times there is a write lock that serializes concurrent writes. So I wonder if a lockfree disruptor placed in front of the chronicle queue would increase performance?
What you suggest can improve the writer's performance, esp if you have an expensive serialization/marshalling strategy. However, if you are writing to a real disk, you will find the performance of the drive is possibly your biggest issue. (Even a fast NVMe drive) You might find the time to read the data is worse.
Let's say you spend 1 microsecond writing a 512-byte message, and you are writing at 200K/s messages. This means that your 80%ile will be an extra 1 us waiting for the queue due to contention. However, you will be writing 360 GB/h as will very quickly fill a fast NVMe drive. If instead, you have a relatively low volume of 20K/s messages, you are increasing your 98%ile latency by 1 us.
In short, if write contention is a problem, your drive is probably a much bigger problem. Adding a disruptor could help the writer, but it will delay the read time of every message.
I recommend building a latency benchmark for a realistic throughput first. You can double buffer the data yourself by writing first to a Wire and only copying the bytes while holding the lock.

What actually happens in asynchronous IO

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio
Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.
The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.
Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.
Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

High throughput non-blocking server design: Alternatives to busy wait

I have been building a high-throughput server application for multimedia messaging, language of implementation is C++. Each server can be used in stand-alone mode or many servers can be joined together to create a DHT-based overlay network; the servers act like super-peers like in case of Skype.
The work is in progress. Currently the server can handle around 200,000 messages per second (256 byte messages) and has a max throughput of around 256 MB/s on my machine (Intel i3 Mobile 2 GHz, Fedora Core 18 (64-bit), 4 GB RAM) for messages of length 4096 bytes. The server has got two threads, one thread for handling all IOs (epoll-based, edge triggered) another one for processing the incoming messages. There is another thread for overlay management, but it doesn't matter in the current discussion.
The two threads in discussion share data using two circular buffers. Thread number 1 enqueues fresh messages for the thread number 2 using one circular buffer, while thread number 2 returns back the processed messages through the other circular Buffer. The server is completely lock-free. I am not using any synchronization primitive what-so-ever, not even atomic operations.
The circular buffers never overflow, because the messages are pooled (pre-allocated on start). In fact all vital/frequently-used data-structures are pooled to reduce memory fragmentation and to increase cache efficiency, hence we know the maximum number of messages we are ever going to create when the server starts, hence we can pre-calculate the maximum size of the buffers and then initialize the circular buffers accordingly.
Now my question: Thread #1 enqueues the serialized messages one message at a time (actually the pointers to message objects), while thread #2 takes out messages from the queue in chunks (chunks of 32/64/128), and returns back the processed messages in chunks through the second circular buffer. In case there are no new messages thread #2 keeps busy waiting, hence keeping one of the CPU cores busy all the time. How can I improve upon the design further? What are the alternatives to the busy wait strategy? I want to do this elegantly and efficiently. I have considered using semaphores, but I fear that is not the best solution for a simple reason that I have to call "sem_post" every time I enqueue a message in the thread #1 which might considerably decrease the throughput, the second thread has to call "sem_post" equal number of times to keep the semaphore from overflowing. Also I fear that a semaphore implementation might be using a mutex internally.
The second good option might be use of signal if I can discover an algorithm for raising signal only if the second thread has either "emptied the queue and is in process of calling sigwait" or is "already waiting on sigwait", in short the signal must be raised minimum number of times, although it won't hurt if signals are raised a few more times than needed. Yes, I did use Google Search, but none of the solutions I found on Internet were satisfactory. Here are a few considerations:
A. The server must waste minimum CPU cycles while making system calls, and system calls must be used a minimum number of times.
B. There must be very low overhead and the algorithm must be efficient.
C. No locking what-so-ever.
I want all options to be put on table.
Here is the link to the site where I have shared info about my server, to better understand the functionality and the purpose:
www.wanhive.com
Busy waiting is good if you need to wake up thread #2 as fast as possible. In fact this is the fastest way to notify one processor about changes made by another processor. You need to generate memory fences on both ends (write fence on one side and read fence on the other). But this statement holds true only if your both threads are executed on dedicated processors. In this case no context switching is needed, just cache coherency traffic.
There is some improvements can be made.
If thread #2 is in general CPU bound and do busy waiting - it can be penalized by the scheduler (at least on windows and linux). OS scheduler dynamically adjust thread priorities to improve overall system performance. It reduces priorities of CPU bound threads that consumes large amount of CPU time to prevent thread starvation. You need to manually increase priority of thread #2 to prevent this.
If you have multicore or multiprocessor machine, you will end up with undersubscription of processors and your application won't be able to exploit hardware concurrency. You can mitigate this by using several processor threads (thread #2).
Parallelization of processing step.
There is two options.
Your messages is totally ordered and need to be processed in the same order as they arrived.
Messages can be reordered. Processing can be done in any order.
You need N cycle buffers and N processing threads and N output buffers and one consumer thread in first case. Thread #1 enqueues messages in round-robin order in that cycle buffers.
// Thread #1 pseudocode
auto message = recv()
auto buffer_index = atomic_increment(&message_counter);
buffer_index = buffer_index % N; // N is the number of threads
// buffers is an array of cyclic buffers - Buffer* buffers[N];
Buffer* current_buffer = buffers[buffer_index];
current_buffer->euqueue(message);
Each thread consumes messages from one of the buffers and enqueues result to his dedicated output buffer.
// Thread #i pseudocode
auto message = my_buffer->dequeue();
auto result = process(message);
my_output_buffer->enqueue(result);
Now you need to process all this messages in the arrival order. You can do this with dedicated consumer thread by dequeuing processed messages from output cyclic buffers in round-robin manner.
// Consumer thread pseudocode
// out_message_counter is equal to message_counter at start
auto out_buffer_index = atomic_increment(&out_message_counter);
out_buffer_index = out_buffer_index % N;
// out_buffers is array of output buffers that is used by processing
// threads
auto out_buffer = out_buffers[out_buffer_index];
auto result = out_buffer->dequeue();
send(result); // or whatever you need to do with result
In second case, when you doesn't need to preserve message order - you doesn't need the consumer thread and output cyclic buffers. You just do whatever you need to do with result in processing thread.
N must be equal num CPU's - 3 in first case ("- 3" is one I/O thread + one consumer thread + one DHT thread) and num CPU's - 2 in second case ("- 2" is one I/O thread + one DHT thread). This is because busy wait can't be effective if you have oversubscription of processors.
Sounds like you want to coordinate a producer and consumer connected by some shared state. At least in Java for such patterns, one way to avoid busy wait is to use wait and notify. With this approach, thread #2 can go into a blocked state if it finds that the queue is empty by calling wait and avoid spinning the CPU. Once thread #1 puts some stuff in the queue, it can do a notify. A quick search of such mechanisms in C++ yields this:
wait and notify in C/C++ shared memory
You can have thread #2 go to sleep for X miliseconds when the queue is empty.
X can be determined by the length of the queues you want + some guard band.
BTW, in user mode (ring3) you can't use MONITOR/MWAIT instructions which would be ideal for your question.
Edit
You should definitely give TBB's RWlock a try (there's a free version). Sounds like what you're looking for.
Edit2
Another option is to use conditional variables. They involve a mutex and a condition. Basically you wait on the condition to become "true". The low level pthread stuff can be found here.

boost threadpool with a thread that handles an IO queue

I recently began experimenting with the pseudo-boost threadpool (pseudo because it hasn't been officially accepted yet).
As a simple exercise, I initialized the threadpool with a maximum of two threads.
Each task does two things:
a CPU-intensive calculation
writes out the result to disk
Question
How do I modify the model into a threadpool that does:
a CPU-intensive calculation
and a single I/O thread which listens for completion from the threadpool - takes the resultant memory and simply:
writes out the result to disk
Should I simply have the task communicate to the I/O thread (spawned
as std::thread) through a std::condition_variable (essentially a mutexed queue of calculation results) or is there a way to
do it all within the threadpool library?
Or is the gcc 4.6.1 implementation of future and promise mature enough for me to pull this off?
Answer
It looks like a simple mutex queue with a condition variable works fine.
By grouping read access and writes, in addition to using the threadpool, I got the following improvements:
2 core machine: 1h14m down to 33m (46% reduction in runtime)
4 core vm: 40m down to 18m (55% reduction in runtime)
Thanks to Martin James for his thoughtful answer. Before this exercise, I thought that my next computational server should have dual-processors and a ton of memory. But now, with so much processing power inherent in the multiple cores and hyperthreading, I realize that money will probably better spent dealing with the I/O bottleneck.
As Martin mentioned, having multiple drives or RAID configurations would probably help. I will also look into adjusting I/O buffer settings at the kernel level.
If there is only one local disk, one writer thread on the end of a producer-consumer queue would be my favourite. Seeks, networked-disk delays and other hiccups will not leave any pooled threads that have finsihed their calculation stuck trying to write to the disk. Other disk operations, (eg. select another location/file/folder), are also easier/quicker if only one thread is accessing it - the queue will take up the slack and allow seamless calculation during the latency.
Writing directly from the calcualtion task or submitting the result-write as a separate task would work OK but you would need more threads in the pool to achieve pause-free operation.
Everything changes if there is more than one disk. More than one writer thread would then become a worthwhile proposition because of the increased overall performance. I would then probably go with an array/list of queues/write-threads, one for each disk.

Threads ordering in C++/Linux

I'm currently doing a simulation of a hard disk drive IOs in C++, and I'm using pthread threads and a mutex to do the reading on the disk.
However I'm trying to optimize the reading time by ordering my threads. The problem is that is my disk is currently reading a sector, and a bunch of requests to read arrive, any of them will be executed. What I want is ordering them so that the request with the closest sector is executed next.
This way, the head of the virtual hard disk drive won't move excessively.
My question is : Is using Linux process priority system a good way to make sure that the closest reading request will be executed before the others? If not, what could I rely on to do this?
PS: Sorry for my english.
Thanks for your help.
It is very rarely a good idea to rely on the exact behaviour of process priority schemes, especially on a general purpose operating system like Linux, because they don't really guarantee you any particular behaviour. Making something the very highest priority won't help if it references some address in memory or some I/O call that causes it to held up for an instant - the operating system will then run some lower priority process instead, and you will be unpleasantly surprised.
If you want to be sure of the order in which disk I/O requests are completed, or to simulate this, you could create a thread that keeps a list of pending I/O and asks for the requests to be executed one at a time, in an order it controls.
The I/O schedulers in the Linux kernel can re-order and coalesce reads (and to some extent writes) so that their ordering is more favorable for the disk, just like you are describing. This affects the process scheduler (which takes care of threads too) in that the threads waiting for I/O also get "re-ordered" - their read or write requests complete in the order in which the disk served them, not in the order in which they made their request. (This is a very simplified view of what really happens.)
But if you're simulating disk I/O, i.e. if you're not actually doing real I/O, the I/O scheduler isn't involved at all. Only the process scheduler. And the process scheduler has no idea that you're "simulating" a hard disk - it has no information about what the processes are doing, just information about whether or not they're in need of CPU resources. (Again this is a simplified view of how things work).
So the process scheduler will not help you in re-ordering or coalescing your simulation of read requests. You need to implement that logic in your code. (Reading about I/O schedulers is a great idea.)
If you do submit real I/O, then doing the re-ordering yourself could improve performance in some situations, and indeed the I/O scheduler's algorithms for optimizing throughput or latency will affect the way your threads are scheduled (for blocking I/O anyway - asynchronous I/O makes it a bit more complicated still).