Multiple threads for content retrieval (GET) and data write on disk

Multiple threads for content retrieval (GET) and data write on disk - c++

I need to make different GET queries to a server to download a bunch of json files and write each download to disk and I want to launch some threads to speed that up.
Each download and writting of each file takes approximately 0.35 seconds.
I would like to know if, under linux at least (and under Windows since we are here), it is safe to write in parallel to disk and how many threads can I launch taking into account the waiting time of each thread.
If it changes something (I actually think so), the program doesn't write directly to disk. It just calls std::system to run the program wget because it is currently easier to do that way than importing a library. So, the waiting time is the time that the system call takes to return.
So, each writting to disk is being performed by a different process. I only wait that program to finish, and I'm not actually bound by I/O, but by the running time of an external process (each wget call creates and writes to a different file and thus they are completely independent processes). Each thread just waits for one call to complete.
My machine has 4 CPUs.
Some kind of formula to get an ideal number of threads according to CPU concurrency and "waiting time" per thread would be welcome.
NOTE: The ideal solution will be of course to make some performance testing, but I could be banned for the server if I abuse with so many request.

It is safe to do concurrent file I/O from multiple threads, but if you are concurrently writing to the same file, some form of synchronization is necessary to ensure that the writes to the file don't become interleaved.
For what you describe as your problem, it is perfectly safe to fetch each JSON blob in a separate thread and write them to different, unique files (in fact, this is probably the sanest, simplest design). Given that you mention running on a 4-core machine, I would expect to see a speed-up well past the four concurrent thread mark; network and file I/O tends to do quite a bit of blocking, so you'll probably run into a bottleneck with network I/O (or on the server's ability to send) before you hit a processing bottleneck.
Write your code so that you can control the number of threads that are spawned, and benchmark different numbers of threads. I'll guess that your sweet spot will be somewhere between 8 and 16 threads.

Related

What actually happens in asynchronous IO

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio

Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.

The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.

Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

10 threads in a single program or 1 thread program ran 10 times (C++)?

I am wondering whether there is any difference in performance in running a single program (exe) with 10 different threads or running the program with a single thread 10 times in parallel (starting it from a .bat file) assuming the work done is the same and only the number of threads spawned by the program change?
I am developing a client/server communication program and want to test it for throughput. I'm currently learning about parallel programming and threading as wasn't sure how Windows would handle the above scenario. Will the scheduler schedule work the same way for both scenarios? Will there be a performance difference?
The machine the program is running on has 4 threads.

Threads are slightly lighter weight than processes as there are many things a process gets it's own copy of. Especially when you compare the time it takes to start a new thread, vs starting a new process (from scratch, fork where available also avoids a lot of costs). Although in either case you can generally get even better performance using a worker pool where possible rather than starting and stopping fresh processes/threads.
The other major difference is that threads by default all share the same memory while processes get their own and need to communicate through more explicit means (which may include blocks of shared memory). This might make it easier for a threaded solution to avoid copying data, but this is also one of the dangers of multithreaded programming when care is not taken in how they use the shared memory/objects.
Also there may be API's that are more orientated around a single process. For example on Windows there is IO Completion Ports which basically works on the idea of having many in-progress IO operations for different files, sockets, etc. with multiple threads (but generally far less than the number of files/sockets) handling the results as they become available through a GetQueuedCompletionStatus loop.

How to write-off an expensive file I/O in the middle of a C++ program

I am working on some code which is performance wise extremely demanding (I am using microsecond timers!). The thing is, it has a server<->client architecture where a lot of data is being shared at high speeds. To maintain a sync between the client and the server a simple "sequence number" based approach is followed. Such that if the client's program crashes, the client can "resume" communication by sending the server the last sequence number and they can "resume operations" without missing out on anything.
The issue with this is that I am forced to write sequence numbers to disk. Sadly this has to be done on every "transaction". This file write causes huge time costs (as we would expect).
So I thought I would use threads to get around this problem. However, if I create a regular thread, I would have to wait until the file write finishes anyway and if I used a detached thread, I am doing something risky, as the thread might not finish when my actual process is killed (let's say) and thus the sequence number gets messed up.
What are my options here. Kindly note that sadly I do not have access to C++11. I am using lpthread on linux.

You can just add the data to a queue, and have the secondary threads dequeue, write, and signal when they're done.
You can also get some inspiration from log-based file systems. They get around this problem by having the main thread first writing a small record to a log file and returning control immediately to the rest of the program. Meanwhile, secondary threads can carry out the actual data write, and signal when done by also writing to the log file. This helps your maintain throughput by deferring writes to when more system resources are available, and doesn't block the main thread. Read more about it here

multilevel threads takes time

I have created a module to transfer the data using multiple sockets using TCP client server communication. This transfers the file of 20MB in 10 secs.
Multiple sockets sends/receives the data in each of their a separate thread.
When I launch the module from another worker thread the time taken to send the same file increases to 40 secs.
Please let me know any solutions to avoid the time lagging.

Are you synchronizing the threads to read the content from the file at Client side and write back to file at server side? This adds time.
Along with this, by default you will having context switching time between multiple threads at both client and server.

One problem may be disk caching and seeking. If you are not already doing this, try interleaving blocks transferred by different threads more finely (like, say, to 4KB, so bytes 0...4095 transferred by 1st thread, 4096...8191 by 2nd thread, etc).
Also avoid mutexes, for example by having each thread know what it's supposed to read and send, or write and receive, when thread starts, so no inter-thread communication is needed. Aborting the whole transfer can be done by an atomic flag variable (checked by each thread after transferring a block) instead of mutexes.
Also on receiving end, make sure to do buffering in memory, so that you write to destination file sequentially. That is, if one thread transfers blocks faster than some other, those "early" blocks are just kept in memory until all the preceding blocks have been received and written.
If buffer size becomes an issue here, you may need to implement some inter-thread synchronization at one end (doesn't matter much, wether you slow down receiving or sending), to prevent the fastest thread getting too far ahead of the slowest thread, but for file sizes in the order of tens of megabytes, on PCs, this should not become an issue.

Multiple threads and blocking sockets

I'm writing a TCP server on Windows Server 2k8. This servers receives data, parses it and then sinks it on a database. I'm doing now some tests but the results surprises me.
The application is written in C++ and uses, directly, Winsocks and the Windows API. It creates a thread for each client connection. For each client it reads and parses the data, then insert it into the database.
Clients always connect in simultaneous chunks - ie. every once in a while about 5 (I control this parameter) clients will connect simultaneously and feed the server with data.
I've been timing both the reading-and-parsing and the database-related stages of each thread.
The first stage (reading-and-parsing) has a curious behavoir. The amount of time each thread takes is roughly equal to each thread but is also proportional to the number of threads connecting. The server is not CPU starved: it has 8 cores and always less than 8 threads are connected to it.
For example, with 3 simultaneous threads 100k rows (for each thread) will be read-and-parsed in about 4,5s. But with 5 threads it will take 9,1s on average!
A friend of mine suggested this scaling behavoir might be related to the fact I'm using blocking sockets. Is this right? If not, what might be the reason for this behavoir?
If it is, I'd be glad if someone can point me out good resources for understanding non blocking sockets on Windows.
Edit:
Each client thread reads a line (ie., all chars untils a '\n') from the socket, then parses it, then read again, until the parse fails or a terminator character is found. My readline routine is based on this:
http://www.cis.temple.edu/~ingargio/cis307/readings/snaderlib/readline.c
With static variables being declared as __declspec(thread).
The parsing, assuming from the non networking version, is efficient (about 2s for 100k rows). I assume therefore the problem is in the multhreaded/network version.

If your lines are ~120–150 characters long, you are actually saturating the network!
There's no issue with sockets. Simply transfering 3 times 100k lines, 150 bytes each, over a 100 Mbps line (1 take 10 bytes/byte to account for headers) will take... 4.5 s! There is no problem with sockets, blocking or otherwise. You've simply hit the limit of how much data you can feed it.

Non-blocking sockets are only useful if you want one thread to service multiple connections. Using non-blocking sockets and a poll/select loop means that your thread does not sit idle while waiting for new connections.
In your case this is not an issue since there is only one connection per thread so there is no issue if your thread is waiting for input.
Which leads to your original questions of why things slow down when you have more connections. Without further information, the most likely culprit is that you are network limited: ie your network cannot feed your server data fast enough.
If you are interested in non-blocking sockets on Windows do a search on MSDN for OVERLAPPED APIs

You could be running into other threading related issues, like deadlocks/race conditions/false sharing, which could be destroying the performance.

One thing to keep in mind is that although you have one thread per client, Windows will not automatically ensure they all run on different cores. If some of them are running on the same core, it is possible (although unlikely) to have your server in a sort of CPU-starved state, with some cores at 100% load and others idle. There are simply no guarantees as to how the OS spreads the load (in the default case).
To explicitly assign threads to particular cores, you can use SetThreadAffinityMask. It may or may not be worth having a bit of a play around with this to see if it helps.
On the other hand, this may not have anything to do with it at all. YMMV.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js