Is using cat command to merge files created by multiple threads efficient?

Is using cat command to merge files created by multiple threads efficient? - c++

I have a multi-threaded C++11 program in which each thread produces a large amount of data that need to be written to the disk. All the data need to be written onto one file. At the moment, I use a mutex that protects accesses to the file from multiple threads. My friend suggested me that I can use one file for each thread, then at the end merge the files into one file with cat command done from C++ code using system().
I'm thinking if cat command is going to read all the data back from the disk and then write it again to the disk but this time into a single file, it's not going to be any better. I have googled but couldn't find cat command implementation details. May I know how it works and if it's going to accelerate the whole procedure?
Edit:
Chronology of events is not important, and there's no ordering constraint on the contents of the files. Both methods will perform what I want.

You don't specify if you have some ordering or structuring constraints on the content of the file. Generally it is the case, so I'll treat it as such, but hopefully my solution should work either way.
The classical programmatic approach
The idea is to offload the work of writing to disk to a dedicated IO thread, and have a multiple producers/ one consumer queue to queue up all the write commands. Each work thread simply format its output as a string and push it back to the queue. The IO thread pop batches of messages from the queue into a buffer, and issue the write commands.
Alternatively, you could add a field in your messages to indicate which worker emitted the write command, and have the IO thread push to different files, if needed.
For better performance, it also interesting to look into async versions of the IO system primitives (read/write), if your host OS supports them. The IO thread would then be able to monitor several concurrent IO, and feed the OS with new ones as soon as one terminate.
As it has been advised in comments, you will have to monitor the IO thread for congestion situations, and tune the number of workers accordingly. The "natural" feedback based mechanism is to simply make the queue bounded, and workers will wait on the lock until space free up on it. This let you control the amount of produced data at any point during the process life, which is an important point in memory constrained scenarios.
Your cat concerns
As for cat, this command line tool simply read whatever is wrote to its input channel (usually stdin), and duplicates it to its output (stdout). It's as simple a that, and you can clearly see the similarity with the solution advocated above. The difference is that cat doesn't understand the file internal structure (if any), it only deals with byte streams, which means that if several processes write concurrently to a cat input without synchronization, the resulting output would probably be completely mixed up. Another issue is the atomicity (or lack thereof) of IO primitives.
NB: On some systems, there's a neat little feature called a fork, which let you have several "independent" streams of data multiplexed in a single file. If you happen to work on a platform supporting that feature, you could have all your data streams bundled in a single file, but separately reachable.

Related

What actually happens in asynchronous IO

I keep reading about why asynchronous IO is better than synchronous IO, which is because in a-sync IO, your program can keep running, while in sync IO you're blocked until operation is finished.
I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
So in a-sync IO, it needs it as well, which might result in context switch from my application to the kernel. So it's not really blocking, but there cpu cycles do need to run this operation.
Is that correct?
Is the difference between those two that we assume disk access is slow, so compared to sync IO where you wait for the data to be written to disk, in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Examples of sync IO:
write()
Examples of async IO:
io_uring (as I understand has zero copy as well, so it's a benefit)
spdk (should be best, though I don't understand how to use it)
aio

Your understanding is partly right, but which tools you use are a matter of what programming model you prefer, and don't determine whether your program will freeze waiting for I/O operations to finish. For certain, specialized, very-high-load applications, some models are marginally to moderately more efficient, but unless you're in such a situation, you should pick the model that makes it easy to write and maintain your program and have it be portable to systems you and your users care about, not the one someone is marketing as high-performance.
Traditionally, there were two ways to do I/O without blocking:
Structure your program as an event loop performing select (nowadays poll; select is outdated and has critical flaws) on a set of file descriptors that might be ready for reading input or accepting output. This requires keeping some sort of explicit state for partial input that you're not ready to process yet and for pending output that you haven't been able to write out yet.
Separate I/O into separate execution contexts. Historically the unixy approach to this was separate processes, and that can still make sense when you have other reasons to want separate processes anyway (privilege isolation, etc.) but the more modern way to do this is with threads. With a separate execution context for each I/O channel you can just use normal blocking read/write (or even buffered stdio functions) and any partial input or unfinished output state is kept for you implicitly in the call frame stack/local variables of its execution context.
Note that, of the above two options, only the latter helps with stalls from disk access being slow, as regular files are always "ready" for input and output according to select/poll.
Nowadays there's a trend, probably owing largely to languages like JavaScript, towards a third approach, the "async model", with even handler callbacks. I find it harder to work with, requiring more boilerplate code, and harder to reason about, than either of the above methods, but plenty of people like it. If you want to use it, it's probably preferable to do so with a library that abstracts the Linuxisms you mentioned (io_uring, etc.) so your program can run on other systems and doesn't depend on latest Linux fads.
Now to your particular question:
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
If your application has a single input source (no interactivity) and single output, e.g. like most unix commands, there is absolutely no benefit to any kind of async I/O regardless of which programmind model (event loop, threads, async callbacks, whatever). The simplest and most efficient thing to do is just read and write.

The kernel do need CPU time in order to do it.
Is that correct?.
Pretty much, yes.
Is the difference between those two that we assume disk access is slow ... in a-sync IO the time you wait for it to be written to disk can be used to continue doing application processing, and the kernel part of writing it to disk is small?
Exactly.
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Depends on many factors. How does the application "get info"? Is it CPU intensive? Does it use the same IO as the writing? Is it a service that processes multiple requests concurrently? How many simultaneous connections? Is the performance important in the first place? In some cases: Yes, there may be significant benefit in using async IO. In some other cases, you may get most of the benefits by using sync IO in a separate thread. And in other cases single threaded sync IO can be sufficient.

I do not understand this saying because using sync IO (such as write()) the kernel writes the data to the disk - it doesn't happen by itself. The kernel do need CPU time in order to do it.
No. Most modern devices are able to transfer data to/from RAM by themselves (using DMA or bus mastering).
For an example; the CPU might tell a disk controller "read 4 sectors into RAM at address 0x12345000" and then the CPU can do anything else it likes while the disk controller does the transfer (and will be interrupted by an IRQ from the disk controller when the disk controller has finished transferring the data).
However; for modern systems (where you can have any number of processes all wanting to use the same device at the same time) the device driver has to maintain a list of pending operations. In this case (under load); when the device generates an IRQ to say that it finished an operation the device driver responds by telling the device to start the next "pending operation". That way the device spends almost no time idle waiting to be asked to start the next operation (much better device utilization) and the CPU spends almost all of its time doing something else (between IRQs).
Of course often hardware is more advanced (e.g. having an internal queue of operations itself, so driver can tell it to do multiple things and it can start the next operation as soon as it finished the previous operation); and often drivers are more advanced (e.g. having "IO priorities" to ensure that more important stuff is done first rather than just having a simple FIFO queue of pending operations).
Let's say I have an application that all it does is get info and write it into files. Is there any benefit for using a-sync IO instead of sync IO?
Lets say that you get info from deviceA (while CPU and deviceB are idle); then process that info a little (while deviceA and deviceB are idle); then write the result to deviceB (while deviceA and CPU are idle). You can see that most hardware is doing nothing most of the time (poor utilization).
With asynchronous IO; while deviceA is fetching the next piece of info the CPU can be processing the current piece of info while deviceB is writing the previous piece of info. Under ideal conditions (no speed mismatches) you can achieve 100% utilization (deviceA, CPU and deviceB are never idle); and even if there are speed mismatches (e.g. deviceB needs to wait for CPU to finish processing the current piece) the time anything spends idle will be minimized (and utilization maximized as much as possible).
The other alternative is to use multiple tasks - e.g. one task that fetches data from deviceA synchronously and notifies another task when the data was read; a second task that waits until data arrives and processes it and notifies another task when the data was processed; then a third task that waits until data was processed and writes it to deviceB synchronously. For utilization; this is effectively identical to using asynchronous IO (in fact it can be considered "emulation of asynchronous IO"). The problem is that you've added a bunch of extra overhead managing and synchronizing multiple tasks (more RAM spent on state and stacks, task switches, lock contention, ...); and made the code more complex and harder to maintain.

Context switching is necessary in any case. Kernel always works in its own context. So, the synchronous access doesn't save the processor time.
Usually, writing doesn't require a lot of processor work. The limiting factor is the disk response. The question is will we wait for this response do our work.
Let's say I have an application that all it does is get info and write
it into files. Is there any benefit for using a-sync IO instead of
sync IO?
If you implement a synchronous access, your sequence is following:
get information
write information
goto 1.
So, you can't get information until write() completes. Let the information supplier is as slow as the disk you write to. In this case the program will be twice slower that the asynchronous one.
If the information supplier can't wait and save the information while you are writing, you will lose portions of information when write. Examples of such information sources could be sensors for quick processes. In this case, you should synchronously read sensors and asynchronously save the obtained values.

Asynchronous IO is not better than synchronous IO. Nor vice versa.
The question is which one is better for your use case.
Synchronous IO is generally simpler to code, but asynchronous IO can lead to better throughput and responsiveness at the expense of more complicated code.
I never had any benefit from asynchronous IO just for file access, but some applications may benefit from it.
Applications accessing "slow" IO like the network or a terminal have the most benefit. Using asychronous IO allows them to do useful work while waiting for IO to complete. This can mean the ability to serve more clients or to keep the application responsive for the user.
(and "slow" just means that the time for an IO operation to finish is unbounded, it may ever never finish, eg when waiting for a user to press enter or a network client to send a command)
In the end, asynchronous IO doesn't do less work, it's just distributed differently in time to reduce idle waiting.

How to write-off an expensive file I/O in the middle of a C++ program

I am working on some code which is performance wise extremely demanding (I am using microsecond timers!). The thing is, it has a server<->client architecture where a lot of data is being shared at high speeds. To maintain a sync between the client and the server a simple "sequence number" based approach is followed. Such that if the client's program crashes, the client can "resume" communication by sending the server the last sequence number and they can "resume operations" without missing out on anything.
The issue with this is that I am forced to write sequence numbers to disk. Sadly this has to be done on every "transaction". This file write causes huge time costs (as we would expect).
So I thought I would use threads to get around this problem. However, if I create a regular thread, I would have to wait until the file write finishes anyway and if I used a detached thread, I am doing something risky, as the thread might not finish when my actual process is killed (let's say) and thus the sequence number gets messed up.
What are my options here. Kindly note that sadly I do not have access to C++11. I am using lpthread on linux.

You can just add the data to a queue, and have the secondary threads dequeue, write, and signal when they're done.
You can also get some inspiration from log-based file systems. They get around this problem by having the main thread first writing a small record to a log file and returning control immediately to the rest of the program. Meanwhile, secondary threads can carry out the actual data write, and signal when done by also writing to the log file. This helps your maintain throughput by deferring writes to when more system resources are available, and doesn't block the main thread. Read more about it here

executing a function in background? Do I need threading for this? c++

I am trying to automate some data entry so I implemented a tcp client and server, the client will send file names then the server will go into a shared folder and imports that file the database.
my problem is that file names could be sent at a faster rate than "importing to database". so I made a queue (which I am not sure how to set the size on it) and I push file names on the queue then execute
PushToDatabase(filename);
what I am trying to do is:
queue<string> q;
char *data = new char[1024];
ReadFromClient(data);
//now 'data' has a filename
q.push(data);
PushToDatabase(q.front()); // I want to execute this in the background
q.pop();
I am not sure if I need to implement threading to make this work or not I also have no clue how this could be done in c++
Any other ideas??

Depending on the number of files that you are looking at processing, you should take a look at ring buffers. You could allocate a fixed size and should be able to continuously read and write to it without buffer over/under run concerns if implemented properly. I think boost has a circular buffer container that you could use, but you'll need to have mutexes to make sure it is synchronized and thread safe. This will ensure you don't read and write to the same memory locations at the same time or change variables in other threads (since the threads use a shared memory space). You could poll the ring buffer to see if there is any new data to process using semaphores, which will also eliminate the buffer over/under run concern outlined above. You should also take a look at std::atomic containers for writing variables that will be shared between threads, that way you eliminate any race conditions if two threads try to write to the same variable.

You could use several threads. Be very careful about synchronization issues.
Alternatively you could have a (single-threaded) event loop. You could (painfully) write it yourself above some multiplexing syscall like poll(2).
Read also Advanced Linux Programming.
You could also use some event loop library like libevent or libev or Glib (from GTK) or QtCore (from Qt) or libsigc++
Reading about C10K and closures and callbacks and continuation-passing style might somehow be relevant and could open your mind about potential issues and terminology. Notice that C++11 has anonymous functions (i.e. closures).

Data written to disk callback

How can I get callbacks once data has been successfully written to disk in Linux?
I would like to have my programs db file mapped into memory for read/write operations and receive callbacks once a write has successfully hit disk. Kind of like what the old VMSs used to do..

You need to call fdatasync (or fsync if you really need the metadata to be synchronised as well) and wait for it to return.
You could do this from another thread, but if one thread writes to the file while another thread is doing a fdatasync(), it's not going to be clear which of the writes are guaranteed to be persistent or not.
Databases which want to store transaction logs in a guaranteed-durable way, need to call fdatasync.
Databases (such as innodb) typically use direct IO (as well as their own data-caching, rather than rely on the OS) on their main data files, so that they know that it will be written in a predictable manner.

As far as I know you cannot get any notification when the actual synchronization between a file (or a mmaped region) happen, not even the timestamps of the file are going to change. You can, however, force the synchronization of the file (or region) by using fsync.
It is also hard to see a reason for why you would want that. File IO is supposed to be opaque.

How do I concurrently download and convert a binary file using threads?

I have a program that downloads a binary file, from another PC.
I also have a another standalone program that can convert this binary file to a human readable CSV.
I would like to bring the conversion tool "into" the download tool, creating a thread in the download tool that kicks off the conversion code (so it can start converting while it is downloading, reducing the total time of download and convert independently).
I believe I can successfully kick off another thread but how do I synchronize the conversion thread with the main download?
i.e. The conversion catches up with the download, needs to wait for more to download, then start converting again, etc.
Is this similar to the Synchronizing Execution of Multiple Threads ? If so does this mean the downloaded binary needs to be a resource accessed by semaphores?
Am I on the right path or should i be pointed in another direction before I start?
Any advice is appreciated.
Thank You.

This is a classic case of the producer-consumer problem with the download thread as the producer and the conversion thread as the consumer.
Google around and you'll find an implementation for your language of choice. Here are some from MSDN: How to: Implement Various Producer-Consumer Patterns.

Intead of downloading to a file, you should write the downloaded data to a pipe. The convert thread can be reading from the pipe and then writing the converted output to a file. That will automatically synchronize them.
If you need the original file as well as the converted one, just have the download thread write the data to the file then write the same data to the pipe.

Yes, you undoubtedly need semaphores (or something similar such as an event or critical section) to protect access to the data.
My immediate reaction would be to think primarily in terms of a sequence of blocks though, not an entire file. Second, I almost never use a semaphore (or anything similar) directly. Instead, I would normally use a thread-safe queue, so when the network thread has read a block, it puts a structure into the queue saying where the data is and such. The processing thread waits for an item in the queue, and when one arrives it pops and processes that block.
When it finishes processing a block, it'll typically push the result onto another queue for the next stage of processing (e.g., writing to a file), and (quite possibly) put a descriptor for the processed block onto another queue, so the memory can be re-used for reading another block of input.
At least in my experience, this type of design eliminates a large percentage of thread synchronization issues.
Edit: I'm not sure about guidelines about how to design a thread-safe queue, but I've posted code for a simple one in a previous answer.
As far as design patterns go, I've seen this called at least "pipeline" and "production line" (though I'm not sure I've seen the latter in much literature).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js