How do I concurrently download and convert a binary file using threads?

How do I concurrently download and convert a binary file using threads? - c++

I have a program that downloads a binary file, from another PC.
I also have a another standalone program that can convert this binary file to a human readable CSV.
I would like to bring the conversion tool "into" the download tool, creating a thread in the download tool that kicks off the conversion code (so it can start converting while it is downloading, reducing the total time of download and convert independently).
I believe I can successfully kick off another thread but how do I synchronize the conversion thread with the main download?
i.e. The conversion catches up with the download, needs to wait for more to download, then start converting again, etc.
Is this similar to the Synchronizing Execution of Multiple Threads ? If so does this mean the downloaded binary needs to be a resource accessed by semaphores?
Am I on the right path or should i be pointed in another direction before I start?
Any advice is appreciated.
Thank You.

This is a classic case of the producer-consumer problem with the download thread as the producer and the conversion thread as the consumer.
Google around and you'll find an implementation for your language of choice. Here are some from MSDN: How to: Implement Various Producer-Consumer Patterns.

Intead of downloading to a file, you should write the downloaded data to a pipe. The convert thread can be reading from the pipe and then writing the converted output to a file. That will automatically synchronize them.
If you need the original file as well as the converted one, just have the download thread write the data to the file then write the same data to the pipe.

Yes, you undoubtedly need semaphores (or something similar such as an event or critical section) to protect access to the data.
My immediate reaction would be to think primarily in terms of a sequence of blocks though, not an entire file. Second, I almost never use a semaphore (or anything similar) directly. Instead, I would normally use a thread-safe queue, so when the network thread has read a block, it puts a structure into the queue saying where the data is and such. The processing thread waits for an item in the queue, and when one arrives it pops and processes that block.
When it finishes processing a block, it'll typically push the result onto another queue for the next stage of processing (e.g., writing to a file), and (quite possibly) put a descriptor for the processed block onto another queue, so the memory can be re-used for reading another block of input.
At least in my experience, this type of design eliminates a large percentage of thread synchronization issues.
Edit: I'm not sure about guidelines about how to design a thread-safe queue, but I've posted code for a simple one in a previous answer.
As far as design patterns go, I've seen this called at least "pipeline" and "production line" (though I'm not sure I've seen the latter in much literature).

Related

How to write-off an expensive file I/O in the middle of a C++ program

I am working on some code which is performance wise extremely demanding (I am using microsecond timers!). The thing is, it has a server<->client architecture where a lot of data is being shared at high speeds. To maintain a sync between the client and the server a simple "sequence number" based approach is followed. Such that if the client's program crashes, the client can "resume" communication by sending the server the last sequence number and they can "resume operations" without missing out on anything.
The issue with this is that I am forced to write sequence numbers to disk. Sadly this has to be done on every "transaction". This file write causes huge time costs (as we would expect).
So I thought I would use threads to get around this problem. However, if I create a regular thread, I would have to wait until the file write finishes anyway and if I used a detached thread, I am doing something risky, as the thread might not finish when my actual process is killed (let's say) and thus the sequence number gets messed up.
What are my options here. Kindly note that sadly I do not have access to C++11. I am using lpthread on linux.

You can just add the data to a queue, and have the secondary threads dequeue, write, and signal when they're done.
You can also get some inspiration from log-based file systems. They get around this problem by having the main thread first writing a small record to a log file and returning control immediately to the rest of the program. Meanwhile, secondary threads can carry out the actual data write, and signal when done by also writing to the log file. This helps your maintain throughput by deferring writes to when more system resources are available, and doesn't block the main thread. Read more about it here

Waiting on a file handle until it has grown to some size

There is a component-A which downloads the content from some source and writes it to a file.
Another component-B which needs to wait for the Download+write operation by component-A to finish.
File Size is known before hand to component B.
Condition:
Component-A cannot signal that it has finished write operation.
Component B has to somehow identify that file has grown to the expected size and start reading it.
What is the best way to do this? Trivial way is to check for the size after some time intervals.
Is there a way to wait on file handle until it has grown to expected size?

No, you cannot wait on a file handle that way. Waiting on file handles is meaningful only if you use async operations (the handle becomes signaled whenever any such operation on it completes), and it's not recommended anyway.
A usable option would be to call ReadDirectoryChangesW, but that comes with its own set of pitfalls (it works in terms of file names, not handles; file names might be long or short, no guarantee is given; you have to use the more complicated async workflow because the sync one offers nothing better than what you already have).
All in all, if your requirements are inviolate then using a timer doesn't sound bad, and it will certainly make for much simpler code.

Is using cat command to merge files created by multiple threads efficient?

I have a multi-threaded C++11 program in which each thread produces a large amount of data that need to be written to the disk. All the data need to be written onto one file. At the moment, I use a mutex that protects accesses to the file from multiple threads. My friend suggested me that I can use one file for each thread, then at the end merge the files into one file with cat command done from C++ code using system().
I'm thinking if cat command is going to read all the data back from the disk and then write it again to the disk but this time into a single file, it's not going to be any better. I have googled but couldn't find cat command implementation details. May I know how it works and if it's going to accelerate the whole procedure?
Edit:
Chronology of events is not important, and there's no ordering constraint on the contents of the files. Both methods will perform what I want.

You don't specify if you have some ordering or structuring constraints on the content of the file. Generally it is the case, so I'll treat it as such, but hopefully my solution should work either way.
The classical programmatic approach
The idea is to offload the work of writing to disk to a dedicated IO thread, and have a multiple producers/ one consumer queue to queue up all the write commands. Each work thread simply format its output as a string and push it back to the queue. The IO thread pop batches of messages from the queue into a buffer, and issue the write commands.
Alternatively, you could add a field in your messages to indicate which worker emitted the write command, and have the IO thread push to different files, if needed.
For better performance, it also interesting to look into async versions of the IO system primitives (read/write), if your host OS supports them. The IO thread would then be able to monitor several concurrent IO, and feed the OS with new ones as soon as one terminate.
As it has been advised in comments, you will have to monitor the IO thread for congestion situations, and tune the number of workers accordingly. The "natural" feedback based mechanism is to simply make the queue bounded, and workers will wait on the lock until space free up on it. This let you control the amount of produced data at any point during the process life, which is an important point in memory constrained scenarios.
Your cat concerns
As for cat, this command line tool simply read whatever is wrote to its input channel (usually stdin), and duplicates it to its output (stdout). It's as simple a that, and you can clearly see the similarity with the solution advocated above. The difference is that cat doesn't understand the file internal structure (if any), it only deals with byte streams, which means that if several processes write concurrently to a cat input without synchronization, the resulting output would probably be completely mixed up. Another issue is the atomicity (or lack thereof) of IO primitives.
NB: On some systems, there's a neat little feature called a fork, which let you have several "independent" streams of data multiplexed in a single file. If you happen to work on a platform supporting that feature, you could have all your data streams bundled in a single file, but separately reachable.

how to synchronize three dependent threads

If I have
1. mainThread: write data A,
2. Thread_1: read A and write it to into a Buffer;
3. Thread_2: read from the Buffer.
how to synchronize these three threads safely, with not much performance loss? Is there any existing solution to use? I use C/C++ on linux.
IMPORTANT: the goal is to know the synchronization mechanism or algorithms for this particular case, not how mutex or semaphore works.

First, I'd consider the possibility of building this as three separate processes, using pipes to connect them. A pipe is (in essence) a small buffer with locking handled automatically by the kernel. If you do end up using threads for this, most of your time/effort will be spent on creating nearly an exact duplicate of the pipes that are already built into the kernel.
Second, if you decide to build this all on your own anyway, I'd give serious consideration to following a similar model anyway. You don't need to be slavish about it, but I'd still think primarily in terms of a data structure to which one thread writes data, and from which another reads the data. By strong preference, all the necessary thread locking necessary would be built into that data structure, so most of the code in the thread is quite simple, reading, processing, and writing data. The main difference from using normal Unix pipes would be that in this case you can maintain the data in a more convenient format, instead of all the reading and writing being in text.
As such, what I think you're looking for is basically a thread-safe queue. With that, nearly everything else involved becomes borders on trivial (at least the threading part of it does -- the processing involved may not be, but at least building it with multiple threads isn't adding much to the complexity).

It's hard to say how much experience with C/C++ threads you have. I hate to just point to a link but have you read up on pthreads?
https://computing.llnl.gov/tutorials/pthreads/
And for a shorter example with code and simple mutex'es (lock object you need to sync data):
http://students.cs.byu.edu/~cs460ta/cs460/labs/pthreads.html

I would suggest Boost.Thread for this purpose. This is quite good framework with mutexes and semaphores, and it is multiplatform. Here you can find very good tutorial about this.
How exactly synchronize these threads is another problem and needs more information about your problem.
Edit The simplest solution would be to put two mutexes -- one on A and second on Buffer. You don't have to worry about deadlocks in this particular case. Just:
Enter mutex_A from MainThread; Thread1 waits for mutex to be released.
Leave mutex from MainThread; Thread1 enters mutex_A and mutex_Buffer, starts reading from A and writes it to Buffer.
Thread1 releases both mutexes. ThreadMain can enter mutex_A and write data, and Thread2 can enter mutex_Buffer safely read data from Buffer.
This is obviously the simplest solution, and probably can be improved, but without more knowledge about the problem, this is the best I can come up with.

Threading the writing in a log file

I'm setting up a log system for my (2d) game engine, and it should be able to write lines to a file.
The point is, writing to the disc is not instantaneous. If the file writing (basically, the file.flush()) is done in the thread who is calling the Trace.Write(), will it hang while the file is being written ?
If it is the case, then it would be interesting to create a thread used only to write the log lines to the log file, while the processing thread would continue what it is doing.
Same question with the console (while I'm here...).
The question is :
"Is it interesting in a calculation intensive program, to thread the console and/or file writing ?"
Thank you.

Yes, your thread may be suspended while it is in a IOWAIT state. This is a classical suspend situation.
If it is a good idea to create a thread only responsible for writing logfile entries depends on your code. Is it I/O bound? Then it might be a good idea. Is your code CPU bound? Then it won't help much. Is it neither? Then it doesn't matter.
The best way to figure this out is to analyze your code and benchmark the two versions.

If you queue off the log writes to a dedicated logging thread, there are many advantages. The big disadvantage is that the logging will almost certainly not have happened when your log call returns. If the problem you are trying to catch is is a disastrous crash, the log entry that identifies the bug may not get written at all.
Is it interesting in a calculation intensive program, to thread the console and/or file writing ?
In general, given the caveat above, probably yes:
See also:

If the file writing (basically, the file.flush()) is done in the thread who is calling the Trace.Write(), will it hang while the file is being written ?
Yes. This is because the flush() call is designed to ensure the data hits the disk.
If it is the case, then it would be interesting to create a thread used only to write the log lines to the log file, while the processing thread would continue what it is doing.
Why not just stop calling flush()? If you're not interested in making absolutely sure that, by a certain part of the program, all the data written so far is on the disk, just stop calling flush() manually, and it'll get buffered and written out in the usual efficient manner.
Ultimately there might be some small benefit of having the log writes in another thread, if the disk writing system requires periodic syncs that hang the thread (which I'm not confident is the case), but I would expect that you lose far more than you gain by having to implement synchronisation on however you pass your loggable strings to the background thread. Then you start getting into wondering whether you can use a lock-free queue or some other complex system when really you probably just needed to do it the simple way in the first place - write whenever you like, only flush when absolutely necessary.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js