Reading many files in parallel - c++

I have a cross platform project in C++ where I am mixing audio in real-time. I have several independent tracks as input, that I read from separate files on disk. I then mix these, apply some processing, and spit out a buffer with the resulting audio. The problem I am having is that of disk IO speed. For the current test I am performing, I have about 10 tracks that are read simultaneously from disk. Each track is in raw PCM, 48000 HZ 16 bit stereo. This means that there is a significant amount of data that needs to be read as quickly as possible. I have tried both simple fread calls as well as memory mapped files through Boost, but the issue is the same. When a file is first opened, it usually causes the audio to break up (presumably while the file is read into cache by the OS). After that, everything runs smoothly without a glitch. For the time being I use one thread per file in the common case, sometimes two files per thread. It is usually when I have two files per thread that the stalling/breakup of the stream occurs. Note that I do not know in advance which input files that need to be played, as this is controlled by the user. So my problem is how to read these initial blocks in such a way so that I don't get stalling/breakup. Also, when a new file is loaded it is not necessarily in the beginning that the reading must start.
I have a few thoughts:
Can we prefetch the files into cache by reading all of them once at startup but disregarding the data? I cannot store all of it in memory. But it seems bad to rely on internal behavior of the OS's read cashe, especially since this is cross platform.
Can we use a format such as Ogg Vorbis for compression, load the compressed data fully into memory and then decode on the fly? I am thinking that decoding 10 or more Vorbis streams might be too CPU intensive, but I have no benchmarks yet. At least in this way we turn it from an I/O bound task to a CPU bound one.
Can we do any other kind of clever buffering approach to make it so that the large reads are more equally distributed? I know very little about how I might accomplish that.
I am stuck at this point, and would appreciate any suggestions that might improve throughput.

Try doing the file loading using event processing.
This is where you open a bunch of file descriptors and let the operating system notify your programs when data is available.
The most broadly available api to do this with "select" (http://linux.die.net/man/2/select), wbut there are better methods ( poll, epoll, kqueue ). These are not available everywhere.
There are libraries that abstract this for you ( libev and libevent ).
So the way you do it is, one thread opens all the files you need and sets a 'watcher' on them. When data is available the watcher triggers, and call a callback.
The advantage is that you don't have a ton of threads waiting and sleeping checking all your open file descriptors. If that doesn't work then likely you are over saturating the hardware's io bandwidth - in which case you just have to wait. If that is the case then you need to do some buffering to avoid stutters.

As a rule of thumb, you need to perform file IO operations in a separate thread for real time operations. When user wants to mix a second audio file, you can just open a new thread and read the first N bytes of this second audio file and return the data read into the main thread. This will also cause a lag but it will not break the audio flow.

Related

Multithreaded Files Reading

I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?
What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal
for mmap().
Note the last - just using the data once is a bad use-case for mmap().
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open() to get a single int file descriptor.
Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
Using a single producer thread, read large chunks (say 64k to 1MB+)
into one of N page-aligned buffers.
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the
buffer back to the reading thread.
You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.
Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
Obtain the file size (fstat)
Allocate a buffer that size.
Open and read the whole file into the buffer.
Figure out how to partition the data per thread and spin off the threads.
Time that algorithm.
Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.

Using pthreads with CUDA - design questions

I am writing some code that requires some disk I/O, and invoking a library that I wrote that does some computation and GPU work, and then more disk I/O to write the results back to a file.
I would like to create this as multi-threaded code, because the files are quite large. I want to be able to read in portion of the file, send it to the GPU library, and write a portion back to a file. The disk I/O involved is quite large (like 10GB), and the computation is fairly quick on the GPU.
My question is more of a design question. Should I use separate threads to pre-load data that goes to the GPU library, and only have the main thread actually execute the calls to the GPU library, and then send the resulting data to other threads to be written back out to disk, or should I go ahead and have all of the separate threads each do their own part - grab a chucnk of data, execute on the GPU, and write to disk, and then go get the next chunk of data?
I am using CUDA for my GPU library. Is cuda smart enough to not try to run two kernels on the GPU at once? I guess I will have to do the management manually to ensure that two threads dont try to add more data to the GPU than it has space?
Any good resources on the subject of multithreading and CUDA being used in combination is appreciated.
Threads will not help with disk I/O. Generally people tend to solve blocking problems by creating tons of threads. In fact, that only makes things worse. What you have to do is to use asynchronous I/O and not block on write (and read). You can use some generic solutions like libevent or Asio for this or work with lower-level API available on your platform. On Linux, AIO seems to be the best for files, but I haven't tried that yet. Hope it helps.
I encountered this situation with large files in my research work.
As far as I remember there is no much gain in threading the disk I/O work because is
very slow compared to the GPU I/O.
The strategy I used was to read synchronously from disk and to load data and execute asynchronously on the GPU.
Something like:
read from disk
loop:
async_load_to_gpu
async_execute
push_event
read from disk
check event complete or read more data from disk

Need some help writing my results to a file

My application continuously calculates strings and outputs them into a file. This is being run for almost an entire day. But writing to the file is slowing my application. Is there a way I can improve the speed ? Also I want to extend the application so that I can send the results to an another system after some particular amount of time.
Thanks & Regards,
Mousey
There are several things that may or may not help you, depending on your scenario:
Consider using asynchronous I/O, for instance by using Boost.Asio. This way your application does not have to wait for expensive I/O-operations to finish. However, you will have to buffer your generated data in memory, so make sure there is enough available.
Consider buffering your strings to a certain size, and then write them to disk (or the network) in big batches. Few big writes are usually faster than many small ones.
If you want to make it really good C++, meaning STL-comliant, make your algorithm a template-function that takes and output-iterator as argument. This way you can easily have it write to files, the network, memory or the console by providing appropriate iterators.
How if you write the results to a socket, instead of file. Another program Y, will read the socket, open a file, write on it and close it, and after the specified time will transfer the results to another system.
I mean the process of file handling is handled by other program. Original program X just sends the output to the socket. It does not concern it self with flushing the file stream.
Also I want to extend the application
so that I can send the results to an
another system after some particular
amount of time.
If you just want to transfer the file to other system, then I think a simple script will be enough for that.
Use more than one file for the logging. Say, after your file reaches size of 1 MB, change its name to something contains the date and the time and start to write to a new one, named as the original file name.
then you have:
results.txt
results2010-1-2-1-12-30.txt (January 2 2010, 1:12:30)
and so on.
You can buffer the result of different computations in memory and only write to the file when buffer is full. For example, your can design your application in such a way that, it computes result for 100 calculations and writes all those 100 results at once in a file. Then computes another 100 and so on.
Writing file is obviously slow, but you can buffered data and initiate the separate thread for writhing on file. This can improve speed of your application.
Secondly you can use ftp for transfer files to other system.
I think there are some red herrings here.
On an older computer system, I would recommend caching the strings and doing a small number of large writes instead of a large number of small writes. On modern systems, the default disk-caching is more than adequate and doing additional buffering is unlikely to help.
I presume that you aren't disabling caching or opening the file for every write.
It is possible that there is some issue with writing very large files, but that would not be my first guess.
How big is the output file when you finish?
What causes you to think that the file is the bottleneck? Do you have profiling data?
Is it possible that there is a memory leak?
Any code or statistics you can post would help in the diagnosis.

Many small files or one big file? (Or, Overhead of opening and closing file handles) (C++)

I have created an application that does the following:
Make some calculations, write calculated data to a file - repeat for 500,000 times (over all, write 500,000 files one after the other) - repeat 2 more times (over all, 1.5 mil files were written).
Read data from a file, make some intense calculations with the data from the file - repeat for 1,500,000 iterations (iterate over all the files written in step 1.)
Repeat step 2 for 200 iterations.
Each file is ~212k, so over all i have ~300Gb of data. It looks like the entire process takes ~40 days on a Core 2 Duo CPU with 2.8 Ghz.
My problem is (as you can probably guess) is the time it takes to complete the entire process. All the calculations are serial (each calculation is dependent on the one before), so i can't parallel this process to different CPUs or PCs. I'm trying to think how to make the process more efficient and I'm pretty sure the most of the overhead goes to file system access (duh...). Every time i access a file i open a handle to it and then close it once i finish reading the data.
One of my ideas to improve the run time was to use one big file of 300Gb (or several big files of 50Gb each), and then I would only use one open file handle and simply seek to each relevant data and read it, but I'm not what is the overhead of opening and closing file handles. can someone shed some light on this?
Another idea i had was to try and group the files to bigger ~100Mb files and then i would read 100Mb each time instead of many 212k reads, but this is much more complicated to implement than the idea above.
Anyway, if anyone can give me some advice on this or have any idea how to improve the run time i would appreciate it!
Thanks.
Profiler update:
I ran a profiler on the process, it looks like the calculations take 62% of runtime and the file read takes 34%. Meaning that even if i miraculously cut file i/o costs by a factor of 34, I'm still left with 24 days, which is quite an improvement, but still a long time :)
Opening a file handle isn't probable to be the bottleneck; actual disk IO is. If you can parallelize disk access (by e.g. using multiple disks, faster disks, a RAM disk, ...) you may benefit way more. Also, be sure to have IO not block the application: read from disk, and process while waiting for IO. E.g. with a reader and a processor thread.
Another thing: if the next step depends on the current calculation, why go through the effort of saving it to disk? Maybe with another view on the process' dependencies you can rework the data flow and get rid of a lot of IO.
Oh yes, and measure it :)
Each file is ~212k, so over all i have
~300Gb of data. It looks like the
entire process takes ~40 days ...a ll the
calculations are serial (each
calculation is dependent on the one
before), so i can't parallel this
process to different CPUs or PCs. ... pretty
sure the most of the overhead goes to
file system access ... Every
time i access a file i open a handle
to it and then close it once i finish
reading the data.
Writing data 300GB of data serially might take 40 minutes, only a tiny fraction of 40 days. Disk write performance shouldn't be an issue here.
Your idea of opening the file only once is spot-on. Probably closing the file after every operation is causing your processing to block until the disk has completely written out all the data, negating the benefits of disk caching.
My bet is the fastest implementation of this application will use a memory-mapped file, all modern operating systems have this capability. It can end up being the simplest code, too. You'll need a 64-bit processor and operating system, you should not need 300GB of RAM. Map the whole file into address space at one time and just read and write your data with pointers.
From your brief explaination it sounds like xtofl suggestion of threads is the correct way to go. I would recommend you profile your application first though to ensure that the time is divided between IO an cpu.
Then I would consider three threads joined by two queues.
Thread 1 reads files and loads them into ram, then places data/pointers in the queue. If the queue goes over a certain size the thread sleeps, if it goes below a certain size if starts again.
Thread 2 reads the data off the queue and does the calculations then writes the data to the second queue
Thread 3 reads the second queue and writes the data to disk
You could consider merging thread 1 and 3, this might reduce contention on the disk as your app would only do one disk op at a time.
Also how does the operating system handle all the files? Are they all in one directory? What is performance like when you browse the directory (gui filemanager/dir/ls)? If this performance is bad you might be working outside your file systems comfort zone. Although you could only change this on unix, some file systems are optimised for different types of file usage, eg large files, lots of small files etc. You could also consider splitting the files across different directories.
Before making any changes it might be useful to run a profiler trace to figure out where most of the time is spent to make sure you actually optimize the real problem.
What about using SQLite? I think you can get away with a single table.
Using memory mapped files should be investigated as it will reduce the number of system calls.

How best to manage Linux's buffering behavior when writing a high-bandwidth data stream?

My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.
You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.
If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.
You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.
If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).
I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.
Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().
You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.