I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?
What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal
for mmap().
Note the last - just using the data once is a bad use-case for mmap().
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open() to get a single int file descriptor.
Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
Using a single producer thread, read large chunks (say 64k to 1MB+)
into one of N page-aligned buffers.
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the
buffer back to the reading thread.
You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.
Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
Obtain the file size (fstat)
Allocate a buffer that size.
Open and read the whole file into the buffer.
Figure out how to partition the data per thread and spin off the threads.
Time that algorithm.
Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.
Related
I want to read and store a large CSV file into a map. I started by just reading the file and seeing how long it takes to process. This is my loop:
while(!gFile.eof()){
gFile >> data;
}
It is taking me ~35 mins to process the csv file that contains 35 million lines and six columns. Is there any way to speed this up? Pretty new to SO, so apologies if not asking correctly.
Background
Files are stream devices or concepts. The most efficient usage of reading a file is to keep the data streaming (flowing). For every transaction there is an overhead. The larger the data transfer, the less impact the overhead has. So, the goal is to keep the data flowing.
Memory faster than file access
Search memory is many times faster than searching a file. So, searching for a "word" or delimiter is going to be faster than reading a file character by character to find the delimiter.
Method 1: Line by line
Using std::getline is much faster than using operator>>. Although the input code may read a block of data; you are only performing one transaction to read a record versus one transaction per column. Remember, keep data flowing and searching memory for the columns is faster.
Method 2: Block reading
In the spirit of keeping the stream flowing, read a block of memory into a buffer (large buffer). Process the data from the buffer. This is more efficient than reading line by line because you can read in multiple lines of data with one transaction, reducing the overhead of a transaction.
One caveat is that you may have a record cross buffer boundaries, so you'll need to come up with an algorithm to handle that. The execution penalty is small and only happens once per transaction (consider this part of the overhead of a transaction).
Method 3: Multiple threads
In the spirit of keeping the data streaming, you could create multiple threads. One thread is in charge or reading the data into a buffer while another thread processes the data from the buffer. This technique will have better luck keeping the data flowing.
Method 4: Double buffering & multiple threads
This takes Method 3 above and adds multiple buffers. The reading thread can fill up one buffer then start filling a second buffer. The data processing thread will wait until the first buffer is filled before processing the data. This technique is used to better match the speed of reading data to the speed of processing the data.
Method 5: Memory mapped files
With a memory mapped file, the operating system handles the reading of the file into memory on demand. Less code that you have to write, but you don't get as much control over when the file is read into memory. This is still faster than reading field by field.
Lets start with the bottlenecks.
Reading from disk
Decoding the data
Store in map
Memory speed
Amount of memory
Read from disk
Read till you drop, if you can't read fast enough to use all
bandwidth on the disk you can go faster. Ignore all other steps and only read.
Start by adding buffers to your instream
Set hints for reading
use mmap
4GB is a trivial size, if you don't already have 32 GB upgrade
Too slow buy M.2 disk.
Still to slow then more exotic, change disk driver, dump OS. Mirror disks, only you $£€ is the limit.
Decode the data
if you data is in lines where are all the same length then all decodes can be done in parallel, limited only by memory bandwidth.
if the line lengths only wary a little the find end of line can be done in parallel followed by parallel decode.
if the order of the lines doesn't matter for the final map just split the file in #hardwarethreads parts and let each process their part until the first newline in the next threads part.
memory bandwidth will most likely be reach far before the CPU is anyway near used up.
Store in map
hopefully you have thought about this map in advance as none of the std maps are thread safe.
if you don't care about order a std::array can be used and you can run at full memory bandwidth.
lets say you want to use std::unordered_map, there is a problem that it needs to update the size after each write, so effectively your are limited to 1 thread writing to it.
You could use 1 thread at a time to update while the other precompute the hash of the record.
having one thread write has the problem that nearly every write will be a cache miss severely limiting speed.
so if that is not fast enough roll your own hash_map, without a size that must be updated every write.
to ensure thread safety you also need to protect the write, having one mutex makes you as slow or slower than the single writer.
you could try to make it lock and wait free ... if your not an expert you will get severe headache instead.
if you have selected a bucket design for you hash then you could make X times number of writer threads mutexes, use the hash value to select the mutex. The extra mutexes increase the likelihood that two threads won't collide.
Memory speed
Each line will be transferred at least 4 times over the memory bus, once from the disk to ram (at least once more if the driver is not good), once when the data is decoded, once when the map makes a reads request, and one more for when the map writes.
A good setup can save one more memory access if the driver writes to cache and therefore decode not will result in a LLC-miss.
Amount of memory
you should have enough memory to hold the total file, the data structure and some intermediate data.
Check if RAM is cheaper than your programming time.
I am actually developing scientific C++ simulation programs which read data, compute lots of values from them and finally store the results in a file. I wanted to know if reading all the data at once at the beginning of the program is faster than keep accessing the file via std::ifstream during the program.
The data I am using are not very big (several MB), but I do not even know what "big" is for a heap allocation...
I guess it depends on the data and so on (and after some test, effectively it depends), but I was wondering on what it was depending and whether there is a kind of general principle we should be following.
Long story short, the question is: does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators?
Take a look at mmap. This API allows you to map a file descriptor into your address space using the same paging mechanism as is used for RAM. This way you should get both the benefit random access to data while not unnecessarily copying unneeded data into RAM.
Is reading all the data at once at the beginning of the program faster than keep accessing the file via std::ifstream during the program? Yes, probably it is. Keep in mind that working memory is fast and expensive, while storage memory (a hard drive) exists precisely to be cheap at the cost of being slow.
What is "big" for a heap allocation? The operating system is going to try to fool your process into thinking that all existing working memory is free. This is not actually true, and the OS will "swap" one type of memory for the other if some process requests too much memory. But in principle, you should think that a heap allocation is big if it is comparable to the total size of working memory.
Does keeping a file opened and using file manipulators is faster than a potentially big heap allocation and using string manipulators? No, it is not faster, but it has another advantage: it is memory-efficient. If you only put the needed data into memory in order to work with them, you are saving memory for all other processes in the machine (which could be other threads of your program, for instance). This is a very interesting property in order to have scalable software.
(Expect this to be closed because it is an "opinion based" question.)
My thoughts:
This sounds like premature optimization. Write it the easy way, then optimize if it is too slow.
Working in memory is generally thousands of times faster. Heap allocations slow down based on the number of allocations, not the size of the allocations. It does not sound like you are working with a whole lot of data though.
If your files are "several MB" then the OS will probably cache it anyway.
Reading data from a file in large chunks is a lot faster than many read requests of small size. For example, 1 read of 10MB is faster than 10 reads of 1MB.
When I optimize file I/O, I read the data into a uint8_t buffer and then parse the buffer. One thorn with this method is reading of text files. There is a possibility that the text encoded data may span across a buffer boundary. For example, you have 4 numbers per text line and only 2 are in the buffer (or only 2 digits of the number are in the buffer). You will have to write code to handle these cases.
If you consider your program as a pipeline, you may be able to optimize futher. You can implement threads: a reading thread, a processing thread, and a writing (output) thread. The reading thread reads into buffers. When there is enough data for processing, the reading thread wakes up the processing thread. The processing thread processes the data that was read, and when there is some output, it stores it into a shared buffer and wakes up the output thread. So with the pipeline model, the data enters the pipe via the reading thread. At some point in the pipe, the processing thread processes the data. The writing thread takes the data from the processing thread and outputs it (exiting the pipeline).
Also, organizing your data so it fits into a processor cache line will also speed up your program.
In my current project I'm dealing with a big amount of data which is being generated on-the-run by means of a "while" loop. I want to write the data onto a CSV file, and I don't know what's better - should I store all the values in a vector array and write to the file at the end, or write in every iteration?
I guess the first choice it's better, but I'd like an elaborated answer if that's possible. Thank you.
Make sure that you're using an I/O library with buffering enabled, and then write every iteration.
This way your computer can start doing disk access in parallel with the remaining computations.
PS. Don't do anything crazy like flushing after each write, or opening and closing the file each iteration. That would kill efficiency.
The most efficient method to write to a file is to reduce the number of write operations and increase the data written per operation.
Given a byte buffer of 512 bytes, the most inefficient method is to write 512 bytes, one write operation at a time. A more efficient method is to make one operation to write 512 bytes.
There is overhead associated with each call to write to a file. That overhead consists of locating the file on the drive in it's catalog, seeking to the a new location on the drive and writing. The actual operation of writing is quite fast; it's this seeking and waiting for the hard drive to spin up and get ready that wastes your time. So spin it up once, keep it spinning by writing a lot of stuff, then let it spin down. The more data written while the platters are spinning the more efficient the write will be.
Yes, there are caches everywhere along the data path, but all that will be more efficient with large data sizes.
I would recommend writing the the formatted to a text buffer (that is a multiple of 512), and at certain points, flush the buffer to the hard drive. (512 bytes is a common sector size multiple on hard drives).
If you like threads, you can create a thread that monitors the output buffer. When the output buffer reaches a threshold, the thread writes the contents to drive. Multiple buffers can help by having the fast processor fill up buffers while other buffers are written to the slow drive.
If your platform has DMA you might be able to speed things up by having the DMA write the data for you. Although I would expect a good driver to do this automatically.
I do use this technique on an embedded system, using a UART (RS232 port) instead of a hard drive. By using the buffering, I'm able go get about 80% efficiency.
(Loop unrolling may also help.)
The easiest way is in console with > operator. In linux:
./miProgram > myData.txt
Thats get the input of the program and puts in a file.
Sorry for the english :)
I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.
I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).
We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand.
The second is a read()/seek() based system.
Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.
So my question comes to these:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
B: Are there systematic ways for mmap to better deal with very large mapped data?
Mmap-vs-reading-blocks
is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.
Reads back to what? What is the final destination of this data?
Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.
Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.
if (pid = fork()) {
waitpid(pid, ...);
} else {
dup2(dest, 1);
dup2(source, 0);
execlp("cat", "cat");
}
Update0
If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.
To answer your more specific questions:
A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?
That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.
B: Are there systematic ways for mmap to better deal with very large mapped data?
Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):
MAP_HUGETLB
Allocate the mapping using "huge pages."
This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.
MAP_NORESERVE
Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.
This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**
MAP_POPULATE
Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.
Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).
And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?
The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).
mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.
I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.
My problem is this: I have a C/C++ app that runs under Linux, and this app receives a constant-rate high-bandwith (~27MB/sec) stream of data that it needs to stream to a file (or files). The computer it runs on is a quad-core 2GHz Xeon running Linux. The filesystem is ext4, and the disk is a solid state E-SATA drive which should be plenty fast for this purpose.
The problem is Linux's too-clever buffering behavior. Specifically, instead of writing the data to disk immediately, or soon after I call write(), Linux will store the "written" data in RAM, and then at some later time (I suspect when the 2GB of RAM starts to get full) it will suddenly try to write out several hundred megabytes of cached data to the disk, all at once. The problem is that this cache-flush is large, and holds off the data-acquisition code for a significant period of time, causing some of the current incoming data to be lost.
My question is: is there any reasonable way to "tune" Linux's caching behavior, so that either it doesn't cache the outgoing data at all, or if it must cache, it caches only a smaller amount at a time, thus smoothing out the bandwidth usage of the drive and improving the performance of the code?
I'm aware of O_DIRECT, and will use that I have to, but it does place some behavioral restrictions on the program (e.g. buffers must be aligned and a multiple of the disk sector size, etc) that I'd rather avoid if I can.
You can use the posix_fadvise() with the POSIX_FADV_DONTNEED advice (possibly combined with calls to fdatasync()) to make the system flush the data and evict it from the cache.
See this article for a practical example.
If you have latency requirements that the OS cache can't meet on its own (the default IO scheduler is usually optimized for bandwidth, not latency), you are probably going to have to manage your own memory buffering. Are you writing out the incoming data immediately? If you are, I'd suggest dropping that architecture and going with something like a ring buffer, where one thread (or multiplexed I/O handler) is writing from one side of the buffer while the reads are being copied into the other side.
At some size, this will be large enough to handle the latency required by a pessimal OS cache flush. Or not, in which case you're actually bandwidth limited and no amount of software tuning will help you until you get faster storage.
You can adjust the page cache settings in /proc/sys/vm, (see /proc/sys/vm/dirty_ratio, /proc/sys/vm/swappiness specifically) to tune the page cache to your liking.
If we are talking about std::fstream (or any C++ stream object)
You can specify your own buffer using:
streambuf* ios::rdbuf ( streambuf* streambuffer);
By defining your own buffer you can customize the behavior of the stream.
Alternatively you can always flush the buffer manually at pre-set intervals.
Note: there is a reson for having a buffer. It is quicker than writting to a disk directly (every 10 bytes). There is very little reason to write to a disk in chunks smaller than the disk block size. If you write too frquently the disk controler will become your bottle neck.
But I have an issue with you using the same thread in the write proccess needing to block the read processes.
While the data is being written there is no reason why another thread can not continue to read data from your stream (you may need to some fancy footwork to make sure they are reading/writting to different areas of the buffer). But I don't see any real potential issue with this as the IO system will go off and do its work asyncroniously (potentially stalling your write thread (depending on your use of the IO system) but not nesacerily your application).
I know this question is old, but we know a few things now we didn't know when this question was first asked.
Part of the problem is that the default values for /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio are not appropriate for newer machines with lots of memory. Linux begins the flush when dirty_background_ratio is reached, and blocks all I/O when dirty_ratio is reached. Lower dirty_background_ratio to start flushing sooner, and raise dirty_ratio to start blocking I/O later. On very large memory systems, (32GB or more) you may even want to use dirty_bytes and dirty_background_bytes, since the minimum increment of 1% for the _ratio settings is too coarse. Read https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ for a more detailed explanation.
Also, if you know you won't need to read the data again, call posix_fadvise with FADV_DONTNEED to ensure cache pages can be reused sooner. This has to be done after linux has flushed the page to disk, otherwise the flush will move the page back to the active list (effectively negating the effect of fadvise).
To ensure you can still read incoming data in the cases where Linux does block on the call to write(), do file writing in a different thread than the one where you are reading.
Well, try this ten pound hammer solution that might prove useful to see if i/o system caching contributes to the problem: every 100 MB or so, call sync().
You could use a multithreaded approach—have one thread simply read data packets and added them to a fifo, and the other thread remove packets from the fifo and write them to disk. This way, even if the write to disk stalls, the program can continue to read incoming data and buffer it in RAM.