My application uses text file to store data to file.
I was testing for the fastest way of reading it by multi threading the operation.
I used the following 2 techniques:
Use as many streams as NUMBER_OF_PROCESSORS environment variable. Each stream is on a different thread. Divide total no of lines in file equally for each stream. Parse the text.
Only one stream parses the entire file and loads the data in memory. Create threads (= NUMBER_OF_PROCESSORS - 1) to parse data from memory.
The test was run on various file sizes 100kB - 800MB.
Data in file:
100.23123 -42343.342555 ...(and so on)
4928340 -93240.2 349 ...
...
The data is stored in 2D array of double.
Result: Both methods take approximately the same time for parsing the file.
Question: Which method should I choose?
Method 1 is bad for the Hard disk as multiple read access are performed at random locations simultaneously.
Method 2 is bad because memory required is proportional to file size. This can be partially overcome by limiting the container to a fixed size, deleting the parsed content and fill it again from the reader. But this increases the processing time.
Method 2 has a sequential bottleneck (the single-threaded reading and handing out of the work items). This will not scale indefinitely according to Amdahls Law. It is a very fair and reliable method, though.
Method 1 has not bottleneck and will scale. Be sure to not cause random IO on the disk. I'd use a mutex to have only one thread read at a time. Read in big sequential block of maybe 4-16MB. In the time the disk does a single head seek it could have read about 1MB of data.
If parsing the lines takes a considerable amount of time, you can't use method 2 because of the big sequential part. It would not scale. If parsing is fast, though, use method 2 because it is easier to get right.
To illustrate the concept of a bottleneck: Imagine 1.000.000 computation threads asking one reader thread to give them lines. That one reader thread would not be able to keep up handing out lines as quickly as they are demanded. You would not get 1e6 times the throughput. This would not scale. But if 1e6 threads read independently from a very fast IO device, you would get 1e6 times the throughput because there is no bottleneck. (I have used extreme numbers to make the point. The same idea applies in the small.)
I'd prefer slightly modified 2 method. I would read data sequentally in single thread by big chunks. Ready chunk is passed to a thread pool where data is processed. So you have concurrent reading & processing
With enough RAM you can do it without single-thread bottleneck. For Linux:
1) mmap you whole file to RAM with MAP_LOCKED, requires root or system wide permissions tune. Or without MAP_LOCKED for SSD, they handle random access well.
2) give each thread a start position. Process data from first newline after self start position to first newline after next thread start position.
PS What is your program CPU load? Probably HDD is the bottleneck.
Related
I have a C++ program whose task is to analyse a stream of binary data (typically it's a file on disk) and extract some information. This task is "memory-less", meaning that the outcome of each step is independent of the previous one. Because of this, I thought to speed it up by giving the data to separate threads in order to improve performance.
For now, the data is read in blocks of 1GB at a time and saved in an array to avoid I/O bottlenecks. Should I separate the data in n chunks/arrays (where n is the number of threads) or is a single array accessed by multiple threads not an issue?
I have a C++ program whose task is to analyse a stream of binary data (typically it's a file on disk) and extract some information. This task is "memory-less", meaning that the outcome of each step is independent of the previous one. Because of this, I thought to speed it up by giving the data to separate threads in order to improve performance.
For now, the data is read in blocks of 1GB at a time and saved in an array to avoid I/O bottlenecks. Should I separate the data in n chunks/arrays (where n is the number of threads) or is a single array accessed by multiple threads not an issue?
EDIT 1: data and anlaysis specification
I realize that the wording of the problem might be too broad, as pointed out by one of the comments. I will try to go into a bit more detail.
The data being analysed is a series of unsigned 64bits integers generated by a so-called "time-to-digital" converter (TDC), storing a timestamp information about some event they register. My TDC has multiple channels, so each timestamp has information about which channel triggered (first 3 bits), whether that was a rising or falling edge trigger (4th bit), and the actual time (in clock ticks since powering up the TDC, last 60 bits).
The timestamps, of course, are saved in the file chronologically. The task is finding coincidence events between channels within a certain time window which the user sets. So you keep reading the timestamps and when you find two in the channels of interest whose distance in time is less than the set one, you increase the number of coincidence events.
These files can be quite big (tens of GB) and the number of timestamps enormous (one clock tick is 80 picoseconds).
For now I only go through the whole file once, and the idea was to "cut it" in smaller pieces that would then be analysed by different threads. The possible loss of events between the cuts is acceptable for me, since, at most will be 2 over hundreds of thousands.
Of course, they would only read data from the file/memory. I can write the coincidence counts in three separate variables and then sum them when all threads finish, if this helps avoiding sync problems.
I hope now things are clearer.
Yes, the same array can be accessed by multiple threads: if threads only read the array (which seems to be the case here), you won't have false sharing effects.
And to optimize cache use, you could make each thread read consecutive elements of the array (i.e not interleave reads between threads).
As a side-note, you may want to reconsider the 1GB block : that's a lot! Have you measured that it's better than, say, 1MB or 10KB ?
You also might want to parallelize "file reading" (one small chunk at a time) and "processing the content that was read" (using many threads as you do), using (at least) 2 arrays (one is being processed, the other will receive the next read)
I want to read and store a large CSV file into a map. I started by just reading the file and seeing how long it takes to process. This is my loop:
while(!gFile.eof()){
gFile >> data;
}
It is taking me ~35 mins to process the csv file that contains 35 million lines and six columns. Is there any way to speed this up? Pretty new to SO, so apologies if not asking correctly.
Background
Files are stream devices or concepts. The most efficient usage of reading a file is to keep the data streaming (flowing). For every transaction there is an overhead. The larger the data transfer, the less impact the overhead has. So, the goal is to keep the data flowing.
Memory faster than file access
Search memory is many times faster than searching a file. So, searching for a "word" or delimiter is going to be faster than reading a file character by character to find the delimiter.
Method 1: Line by line
Using std::getline is much faster than using operator>>. Although the input code may read a block of data; you are only performing one transaction to read a record versus one transaction per column. Remember, keep data flowing and searching memory for the columns is faster.
Method 2: Block reading
In the spirit of keeping the stream flowing, read a block of memory into a buffer (large buffer). Process the data from the buffer. This is more efficient than reading line by line because you can read in multiple lines of data with one transaction, reducing the overhead of a transaction.
One caveat is that you may have a record cross buffer boundaries, so you'll need to come up with an algorithm to handle that. The execution penalty is small and only happens once per transaction (consider this part of the overhead of a transaction).
Method 3: Multiple threads
In the spirit of keeping the data streaming, you could create multiple threads. One thread is in charge or reading the data into a buffer while another thread processes the data from the buffer. This technique will have better luck keeping the data flowing.
Method 4: Double buffering & multiple threads
This takes Method 3 above and adds multiple buffers. The reading thread can fill up one buffer then start filling a second buffer. The data processing thread will wait until the first buffer is filled before processing the data. This technique is used to better match the speed of reading data to the speed of processing the data.
Method 5: Memory mapped files
With a memory mapped file, the operating system handles the reading of the file into memory on demand. Less code that you have to write, but you don't get as much control over when the file is read into memory. This is still faster than reading field by field.
Lets start with the bottlenecks.
Reading from disk
Decoding the data
Store in map
Memory speed
Amount of memory
Read from disk
Read till you drop, if you can't read fast enough to use all
bandwidth on the disk you can go faster. Ignore all other steps and only read.
Start by adding buffers to your instream
Set hints for reading
use mmap
4GB is a trivial size, if you don't already have 32 GB upgrade
Too slow buy M.2 disk.
Still to slow then more exotic, change disk driver, dump OS. Mirror disks, only you $£€ is the limit.
Decode the data
if you data is in lines where are all the same length then all decodes can be done in parallel, limited only by memory bandwidth.
if the line lengths only wary a little the find end of line can be done in parallel followed by parallel decode.
if the order of the lines doesn't matter for the final map just split the file in #hardwarethreads parts and let each process their part until the first newline in the next threads part.
memory bandwidth will most likely be reach far before the CPU is anyway near used up.
Store in map
hopefully you have thought about this map in advance as none of the std maps are thread safe.
if you don't care about order a std::array can be used and you can run at full memory bandwidth.
lets say you want to use std::unordered_map, there is a problem that it needs to update the size after each write, so effectively your are limited to 1 thread writing to it.
You could use 1 thread at a time to update while the other precompute the hash of the record.
having one thread write has the problem that nearly every write will be a cache miss severely limiting speed.
so if that is not fast enough roll your own hash_map, without a size that must be updated every write.
to ensure thread safety you also need to protect the write, having one mutex makes you as slow or slower than the single writer.
you could try to make it lock and wait free ... if your not an expert you will get severe headache instead.
if you have selected a bucket design for you hash then you could make X times number of writer threads mutexes, use the hash value to select the mutex. The extra mutexes increase the likelihood that two threads won't collide.
Memory speed
Each line will be transferred at least 4 times over the memory bus, once from the disk to ram (at least once more if the driver is not good), once when the data is decoded, once when the map makes a reads request, and one more for when the map writes.
A good setup can save one more memory access if the driver writes to cache and therefore decode not will result in a LLC-miss.
Amount of memory
you should have enough memory to hold the total file, the data structure and some intermediate data.
Check if RAM is cheaper than your programming time.
I need to read / parse a large binary file (4 ~ 6 GB) that comes in fixed chunks of 8192 bytes. My current solution involves streaming the file chunks using the Single Producer Multiple Consumer (SPMC) pattern.
EDIT
File size = N * 8192 Bytes
All I am required to do is to do something to each of these 8192 bytes. The file is only required to be read once top down.
Having thought that this should be an embarrassingly parallel problem, I would like to have X threads to read at equal ranges of (File Size / X) sizes independently. The threads do not need to communicate with each other at all.
I've tried spawning X threads to open the same file and seek to their respective sections to process, however, this solution seems to have a problem with the due to HDD mechanical seeks and apparently performs worse than the SPMC solution.
Would there be any difference if this method is used on the SSD instead?
Or would it be more straight forward to just memory map the whole file and use #pragma omp parallel for to process the chunks? I suppose I would need sufficient enough RAM to do this?
What would you suggest?
What would you suggest?
Don't use mmap()
Per Linux Torvalds himself:
People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.
HOWEVER, playing games with the virtual memory mapping is very
expensive in itself. It has a number of quite real disadvantages that
people tend to ignore because memory copying is seen as something very
slow, and sometimes optimizing that copy away is seen as an obvious
improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Upsides of mmap:
if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.
And the automatic sharing is obviously a case of this.
But your test-suite (just copying the data once) is probably pessimal
for mmap().
Note the last - just using the data once is a bad use-case for mmap().
For a file on an SSD, since there are no physical head seek movements:
Open the file once, using open() to get a single int file descriptor.
Use pread() per thread to read appropriate 8kB chunks. pread() reads from a specified offset without using lseek(), and does not effect the current offset of the file being read from.
You'll probably need somewhat more threads than CPU cores, since there's going to be significant IO waiting on each thread.
For a file on mechanical disk(s):
You want to minimize head seek(s) on the mechanical disk.
Open the file once, using open() with direct IO (assuming Linux, open( filename, O_RDONLY | O_DIRECT );) to bypass the page cache (since you're going to stream the file and never re-read any portion of it, the page cache does you no good here)
Using a single producer thread, read large chunks (say 64k to 1MB+)
into one of N page-aligned buffers.
When a buffer is read, pass it to the worker threads, then read to fill the next buffer
When all workers are done using their part of the buffer, pass the
buffer back to the reading thread.
You'll need to experiment with the proper read() size, the number of worker threads, and the number of buffers passed around. Larger read()s will be more efficient, but the larger buffer size makes the memory requirements larger and makes the latency of getting that buffer back from the worker threads much more unpredictable. You want to make as few copies of the data as possible, so you'd want the worker threads to work directly on the buffer read from the file.
Even if the processing of each 8K block is significant (short of OCR processing), the i/o is the bottleneck. Unless it can be arranged for parts of the file to be already cached by previous operations....
If the system this is to run on can be dedicated to the problem:
Obtain the file size (fstat)
Allocate a buffer that size.
Open and read the whole file into the buffer.
Figure out how to partition the data per thread and spin off the threads.
Time that algorithm.
Then, revise it using asynchronous reading. See man aio_read and man 7 aio to learn what needs to be done.
I am getting data from a sensor(camera) and writing the data into a binary file. The problem is it takes lot of space on the disk.
So, I used the compression from boost (zlib) and the space reduced a lot! The problem is the compression process is slow and lots of data is missing.
So, I want to implement two threads, with one getting the data from the camera and writing the data into a buffer. The second thread will take the front data of the buffer and write it into the binary file. And in this case, all the data will be present.
How do I implement this buffer? It needs to expand dynamically and pop_front. Shall I use std::deque, or does something better already exist?
First, you have to consider these four rates (or speeds):
Speed of Production (SP): The average number of bytes your sensor produces per second.
Speed of Compression (SC): The average number of bytes per second you can compress. This is the number of input bytes to the compression algorithm.
Rate of Compression (RC): The average ratio of compressed data to uncompressed data your compress algorithm produces (ratio of size of output to the input of compression.) (This is obviously somewhere between 0 and 1.)
Speed of Writing (SW): The average number of bytes you can write to disk, per second.
If SC is less than SP, you are in trouble. It means you can't compress all the data you gather from your sensor, in real time. Which means you'll eventually run out of buffer memory. You'll have to find a faster compression algorithm, or dedicate more CPU cores to compression.
If SW is less than SP times RC (which is the size of sensor data after compression,) you are again in trouble. It means you can't write out your output data as fast as you are producing and compressing them, and again, you will eventually run out of buffer memory, no matter how much you have. You might be able to gain some speed by adopting a better write strategy or file system, but a real gain in SW comes from a better disk system (RAID, SSD, better hardware, etc.)
Now, if everything is OK speed-wise, you can probably employ something like the following architecture to read, compress and write the data out:
You'll have three threads (or two, described later) that do one part of the pipeline each. You'll also have two thread-safe queues, one for communication from each stage of the pipeline to the next.
Assuming the two queues are named Q1 and Q2, the high-level operation of the threads will look like this:
Input Thread:
Read K bytes of sensor data
Put the whole K bytes as a unit on Q1.
Go to 1.
Compression Thread:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer and put it on Q2.
Go to 1.
Output Thread:
Wait till there is something on Q2.
Pop one buffer of data from Q2.
Write the buffer to the output file.
Go to 1.
The most CPU-intensive part of the work is in the second thread, and the other two probably don't consume much CPU time and therefore probably can share a CPU core. This means that the above strategy may be runnable on two cores. But it can also run on a single core if the workload is light, or require many many cores. That all depends on the four rates I described up top.
Using asynchronous writes (e.g. IOCP on Windows or epoll on Linux,) you can drop the third thread and the second queue altogether. Then your second thread needs to execute something like this:
Wait till there is something on Q1.
Pop one buffer of data (probably K bytes) from Q1.
Compress the buffer into a hopefully smaller buffer.
Issue an asynchronous write request to the OS to write out the compressed buffer to disk.
Go to 1.
There are four more issues worth mentioning:
K should be selected so that the time required for various (usually constant time) activities associated with allocating a buffer, pushing it into and popping it from a thread-safe queue, starting a compression run and issuing a write request into a file become negligible relative to doing the actual work (reading sensor data, compressing bytes and writing to disk.) This usually means that K needs to be as large as possible. But if K is very large (many megabytes or hundreds of megabytes) then if your application crashes, you'll lose a lot of data. You need to find a balance between performance and risk of data loss. I suggest (without any knowledge of your specific needs and constraints) a value between 10KiB to 1MiB for K.
Implementing a thread-safe queue is easy if you have some knowledge and experience with concurrent/parallel programming, but rather hard and error-prone if you do not. Finding good examples and implementations should not be hard. A normal std::deque or std::list or std::anything won't be usable by itself, but can used as a good basis for writing a thread-safe queue.
Note that you are queuing buffers of data, not individual numbers or bytes. If you pass your data one number at a time through this pipeline, it will be painfully slow and wasteful.
Some compression algorithms are limited in how much data they can consume in each invocation, or that you must sync the output of each one call to compression routine with one call to the decompression routine later on. These might affect the choice of K, and also how you write your output file. You might have to add some metadata so that you can be able to actually decompress and read the data later.
I was working on a C++ tutorial exercise that asked to count the number of words in a file. It got me thinking about the most efficient way to read the inputs. How much more efficient is it really to read the entire file at once than it is to read small chunks (line by line or character by character)?
The answer changes depending on how you're doing the I/O.
If you're using the POSIX open/read/close family, reading one byte at a time will be excruciating since each byte will cost one system call.
If you're using the C fopen/fread/fclose family or the C++ iostream library, reading one byte at a time still isn't great, but it's much better. These libraries keep an internal buffer and only call read when it runs dry. However, since you're doing something very trivial for each byte, the per-call overhead will still likely dwarf the per-byte processing you actually have to do. But measure it and see for yourself.
Another option is to simply mmap the entire file and just do your logic on that. You might, or might not, notice a performance difference between mmap with and without the MAP_POPULATE flag. Again, you'll have to measure it and see.
The most efficient method for I/O is to keep the data flowing.
That said, reading one block of 512 characters is faster than 512 reads of 1 character. Your system may have made optimizations, such as caches, to make reading faster, but you still have the overhead of all those function calls.
There are different methods to keep the I/O flowing:
Memory mapped file I/O
Double buffering
Platform Specific API
Some simple experiments should suffice for demonstration.
Create a vector or array of 1 megabyte.
Start a timer.
Repeat 1000 times:
Read data into container using 1 read instruction.
End the timer.
Repeat, using a for loop, reading 1,000,000 characters, with 1 read instruction each.
Compare your data.
Details
For each request from the hard drive, the following steps are performed (depending on platform optimizations):
Start hard drive spinning.
Read filesystem directory.
Search directory for the filename.
Get logical position of the byte requested.
Seek to the given track & sector.
Read 1 or more sectors of data into hard drive memory.
Return the requested portion of hard drive memory to the platform.
Spin down the hard drive.
This is called overhead (except where it reads the sectors).
The object is to get as much data transferred while the hard drive is spinning. Starting a hard drive takes more time than to keep it spinning.