If I have multiple threads generating blocks of a file, what is the best way to write out the blocks?
ex) 5 threads working on a file of 500 blocks, block 0 is not necessarily completed before block 1, but the output file on disk need to be in order. (block 0, block 1, block 2, .... block 499)
the program is in C++, can fwrite() somehow "random access" the file? the file is created from scratch, meaning when block 5 is completed, the file may still be of size 0 due to block 1~4 are not completed yet. Can I directly write out block 5? (with proper fseek)
This piece of code is performance critical, so I'm really curious about anything that can improve the perf. This looks like a multiple producer(block generators) and one consumer(output writer) scenario. The idea case is that thread A can continue generating the next block when it complete the previous.
if fwrite can be "random", then the output writer can simply takes outputs, seek, and then write. However not sure if this design can perform well in large scale.
Some limitations
Each block is of the same size, generated in memory
block size is known in advance, but not the total number of blocks.
the total size is a few GBs. Big.
There could be multiple jobs running on one server. each job is described at above. They have their own independent generators/writer, difference processes.
The server is a Linux/CentOS machine.
Assuming each block is the same size, and that the blocks are generated in memory before they are required to be written to disk, then a combination of lseek and write would be perfectly fine.
If you are able to write the entire block in one write you would not gain any advantage in using fwrite -- so just use write directly -- however you would need some sort of locking access control (mutex) if all the threads are sharing the same fd -- since seek+write cannot be done atomically, and you would not want one thread to seek before an just before a second thread is about to write.
This further assume that your file system is a standard file system, and not of some exotic nature, since not all input/output device everything supports lseek (for example a pipe).
Update: lseek can seek beyond the end of file, just set the whence parameter = SEEK_SET and the offset to the absolute position in the file (fseek has the same option, but I have never used).
Related
I'm designing a musical Looper, which is something that, once a first recording is made with x seconds, repeats playing these x seconds and on each iteration, adds new content to the loop.
Since the size of the first recording can vary, I cannot do this with RAM allocated memory, I must place it to disk.
Long story short, I cannot spend the time to close the file and open again on every loop iteration, so I need to write and read from the same file.
If I protect this file by a mutex, can I do that without having undefined behaviour?
Since the size of the first recording can vary, I cannot do this with RAM allocated memory, I must place it to disk.
Your assumption is simply wrong. Just because size of the recording can vary does not mean you have to put it on disk. For example you could store your recording in a std::vector<unsigned char>. This is a vector holding bytes. You can add or remove any number of bytes you want. Even this is too low level. You better define your own application specific data structure to be able to fluently modify your recording without concerning about files, bytes and the memory.
If you share a few pieces of your code, people can suggest on that.
I have this function in my MMF class
void Clear() {
int size = SizeB();
int iter = size / sysInfo.granB;
for (int i = 0; i < iter; i++) {
auto v = (char*)MapViewOfFile(hMmf, FILE_MAP_READ | (write ? FILE_MAP_WRITE : 0), 0, i * sysInfo.granB, sysInfo.granB);
std::memset(v, 0, sysInfo.granB);
UnmapViewOfFile(v);
}
}
So what it does is go through the whole file in smallest addressable chunks (64k in this case), maps the view, writes 0's, unmap, repeat. It works allright and is very quick but when I use it, there is some phantom memory usage going on.
According to windows task manager, the process itself is using just a few megabytes but the "physical memory usage" leaps up when I use it on larger files. For instance, using this on a 2GB file is enough to put my laptop in a coma for a few minutes, physical memory usage goes to 99%, everything in task manager is frantically reducing memory and everything freezes for a while.
The whole reason I'm trying to do this in 64k chunks is to keep memory usage down but the chunk size doesn't really matter in this case, any size chunks * n to cover the file does the same thing.
Couple of things I've tried:
flushing the view before unmapping - this makes things terribly slow, doing that 2gb file in any size chunks takes like 10 minutes minutes.
adding a hardcoded delay in the loop - it actually works really good, it still gets it done in seconds and the memory usage stays down but I just really don't like the concept of a hardcoded delay in any loop
writing 0's to just the end of the file - I don't actually need to clear the file but only to force it to be ready for usage. What I mean is - when I create a new file and just start with my random IO, I get ~1MB/s at best. If I open an existing file or force write 0's in the new file first, I get much better speeds. I'm not exactly sure why that is but a user in another thread suggested that writing something to the very end of the file after setting the file pointer would have the same effect as clearing but from testing, this is not true.
So currently I'm trying to solve this from the angle of clearing the file without destroying the computers memory. Does anybody know how to appropriately limit that loop?
So here's the thing. When you MapViewOfFile, it allocates the associated memory range but may may mark it as swapped out (eg, if it hasn't already been read into memory). If that's the case, you then get a page fault when you first access it (which will then cause the OS to read it in).
Then when you UnmapViewOfFile, the OS takes ownership of the associated memory range and writes the now-not-accessible-by-userspace data back to disk (assuming, of course, that you've written to it, which marks the page as "dirty", otherwise it's straight up deallocated). To quote the documentation (that I asked you to read in comments): modified pages are written "lazily" to disk; that is, modifications may be cached in memory and written to disk at a later time.
Unmapping the view of the file is not guaranteed to "un-commit" and write the data to disk. Moreover, even CloseHandle does not provide that guarantee either. It merely closes the handle to it. Because of caching mechanisms, the operating system is entirely allowed to write data back to disk on its own time if you do not call FlushViewOfFile. Even re-opening the same file may simply pull data back from the cache instead of from disk.
Ultimately the problem is
you memory map a file
you write to the memory map
writing to the memory map's address range causes the file's mapping to be read in from disk
you unmap the file
unmapping the file "lazily" writes the data back to disk
OS may reach memory stress, sees that there's some unwritten data that it can now write to disk, and forces that to happen to recover the physical memory for new allocations; by the way, because of the OS lazily flushing, your IO is no longer sequential and causes spindle disk latency to drastically increase
You see better performance when you're sleeping because you're giving the OS the opportunity to say "hey I'm not doing anything... let's go ahead and flush cache" which coerces disk IO to be roughly sequential.
I have noticed that reading a file byte-by-bye takes more time to read whole file than reading file using fread .
According to cplusplus :
size_t fread ( void * ptr, size_t size, size_t count, FILE * stream );
Reads an array of count elements, each one with a size of size bytes, from the stream and stores them in the block of memory specified by ptr.
Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?
Q2 ) Results have proved that still fread takes lesser time .
From here:
I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:
using getc Count: 400000 Time: 2.034
using fread Count: 400000 Time: 0.257
Also few posts on SO talks about it that it depends on OS .
Q3) What is the role of OS ?
Why is it so and what exactly goes behind the scene ?
fread does not read a file one byte at a time. The interface, which lets you specify size and count separately, is purely for your convenience. Behind the scenes, fread will simply read size * count bytes.
The amount of bytes that fread will try to read at once is highly dependent on your C implementation and the underlying filesystem. Unless you're intimately familiar with both, it's often safe to assume that fread will be closer to optimal than anything you invent yourself.
EDIT: physical disks tend to have a relatively high seek time compared to their throughput. In other words, they take relatively long to start reading. But once started, they can read consecutive bytes relatively fast. So without any OS/filesystem support, any call to fread would result in a severe overhead to start each read. So to utilize your disk efficiently, you'll want to read as many bytes at once as possible. But disks are slow compared to CPU, RAM and physical caches. Reading too much at once means your program spends a lot of time waiting for the disk to finish reading, when it could have been doing something useful (like processing already read bytes).
This is where the OS/filesystem comes in. The smart people who work on those have spent a lot of time figuring out the right amount of bytes to request from a disk. So when you call fread and request X bytes, the OS/filesystem will translate that to N requests for Y bytes each. Where Y is some generally optimal value that depends on more variables than can be mentioned here.
Another role of the OS/filesystem is what's called 'readahead'. The basic idea is that most IO occurs inside loops. So if a program requests some bytes from disk, there's a very good chance it'll request the next bytes shortly afterwards. Because of this, the OS/filesystem will typically read slightly more than you actually requested at first. Again, the exact amount depends on too many variables to mention. But basically, this is the reason that reading a single byte at a time is still somewhat efficient (it would be another ~10x slower without readahead).
In the end, it's best to think of fread as giving some hints to the OS/filesystem about how many bytes you'll want to read. The more accurate those hints are (closer to the total amount of bytes you'll want to read), the better the OS/filesystem will optimize the disk IO.
Protip: Use your profiler to identify the most significant bottlenecks in an actual, real-life problem...
Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?
Is there anything from the manual to suggest that bytes can only be read one at a time? Flash memory, which is becoming more and more common, typically requires that your OS read chunks as large as 512KB at a time. Perhaps your OS performs buffering for your benefit, so you don't have to inspect the entire amount...
Q2 ) Results have proved that still fread takes lesser time .
Logically speaking, that's a fallacy. There is no requirement that fgetc be any slower at retrieving a block of bytes than fread. In fact, an optimal compiler may very well produce the same machine code following optimisation parses.
In reality, it also turns out to be invalid. Most proofs (for example, the ones you're citing) neglect to consider the influence that setvbuf (or stream.rdbuf()->pubsetbuf, in C++) has.
The empirical evidence below, however, integrates setvbuf and, at least on every implementation I've tested it on, has shown fgetc to be roughly as fast as fread at reading a large block of data, within some meaningless margin of error that swings either way... Please, run these tests multiple times and let me know if you find a system where one of these is significantly faster than the other. I suspect you won't. There are two programs to build from this code:
gcc -o fread_version -std=c99 file.c
gcc -o fgetc_version -std=c99 -DUSE_FGETC file.c
Once both programs are compiled, generate a test_file containing a large number of bytes and you can test like so:
time cat test_file | fread_version
time cat test_file | fgetc_version
Without further adieu, here's the code:
#include <assert.h>
#include <stdio.h>
int main(void) {
unsigned int criteria[2] = { 0 };
# ifdef USE_FGETC
int n = setvbuf(stdin, NULL, _IOFBF, 65536);
assert(n == 0);
for (;;) {
int c = fgetc(stdin);
if (c < 0) {
break;
}
criteria[c == 'a']++;
}
# else
char buffer[65536];
for (;;) {
size_t size = fread(buffer, 1, sizeof buffer, stdin);
if (size == 0) {
break;
}
for (size_t x = 0; x < size; x++) {
criteria[buffer[x] == 'a']++;
}
}
# endif
printf("%u %u\n", criteria[0], criteria[1]);
return 0;
}
P.S. You might have even noticed the fgetc version is simpler than the fread version; it doesn't require a nested loop to traverse the characters. That should be the lesson to take away, here: Write code with maintenance in mind, rather than performance. If necessary, you can usually provide hints (such as setvbuf) to optimise bottlenecks that you've used your profiler to identify.
P.P.S. You did use your profiler to identify this as a bottleneck in an actual, real-life problem, right?
It depends how you are reading byte-by-byte. But there is a significant overhead to each call to fread (it probably needs to make an OS/kernel call).
If you call fread 1000 times to read 1000 bytes one by one then you pay that cost 1000 times; if you call fread once to read 1000 bytes then you only pay that cost once.
Consider what's physically happening with the disk. Every time you ask it to perform a read, its head must seek to the correct position and then wait for the right part of the platter to spin under it. If you do 100 separate 1-byte reads, you have to do that 100 times (as a first approximation; in reality the OS probably has a caching policy that's smart enough to figure out what you're trying to do and read ahead). But if you read 100 bytes one operation, and those bytes are roughly contiguous on the disk, you only have to do all this once.
Hans Passant's comment about caching is right on the money too, but even in the absence of that effect, I'd expect 1 bulk read operation to be faster than many small ones.
Other contributors to the speed reduction are instruction pipeline reloads and databus contentions. Data cache misses are similar to the instruction pipeline reloads, so I am not presenting them here.
Function calls and Instruction Pipeline
Internally, the processor has an instruction pipeline in cache (fast memory physically near the processor). The processor will fill up the pipeline with instructions, then execute the instructions and fill up the pipeline again. (Note, some processors may fetch instructions as slots open up in the pipeline).
When a function call is executed, the processor encounters a branch statement. The processor can't fetch any new instructions into the pipeline until the branch is resolved. If the branch is executed, the pipeline may be reloading, wasting time. (Note: some processors can read in enough instructions into the cache so that no reading of instructions is necessary. An example is a small loop.)
Worst case, when you call the read function 1000 times, you are cause 1000 reloads of the instruction pipeline. If you call the read function once, the pipeline is only reloaded once.
Databus Collisions
Data flows through a databus from the hard drive to the processor, then from the processor to the memory. Some platforms allow for Direct Memory Access (DMA) from the hard drive to the memory. In either case, there is contention of multiple users with the data bus.
The most efficient use of the databus is send large blocks of data. When the user (component, such as the processor or DMA) wants to use the databus, the user must wait for it to become available. Worst case, another user is sending large blocks so there is a long delay. When sending 1000 bytes, one at a time, the User has to wait 1000 times for other Users to give up time with the databus.
Picture waiting in a queue (line) at a market or restaurant. You need to purchase many items, but you purchase one, then have to go back and wait in line again. Or you could be like other shoppers and purchase many items. Which consumes more time?
Summary
There are many reasons to use large blocks for I/O transfers. Some of the reasons are with the physical drive, others involve instruction pipelines, data caches, and databus contention. By reducing the quantity of data requests and increasing the data size, the accumulative time is also reduced. One request has a lot less overhead than 1000 requests. If the overhead is 1 millisecond, one request takes 1 millisecond, while 1000 requests take 1 second.
I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.
The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.
I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)
The problematic section of my code is below
if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
for (xindex=0; xindex<xcells; xindex++)
{
getline(alphaFile, alphaStringValue);
convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
}
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}
Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).
convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.
alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.
I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.
Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.
Edit:
template<class T> inline void convertToNumber(std::string const& s, T &result)
{
std::istringstream i(s);
T x;
if (!(i >> x))
throw BadConversion("convertToNumber(\"" + s + "\")");
result = x;
}
turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.
mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.
It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.
Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).