I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.
The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.
I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)
The problematic section of my code is below
if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
for (xindex=0; xindex<xcells; xindex++)
{
getline(alphaFile, alphaStringValue);
convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
}
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}
Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).
convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.
alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.
I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.
Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.
Edit:
template<class T> inline void convertToNumber(std::string const& s, T &result)
{
std::istringstream i(s);
T x;
if (!(i >> x))
throw BadConversion("convertToNumber(\"" + s + "\")");
result = x;
}
turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.
mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.
It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.
Related
I'm designing a musical Looper, which is something that, once a first recording is made with x seconds, repeats playing these x seconds and on each iteration, adds new content to the loop.
Since the size of the first recording can vary, I cannot do this with RAM allocated memory, I must place it to disk.
Long story short, I cannot spend the time to close the file and open again on every loop iteration, so I need to write and read from the same file.
If I protect this file by a mutex, can I do that without having undefined behaviour?
Since the size of the first recording can vary, I cannot do this with RAM allocated memory, I must place it to disk.
Your assumption is simply wrong. Just because size of the recording can vary does not mean you have to put it on disk. For example you could store your recording in a std::vector<unsigned char>. This is a vector holding bytes. You can add or remove any number of bytes you want. Even this is too low level. You better define your own application specific data structure to be able to fluently modify your recording without concerning about files, bytes and the memory.
If you share a few pieces of your code, people can suggest on that.
I have this function in my MMF class
void Clear() {
int size = SizeB();
int iter = size / sysInfo.granB;
for (int i = 0; i < iter; i++) {
auto v = (char*)MapViewOfFile(hMmf, FILE_MAP_READ | (write ? FILE_MAP_WRITE : 0), 0, i * sysInfo.granB, sysInfo.granB);
std::memset(v, 0, sysInfo.granB);
UnmapViewOfFile(v);
}
}
So what it does is go through the whole file in smallest addressable chunks (64k in this case), maps the view, writes 0's, unmap, repeat. It works allright and is very quick but when I use it, there is some phantom memory usage going on.
According to windows task manager, the process itself is using just a few megabytes but the "physical memory usage" leaps up when I use it on larger files. For instance, using this on a 2GB file is enough to put my laptop in a coma for a few minutes, physical memory usage goes to 99%, everything in task manager is frantically reducing memory and everything freezes for a while.
The whole reason I'm trying to do this in 64k chunks is to keep memory usage down but the chunk size doesn't really matter in this case, any size chunks * n to cover the file does the same thing.
Couple of things I've tried:
flushing the view before unmapping - this makes things terribly slow, doing that 2gb file in any size chunks takes like 10 minutes minutes.
adding a hardcoded delay in the loop - it actually works really good, it still gets it done in seconds and the memory usage stays down but I just really don't like the concept of a hardcoded delay in any loop
writing 0's to just the end of the file - I don't actually need to clear the file but only to force it to be ready for usage. What I mean is - when I create a new file and just start with my random IO, I get ~1MB/s at best. If I open an existing file or force write 0's in the new file first, I get much better speeds. I'm not exactly sure why that is but a user in another thread suggested that writing something to the very end of the file after setting the file pointer would have the same effect as clearing but from testing, this is not true.
So currently I'm trying to solve this from the angle of clearing the file without destroying the computers memory. Does anybody know how to appropriately limit that loop?
So here's the thing. When you MapViewOfFile, it allocates the associated memory range but may may mark it as swapped out (eg, if it hasn't already been read into memory). If that's the case, you then get a page fault when you first access it (which will then cause the OS to read it in).
Then when you UnmapViewOfFile, the OS takes ownership of the associated memory range and writes the now-not-accessible-by-userspace data back to disk (assuming, of course, that you've written to it, which marks the page as "dirty", otherwise it's straight up deallocated). To quote the documentation (that I asked you to read in comments): modified pages are written "lazily" to disk; that is, modifications may be cached in memory and written to disk at a later time.
Unmapping the view of the file is not guaranteed to "un-commit" and write the data to disk. Moreover, even CloseHandle does not provide that guarantee either. It merely closes the handle to it. Because of caching mechanisms, the operating system is entirely allowed to write data back to disk on its own time if you do not call FlushViewOfFile. Even re-opening the same file may simply pull data back from the cache instead of from disk.
Ultimately the problem is
you memory map a file
you write to the memory map
writing to the memory map's address range causes the file's mapping to be read in from disk
you unmap the file
unmapping the file "lazily" writes the data back to disk
OS may reach memory stress, sees that there's some unwritten data that it can now write to disk, and forces that to happen to recover the physical memory for new allocations; by the way, because of the OS lazily flushing, your IO is no longer sequential and causes spindle disk latency to drastically increase
You see better performance when you're sleeping because you're giving the OS the opportunity to say "hey I'm not doing anything... let's go ahead and flush cache" which coerces disk IO to be roughly sequential.
I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.
So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.
Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.
I have noticed that reading a file byte-by-bye takes more time to read whole file than reading file using fread .
According to cplusplus :
size_t fread ( void * ptr, size_t size, size_t count, FILE * stream );
Reads an array of count elements, each one with a size of size bytes, from the stream and stores them in the block of memory specified by ptr.
Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?
Q2 ) Results have proved that still fread takes lesser time .
From here:
I ran this with a file of approximately 44 megabytes as input. When compiled with VC++2012, I got the following results:
using getc Count: 400000 Time: 2.034
using fread Count: 400000 Time: 0.257
Also few posts on SO talks about it that it depends on OS .
Q3) What is the role of OS ?
Why is it so and what exactly goes behind the scene ?
fread does not read a file one byte at a time. The interface, which lets you specify size and count separately, is purely for your convenience. Behind the scenes, fread will simply read size * count bytes.
The amount of bytes that fread will try to read at once is highly dependent on your C implementation and the underlying filesystem. Unless you're intimately familiar with both, it's often safe to assume that fread will be closer to optimal than anything you invent yourself.
EDIT: physical disks tend to have a relatively high seek time compared to their throughput. In other words, they take relatively long to start reading. But once started, they can read consecutive bytes relatively fast. So without any OS/filesystem support, any call to fread would result in a severe overhead to start each read. So to utilize your disk efficiently, you'll want to read as many bytes at once as possible. But disks are slow compared to CPU, RAM and physical caches. Reading too much at once means your program spends a lot of time waiting for the disk to finish reading, when it could have been doing something useful (like processing already read bytes).
This is where the OS/filesystem comes in. The smart people who work on those have spent a lot of time figuring out the right amount of bytes to request from a disk. So when you call fread and request X bytes, the OS/filesystem will translate that to N requests for Y bytes each. Where Y is some generally optimal value that depends on more variables than can be mentioned here.
Another role of the OS/filesystem is what's called 'readahead'. The basic idea is that most IO occurs inside loops. So if a program requests some bytes from disk, there's a very good chance it'll request the next bytes shortly afterwards. Because of this, the OS/filesystem will typically read slightly more than you actually requested at first. Again, the exact amount depends on too many variables to mention. But basically, this is the reason that reading a single byte at a time is still somewhat efficient (it would be another ~10x slower without readahead).
In the end, it's best to think of fread as giving some hints to the OS/filesystem about how many bytes you'll want to read. The more accurate those hints are (closer to the total amount of bytes you'll want to read), the better the OS/filesystem will optimize the disk IO.
Protip: Use your profiler to identify the most significant bottlenecks in an actual, real-life problem...
Q1 ) So , again fread reads the file by 1 bytes , so isn't it the same way as to read by 1-byte method ?
Is there anything from the manual to suggest that bytes can only be read one at a time? Flash memory, which is becoming more and more common, typically requires that your OS read chunks as large as 512KB at a time. Perhaps your OS performs buffering for your benefit, so you don't have to inspect the entire amount...
Q2 ) Results have proved that still fread takes lesser time .
Logically speaking, that's a fallacy. There is no requirement that fgetc be any slower at retrieving a block of bytes than fread. In fact, an optimal compiler may very well produce the same machine code following optimisation parses.
In reality, it also turns out to be invalid. Most proofs (for example, the ones you're citing) neglect to consider the influence that setvbuf (or stream.rdbuf()->pubsetbuf, in C++) has.
The empirical evidence below, however, integrates setvbuf and, at least on every implementation I've tested it on, has shown fgetc to be roughly as fast as fread at reading a large block of data, within some meaningless margin of error that swings either way... Please, run these tests multiple times and let me know if you find a system where one of these is significantly faster than the other. I suspect you won't. There are two programs to build from this code:
gcc -o fread_version -std=c99 file.c
gcc -o fgetc_version -std=c99 -DUSE_FGETC file.c
Once both programs are compiled, generate a test_file containing a large number of bytes and you can test like so:
time cat test_file | fread_version
time cat test_file | fgetc_version
Without further adieu, here's the code:
#include <assert.h>
#include <stdio.h>
int main(void) {
unsigned int criteria[2] = { 0 };
# ifdef USE_FGETC
int n = setvbuf(stdin, NULL, _IOFBF, 65536);
assert(n == 0);
for (;;) {
int c = fgetc(stdin);
if (c < 0) {
break;
}
criteria[c == 'a']++;
}
# else
char buffer[65536];
for (;;) {
size_t size = fread(buffer, 1, sizeof buffer, stdin);
if (size == 0) {
break;
}
for (size_t x = 0; x < size; x++) {
criteria[buffer[x] == 'a']++;
}
}
# endif
printf("%u %u\n", criteria[0], criteria[1]);
return 0;
}
P.S. You might have even noticed the fgetc version is simpler than the fread version; it doesn't require a nested loop to traverse the characters. That should be the lesson to take away, here: Write code with maintenance in mind, rather than performance. If necessary, you can usually provide hints (such as setvbuf) to optimise bottlenecks that you've used your profiler to identify.
P.P.S. You did use your profiler to identify this as a bottleneck in an actual, real-life problem, right?
It depends how you are reading byte-by-byte. But there is a significant overhead to each call to fread (it probably needs to make an OS/kernel call).
If you call fread 1000 times to read 1000 bytes one by one then you pay that cost 1000 times; if you call fread once to read 1000 bytes then you only pay that cost once.
Consider what's physically happening with the disk. Every time you ask it to perform a read, its head must seek to the correct position and then wait for the right part of the platter to spin under it. If you do 100 separate 1-byte reads, you have to do that 100 times (as a first approximation; in reality the OS probably has a caching policy that's smart enough to figure out what you're trying to do and read ahead). But if you read 100 bytes one operation, and those bytes are roughly contiguous on the disk, you only have to do all this once.
Hans Passant's comment about caching is right on the money too, but even in the absence of that effect, I'd expect 1 bulk read operation to be faster than many small ones.
Other contributors to the speed reduction are instruction pipeline reloads and databus contentions. Data cache misses are similar to the instruction pipeline reloads, so I am not presenting them here.
Function calls and Instruction Pipeline
Internally, the processor has an instruction pipeline in cache (fast memory physically near the processor). The processor will fill up the pipeline with instructions, then execute the instructions and fill up the pipeline again. (Note, some processors may fetch instructions as slots open up in the pipeline).
When a function call is executed, the processor encounters a branch statement. The processor can't fetch any new instructions into the pipeline until the branch is resolved. If the branch is executed, the pipeline may be reloading, wasting time. (Note: some processors can read in enough instructions into the cache so that no reading of instructions is necessary. An example is a small loop.)
Worst case, when you call the read function 1000 times, you are cause 1000 reloads of the instruction pipeline. If you call the read function once, the pipeline is only reloaded once.
Databus Collisions
Data flows through a databus from the hard drive to the processor, then from the processor to the memory. Some platforms allow for Direct Memory Access (DMA) from the hard drive to the memory. In either case, there is contention of multiple users with the data bus.
The most efficient use of the databus is send large blocks of data. When the user (component, such as the processor or DMA) wants to use the databus, the user must wait for it to become available. Worst case, another user is sending large blocks so there is a long delay. When sending 1000 bytes, one at a time, the User has to wait 1000 times for other Users to give up time with the databus.
Picture waiting in a queue (line) at a market or restaurant. You need to purchase many items, but you purchase one, then have to go back and wait in line again. Or you could be like other shoppers and purchase many items. Which consumes more time?
Summary
There are many reasons to use large blocks for I/O transfers. Some of the reasons are with the physical drive, others involve instruction pipelines, data caches, and databus contention. By reducing the quantity of data requests and increasing the data size, the accumulative time is also reduced. One request has a lot less overhead than 1000 requests. If the overhead is 1 millisecond, one request takes 1 millisecond, while 1000 requests take 1 second.
i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.