I have this function in my MMF class
void Clear() {
int size = SizeB();
int iter = size / sysInfo.granB;
for (int i = 0; i < iter; i++) {
auto v = (char*)MapViewOfFile(hMmf, FILE_MAP_READ | (write ? FILE_MAP_WRITE : 0), 0, i * sysInfo.granB, sysInfo.granB);
std::memset(v, 0, sysInfo.granB);
UnmapViewOfFile(v);
}
}
So what it does is go through the whole file in smallest addressable chunks (64k in this case), maps the view, writes 0's, unmap, repeat. It works allright and is very quick but when I use it, there is some phantom memory usage going on.
According to windows task manager, the process itself is using just a few megabytes but the "physical memory usage" leaps up when I use it on larger files. For instance, using this on a 2GB file is enough to put my laptop in a coma for a few minutes, physical memory usage goes to 99%, everything in task manager is frantically reducing memory and everything freezes for a while.
The whole reason I'm trying to do this in 64k chunks is to keep memory usage down but the chunk size doesn't really matter in this case, any size chunks * n to cover the file does the same thing.
Couple of things I've tried:
flushing the view before unmapping - this makes things terribly slow, doing that 2gb file in any size chunks takes like 10 minutes minutes.
adding a hardcoded delay in the loop - it actually works really good, it still gets it done in seconds and the memory usage stays down but I just really don't like the concept of a hardcoded delay in any loop
writing 0's to just the end of the file - I don't actually need to clear the file but only to force it to be ready for usage. What I mean is - when I create a new file and just start with my random IO, I get ~1MB/s at best. If I open an existing file or force write 0's in the new file first, I get much better speeds. I'm not exactly sure why that is but a user in another thread suggested that writing something to the very end of the file after setting the file pointer would have the same effect as clearing but from testing, this is not true.
So currently I'm trying to solve this from the angle of clearing the file without destroying the computers memory. Does anybody know how to appropriately limit that loop?
So here's the thing. When you MapViewOfFile, it allocates the associated memory range but may may mark it as swapped out (eg, if it hasn't already been read into memory). If that's the case, you then get a page fault when you first access it (which will then cause the OS to read it in).
Then when you UnmapViewOfFile, the OS takes ownership of the associated memory range and writes the now-not-accessible-by-userspace data back to disk (assuming, of course, that you've written to it, which marks the page as "dirty", otherwise it's straight up deallocated). To quote the documentation (that I asked you to read in comments): modified pages are written "lazily" to disk; that is, modifications may be cached in memory and written to disk at a later time.
Unmapping the view of the file is not guaranteed to "un-commit" and write the data to disk. Moreover, even CloseHandle does not provide that guarantee either. It merely closes the handle to it. Because of caching mechanisms, the operating system is entirely allowed to write data back to disk on its own time if you do not call FlushViewOfFile. Even re-opening the same file may simply pull data back from the cache instead of from disk.
Ultimately the problem is
you memory map a file
you write to the memory map
writing to the memory map's address range causes the file's mapping to be read in from disk
you unmap the file
unmapping the file "lazily" writes the data back to disk
OS may reach memory stress, sees that there's some unwritten data that it can now write to disk, and forces that to happen to recover the physical memory for new allocations; by the way, because of the OS lazily flushing, your IO is no longer sequential and causes spindle disk latency to drastically increase
You see better performance when you're sleeping because you're giving the OS the opportunity to say "hey I'm not doing anything... let's go ahead and flush cache" which coerces disk IO to be roughly sequential.
Related
I'm designing a musical Looper, which is something that, once a first recording is made with x seconds, repeats playing these x seconds and on each iteration, adds new content to the loop.
Since the size of the first recording can vary, I cannot do this with RAM allocated memory, I must place it to disk.
Long story short, I cannot spend the time to close the file and open again on every loop iteration, so I need to write and read from the same file.
If I protect this file by a mutex, can I do that without having undefined behaviour?
Since the size of the first recording can vary, I cannot do this with RAM allocated memory, I must place it to disk.
Your assumption is simply wrong. Just because size of the recording can vary does not mean you have to put it on disk. For example you could store your recording in a std::vector<unsigned char>. This is a vector holding bytes. You can add or remove any number of bytes you want. Even this is too low level. You better define your own application specific data structure to be able to fluently modify your recording without concerning about files, bytes and the memory.
If you share a few pieces of your code, people can suggest on that.
In my modelling code I use boost memory mapped files, to allocate large-ish arrays on disk.
It works well, but I couldn't find a way to detect situation in which I allocate array which is larger than free space on disk drivw. For example following code will execute happily (assuming that I have less than 8E9 bytes of free space on HDD):
boost::iostreams::mapped_file_params file_params;
file_params.path = path;
file_params.mode = std::ios::in | std::ios::out;
file_params.new_file_size = static_cast<size_t>(8E9); # About 10GB
file_params.length = static_cast<size_t>(8E9);
boost::iostreams::mapped_file result;
result.open(file_params);
I can even work on resuld.data() until I write to part of memory which is not allocted (becaue of missing space on HDD) and then I get a following error:
memory access violation at address: 0x7e9e2cd1e000: non-existent physical address
Is there any way to detect this error before I get cryptic memory access violation?
I actually tested this: if file is bigger than avilable free space on partition it code has memory access violation, if it is smaller the code works (I tested it by changing space free on the partition not by editing the code).
Possible solutions
If I std:fill file contents with zeroes I still get memory access violation, but this error is located near the allocation and easier to debug. I'd rather want some way to raise an exception.
You can use fallocate or posix_fallocate to actually reserve space for the file up front. This way you know you won't ever "over-commit". It has a performance drawback, of course upon initial creation.
For security reasons the OS is likely to zero out the blocks on fallocate.
fallocate lets you do unwritten extents, but it still zeros upon first access.
On windows, this can be combated using, SetFileValidData lets you bypass even that.
Note that Linux with O_DIRECT + fallocate(), still uses considerable CPU (as opposed to Windows' SetFileValidData), and although IO bandwidth is usually the bottleneck, this could still have noticeable performance effect if you are doing much CPU work at the same time.
Is there any way to detect this error before I get cryptic memory access violation?
When you just change the size of a file it will be sparse, that means the areas without data do not consume disk space. Space is allocated during writes - and can create out of diskspace errors.
A way to solve the problem would be to write (dummy) data to the file instead of just changing the size. That takes more time but you would get the out of diskspace during this first write cycle only, as the file has its final size afterwards.
I have written a program (using FFTW) to perform Fourier transforms of some data files written in OpenFOAM.
The program first finds the paths to each data file (501 files in my current example), then splits the paths between threads, such that thread0 gets paths 0->61, thread1 gets 62-> 123 or so, etc, and then runs the remaining files in serial at the end.
I have implemented timers throughout the code to try and see where it bottlenecks, since run in serial each file takes around 3.5s and for 8 files in parallel the time taken is around 21s (a reduction from the 28s for 8x3.5 (serial time), but not by so much)
The problematic section of my code is below
if (DIAG_timers) {readTimer = timerNow();}
for (yindex=0; yindex<ycells; yindex++)
{
for (xindex=0; xindex<xcells; xindex++)
{
getline(alphaFile, alphaStringValue);
convertToNumber(alphaStringValue, alphaValue[xindex][yindex]);
}
}
if (DIAG_timers) {endTimerP(readTimer, tid, "reading value and converting", false);}
Here, timerNow() returns the clock value, and endTimerP calculates the time that has passed in ms. (The remaining arguments relate to it running in a parallel thread, to avoid outputting 8 lines for each loop etc, and a description of what the timer measures).
convertToNumber takes the value on alphaStringValue, and converts it to a double, which is then stored in the alphaValue array.
alphaFile is a std::ifstream object, and alphaStringValue is a std::string which stores the text on each line.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible, since only 8 (1 per thread) should be open at once. I am unsure if mmap would do this better? Several threads on stackoverflow argue about the merits of mmap vs more straightforward read operations, in particular for sequential access, so I don't know if that would be beneficial.
I tried surrounding the code block with a mutex so that only one thread could run the block at once, in case reading multiple files was leading to slow IO via vaguely random access, but that just reduced the process to roughly serial-speed times.
Any suggestions allowing me to run this section more quickly, possibly via copying the file, or indeed anything else, would be appreciated.
Edit:
template<class T> inline void convertToNumber(std::string const& s, T &result)
{
std::istringstream i(s);
T x;
if (!(i >> x))
throw BadConversion("convertToNumber(\"" + s + "\")");
result = x;
}
turns out to have been the slow section. I assume this is due to the creation of 5 million stringstreams per file, followed by the testing of 5 million if conditions? Replacing it with TonyD's suggestion presumably removes the possibility of catching an error, but saves a vast number of (at least in this controlled case) unnecessary operations.
The files to be read are approximately 40MB each (just a few lines more than 5120000, each containing only one value, between 0 and 1 (in most cases == (0||1) ), and I have 16GB of RAM, so copying all the files to memory would certainly be possible,
Yes. But loading them there will still count towards your process' wall clock time unless they were already read by another process short before.
since only 8 (1 per thread) should be open at once.
Since any files that were not loaded in memory before the process started will have to be loaded and thus the loading will count towards the process wall clock time, it does not matter how many are open at once. Any that are not cache will slow down the process.
I am unsure if mmap would do this better?
No, it wouldn't. mmap is faster, but because it saves the copy from kernel buffer to application buffer and some system call overhead (with read you do a kernel entry for each page while with mmap pages that are read with read-ahead won't cause further page faults). But it will not save you the time to read the files from disk if they are not already cached.
mmap does not load anything in memory. The kernel loads data from disk to internal buffers, the page cache. read copies the data from there to your application buffer while mmap exposes parts of the page cache directly in your address space. But in either case the data are fetched on first access and remain there until the memory manager drops them to reuse the memory. The page cache is global, so if one process causes some data to be cached, next process will get them faster. But if it's first access after longer time, the data will have to be read and this will affect read and mmap exactly the same way.
Since parallelizing the process didn't improve the time much, it seems majority of the time is the actual I/O. So you can optimize a bit more and mmap can help, but don't expect much. The only way to improve I/O time is to get a faster disk.
You should be able to ask the system to tell you how much time was spent on the CPU and how much was spent waiting for data (I/O) using getrusage(2) (call it at end of each thread to get data for that thread). So you can confirm how much time was spent by I/O.
mmap is certainly the most efficient way to get large amounts of data into memory. The main benefit here is that there is no extra copying involved.
It does however make the code slightly more complex, since you can't directly use the file I/O functions to use mmap (and the main benefit is sort of lost if you use "m" mode of stdio functions, as you are now getting at least one copy). From past experiments that I've made, mmap beats all other file reading variants by some amount. How much depends on what proportion of the overall time is spent on waiting for the disk, and how much time is spent actually processing the file content.
Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.
The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned.
Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.
Thanks.
You should consider using mmap() instead of reading your files using read().
What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.
The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.
Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.
Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.
Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).
It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).
I need to write data into drive. I have two options:
write raw sectors.(_write(handle, pBuffer, size);)
write into a file (fwrite(pBuffer, size, count, pFile);)
Which way is faster?
I expected the raw sector writing function, _write, to be more efficient. However, my test result failed! fwrite is faster. _write costs longer time.
I've pasted my snippet; maybe my code is wrong. Can you help me out? Either way is okay by me, but I think raw write is better, because it seems the data in the drive is encrypted at least....
#define SSD_SECTOR_SIZE 512
int g_pSddDevHandle = _open("\\\\.\\G:",_O_RDWR | _O_BINARY, _S_IREAD | _S_IWRITE);
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
_write(g_pSddDevHandle,szMemZero,SSD_SECTOR_SIZE);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();
FILE * file = fopen("f:\\test.tmp","a+");
TIMER_START();
while (ulMovePointer < 1024 * 1024 * 1024)
{
fwrite(szMemZero,SSD_SECTOR_SIZE,1,file);
ulMovePointer += SSD_SECTOR_SIZE;
}
TIMER_END();
TIMER_PRINT();
Probably because a direct write isn't buffered. When you call fwrite, you are doing buffered writes which tend to be faster in most situations. Essentially, each FILE* handler has an internal buffer which is flushed to disk periodically when it becomes full, which means you end up making less system calls, as you only write to disk in larger chunks.
To put it another way, in your first loop, you are actually writing SSD_SECTOR_SIZE bytes to disk during each iteration. In your second loop you are not. You are only writing SSD_SECTOR_SIZE bytes to a memory buffer, which, depending on the size of the buffer, will only be flushed every Nth iteration.
In the _write() case, the value of SSD_SECTOR_SIZE matters. In the fwrite case, the size of each write will actually be BUFSIZ. To get a better comparison, make sure the underlying buffer sizes are the same.
However, this is probably only part of the difference.
In the fwrite case, you are measuring how fast you can get data into memory. You haven't flushed the stdio buffer to the operating system, and you haven't asked the operating system to flush its buffers to physical storage. To compare more accurately, you should call fflush() before stopping the timers.
If you actually care about getting data onto the disk rather than just getting the data into the operating systems buffers, you should ensure that you call fsync()/FlushFileBuffers() before stopping the timer.
Other obvious differences:
The drives are different. I don't know how different.
The semantics of a write to a device are different to the semantics of writes to a filesystem; the file system is allowed to delay writes to improve performance until explicitly told not to (eg. with a standard handle, a call to FlushFileBuffers()); writes directly to a device aren't necessarily optimised in that way. On the other hand, the file system must do extra I/O to manage metadata (block allocation, directory entries, etc.)
I suspect that you're seeing a different in policy about how fast things actually get on to the disk. Raw disk performance can be very fast, but you need big writes and preferably multiple concurrent outstanding operations. You can also avoid buffer copying by using the right options when you open the handle.