File reads slow on first read, but fast on consecutive reads - c++

(This isn't my program, but I'll try to provide all the relevant information to the best of my knowledge.)
There is a program which reads binary files that are roughly 300MB in size, processes them and outputs some information. The program uses ifstream for file input and streams are correctly initialized and closed for each read.
The program has to read each file multiple times. Reading a file for the first time takes about 3 seconds, and each consecutive read takes about 0.1 seconds. If several files are processed, going back to the first file will still yield fast read speeds, but after some time re-reading a file becomes slow.
Additionally, if a file is copied to another location, the speed of the first read of the new file is roughly 0.1 seconds.
If you do the math, the speed of consecutive reads is roughly the advertised read speed of the hard drive.
All this looks like file locations are cached by either the OS or the hard drive, so that on consecutive reads you don't have to seek out file locations.
Does anyone know what exactly is causing the slowdown on the initial read, and if it can be prevented? Three seconds may not seem like a lot, but they add about 5 hours to the total time needed to correctly process every file.
Also, the program runs on Fedora 14 and Scientific Linux, with both OS's having their default file systems.
Any ideas would be appreciated.

Linux will try and copy the file into RAM to make the next read faster - I am guessing this is what is happening. The initial read is actual off disk - subsequent reads are out of the file cache because the entire file has been copied to RAM

The OS (Linux) has a disk cache. After you read the file once, it's in the cache.

My guess would be that maybe the first time it reads the file it takes longer because it loads some information into the cache?
After the first time, it just uses some of the information in the cache.

Yes, the data becomes cached. You might force that caching with the readahead syscall (or simply by having another process read it). If using mmap you could also use madvise

Related

Reading small separated chunks of a large file (C++)

I am reading a proprietary binary data file format. The format is basically header, data, size_of_previous_data, header, data, size_of_previous_data, header, data, size_of_previous_data, ...
Part of the header includes the number of bytes of the next chunk of data as well as its size being listed immediately after the data. The header is 256 bytes, the data is typically ~ 2MB and the size_of_previous_data is a 32 bit int.
The files are generally large ~GB, and I often have to search through tens of them for the data I want. In order to do this, the first thing I do in my code is idex each of the files, i.e. read in just the headers and record the location of the associated data (file and byte number). My code basically ready the header using fstream::read(), checks the data size, skips the data using fstream::seekg(), then reads in the size_of_previous_data, then repeats until I reach the end of the file.
My problem is that this indexing is painfully slow. The data is on an internal 7200 rpm hard drive on my Windows 10 laptop and Task manager shows that my hard drive usage is maxed out, but I am only getting read speeds of about 1.5 MB/s with response times typically >70 ms. I am reading the file using a std::fstream using fstream::get() to read the headers and fstream::seekg() to move to the next header.
I have profiled my code and almost the entire time is spent in the fstream::read() code to read the size_of_previous_data value. I presume that when I do this the data immediately after this is buffered so my fstream::read() to get the next header takes practically no time.
So I am wondering if there is a way to optimise this? Almost my entire buffer in any buffered read is likely to be wasted (97% of it, if it is an 8kB buffer). Is there a way to shrink this and is it likely to be worth it (perhaps underlying OS buffers too in a way I cannot change)?
Assuming that a disk seek takes about 10 ms (from Latency Numbers Every Programmer Should Know), your file is 11 GB consisting of 2 MB chunks, the theoretical minimum running time is 5500 * 10 ms = 55 seconds.
If you're already in that order of magnitude, the most effective way of speeding this up might be to buy an SSD.

How to determine whether data has been retrieved from disk or from caches?

I have written a program in C/C++ which needs to fetch data from the disk. After some time it so happens that the operating system stores some of the data in its caches. Is there some way by which I may figure out in a C/c++ programs whether the data has been retrieved from the caches or the data has been retrieved from the disk?
A simple solution would be to time the read operation. Disk reads are significantly slower. you can read a a group of file blocks (4K) twice to get an estimate.
The problem is that if you run the program again or copy the file in a shell, the OS will cache it.

fortran: wait to open a file until closed by another application

I have a fortran code which needs to read a series of ascii data files (which all together are about 25 Gb). Basically the code opens a given ascii file, reads the information and use it to do some operations, and then close it. Then opens another file, reads the information, do some operations, and close it again. And so on with the rest of ascii files.
Overall each complete run takes about 10h. I usually need to run several independent calculations with different parameters, and the way I do is to run each independent calculation sequentially, so that at the end if I have 10 independent calculations, the total CPU time is 100h.
A more rapid way would be to run the 10 independent calculations at the same time using different processors on a cluster machine, but the problem is that if a given calculation needs to open and read data from a given ascii file which has been already opened and it's being used by another calculation, then the code gives obviously an error.
I wonder whether there is a way to verify if a given ascii file is already being used by another calculation, and if so to ask the code to wait until the ascii file is finally closed.
Any help would be of great help.
Many thanks in advance.
Obamakoak.
Two processes should be able to read the same file. Perhaps action="read" on the open statement might help. Must the files be human readable? The I/O would very likely be much faster with unformatted (sometimes call binary) files.
P.S. If your OS doesn't support multiple-read access, you might have to create your own lock system. Create a master file that a process opens to check which files are in use or not, and to update said list. Immediately closing after a check or update. To handle collisions on this read/write file, use iostat on the open statement and retry after a delay if there is an error.
I know this is an old thread but I've been struggling with the same issue for my own code.
My first attempt was creating a variable on a certain process (e.g. the master) and accessing this variable exclusively using one-sided passive MPI. This is fancy and works well, but only with newer versions of MPI.
Also, my code seemed happy to open (with READWRITE status) files that were also open in other processes.
Therefore, the easiest workaround, if your program has file access, is to make use of an external lock file, as described here. In your case, the code might look something like this:
A process checks whether the lock file exists using the NEW statement, which fails if a file already exists. It will look something like:
file_exists = .true.
do while (file_exists)
open(STATUS='NEW',unit=11,file=lock_file_name,iostat=open_stat)
if (open_stat.eq.0) then
file_exists = .false.
open(STATUS='OLD',ACTION=READWRITE',unit=12,file=data_file_name,iostat=ierr)
if (ierr.ne.0) stop
else
call sleep(1)
end if
end do
The file is now opened exclusively by the current process. Do the operations you need to do, such as reading, writing.
When you are done, close the data file and finally the lock file
close(12,iostat=ierr)
if (ierr.ne.0) stop
close(11,status='DELETE',iostat=ierr)
if (ierr.ne.0) stop
The data file is now again unlocked for the other processes.
I hope this may be useful for other people who have the same problem.

Reading file that changes over time C++

I am going to read a file in C++. The reading itself is happening in a while-loop, and is reading from one file.
When the function reads information from the file, it is going to push this information up some place in the system. The problem is that this file may change while the loop is ongoing.
How may I catch that new information in the file? I tried out std::ifstream reading and changing my file manually on my computer as the endless-loop (with a sleep(2) between each loop) was ongoing, but as expected -- nothing happend.
EDIT: the file will overwrite itself at each new entry of data to the file.
Help?
Running on virtual box Ubuntu Linux 12.04, if that may be useful info. And this is not homework.
The usual solution is something along the lines of what MichaelH
proposes: the writing process opens the file in append mode, and
always writes to the end. The reading process does what
MichaelH suggests.
This works fine for small amounts of data in each run. If the
processes are supposed to run a long time, and generate a lot of
data, the file will eventually become too big, since it will
contain all of the processed data as well. In this case, the
solution is to use a directory, generating numbered files in it,
one file per data record. The writing process will write each
data set to a new file (incrementing the number), and the
reading process will try to open the new file, and delete it
when it has finished. This is considerably more complex than
the first suggestion, but will work even for processes
generating large amounts of data per second and running for
years.
EDIT:
Later comments by the OP say that the device is actually a FIFO.
In that case:
you can't seek, so MichaelH's suggestion can't be used
literally, but
you don't need to seek, since data is automatically removed
from the FIFO whenever it has been read, and
depending on the size of the data, and how it is written, the
writes may be atomic, so you don't have to worry about partial
records, even if you happen to read exactly in the middle of
a write.
With regards to the latter: make sure that both the read and
write buffers are large enough to contain a complete record, and
that the writer flushes after each record. And make sure that
the records are smaller than the size needed to guarantee
atomicity. (Historically, on the early Unix I know, this was
4096, but I would be surprised if it hasn't increased since
then. Although... Under Posix, this is defined by PIPE_BUF,
which is only guaranteed to be at least 512, and is only 4096
under modern Linux.)
Just read the file, rename the file, open the renamed file. do the processing of data to your system, and at the end of the loop close the file. After a sleep, re-open the file at the top of the white loop, rename it and repeat.
That's the simplest way to approach the problem and saves having to write code to process dynamic changes to the file during the processing stage.
To be absolutely sure you don't get any corruption it's best to rename the file. This guarantees that any changes from another process do not affect the processing. It may not be necessary to do this - it depends on the processing and how the file is updated. But it's a safer approach. A move or rename operation is guaranteed to be atomic - so there should be no concurreny issues if using this approach.
You can use inotify to watch file changes.
If you need simpler solution - read file attributes ( with stat(), and check last_write_time of a file ).
However you still may miss some file modification, while you'll be opening and rereading file. So if you have control over the application which writes to a file, i'd recommend you using something else to communicate between these processes, pipes for example.
To be more explicit, if you want tail-like behavior you'll want to:
Open the file, read in the data. Save the length. Close the file.
Wait for a bit.
Open the file again, attempt to seek to the last read position, read the remaining data, close.
rinse and repeat

Multithreading a File Map into an Array of Buffers

I'm trying to work with nasty large xml and text documents: ~40GBs.
I'm using Visual Studio 2012 on Windows 7.
I'm going to use 'Xerces' to snag the header/'footer tag' from the xmls.
I want to map an area of the file, say.. 60-120MBs.
Split the Map into (3 * processors/cores) equal parts. Setting each part as a buffer and loading the buffers into an array.
Then using (#processors/cores) while statments in new threads, I will synchronously count characters/lines/xml cycles while chewing through the the buffer array. When one buffer is completed the the process will jump to the next 'available' buffer and the completed buffer will be dropped out of memory. At the end I will add the total results into a project log.
Afterwards, I will reference the log, Split the files by character count/size(Or other option) to the nearest line or cycle and drop in the header and 'footer tag' to all the splits.
I'm doing this so I can import massive data to a MySQL server over a network with multiple computers.
My Question is, how do I create the buffer array and the file map with new threads?
Can I use :
win CreateFile
win CreateFileMapping
win MapViewOfFile
with standard ifstream operations and char buffers or should I opt something else?
Futher clarification:
My thinking is that if I can have the hard drive streaming the file into memory from one place and in one direction that I can use the full processing power of the machine to chew through seperate but equal buffers.
~Flavor: It's kind of like being a Shepard trying to scoop food out from one huge bin with 3-6 Large buckets with only two arms for X sheep that need to stay inside the fenced area. But they all move at the speed of light.
A few ideas or pointers might help me along here.
Any thoughts are Most Welcome. Thanks.
while(getline(my_file, myStr))
{
characterCount += myStr.length();
lineCount++;
if(my_file.eof()){
break;
}
}
This was the only code at run time for the test. 2hours, 30+min. 45-50% total processor for the program running it on a dual core 1.6Mhz laptop with 2GB RAM. Most of the RAM loaded right now is 600+MB from ~50 tabs open in firefox, Visual Studio at 60MB, then etcs.
IMPORTANT: During the test, the program running the code, which is only a window, and a dialog box, seemed to dump it's own working and private set of ram, down to like 300K ish, and didn't respond for the length of the test. I need to make another thread for the while statement I'm sure. But this means that NONE of the file was read into a buffer. The CPU was struggling for the entire run to keep up with the tinyest effort from the hard drive.
P.S. Further proof of CPU bottlenecking. It might take me 20min to transfer than entire file to another computer over my wireless network. Which includes the read process and a socket catch to write process on the other computer.
UPDATE
I used this adorable little thing to go from the previous test time to about 15-20min which is in line with what Mats Petersson was saying.
while (my_file.read( &bufferOne[0], bufferOne.size() ))
{
int cc = my_file.gcount();
for (int i = 0; i < cc; i++)
{
if (bufferOne[i] == '\n')
lineCount++;
characterCount++;
}
currentPercent = characterCount/onePercent;
SendMessage(GetDlgItem(hDlg, IDC_GENPROGRESS), PBM_SETPOS, currentPercent, 0);
}
Granted this is a single loop and it actually behaved much more appropriately than the previous test. This test was ~800% faster than the tight loop shown above this one with Getline. I set the buffer for this loop at 20MB. I jacked this code from: SOF - Fastest Example
BUT...
I would like to point out that while polling the process in resource mon and task manager, it clearly showed the first core at 75-90% usage, the second fluxuately 25-50% (Pretty standard for some minor background stuff that I have open), and the hard drive at.. wait for it... 50%. Some 100% disk time spikes but also some lows at 25%. All of which basically means that Splitting the buffer processing between two different threads could very well be a benefit. It will use all the system resources but.. that's what I want. I'll update later today when I have the working prototype.
MAJOR UPDATE:
Finally finished my project after a bunch of learning. No File Map needed. Only a bunch of vector char's. I have successfully built a dynamically executing file stream line and character counter.
The good news, went from the previous 10-15min marker to ~3-4min on a 5.8GB file, BOOYA!~
Very short answer: Yes, you can use those functions.
For reading data, it's likely the most efficient method to map the file content into memory, since it saves having to copy the memory into a buffer in the application, just read it straight into the place it's supposed to go. So, no problem as long as you have enough address space available - 64-bit machines should certainly have plenty, in a 32-bit system it may be more of a scarce resource - but for sections of a few hundred MB, it shouldn't be a huge issue.
However, using multiple threads, I'm not at all convinced. I have a fair idea that reading more than one part of a very large file will be counter productive. This will increase the amount of head movement on the disk, which is a large portion of transfer rate. You can count on some 50-100MB/s transfer rates for "ordinary" systems. If the system has some sort of raid controller or some such, maybe around double that - very exotic raid controllers may achieve three times.
So reading 40GB will take somewhere in the order of 3-15 minutes.
The CPU is probably not going to be very busy, and running multiple threads is quite likely to worsen the overall performance of the system.
You may want to keep a thread for reading and one for writing, and only actually write out the data once you have a sufficient amount of it, again, to avoid unnecessary moves of the read/write head on the disk(s).