buffered std::ifstream to read from disk only once (C++) - c++

Is there a way to add buffering to a std::ifstream in the sense that seeking (seekg) and reading multiple times wouldn't cause any more reads than necessary.
I'd basically like to read a chunk of file using stream multiple times but I'd want to have the chunk read from disk only once.
The question is probably a bit off cuz I want to mix buffered reads and streams ...
For example:
char filename[] = "C:\\test.txt";
fstream inputfile;
char buffer[20];
inputfile.open(filename, ios::binary);
inputfile.seekg(2, ios::beg);
inputfile.read(buffer, 3);
cout << buffer << std::endl;
inputfile.seekg(2, ios::beg);
inputfile.read(buffer, 3);
cout << buffer3 << std::endl;
I'd want to have to read from disk only once.

Personally, I wouldn't worry about reading from the file multiple times: the system will keep the used buffers hot anyway. However, depending on the location of the file and swap space, different disks may be used.
The file stream itself does support a setbuf() function which could theoretically set the internally used buffer to a size chosen by the user. However, the only arguments which have to be supported and need to have an effect are setbuf(0, 0) which is quite the opposite effect, i.e., the stream becomes unbuffered.
I guess, the easiest way to guarantee that the data isn't read from the stream again is to use a std::stringstream and use that instead of the file stream after initial reading, e.g.:
std::stringstream inputfile;
inputfile << std::ifstream(filename).rdbuf();
inputfile.seekg(0, std::ios_base::beg);
If it is undesirable to read the entire file stream first, a filtering stream could be used which reads the file whenever it reaches a section it hasn't read, yet. However, creating a corresponding stream buffer isn't that trivial and since I consider the original objective already questionable I would doubt that it has much of a benefit. Of course, you could create a simple stream which just does the initialization in the constructor and use that instead.

Related

What is the secret to the speed of the filesystem::copy function?

I am trying to reach the speed of filesystem::copy in reading the content of a file and write that content to a new file "copy operation" but I can't reach that speed.
The following is a simple example of my attempt:
void Copy(const wstring &fromPath, const wstring &toPath) {
ifstream readFile(fromPath.c_str(), ios_base::binary|ios_base::ate);
char* fileContent = NULL;
if (!readFile) { cout << "Cannot open the file.\n"; return; }
ofstream writeFile(toPath.c_str(), ios_base::binary);
streampos size = readFile.tellg();
readFile.seekg(0, ios_base::beg);
fileContent = new char[size];
readFile.read(fileContent, size);
writeFile.write(fileContent, size);
readFile.close();
writeFile.close();
delete[] fileContent;
}
The previous code able to copy a file.iso its size "1.48GB" in between "8 to 9" seconds, while filesystem::copy able to copy the same file in between "1 to 2" seconds maximum.
Notice: I don't want to use C++17 in the current period.
How can I do to make the speed of my function to be like filesystem::copy?
Your implementation needs to allocate a buffer of the size of the whole file. That is wasteful, you could just read 64k, write 64k, repeat for the next blocks.
There's cost to paging memory in and out. If you read the whole thing then write the whole thing, you end up paging in and out the whole file twice.
It could be that multiple threads might read/write separately (provided read stays ahead). That may speed things up.
With hardware support, there might not even be a need for the data to go all the way to the CPU. Yet, your implementation probably ends up doing it. It would be very hard hard for the compiler to reason about what you do or don't with fileContent.
There's countless other tricks the implementation of filesystem::copy might be using. You could go see how it is coded, there's plenty of open implementations.
There's a caveat though: The implementation of the standard library might rely on specific behaviours that the language doesn't guarantee. So you can't simply copy the code to a different compiler/architecture/platform.

Efficient way to read file multiple line one time?

I am now trying to handle a large file (several GB), so I am thinking to use multi-thread. The file is multiple lines of data like:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 attr2.2 attr2.3
data3 attr3.1 attr3.2 attr3.3
I am thinking to use one thread read multiple lines first to a buffer1, and then one other thread to handle the data in buffer1 line by line, while the reading thread start to read file to buffer2. Then the handling thread continues when buffer2 is ready, and the reading thread read to buffer1 again.
Now I finished the handler part by using freads for small file (several KB), but I am not sure how to make the buffer contains the complete line instead of splitting part of line at end of the buffer, which is like this:
data1 attr1.1 attr1.2 attr1.3
data2 attr2.1 att
Also, I find that the fgets or ifstream getline can read file line by line, but would it be very costly since it has many IOs?
Now I am struggling to figure out what it the best way to do that? Is there any efficient way to read multiple lines at one time? Any advice is appreciated.
C stdio and C++ iostream functions use buffered I/O. Small reads only have function-call and locking overhead, not read(2) system call overhead.
Without knowing the line length ahead of time, fgets has to either use a buffer or read one byte at a time. Luckily, the C/C++ I/O semantics allow it to use buffering, so every mainstream implementation does. (According to the docs, mixing stdio and I/O on the underlying file descriptors gives undefined results. This is what allows buffering.)
You're right that it would be a problem if every fgets required a system call.
You might find it useful for one thread to read lines and put the lines into some kind of data structure that's useful for the processing thread.
If you don't have to do much processing on each line, doing the I/O in the same thread as the processing will keep everything in the L1 cache of that CPU, though. Otherwise data will end up in L1 of the I/O thread, and then have to make it to L1 of the core running the processing thread.
Depending on what you want to do with your data, you can minimize copying by memory-mapping the file in-place. Or read with fread, or skip the stdio layer entirely and just use POSIX open / read, if you don't need your code to be as portable. Scanning a buffer for newlines migh have less overhead than what the stdio functions do.
You can handle the leftover line at the end of the buffer by copying it to the front of the buffer, and calling the next fread with a reduced buffer size. (Or, make your buffer ~1k bigger than the size of your fread calls, so you can always read multiples of the memory and filesystem page size (typically 4kiB), unless the trailing part of the line is > 1k.)
Or use a circular buffer, but reading from a circular buffer means checking for wraparound every time you touch it.
It all depends what you want to do as processing afterwards : do you need to keep a copy of the lines ? Do you intend to process input as std::strings ? etc...
Here some general remarks that could help you further:
istream::getline() and fgets() are buffered operations. So I/O is already reduced and you could assume the performance is already correct.
std::getline() is also buffered. Nevertheless, if you don't need to process std::strings the function would cost you a considerable number of memory allocation/deallocation, which might impact performance
Bloc operations like read() or fread() can achieve economies of scale if you can afford large buffers. This can be especially efficient, if you use the data in a throw-away fashion (because you can avoid copying the data and work directly in the buffer), but at the cost of an extra complexity.
But all these considerations shall not forget that the erformance is very much affected by the library implementation that you use.
I've done a little informal benchmark reading a milion of lines in the format you've shown:
* With MSVC2015 on my PC the read() is twice as fast as fgets(), and almost 4 times faster than std::string.
* With GCC on CodingGround, compiling with O3, fgets(), and both getline() are approximately the same, and the read() is slower.
Here the full code if you want to play around.
Here the the code that show you how to move the buffer arround.
int nr=0; // number of bytes read
bool last=false; // last (incomplete) read
while (!last)
{
// here nr conains the number of bytes kept from incomplete line
last = !ifs.read(buffer+nr, szb-nr);
nr = nr+ifs.gcount();
char *s, *p = buffer, *pe = p + nr;
do { // process complete lines in buffer
for (s = p; p != pe && *p != '\n'; p++)
;
if (p != pe || (p == pe && last)) {
if (p != pe)
*p++ = '\0';
lines++; // TO DO: here s is a null terminated line to process
sln += strlen(s); // (dummy operatio for the example)
}
} while (p != pe); // until eand of buffer is reached
std::copy(s, pe, buffer); // copy last (incoplete) line to begin of buffer
nr = pe - s; // and prepare the info for the next iteration
}

is it safe to use a text file that is modified by c++ and is not closed?

The title is not so clear but what I mean is this:
std::fstream filestream("abc.dat", std::ios::out);
double write_to_file;
while (some_condition) {
write_to_file = 1.345; ///this number will be different in each loop iteration
filestream.seekg( 345 );
filestream << std::setw(5) << write_to_file << std::flush;
///write the number to replace the number that is written in the previous iteration
system( "./Some_app ./abc.dat" ); ///open an application in unix,
////which uses "abc.dat" as the input file
}
filestream.close();
that's the rough idea, each iteration re-write the number into the file and flush. I'm hoping not to open and close the file in each iteration, in order to save computing time. (also not sure of the complexity of open and close :/ ) Is it ok to do this?
On unix, std::flush does not necessarily write to the physical device. Typically, it does not. std::ofstream::flush calls rdbuf->pubsync(), which in turn calls rdbuf->sync(), which in turn "synchronizes the controlled sequences with the arrays." What are those "control sequences"? Typically they're not the underlying physical device. In a modern OS such as unix, there are lots of things in between high level I/O constructs such as C++'s concept of an I/O buffer and the bits on the device.
Even the low-level POSIX function fsync() does not necessarily guarantee that bits are written to the device. Even closing and reopening the output file does not necessarily guarantee that bits are written to the device.
You might want to rethink your design.
You need at least to flush the C++ stream buffer with filestream.flush() before calling system (but you did that with << std::flush;)
I am assuming that ./Someapp is not writing the file, and is opening it for reading only.
But in your case, better open and close the file at each iteration, since the system call is obviously a huge bottleneck.

Are C++ << and >> operators slow? What alternatives are there to these operators?

I'm doing a project for college and I'm using C++. I used std::cin and std::cout with the << and >> operators to read input and to display output. My professor has published an announcement saying that >> and << are not recommended because they are slow.
We only have to read integers and the input is always correct (we don't need to verify it, we know the format it is in and just need to read it). What alternatives should we use then, if << and >> are not recommended?
For cout you can use put or write
// single character
char character;
cout.put(character);
// c string
char * buffer = new char[size];
cout.write(buffer, size);
For cin you could use get, read, or getline
// Single character
char ch;
std::cin.get(ch);
// c string
char * buffer = new char[size];
std::cin.read(buffer, size);
std::cin.get(buffer, size);
std::cin.getline(buffer, size);
Worrying about the speed of the stream extraction operators (<< and >>) in C++ is something to do when you have lots of data to process (over 1E06 items). For smaller sets of data, the execution time is negligible to other factors with the computer and your program.
Before you worry about the speed of formatted I/O, get your program working correctly. Review your algorithms for efficiency. Review your implementation of the algorithms for efficiency. Review the data for efficiency.
The slowness of the stream extraction operators is first translating from textual representation to internal representation, then the implementation. Heck, if you are typing in the data, forget about any optimizations. To speed up your file reading, organize the data for easy extraction and translation.
If you are still panicking about efficiency, use binary file representation. The data in the file should be formatted so that it can be loaded directly into memory without any translations. Also, the data should be loaded in large chunks.
From the Hitchhiker's Guide to the Galaxy, DON'T PANIC.

ifstream vs. fread for binary files

Which is faster? ifstream or fread.
Which should I use to read binary files?
fread() puts the whole file into the memory.
So after fread, accessing the buffer it creates is fast.
Does ifstream::open() puts the whole file into the memory?
or does it access the hard disk every time we run ifstream::read()?
So... does ifstream::open() == fread()?
or (ifstream::open(); ifstream::read(file_length);) == fread()?
Or shall I use ifstream::rdbuf()->read()?
edit:
My readFile() method now looks something like this:
void readFile()
{
std::ifstream fin;
fin.open("largefile.dat", ifstream::binary | ifstream::in);
// in each of these small read methods, there are at least 1 fin.read()
// call inside.
readHeaderInfo(fin);
readPreference(fin);
readMainContent(fin);
readVolumeData(fin);
readTextureData(fin);
fin.close();
}
Will the multiple fin.read() calls in the small methods slow down the program?
Shall I only use 1 fin.read() in the main method and pass the buffer into the small methods? I guess I am going to write a small program to test.
Thanks!
Are you really sure about fread putting the whole file into memory? File access can be buffered, but I doubt that you really get the whole file put into memory. I think ifstream::read just uses fread under the hood in a more C++ conformant way (and is therefore the standard way of reading binary information from a file in C++). I doubt that there is a significant performance difference.
To use fread, the file has to be open. It doesn't take just a file and put it into memory at once. so ifstream::open == fopen and ifstream::read == fread.
C++ stream api is usually a little bit slower then C file api if you use high level api, but it provides cleaner/safer api then C.
If you want speed, consider using memory mapped files, though there is no portable way of doing this with standard library.
As to which is faster, see my comment. For the rest:
Neither of these methods automatically reads the whole file into memory. They both read as much as you specify.
As least for ifstream I am sure that the IO is buffered, so there will not necessarily be a disk access for every read you make.
See this question for the C++-way of reading binary files.
The idea with C++ file streams is that some or all of the file is buffered in memory (based on what it thinks is optimal) and that you don't have to worry about it.
I would use ifstream::read() and just tell it how much you need.
Use stream operator:
DWORD processPid = 0;
std::ifstream myfile ("C:/Temp/myprocess.pid", std::ios::binary);
if (myfile.is_open())
{
myfile >> processPid;
myfile.close();
std::cout << "PID: " << processPid << std::endl;
}