Writing huge txt files without overloading RAM - c++

I need to write the results of a process in a txt file. The process is very long and the amount of data to be written is huge (~150Gb). The program works fine, but the problem is that the RAM gets overloaded and, at a certain point, it just stops.
The program is simple:
ostream f;
f.open(filePath);
for(int k=0; k<nDataset; k++){
//treat element of dataset
f << result;
}
f.close();
Is there a way of writing this file without overloading the memory?

You should flush output periodically.
For example:
if (k%10000 == 0) f.flush();

I'd like to suggest something like this
ogzstream f;
f.open(filePath);
string s("");
for(int k=0; k<nDataset; k++){
//treat element of dataset
s.append(result);
if (s.length() == OPTIMUM_BUFFER_SIZE) {
f << s;
f.flush();
s.clear();
}
}
f << s;
f.flush();
f.close();
Basically, you construct the stream in memory rather than redirecting to the stream so you don't have to worry about when the stream gets flushed. And when you are redirecting you ensure it's flushed to the actual file. Some ideas for the OPTIMUM_BUFFER_SIZE can be found from here and here.
I'm not exactly sure whether string or vector is the best option for the buffer. Will do some research myself and update the answer or you can refer to Effective STL by Scott Meyers.

If that truly is the code where your program gets stuck, then your explanation of the problem is wrong.
There's no text file. Your igzstream is not dealing with text, but a gzip archive.
There's no data being written. The code you show reads from the stream.
I don't know what your program does with result, because you didn't show that. But if it accumulates results into a collection in memory, that will grow. You'll need to find a way to process all your data without loading all of it into RAM at the same time.
Your memory usage could be from the decompressor. For some compression algorithms, an entire block has to be stored in memory. In such cases it's best to break the file into blocks and compress each separately (possibly pre-initializing a dictionary with the results of the previous block). I don't think that gzip is such an algorithm, however. You may need to find a library that supports streaming.

Related

What is the secret to the speed of the filesystem::copy function?

I am trying to reach the speed of filesystem::copy in reading the content of a file and write that content to a new file "copy operation" but I can't reach that speed.
The following is a simple example of my attempt:
void Copy(const wstring &fromPath, const wstring &toPath) {
ifstream readFile(fromPath.c_str(), ios_base::binary|ios_base::ate);
char* fileContent = NULL;
if (!readFile) { cout << "Cannot open the file.\n"; return; }
ofstream writeFile(toPath.c_str(), ios_base::binary);
streampos size = readFile.tellg();
readFile.seekg(0, ios_base::beg);
fileContent = new char[size];
readFile.read(fileContent, size);
writeFile.write(fileContent, size);
readFile.close();
writeFile.close();
delete[] fileContent;
}
The previous code able to copy a file.iso its size "1.48GB" in between "8 to 9" seconds, while filesystem::copy able to copy the same file in between "1 to 2" seconds maximum.
Notice: I don't want to use C++17 in the current period.
How can I do to make the speed of my function to be like filesystem::copy?
Your implementation needs to allocate a buffer of the size of the whole file. That is wasteful, you could just read 64k, write 64k, repeat for the next blocks.
There's cost to paging memory in and out. If you read the whole thing then write the whole thing, you end up paging in and out the whole file twice.
It could be that multiple threads might read/write separately (provided read stays ahead). That may speed things up.
With hardware support, there might not even be a need for the data to go all the way to the CPU. Yet, your implementation probably ends up doing it. It would be very hard hard for the compiler to reason about what you do or don't with fileContent.
There's countless other tricks the implementation of filesystem::copy might be using. You could go see how it is coded, there's plenty of open implementations.
There's a caveat though: The implementation of the standard library might rely on specific behaviours that the language doesn't guarantee. So you can't simply copy the code to a different compiler/architecture/platform.

Correct way of reading /proc/pid/status

I read /proc/<pid>/status this way:
std::ifstream file(filename);
std::string line;
int numberOfLinesToRead = 4;
int linesRead = 0;
while (std::getline(file, line)) {
// do stuff
if (numberOfLinesToRead == ++linesRead) {
break;
}
}
I noticed that in rare cases std::getline hangs.
Why it happens? I was under impression that proc filesystem should be in somewhat consistent state and there should not be cases when newline is missing. My assumption was that getline returns false when EOF/error occurs.
What is the recommended, safe way to read /proc/<pid>/status ?
Perhaps a more sure path is to use fread into a large buffer. The status file is small so allocate a local buffer and read the whole file.
Example look at the second answer for the simplest solution
This may still fail on the fopen or fread but a sensible error should be returned.
/proc is a virtual filesystem. That means reading from "files" in it is not the same as reading from normal filesystem.
If process exits the information about it is removed from /proc much faster than if it was real filesystem (dirty cache flushing delay involved here).
Bearing that in mind imagine that process exits before you get to read next line which wasn't buffered yet.
The solution is either to account for file loss since you may not need information about process which does not exists anymore or buffer the entire file and then only parse it.
EDIT: the hang in process should clearly be related to the fact that this is virtual filesystem. It does not behave exactly same way as real filesystem. Since this is specific fs type the issue could be in fs driver. The code you provide looks fine for normal file reading.

How to copy a file from one location to another in a fast way with C++ program? [duplicate]

This question already has answers here:
Copy a file in a sane, safe and efficient way
(9 answers)
Closed 7 years ago.
I am trying to understand the code behind the copy command which copies a file from one place to other.I studied c++ file system basics and have written the following code for my task.
#include<iostream>
#include<fstream>
using namespace std;
main()
{
cout<<"Copy file\n";
string from,to;
cout<<"Enter file address: ";
cin>>from;
ifstream in(from,ios::in | ios::binary);
if(!in)
{
cout<<"could not find file "<<from<<endl;
return 1;
}
cout<<"Enter file destination: ";
cin>>to;
ofstream out(to,ios::out | ios::binary);
char ch;
while(in.get(ch))
{
out.put(ch);
}
cout<<"file has been copied\n";
in.close();
out.close();
}
Though this code works but is much slower than the copy command of my OS which is windows.I want to know how I can make my program faster to reduce the difference between my program's time and the my OS's copy command time.
Reading one byte at time is going to waste a lot of time in function calls... use a bigger buffer:
char ch[4096];
while(in) {
in.read(ch, sizeof(ch));
out.write(ch, in.gcount());
}
(you may want to add some more error handling, e.g. out may go in a bad state and the like)
(the most C++-ish way is reported here, but takes advantage of streambuf functionalities that typically a beginner rarely has reason to know, and to me is also way less instructive)
You have correctly opened the file for binary read and binary write. However instead of reading characters(which is not meaningful in binary format), use istream::read and ostream::write.
Like other answers say, use bigger buffers. I'd go for 1MB.
But there's a lot more to it.
Also, avoid stream lib and FILE stuff. They buffer the data so you get 2 memcpy calls instead of 1.
Disabling buffering on the streams can achieve a similar result, but I think you're better of using the system calls directly.
And one last thing, on the "do it yourself" front. You must check the return values from read and write calls. They may read/write less bytes than you ask them to.
If you can manage a circular buffer, you should switch read/wrote whenever the function returns short... disk may be more ready for reading or for writing so no point in wasting time waiting instead of switching to the other thing you have to do.
And now the very last thing you might want to explore- look into the sendfile system call. It was built to speed up web servers by doing all the copy in the kernel and avoiding context switches and memcpys, but may serve here if it works with two disk file descriptors.

How to optimize input/output in C++

I'm solving a problem which requires very fast input/output. More precisely, the input data file will be up to 15MB. Is there a fast, way to read/print integer values.
Note: I don't know if it helps, but the input file has the following form:
line 1: a number n
line 2..n+1: three numbers a,b,c;
line n+2: a number r
line n+3..n+4+r: four numbers a,b,c,d
Note 2: The input file will be stdin.
Edit: Something like the following isn't fast enough:
void fast_scan(int &n) {
char buffer[10];
gets(buffer);
n=atoi(buffer);
}
void fast_scan_three(int &a,int &b,int &c) {
char buffval[3][20],buffer[60];
gets(buffer);
int n=strlen(buffer);
int buffindex=0, curindex=0;
for(int i=0; i<n; ++i) {
if(!isdigit(buffer[i]) && !isspace(buffer[i]))break;
if(isspace(buffer[i])) {
buffindex++;
curindex=0;
} else {
buffval[buffindex][curindex++]=buffer[i];
}
}
a=atoi(buffval[0]);
b=atoi(buffval[1]);
c=atoi(buffval[2]);
}
General input/output optimization principle is to perform as less I/O operations as possible reading/writing as much data as possible.
So performance-aware solution typically looks like this:
Read all data from device into some buffer (using the principle mentioned above)
Process the data generating resulting data to some buffer (on place or another one)
Output results from buffer to device (using the principle mentioned above)
E.g. you could use std::basic_istream::read to input data by big chunks instead of doing it line by line. The similar idea with output - generate single string as result adding line feed symbols manually and output it at once.
If you want to minimize the physical I/O operation overhead, load the whole file into memory by a technique called memory mapped files. I doubt you'll get a noticable performance gain though. Parsing will most likely be a lot costlier.
Consider using threads. Threading is useful for lots of things, but this is exactly the kind of problem that motivated the invention of threads.
The underlying idea is to separate the input, processing, and output, so these different operations can run in parallel. Do it right and you will see a significant speedup.
Have one thread doing close to pure input. It reads lines into a buffer of lines. Have a second thread do a quick pre-parse and organize the raw input into blocks. You have two things that need to be parsed, the line that contain the number of lines that contain triples and the line that contains the number of lines that contain quads. This thread forms the raw input into blocks that are still mostly text. A third thread parses the triples and quads, re-forming the the input into fully parsed structures. Since the data are now organized into independent blocks, you can have multiple instances of this third operation so as to better take advantage of the multiple processors on your computer. Finally, other threads will operate on these fully-parsed structures. Note: It might be better to combine some of these operations, for example combining the input and pre-parsing operations into one thread.
Put several input lines in a buffer, split them, and then parse them simultaneously in different threads.
It's only 15MB. I would just slurp the whole thing into a memory buffer, then parse it.
The parsing looks something like this, approximately:
#define DIGIT(c)((c) >= '0' && (c) <= '9')
while(*p == ' ') p++;
if (DIGIT(*p)){
a = 0;
while(DIGIT(*p){
a *= 10; a += (*p++ - '0');
}
}
// and so on...
You should be able to write this kind of code in your sleep.
I don't know if that's any faster than atoi, but it's not fussing around figuring out where numbers begin and end. I would stay away from scanf because it goes through nine yards of figuring out its format string.
If you run this whole thing in a loop 1000 times and grab some stack samples, you should see that it is spending nearly 100% of its time reading in the file and generating output (which you didn't mention).
You just can't beat that.
If you do see that noticeable time is spent in the actual parsing, it might be possible to do overlapped I/O, but the machine would have to be really slow, or the I/O really fast (like from a solid state drive) before that would make sense.

ifstream vs. fread for binary files

Which is faster? ifstream or fread.
Which should I use to read binary files?
fread() puts the whole file into the memory.
So after fread, accessing the buffer it creates is fast.
Does ifstream::open() puts the whole file into the memory?
or does it access the hard disk every time we run ifstream::read()?
So... does ifstream::open() == fread()?
or (ifstream::open(); ifstream::read(file_length);) == fread()?
Or shall I use ifstream::rdbuf()->read()?
edit:
My readFile() method now looks something like this:
void readFile()
{
std::ifstream fin;
fin.open("largefile.dat", ifstream::binary | ifstream::in);
// in each of these small read methods, there are at least 1 fin.read()
// call inside.
readHeaderInfo(fin);
readPreference(fin);
readMainContent(fin);
readVolumeData(fin);
readTextureData(fin);
fin.close();
}
Will the multiple fin.read() calls in the small methods slow down the program?
Shall I only use 1 fin.read() in the main method and pass the buffer into the small methods? I guess I am going to write a small program to test.
Thanks!
Are you really sure about fread putting the whole file into memory? File access can be buffered, but I doubt that you really get the whole file put into memory. I think ifstream::read just uses fread under the hood in a more C++ conformant way (and is therefore the standard way of reading binary information from a file in C++). I doubt that there is a significant performance difference.
To use fread, the file has to be open. It doesn't take just a file and put it into memory at once. so ifstream::open == fopen and ifstream::read == fread.
C++ stream api is usually a little bit slower then C file api if you use high level api, but it provides cleaner/safer api then C.
If you want speed, consider using memory mapped files, though there is no portable way of doing this with standard library.
As to which is faster, see my comment. For the rest:
Neither of these methods automatically reads the whole file into memory. They both read as much as you specify.
As least for ifstream I am sure that the IO is buffered, so there will not necessarily be a disk access for every read you make.
See this question for the C++-way of reading binary files.
The idea with C++ file streams is that some or all of the file is buffered in memory (based on what it thinks is optimal) and that you don't have to worry about it.
I would use ifstream::read() and just tell it how much you need.
Use stream operator:
DWORD processPid = 0;
std::ifstream myfile ("C:/Temp/myprocess.pid", std::ios::binary);
if (myfile.is_open())
{
myfile >> processPid;
myfile.close();
std::cout << "PID: " << processPid << std::endl;
}