Editing a 10gb file using limited main memory in C/C++ - c++

I need to sort a 10gb file containing a list of numbers as fast as possible using only 100mb of memory.
I'm breaking them into chunks and then merging them.
I am currently using C File pointers as they go faster than c++ file i/o(atleast on my system).
I tried for a 1gb file and my code works fine, but it throws a segmentation fault as soon as I fscanf after opening the 10gb file.
FILE *fin;
FILE *fout;
fin = fopen( filename, "r" );
while( 1 ) {
// throws the error here
for( i = 0; i < MAX && ( fscanf( fin, "%d", &temp ) != EOF ); i++ ) {
v[i] = temp;
}
What should I use instead?
And do you have any suggestions about how to go about this in the best way possible?

There is a special class of algorithms for this called external sorting. There is a variant of merge sort that is an external sorting algorithm (just google for merge sort tape).
But if you're on Unix, it's probably easier to run the sort command in a separate process.
BTW. Opening files that are bigger than 2 GB requires large file support. Depending on your operating system and your libraries, you need to define a macro or call other file handling functions.

Related

Faster methods of reading a large amount of text/text files?

I'm currently in the process of making a program to read a large number of text files, and searching for regular expressions, then saving the line text and line number, as well as the file name and file folder path, and writing that data to a .csv file. The method I'm using is as follows:
string line;
ifstream stream1(filePath)
{
while (getline(stream1,line))
{
// Code here that compares regular search expression to the line
// If match, save data to a tuple for later writing to .csv file.
}
}
I'm wondering if there is a faster method to do this. I wrote the same type of program in Matlab (which I'm more experienced in) using the same logic as described above, going line by line. I had run time down to roughly 5.5 minutes for 300 MB of data (which I'm not even sure if that's fast or not, probably not), but in Visual Studio it's taking as much as 2 hours for the same data.
I had heard of how fast C++ can be for data reading/writing so I'm a little confused by these results. Is there a faster method? I tried looking around online but all I found was memory mapping which seemed to only be Linux/Unix?
You can use memory-mapped files.
Since you’re on Windows, the correct API is probably CAtlFileMapping<char> template class. Here's an example.
#include <atlfile.h>
// Error-checking macro
#define CHECK( hr ) { const HRESULT __hr = ( hr ); if( FAILED( __hr ) ) return __hr; }
HRESULT testMapping( const wchar_t* path )
{
// Open the file
CAtlFile file;
CHECK( file.Create( path, GENERIC_READ, FILE_SHARE_READ, OPEN_EXISTING ) );
// Map the file
CAtlFileMapping<char> mapping;
CHECK( mapping.MapFile( file ) );
// Query file size
ULONGLONG ullSize;
CHECK( file.GetSize( ullSize ) );
const char* const ptrBegin = mapping;
const size_t length = (size_t)ullSize;
// Process the mapped data, e.g. call memchr() to find your new lines
return S_OK;
}
Don’t forget that for 32-bit processes address space is limited, compiling a 64-bit program makes a lot of sense for this application.
Also, if your files are very small, you have huge count if them, and they are stored on a fast SSD, better approach is processing multiple files in parallel. But it’s somewhat harder to implement.

Writing a single large data file, or multiple smaller files: Which is faster?

I am developing a c++ program that writes a large amount of data to disk. The following function gzips the data and writes it out to a file. The compressed data is on the order of 100GB. The function to compress and write out the data is as follows:
void constructSNVFastqData(string const& fname) {
ofstream fastq_gz(fname.c_str());
stringstream ss;
for (int64_t i = 0; i < snvId->size(); i++) {
consensus_pair &cns_pair = snvId->getPair(i);
string qual(cns_pair.non_mutated.size(), '!');
ss << "#" + cns_pair.mutated + "[" + to_string(cns_pair.left_ohang) +
";" + to_string(cns_pair.right_ohang) + "]\n"
+ cns_pair.non_mutated + "\n+\n" + qual + "\n";
}
boost::iostreams::filtering_streambuf<boost::iostreams::input> out;
out.push(boost::iostreams::gzip_compressor());
out.push(ss);
boost::iostreams::copy(out,fastq_gz);
fastq_gz.close();
}
The function writes data to a string stream, which I then
write out to a file (fastq_gz) using boost's filtering_streambuf.
The file is not a log file. After the file has been written
it will be read by a child process. The file does not need to be viewed
by humans.
Currently, I am writing the data out to a single, large file (fastq_gz). This is taking a while, and the file system - according to our system manager - is very busy. I wonder if, instead of writing out a single large file, I should instead write out a number of smaller files? Would this approach be faster, or reduce the load on the file system?
Please note that it is not the compression that is slow - I have benchmarked.
I am running on a linux system and do not need to consider generalising the implementation to a windows filesystem.
So what your code is probably doing is (a) generating your file into memory swap space, (b) loading from swap space and compressing on the fly, (c) writing compressed data as you get it to the outfile.
(b) and (c) are great; (a) is going to kill you. It is two roundmtrips of thr uncompressed data, one of which while competing with your output file generation.
I cannot find one in boost iostreams, but you need an istream (source) or a device that gets data from you on demand. Someone must have written it (it seems so useful), but I don't see it in 5 minutes of looking at boost iostreams docs.
0.) Devise an algorithm to divide the data into multiple files so that it could be recombined later.
1.) Write data to multiple files on separate threads in the background. Maybe shared threads. (maybe start n = 10 threads at a time or so)
2.) Query through the future attribute of the shared objects to check if writing is done. (size > 1 GB)
3.) Once above is the case; then recombine data when it is queried by the child process
4.) I would recommend writing a new file after every 1 GB

Reading a Potentially incomplete File C++

I am writing a program to reformat a DNS log file for insertion to a database. There is a possibility that the line currently being written to in the log file is incomplete. If it is, I would like to discard it.
I started off believing that the eof function might be a good fit for my application, however I noticed a lot of programmers dissuading the use of the eof function. I have also noticed that the feof function seems to be quite similar.
Any suggestions/explanations that you guys could provide about the side effects of these functions would be most appreciated, as would any suggestions for more appropriate methods!
Edit: I currently am using the istream::peek function in order to skip over the last line, regardless of whether it is complete or not. While acceptable, a solution that determines whether the last line is complete would be preferred.
The specific comparison I'm using is: logFile.peek() != EOF
I would consider using
int fseek ( FILE * stream, long int offset, int origin );
with SEEK_END
and then
long int ftell ( FILE * stream );
to determine the number of bytes in the file, and therefore - where it ends. I have found this to be more reliable in detecting the end of the file (in bytes).
Could you detect an (End of Record/Line) EOR marker (CRLF perhaps) in the last two or three bytes of the file? (3 bytes might be used for CRLF^Z...depends on the file type). This would verify if you have a complete last row
fseek (stream, -2,SEEK_END);
fread (2 bytes... etc
If you try to open the file with exclusive locks, you can detect (by the failure to open) that the file is in use, and try again in a second...(or whenever)
If you need to capture the file contents as the file is being written, it's much easier if you eliminate as many layers of indirection and buffering between your logic and the actual bytes of data in the file.
Do not use C++ IO streams of any type - you have no real control over them. Don't use FILE *-based functions such as fopen() and fread() - those are buffered, and even if you disable buffering there are layers of code between your code and the data that once again you can't control and don't know what's happening.
In a POSIX environment, you can use low-level C-style open() and read()/pread() calls. And use fstat() to know when the file contents have changed - you'll see the st_size member of the struct stat argument change after a call to fstat().
You'd open the file like this:
int logFileFD = open( "/some/file/name.log", O_RDONLY );
Inside a loop, you could do something like this (error checking and actual data processing omitted):
size_t lastSize = 0;
while ( !done )
{
struct stat statBuf;
fstat( logFileFD, &statBuf );
if ( statBuf.st_size == lastSize )
{
sleep( 1 ); // or however long you want
continue; // go to next loop iteration
}
// process new data - might need to keep some of the old data
// around to handle lines that cross boundaries
processNewContents( logFileFD, lastSize, statBuf.st_size );
}
processNewContents() could look something like this:
void processNewContents( int fd, size_t start, size_t end )
{
static char oldData[ BUFSIZE ];
static char newData[ BUFSIZE ];
// assumes amount of data will fit in newData...
ssize_t bytesRead = pread( fd, newData, start, end - start );
// process the data that was read read here
return;
}
You may also find that you need to add some code to close() then re-open() the file in case your application doesn't seem to be "seeing" data written to the file. I've seen that happen on some systems - the application somehow sees a cached copy of the file size somewhere while an ls run in another context gets the more accurate, updated size. If, for example, you know your log file is written to every 10-15 seconds, if you go 30 seconds without seeing any change to the file you know to try reopening the file.
You can also track the inode number in the struct stat results to catch log file rotation.
In a non-POSIX environment, you can replace open(), fstat() and pread() calls with the low-level OS equivalent, although Windows provides most of what you'd need. On Windows, lseek() followed by read() would replace pread().

Is modifying a file without writing a new file possible in c++? [duplicate]

This question already has answers here:
How to overwrite only part of a file in c++
(3 answers)
Closed 7 years ago.
Let's say I have a text file that is 100 lines long. I want to only change what is in the 50th line.
One way to do it is open the file for input and open a new file for output. Use a for-loop to read in the first half of the file line by line and write to the second file line by line, then write what I want to change, and then write out the second half using a for-loop again. Finally, I rename the new file to overwrite the original file.
Is there another way besides this? A way to modify the contents in the middle of a file without touching the rest of the file and without writing everything out again?
If there is, then what's the code to do it?
(Copied from my comment on your previous question)
Well, it depends on which I/O library you're using, the C standard library, C++ standard library, or an OS-specific API. But generally there are three steps. When opening the file, specify write access without truncation, e.g. fopen(fname, "r+"). Seek to the desired location with e.g. fseek. Write the new string with e.g. fwrite or fprintf. In some APIs, like Windows OVERLAPPED or linux aio, you can say "write to this location" without a separate seek step.
Open the file, use fseek to jump to the place you need and write the data, then close the file.
from http://www.cplusplus.com/reference/clibrary/cstdio/fseek/...
#include <stdio.h>
int main ()
{
FILE * pFile;
pFile = fopen ( "example.txt" , "r+" );
fputs ( "This is an apple." , pFile );
fseek ( pFile , 9 , SEEK_SET );
fputs ( " sam" , pFile );
fclose ( pFile );
return 0;
}

Fastest way to find the number of lines in a text (C++)

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).
The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.
Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.
The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.
As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.
Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:
#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;
unsigned int FileRead( istream & is, vector <char> & buff ) {
is.read( &buff[0], buff.size() );
return is.gcount();
}
unsigned int CountLines( const vector <char> & buff, int sz ) {
int newlines = 0;
const char * p = &buff[0];
for ( int i = 0; i < sz; i++ ) {
if ( p[i] == '\n' ) {
newlines++;
}
}
return newlines;
}
int main( int argc, char * argv[] ) {
time_t now = time(0);
if ( argc == 1 ) {
cout << "lines\n";
ifstream ifs( "lines.dat" );
int n = 0;
string s;
while( getline( ifs, s ) ) {
n++;
}
cout << n << endl;
}
else {
cout << "buffer\n";
const int SZ = 1024 * 1024;
std::vector <char> buff( SZ );
ifstream ifs( "lines.dat" );
int n = 0;
while( int cc = FileRead( ifs, buff ) ) {
n += CountLines( buff, cc );
}
cout << n << endl;
}
cout << time(0) - now << endl;
}
Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.
Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )
If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines
and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.
You wrote that it keeps getting larger.
This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:
Parse to the end of file.
Remember the line count and the offset of EOF.
When the file grows fseek to the offset, parse to EOF and update the line count and the offset.
There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:
What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.
The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.
Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
struct TestEOL
{
bool operator()(char c)
{
last = c;
return last == '\n';
}
char last;
};
int main()
{
std::fstream file("Plop.txt");
TestEOL test;
std::size_t count = std::count_if(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>(),
test);
if (test.last != '\n') // If the last character checked is not '\n'
{ // then the last line in the file has not been
++count; // counted. So increement the count so we count
} // the last line even if it is not '\n' terminated.
}
It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.
However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.
Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.
You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.
However, there are a couple of possibilities which you may want to consider.
1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.
A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.
2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).
3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.
For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.
To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.
The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.
So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.
If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).
Ludwig Algorithm
Initialize offset to 0
Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
Update offset with position at EOF
Save counter & offset to a file or in a variable if you only need it in your software
Repeat from "Read file ..." on a change
This is actually how various software processing log files function (i.e. fail2ban comes to mind).
The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.
Proactive Algorithm
When creating the files, reset counter to 0.
Then each time you receive a new line to add to the file:
Lock file
Write one line
Load counter
Add one to counter
Save counter
Unlock file
This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.
Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.
For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.
Don't do processing backward.