Fastest and efficient way of parsing raw data from file - c++

I'm working on some project and I'm wondering which way is the most efficient to read a huge amount of data off a file(I'm speaking of file of 100 lines up to 3 billions lines approx., can be more thought). Once read, data will be stored in a structured data set (vector<entry> where "entry" defines a structured line).
A structured line of this file may look like :
string int int int string string
which also ends with the appropriate platform EOL and is TAB delimited
What I wish to accomplish is :
Read file into memory (string) or vector<char>
Read raw data from my buffer and format it into my data set.
I need to consider memory footprint and have a fast parsing rate.
I'm already avoiding usage of stringstream as they seems too slow.
I'm also avoiding multiple I/O call to my file by using :
// open the stream
std::ifstream is(filename);
// determine the file length
is.seekg(0, ios_base::end);
std::size_t size = is.tellg();
is.seekg(0, std::ios_base::beg);
// "out" can be a std::string or vector<char>
out.reserve(size / sizeof (char));
out.resize(size / sizeof (char), 0);
// load the data
is.read((char *) &out[0], size);
// close the file
is.close();
I've thought of taking this huge std::string and then looping line by line, I would extract line information (string and integer parts) into my data set row. Is there a better way of doing this?
EDIT : This application may run on a 32bit, 64bit computer, or on a super computer for bigger files.
Any suggestions are very welcome.
Thank you

Some random thoughts:
Use vector::resize() at the beginning (you did that)
Read large blocks of file data at a time, at least 4k, better still 256k. Read them into a memory buffer, parse that buffer into your vector.
Don't read the whole file at once, this might needlessly lead to swapping.
sizeof(char) is always 1 :)

while i cannot speak for supercomputers with 3 gig lines you will go nowhere in memory on a desktop machine.
i think you should first try to figure out all operations on that data. you should try to design all algorithms to operate sequentially. if you need random access you will do swapping all the time. this algorithm design will have a big impact on your data model.
so do not start with reading all data, just because that is an easy part, but design the whole system with a clear view an what data is in memory during the whole processing.
update
when you do all processing in a single run on the stream and separate the data processing in stages (read - preprocess - ... - write) you can utilize multithreading effectivly.
finally
whatever you want to do in a loop over the data, try to keep the number of loops a minimum. averaging for sure you can do in the read loop.
immediately make up a test file the size you expect to be the worst case in size and time two different approaches
.
time
loop
read line from disk
time
loop
process line (counting words per line)
time
loop
write data (word count) from line to disk
time
versus.
time
loop
read line from disk
process line (counting words per line)
write data (word count) from line to disk
time
if you have the algorithms already use yours. otherwise make up one (like counting words per line). if the write stage does not apply to your problem skip it. this test does take you less than an hour to write but can save you a lot.

Related

Faster reading and faster processing

I have a large file to read line by line and do some processing for each line. I came up with the simplest program in c/c++ like the following, but I am wondering if I can get some help in making it faster (using threading or fopenmp etc).
FILE *fp=fopen(argv[1], "r");
char line[500];
while(fgets(line, 500, fp) != NULL){
line[strlen(line)-1] = '\0';
/* do dome processing on each line */
for(int i=0; i<strlen(line)-k+1; i++){
/* do something for each k-length substring */
}
}
It takes huge amount of time as my file contains 500 million lines. I tried with a smaller file by first storing the lines and then doing processing the lines one by one and that was faster. Here I cannot store all 500 million lines as they will consume huge space.
I am new to programming, so any help to make it efficient will be appreciated.
This question is more suited for code-review. Anyways, some optimizations you can do are if you are on windows are.
Use CreateFile with the OVERLAPPED parameter for async IO.
ReadFile to read a chunk of the file into memory
Create multiple std::thread's at your ReadFile sub-routine, each at different parts of the file.
You could mmap the text file and let multiple worker threads process the data.

C++ / Fast random access skipping in a large file

I have large files, containing a small number of large datasets. Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
I want to build an index of dataset names very quickly. An example of file is about 21MB large, and contains 88 datasets. Reading the 88 names quickly by using a std::ifstream and seekg() to skip between datasets takes about 1300ms, which I would like to reduce.
So in fact, I'm reading 88 chunks of about 30 bytes, at given positions in a 21MB file, and it takes 1300ms.
Is there a way to improve this, or is it an OS and filesystem limitation? I'm running the test under Windows 7 64bit.
I know that having a complete index at the beginning of the file would be better, but the file format does not have this, and we can't change it.
You could use a memory mapped file interface (I recommend boost's implementation.)
This will open the file into the virtual page for your application for quicker lookup time, without going back to the disk.
You could scan the file and make your own header with the key and the index in a seperate file. Depending on your use case you can do it once at program start and everytime the file changes.
Before accessing the big data, a lookup in the smaller file gives you the needed index.
You may be able to do a buffer queuing process with multithreading. You could create a custom struct that would store various amounts of data.
You said:
Each dataset contains a name and the dataset size in bytes, allowing to skip it and go to the next dataset.
So as opening and closing the files over and over again is slow you could read the file all in one go and store it into a full buffer object and then parse it or store it into batches. This would also depend on if you are reading in text or binary mode on how easy it is to parse the file. I'll demonstrate the later with populating multiple batches while reading in a buffered size amount of data from file.
Pseudo Code
struct Batch {
std::string name; // Name of Dataset
unsigned size; // Size of Dataset
unsigned indexOffset; // Index to next read location
bool empty = true; // Flag to tell if this batch is full or empty
std::vector<DataType> dataset; // Container of Data
};
std::vector<Batch> finishedBatches;
// This doesn't matter on the size of the data set; this is just a buffer size on how much memory to digest in reading the file
const unsigned bufferSize = "Set to Your Preference" 1MB - 4MB etc.
void loadDataFromFile( const std::string& filename, unsigned bufferSize, std::vector<Batch>& batches ) {
// Set ifstream's buffer size
// OpenFile For Reading and read in and upto buffer size
// Spawn different thread to populate the Batches and while that batch is loading
// in data read in that much buffer data again. You will need to have a couple local
// stack batches to work with. So if a batch is complete and you reached the next index
// location from the file you can fill another batch.
// When a single batch is complete push it into the vector to store that batch.
// Change its flag and clear its vector and you can then use that empty batch again.
// Continue this until you reached end of file.
}
This would be a 2 threaded system here. Main thread for opening and reading from file and seeking from file with a worker thread filling the batches and pushing batches into container and swapping to use next batch.

C++ read text line-by-line, speed/efficiency savings needed

I have a series of large text files (10s - 100s of thousands of lines) that I want to parse line-by-line. The idea is to check if the line has a specific word/character/phrase and to, for now, record to a secondary file if it does.
The code I've used so far is:
ifstream infile1("c:/test/test.txt");
while (getline(infile1, line)) {
if (line.empty()) continue;
if (line.find("mystring") != std::string::npos) {
outfile1 << line << '\n';
}
}
The end goal is to be writing those lines to a database. My thinking was to write them to the file first and then to import the file.
The problem I'm facing is the time taken to complete the task. I'm looking to minimize the time as far as possible, so any suggestions as to time savings on the read/write scenario above would be most welcome. Apologies if anything is obvious, I've only just started moving into C++.
Thanks
EDIT
I should say that I'm using VS2015
EDIT 2
So this was my own dumb fault, when switching to Release and changing the architecture type I had noticeable speed increases. Thanks to everyone for pointing me in that direction. I'm also looking at the mmap stuff and that's proving useful too. Thanks guys!
When you use ifstream to read and process to/from really big files, you have to increase the default buffer size that is used (normally 512 bytes).
The best buffer size depends on your needs, but as a hint you can use the partition block size of the file(s) your reading/writing. To know that information you can use a lot of tools or even code.
Example in Windows:
fsutil fsinfo ntfsinfo c:
Now, you have to create a new buffer to ifstream like this:
size_t newBufferSize = 4 * 1024; // 4K
char * newBuffer = new char[newBufferSize];
ifstream infile1;
infile1.rdbuf()->pubsetbuf(newBuffer, newBufferSize);
infile1.open("c:/test/test.txt");
while (getline(infile1, line)) {
/* ... */
}
delete newBuffer;
Do the same with the output stream and don't forget set new buffer before open file or it may not work.
You can play with values to find the very best size for you.
You'll note the difference.
C-style I/O functions are much faster than fstream.
You may use fgets/fputs to read/write each text line.

Fast and efficient way of writing an array of structs to a text file

I have a binary file. I am reading a block of data from that file into an array of structs using fread method. My struct looks like below.
struct Num {
uint64_t key;
uint64_t val
};
My main goal is to write the array into a different text file with space separated key and value pairs in each line as shown below.
Key1 Val1
Key2 Val2
Key3 Val3
I have written a simple function to do this.
Num *buffer = new Num[buffer_size];
// Read a block of data from the binary file into the buffer array.
ofstream out_file(OUT_FILE, ios::out);
for(size_t i=0; i<buffer_size; i++)
out_file << buffer[i].key << ' ' << buffer[i].val << '\n';
The code works. But it's slow. One more approach would be to create the entire string first and write to file only once at the end.
But I want to know if there are any best ways to do this. I found some info about ostream_iterator. But I am not sure how it works.
The most efficient method to write structures to a file is to write as many as you can in the fewest transactions.
Usually this means using an array and writing entire array with one transaction.
The file is a stream device and is most efficient when data is continuously flowing in the stream. This can be as simple as writing the array in one call to more complicated using threads. You will save more time by performing block or burst I/O than worrying about which function call to use.
Also, in my own programs, I have observed that placing formatted text into a buffer (array) and then block writing the buffer is faster than using a function to write the formatted text to the file. There is a chance that the data stream may pause during the formatting. With writing formatted data from a buffer, the flow of data through the stream is continuous.
There are other factors involved in writing to a file, such as allocating space on the media, other tasks running on your system and any sharing of the file media.
By using the above techniques, I was able to write GBs of data in minutes instead of the previous duration of hours.

Fastest way to find the number of lines in a text (C++)

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).
The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.
Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.
The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.
As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.
Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:
#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;
unsigned int FileRead( istream & is, vector <char> & buff ) {
is.read( &buff[0], buff.size() );
return is.gcount();
}
unsigned int CountLines( const vector <char> & buff, int sz ) {
int newlines = 0;
const char * p = &buff[0];
for ( int i = 0; i < sz; i++ ) {
if ( p[i] == '\n' ) {
newlines++;
}
}
return newlines;
}
int main( int argc, char * argv[] ) {
time_t now = time(0);
if ( argc == 1 ) {
cout << "lines\n";
ifstream ifs( "lines.dat" );
int n = 0;
string s;
while( getline( ifs, s ) ) {
n++;
}
cout << n << endl;
}
else {
cout << "buffer\n";
const int SZ = 1024 * 1024;
std::vector <char> buff( SZ );
ifstream ifs( "lines.dat" );
int n = 0;
while( int cc = FileRead( ifs, buff ) ) {
n += CountLines( buff, cc );
}
cout << n << endl;
}
cout << time(0) - now << endl;
}
Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.
Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )
If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.
If the file is only ever being appended to, you could also remember the number of lines
and previous length, and start from there.
It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.
You wrote that it keeps getting larger.
This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:
Parse to the end of file.
Remember the line count and the offset of EOF.
When the file grows fseek to the offset, parse to EOF and update the line count and the offset.
There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:
What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.
Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.
Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).
You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?
Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)
If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.
The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.
Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.
#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>
struct TestEOL
{
bool operator()(char c)
{
last = c;
return last == '\n';
}
char last;
};
int main()
{
std::fstream file("Plop.txt");
TestEOL test;
std::size_t count = std::count_if(std::istreambuf_iterator<char>(file),
std::istreambuf_iterator<char>(),
test);
if (test.last != '\n') // If the last character checked is not '\n'
{ // then the last line in the file has not been
++count; // counted. So increement the count so we count
} // the last line even if it is not '\n' terminated.
}
It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.
However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.
Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.
You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.
However, there are a couple of possibilities which you may want to consider.
1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.
A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.
2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).
3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.
For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.
To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.
The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.
So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.
Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.
Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.
Reading large amounts of data from the harddrive is slow. There's no way around that.
If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).
Ludwig Algorithm
Initialize offset to 0
Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
Update offset with position at EOF
Save counter & offset to a file or in a variable if you only need it in your software
Repeat from "Read file ..." on a change
This is actually how various software processing log files function (i.e. fail2ban comes to mind).
The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.
Proactive Algorithm
When creating the files, reset counter to 0.
Then each time you receive a new line to add to the file:
Lock file
Write one line
Load counter
Add one to counter
Save counter
Unlock file
This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.
Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.
For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.
Don't do processing backward.