Most efficient way to extract certain lines from a text file - c++

I have a log file of variable length which may or may not contain the strings I'm looking for.
Lines have timestamps etc followed by < parameter >#< value > I want to check the parameter and extract the value.
The implementation below works but I'm sure there must be a more efficient way to parse the file.
Key points:
Most lines are going to be ignored
There are approx 1600 log files of between 1 - 20 Mb
Even a small gain per file will be an advantage
NB. the parse function calls substring then converts that to an int
Any ideas much appreciated
ifstream fileReader(logfile.c_str());
string lineIn;
if(fileReader.is_open())
{
while(fileReader.good())
{
getline(fileReader,lineIn);
if(lineIn.find("value1#") != string::npos)
{
parseValue1(lineIn);
}
else if(lineIn.find("value2#") != string::npos)
{
parseValue2(lineIn);
}
else if(lineIn.find("value3#") != string::npos)
{
parseValue3(lineIn);
}
}
}
fileReader.close();

First of all you are doing loop wrong. your code should be:
while( getline( fileReader,lineIn ) ) {
}
Second, lines:
if( fileReader.is_open() )
and
fileReader.close();
are redundant.
As for speed. I would recommend using regular expression:
std::regex reg ( "(value1#)|(value#2)|(value#3)(\\d+)" );
while( getline( fileReader,lineIn ) ) {
std::smatch m;
if( std::regex_search( lineIn.begin(), lineIn.end(), m, reg ) ) {
std::cout << "found: " << m[4] << std::endl;
}
}
Of course you would need to modify regular expression accordingly.
Unfortunately, iostreams are known to be pretty slow. If you would not get enough performance you may consider to replace fstream with FILE * or mmap.

Looks like a lot of repeated searches in the same string, which will not be very efficient.
Parse the file/line in a proper way.
There are three libraries in Boost that might be of help.
Parse the line using a regular expression:
http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/index.html
Use a tokenizer
http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/index.html
For full customization you can always use Spirit.
http://www.boost.org/doc/libs/1_53_0/libs/spirit/doc/html/index.html

The first step would be to figure out how much of the time is spent in the if(lineIn.find(...)... and how much is the actual reading of input file.
Time the time your application runs for (you may want to take a selection of log-files, rather than ALL of them). You may want to run this a few times in a row to see that you get the same (approximately) value.
The add:
#if 0
if (lineIn.find(...) ...)
...
#endif
and compare the time it takes. My guess is that it won't actually make that much of a difference. However, if the searching is a major component of the time, you may find that it's beneficial to use a more clever search method. There are some pretty clever methods for searching for strings in a larger string.
I will post back with a couple of benchmarks of "read a file quicker" that I've posted elsewhere. But bear in mind that the hard-disk that you are reading from will be the major amount of time.
References:
getline while reading a file vs reading whole file and then splitting based on newline character
slightly less relevant, but perhaps interesting:
What is the best efficient way to read millions of integers separated by lines from text file in c++

Your execution bottleneck will be in file I/O.
I suggest that you haul in as much data as possible in one fetch into a buffer. Next, search the buffer for your tokens.
You have to read in the text in order to search it, so you might as well read in as much of the file as you can.
There may be some drawbacks in reading too much data into memory. If the OS can't fit all the data, it may page it out to a harddrive, which makes the technique worthless (unless you want the OS to handle reading the file in chunks).
Once the file is in memory, searching technique may have negligible performance increases.

Related

C++ count functional words occurrence

I'm trying to count occurrences of specific words from a text file, the problem is that when my code is reading the file - it is reading it with white-space delimiters but some of the words i want to count are "2 word words" for example "out from"
additional to this there is a second problem and that is the words like "aren't" and "don't" - my code seem to ignore this words even when i put them with backslash in the map - my guess is that it is getting ignored in the process of reading it from the file for some reason
the end outcome that i am looking for is the frequency of the words that i am searching for.
std::list<std::string> Fwords = {
"a","abroad","as far as","ahead of"};
// Begin reading from file:
std::ifstream fileStream(fileName);
// Check if we've opened the file (as we should have).
if (fileStream.is_open())
while (fileStream.good())
{
// Store the next word in the file in a local variable.
std::string word;
fileStream >> word;
std::cout << "This is the word: " << word << endl;
if (std::find(std::begin(Fwords), std::end(Fwords), word) != std::end(Fwords))
wordsCount[word]++;
}
input:
"ahead of me as far as abroad me"
this would be the expected output:
abroad:1
ahead of:1
as far as:1
This approach won't work. Your problem is that you're reading one word at a time from the file. No amount of backslashing or manipulating the list / map of words will fix that.
But how are you supposed to know how many words to read? You don't—it'll have to be trial and error.
One way to "brute force" this, considering your level of programming, would be to add an else case to
if (std::find(std::begin(Fwords), std::end(Fwords), word) != std::end(Fwords))
{
// ...
}
in which you check for words in the map that begin with the word from the file, e.g. "as," but with a space, so the search is for as . If one or more matches are found, then it's time to read another word from the file, e.g. "as far." This should be put in a loop (or a function called in a loop) so that the search for as far and reading another word "as" happens automatically. Upon successfully finding as far as, you're done. You're also done upon failure to find as , as far , or as far as, i.e. if you don't have these in your map, in which case, you want to run a for loop through each word to check if they are words by themselves, and increase their count if so. In this endeavor, you'll realize that you need the same code as your original code; so it'd be smart to factor it out into a function as well.

Reducing time complexity of string comparison

I have a dictionary .txt file with probably over a thousand words and their definitions. I've already written a program to take the first word of each line from this file and check it against a string input by the user:
void checkWord(string input)
{
std::ifstream inFile;
inFile.open("Oxford.txt");
if (inFile.is_open())
{
string line; //there is a "using std::string" in another file
while (getline(inFile, line))
{
//read the first word from each line
std::istringstream iss(line);
string word;
iss >> word;
//make sure the strings being compared are the same case
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
std::transform(input.begin(), input.end(), input.begin(), ::tolower);
if (word == input)
{
//Do a thing with word
}
}
inFile.close();
return "End of file";
}
else
{
return "Unable to open file";
}
}
But if I'm checking more than a sentence, the time it takes to process becomes noticeable. I've thought about about a few ways of making this time shorter:
Making a .txt file for each letter of the alphabet (Pretty easy to do, but not really a fix in the long-term)
Using unordered_set to compare the strings (like in this question) the only problem with this might be the initial creation of these maps from the text file
Using some other data structure to compare strings? (Like std::map)
Given that the data is already "sorted", what kind of data structure or method should I employ in order to (if possible) reduce time complexity? Also, are there any issues with the function I am using to compare strings? (for example, would string::compare() be quicker than "=="?)
A tree (std::map)? Or a hashmap (std::unsorted_map)? Your linear search is obviously a brute force solution! Both of the above will be substantially superior for multiple searches.
Of course, that only really helps if you are going to use this data multiple times per program run, which you didn't specify in your question. If not, there's not really much benefit in loading and parsing and storing all the data only to perform a single lookup then quit. Just put a break in on success, at least.
You imply that your input file is sorted. You could hack together a binary search solution with file seeking (which is really cheap) and snapping to the nearest newline on each iteration to determine roughly where all the words with the same leading (say) three characters are in your file. For a thousand entries, though, this is probably overkill.
So, there are "simple" fixes, and there are some more complex ones.
The first step is to move all unnecessary things out of the search-loop: Lowercase input once, before the loop, rather than every time - after all, it's not changing. If possible, lowercase the Oxford.txt too, so you don't have to lowercase word for every line.
If you are searching the file multiple times, reading a file multiple times is definitely not a great solution - even if it's cached in the filesystem the second time.
So reading it once into some container, really simple one would be std::vector [and lower-case the string at the same time] and just iterating over it. The next improvement would be to sort the vector and us a binary search (but you'd have to write the binary search yourself - it's not terribly hard)
A slightly more complex solution [but faster to search] would be to use std::map<std::string, std::string> wordlist (but that also takes a bit more space), then use auto pos = wordlist.find(input); if (pos != wordlist.end() ... found word ....
You can benefit from using a prefix tree, also known as a trie data structure, as it fits the use case of having a dictionary and frequently looking up words in it.
The simplest model of a trie is a tree where each node holds a letter and a flag to tell whether the current letter is the end of a word (and, additionally, pointers to other data about the word).
Example picture of a trie containing the dictionary aback abate bid bird birth black blast:
To search for a word, start from the root, and for each letter of your word, follow the node containing the current letter (halt if it isn't present as a child of the current node). The search time is proportional to the look up word length, instead of to the size of your dictionary.
A trie also allows you to easily get the alphabetic (lexicographical) order of words in a dictionary: just do a pre-order traversal of it.
Instead of storing everything in a .txt file, store it in a real database.
SQLite3 is a good choice for simple tasks, since it is in-process instead of requiring an external server.
For a very simple, the C API and SQL statements should be very easy to learn.
Something like:
-- Only do this once, for setup, not each time you run your program.
sqlite> CREATE TABLE dictionary (word TEXT PRIMARY KEY);
sqlite> .import /usr/share/dict/words dictionary;
-- Do this every time you run your program.
sqlite> select count(*) from dictionary where word = 'a';
1

Retrieving file from .dat via getline() w/ c++

I posted this over at Code Review Beta but noticed that there is much less activity there.
I have the following code and it works just fine. It's function is to grab the input from a file and display it out (to confirm that it's been grabbed). My task is to write a program that counts how many times a certain word (string) "abc" is found in the input file.
Is it better to store the input as a string or in arrays/vectors and have each line be stored separately? a[1], a[2] ect? Perhaps someone could also point me to a resource that I can use to learn how to filter through the input data.
Thanks.
input_file.open ("in.dat");
while(!input_file.eof()) // Inputs all the lines until the end of file (eof).
{
getline(input_file,STRING); // Saves the input_file in STRING.
cout<<STRING; // Prints our STRING.
}
input_file.close();
Reading as much of the file into memory is always more efficient than reading one letter or text line at a time. Disk drives take a lot of time to spin up and relocate to a sector. However, your program will run faster if you can minimize the number of reads from the file.
Memory is fast to search.
My recommendation is to read the entire file, or as much as you can into memory, then search the memory for a "word". Remember, that in English, words can have hyphens,'-', and single quotes, "don't". Word recognition may become more difficult if it is split across a line or you include abbreviations (with periods).
Good luck.

C++ reading random lines of txt?

I am running C++ code where I need to import data from txt file.
The text file contains 10,000 lines. Each line contains n columns of binary data.
The code has to loop 100,000 times, each time it has to randomly select a line out of the txt file and assign the binary values in the columns to some variables.
What is the most efficient way to write this code? should I load the file first into the memory or should I randomly open a random line number?
How can I implement this in C++?
To randomly access a line in a text file, all lines need to have the same byte-length. If you don't have that, you need to loop until you get at the correct line. Since this will be pretty slow for so much access, better just load it into a std::vector of std::strings, each entry being one line (this is easily done with std::getline). Or since you want to assign values from the different columns, you can use a std::vector with your own struct like
struct MyValues{
double d;
int i;
// whatever you have / need
};
std::vector<MyValues> vec;
Which might be better instead of parsing the line all the time.
With the std::vector, you get your random access and only have to loop once through the whole file.
10K lines is a pretty small file.
If you have, say, 100 chars per line, it will use the HUGE amount of 1MB of your RAM.
Load it to a vector and access it the way you want.
maybe not THE most efficient, but you could try this:
int main() {
//use ifstream to read
ifstream in("yourfile.txt");
//string to store the line
string line = "";
//random number generator
srand(time(NULL));
for(int i = 0; i < 100000; i++) {
in.seekg(rand() % 10000);
in>>line;
//do what you want with the line here...
}
}
Im too lazy right now, but you need to make sure that you check your ifstream for errors like end-of-file, index-out-of-bounds, etc...
Since you're taking 100,000 samples from just 10,000 lines, the majority of lines will be sampled. Read the entire file into an array data structure, and then randomly sample the array. This avoids file seeking entirely.
The more common case is to sample only a small subset of the file's data. To do that, assuming the lines are different length, seek to random points in the file, skip to the next newline (for example cin.ignore( numeric_limits< streamsize >::max(), '\n' ), and then parse the subsequent text.

Search HTML lines and remove lines that don't start with </form></td><td><a

I have an HTML file with very bad formatted code that I get from a website, I want to extract some very small pieces of information.
I am only interested in lines that start like this:
</form></td><td> <b>user897</b></td></tr><tr><td>HouseA</td><td>2</td><td class="entriesTableRow-gamename">HouseA Type12 <span class="entriesTableRow-moredetails"></span></td><td>1 of 2</td><td>user123</td><td>10</td><td>
and I want to extract 3 fields:
A:HouseA
B:HouseA Type12
C:user123
D:10
I know I've seen people recommend HTML Agility Pack and lib2xml but I really don't think I need all that. My app is in C/C++.
I am already using getline to start reading lines, I am just not sure what's the best way to proceed. Thanks!
std::ifstream data("Home.html");
std::string line;
while(std::getline(data,line))
{
linenum++;
std::stringstream lineStream(line);
std::string user;
if (strncmp(line.c_str(), "</form></td><td>",strlen("</form></td><td>")) == 0)
{
printf("found a wanted line in line:%d\n", linenum);
}
}
In the general case, an XML/HTML parser is likely the best way here, as it will be robust against differing input. (Whatever you do, don't use regexps!)
Update
However, if you're targetting specific input, as it seems that you're doing, you can use sscanf (as you suggest) or cin.read() or regexp to scan manually.
Just beware that this code can break at any moment that the HTML changes (even just with whitespace).
Therefore, my/our recommendation is to use a proper tool for the job. XML/HTML is not raw text, and should not be treated as such.
How about writing a python script instead? :)