Reducing time complexity of string comparison - c++

I have a dictionary .txt file with probably over a thousand words and their definitions. I've already written a program to take the first word of each line from this file and check it against a string input by the user:
void checkWord(string input)
{
std::ifstream inFile;
inFile.open("Oxford.txt");
if (inFile.is_open())
{
string line; //there is a "using std::string" in another file
while (getline(inFile, line))
{
//read the first word from each line
std::istringstream iss(line);
string word;
iss >> word;
//make sure the strings being compared are the same case
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
std::transform(input.begin(), input.end(), input.begin(), ::tolower);
if (word == input)
{
//Do a thing with word
}
}
inFile.close();
return "End of file";
}
else
{
return "Unable to open file";
}
}
But if I'm checking more than a sentence, the time it takes to process becomes noticeable. I've thought about about a few ways of making this time shorter:
Making a .txt file for each letter of the alphabet (Pretty easy to do, but not really a fix in the long-term)
Using unordered_set to compare the strings (like in this question) the only problem with this might be the initial creation of these maps from the text file
Using some other data structure to compare strings? (Like std::map)
Given that the data is already "sorted", what kind of data structure or method should I employ in order to (if possible) reduce time complexity? Also, are there any issues with the function I am using to compare strings? (for example, would string::compare() be quicker than "=="?)

A tree (std::map)? Or a hashmap (std::unsorted_map)? Your linear search is obviously a brute force solution! Both of the above will be substantially superior for multiple searches.
Of course, that only really helps if you are going to use this data multiple times per program run, which you didn't specify in your question. If not, there's not really much benefit in loading and parsing and storing all the data only to perform a single lookup then quit. Just put a break in on success, at least.
You imply that your input file is sorted. You could hack together a binary search solution with file seeking (which is really cheap) and snapping to the nearest newline on each iteration to determine roughly where all the words with the same leading (say) three characters are in your file. For a thousand entries, though, this is probably overkill.

So, there are "simple" fixes, and there are some more complex ones.
The first step is to move all unnecessary things out of the search-loop: Lowercase input once, before the loop, rather than every time - after all, it's not changing. If possible, lowercase the Oxford.txt too, so you don't have to lowercase word for every line.
If you are searching the file multiple times, reading a file multiple times is definitely not a great solution - even if it's cached in the filesystem the second time.
So reading it once into some container, really simple one would be std::vector [and lower-case the string at the same time] and just iterating over it. The next improvement would be to sort the vector and us a binary search (but you'd have to write the binary search yourself - it's not terribly hard)
A slightly more complex solution [but faster to search] would be to use std::map<std::string, std::string> wordlist (but that also takes a bit more space), then use auto pos = wordlist.find(input); if (pos != wordlist.end() ... found word ....

You can benefit from using a prefix tree, also known as a trie data structure, as it fits the use case of having a dictionary and frequently looking up words in it.
The simplest model of a trie is a tree where each node holds a letter and a flag to tell whether the current letter is the end of a word (and, additionally, pointers to other data about the word).
Example picture of a trie containing the dictionary aback abate bid bird birth black blast:
To search for a word, start from the root, and for each letter of your word, follow the node containing the current letter (halt if it isn't present as a child of the current node). The search time is proportional to the look up word length, instead of to the size of your dictionary.
A trie also allows you to easily get the alphabetic (lexicographical) order of words in a dictionary: just do a pre-order traversal of it.

Instead of storing everything in a .txt file, store it in a real database.
SQLite3 is a good choice for simple tasks, since it is in-process instead of requiring an external server.
For a very simple, the C API and SQL statements should be very easy to learn.
Something like:
-- Only do this once, for setup, not each time you run your program.
sqlite> CREATE TABLE dictionary (word TEXT PRIMARY KEY);
sqlite> .import /usr/share/dict/words dictionary;
-- Do this every time you run your program.
sqlite> select count(*) from dictionary where word = 'a';
1

Related

C++ count functional words occurrence

I'm trying to count occurrences of specific words from a text file, the problem is that when my code is reading the file - it is reading it with white-space delimiters but some of the words i want to count are "2 word words" for example "out from"
additional to this there is a second problem and that is the words like "aren't" and "don't" - my code seem to ignore this words even when i put them with backslash in the map - my guess is that it is getting ignored in the process of reading it from the file for some reason
the end outcome that i am looking for is the frequency of the words that i am searching for.
std::list<std::string> Fwords = {
"a","abroad","as far as","ahead of"};
// Begin reading from file:
std::ifstream fileStream(fileName);
// Check if we've opened the file (as we should have).
if (fileStream.is_open())
while (fileStream.good())
{
// Store the next word in the file in a local variable.
std::string word;
fileStream >> word;
std::cout << "This is the word: " << word << endl;
if (std::find(std::begin(Fwords), std::end(Fwords), word) != std::end(Fwords))
wordsCount[word]++;
}
input:
"ahead of me as far as abroad me"
this would be the expected output:
abroad:1
ahead of:1
as far as:1
This approach won't work. Your problem is that you're reading one word at a time from the file. No amount of backslashing or manipulating the list / map of words will fix that.
But how are you supposed to know how many words to read? You don't—it'll have to be trial and error.
One way to "brute force" this, considering your level of programming, would be to add an else case to
if (std::find(std::begin(Fwords), std::end(Fwords), word) != std::end(Fwords))
{
// ...
}
in which you check for words in the map that begin with the word from the file, e.g. "as," but with a space, so the search is for as . If one or more matches are found, then it's time to read another word from the file, e.g. "as far." This should be put in a loop (or a function called in a loop) so that the search for as far and reading another word "as" happens automatically. Upon successfully finding as far as, you're done. You're also done upon failure to find as , as far , or as far as, i.e. if you don't have these in your map, in which case, you want to run a for loop through each word to check if they are words by themselves, and increase their count if so. In this endeavor, you'll realize that you need the same code as your original code; so it'd be smart to factor it out into a function as well.

Most effective way to create a naive text summaring algorithm

I'm building a simple naive text summary algorithm. The algorithm works like this:
First step of my algorithm is to remove all stop words(stop words in English).
After my text contains only words with actual meaning I'm going to see how many times each word is used in the text to find the frequency of the word. For example if the word "supercomputer" is used 5 times, it will have frequency = 5.
Then I'm going to calculate each sentences weight by dividing the sum of the frequencies of all words in the sentence to the number of the words in the sentence.
On the last step I'm going to sort the sentences by their length.
I need to write this algorithm in C++ (as V8 NodeJS module), but the problem is that in the past few years I've been working mostly with high-level scripting languages like Javascript and I'm not that experienced in C++. In javascript I could just use regex to remove all stop words and then find the frequency, but in C++ seems to be much more complex.
I came up with the following idea:
struct words {
string word;
int freq;
}
std::vector<words> Words;
The stop words are going to be preloaded in a V8 Local Array or std::vector.
For each word in the text I'm going to loop through all stop words, if the current word is not a stop word, then check if its in the struct, if not -> add a new word to the Words vector, if exists increase freq by 1.
After I have found all the frequencies of all words, I'm going to loop through the text again to find the weight of each sentence.
And with this idea few problems came to my mind:
My texts will be mostly 1000+ words. And for each word looping through 100+ stop words are going to make 100000 iterations just to figure out the stop words. This seems to be really ineffective.
After I have the frequencies I will need to loop one more time through the text 1000+ words with 300+ words(in the vector frequencies) to calculate each sentences weight.
My idea seems to be ineffective, but I'm not well familiar with C++.
So my questions is are there better ways to do this or optimize my algorithm, especially the problems I listed above?
I'm worried about the performance of my algorithm and any tips/suggestions will be greatly appreciated.
For the stopwords, have a look at std::unordered_set. You can store all of your stopword strings in a std::unordered_set<string>, then when you have a string you want to compare, call count(string) to see if it exists.
For the word/frequency pairs, use a std::unordered_map as in some of the comments. It would be fastest if you perform both the find and insert in a single map lookup. Try something like this:
struct Frequency
{
int val;
Frequency() : val(0) {}
void increment()
{
++val;
}
};
std::unordered_map<std::string, Frequency> words;
void processWord(const std::string str)
{
words[str].increment();
}
words[str] searches for a word in the map, adding it if it doesn't exist. New words will call Frequency's constructor which initializes to zero. So all you have to do is call processWord on every word.

Most efficient way to extract certain lines from a text file

I have a log file of variable length which may or may not contain the strings I'm looking for.
Lines have timestamps etc followed by < parameter >#< value > I want to check the parameter and extract the value.
The implementation below works but I'm sure there must be a more efficient way to parse the file.
Key points:
Most lines are going to be ignored
There are approx 1600 log files of between 1 - 20 Mb
Even a small gain per file will be an advantage
NB. the parse function calls substring then converts that to an int
Any ideas much appreciated
ifstream fileReader(logfile.c_str());
string lineIn;
if(fileReader.is_open())
{
while(fileReader.good())
{
getline(fileReader,lineIn);
if(lineIn.find("value1#") != string::npos)
{
parseValue1(lineIn);
}
else if(lineIn.find("value2#") != string::npos)
{
parseValue2(lineIn);
}
else if(lineIn.find("value3#") != string::npos)
{
parseValue3(lineIn);
}
}
}
fileReader.close();
First of all you are doing loop wrong. your code should be:
while( getline( fileReader,lineIn ) ) {
}
Second, lines:
if( fileReader.is_open() )
and
fileReader.close();
are redundant.
As for speed. I would recommend using regular expression:
std::regex reg ( "(value1#)|(value#2)|(value#3)(\\d+)" );
while( getline( fileReader,lineIn ) ) {
std::smatch m;
if( std::regex_search( lineIn.begin(), lineIn.end(), m, reg ) ) {
std::cout << "found: " << m[4] << std::endl;
}
}
Of course you would need to modify regular expression accordingly.
Unfortunately, iostreams are known to be pretty slow. If you would not get enough performance you may consider to replace fstream with FILE * or mmap.
Looks like a lot of repeated searches in the same string, which will not be very efficient.
Parse the file/line in a proper way.
There are three libraries in Boost that might be of help.
Parse the line using a regular expression:
http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/index.html
Use a tokenizer
http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/index.html
For full customization you can always use Spirit.
http://www.boost.org/doc/libs/1_53_0/libs/spirit/doc/html/index.html
The first step would be to figure out how much of the time is spent in the if(lineIn.find(...)... and how much is the actual reading of input file.
Time the time your application runs for (you may want to take a selection of log-files, rather than ALL of them). You may want to run this a few times in a row to see that you get the same (approximately) value.
The add:
#if 0
if (lineIn.find(...) ...)
...
#endif
and compare the time it takes. My guess is that it won't actually make that much of a difference. However, if the searching is a major component of the time, you may find that it's beneficial to use a more clever search method. There are some pretty clever methods for searching for strings in a larger string.
I will post back with a couple of benchmarks of "read a file quicker" that I've posted elsewhere. But bear in mind that the hard-disk that you are reading from will be the major amount of time.
References:
getline while reading a file vs reading whole file and then splitting based on newline character
slightly less relevant, but perhaps interesting:
What is the best efficient way to read millions of integers separated by lines from text file in c++
Your execution bottleneck will be in file I/O.
I suggest that you haul in as much data as possible in one fetch into a buffer. Next, search the buffer for your tokens.
You have to read in the text in order to search it, so you might as well read in as much of the file as you can.
There may be some drawbacks in reading too much data into memory. If the OS can't fit all the data, it may page it out to a harddrive, which makes the technique worthless (unless you want the OS to handle reading the file in chunks).
Once the file is in memory, searching technique may have negligible performance increases.

Given a string, find all its permutations that are a word in dictionary

This is an interview question:
Given a string, find all its permutations that are a word in dictionary.
My solution:
Put all words of the dictionary into a suffix tree and then search each permutation of the string in the tree.
The search time is O(n), where n is the size of the string. But the string may have n! permutations.
How do I improve the efficiency?
Your general approach isn't bad.
However, you can prevent having to search for each permutation by rearranging your word so that all it's characters are in alphabetical order, then searching on a dictionary where each word is similarly re-arranged into alphabetical order and mapped to the original word.
I realise that might be a little hard to grasp as is, so here's an example. Say your word is leap. Rearrange this to aelp.
Now in your dictionary you might have the words plea and pale. Having done as suggested, your dictionary will (among other things) contain the following mappings:
...
aelp -> pale
aelp -> plea
...
So now, to find your anagrams you need only find entries for aelp (using, for example, a suffix-tree approach as suggested), rather than for all 4! = 24 permutations of leap.
A quick alternative solution - all depends on the sizes of data structures in question.
If the dictionary is reasonable small and the string is reasonably long, you can go over each entry in the dictionary and figure out if they are a permutation of the string. You can be smarter - you can sort the dictionary and skip certain entries.
You can build a map from a sorted list of characters to a list of words.
For example, given these:
Array (him, hip, his, hit, hob, hoc, hod, hoe, hog, hon, hop, hos, hot)
you would sort them internally:
Array (him, hip, his, hit, bho, cho, dho, eho, gho, hno, hop, hos, hot)
sort the result:
Array (bho, cho, dho, eho, gho, him, hip, his, hit, hno, hop, hos, hot)
In this small sample, we don't have a match, but for a particular word, you would sort it internally, and with this as key look into your map.
Why don't you use a hash map to store the dictionary words? So you get O(1) lookup time. And if your input is in english, you can build another table to tell all the possible letters in your dictionary, using this table, you can filter some inputs at the beginning. Following is an example:
result_list = empty;
for(char in input)
{
if(char not in letter_table)
{
return result_list;
}
}
for(entry in permutations of input)
{
if(entry in dictionary_hash_table)
{
result_list->add_entry();
}
}
return result_list
You should put the words into a trie. Then you can look up the word as you generate the permutations. You can skip over whole blocks of permutations with the first part is not in the trie.
http://en.wikipedia.org/wiki/Trie
Another simple solution could be as algorithm below,
1) Use "next_permutation" to find a unique permutation.
2) Use "find/find_if" to find it against a dictionary.

How do you read a word in from a file in C++?

So I was feeling bored and decided I wanted to make a hangman game. I did an assignment like this back in high school when I first took C++. But this was before I even too geometry, so unfortunately I didn't do well in any way shape or form in it, and after the semester I trashed everything in a fit of rage.
I'm looking to make a txt document and just throw in a whole bunch of words
(ie:
test
love
hungery
flummuxed
discombobulated
pie
awkward
you
get
the
idea
)
So here's my question:
How do I get C++ to read a random word from the document?
I have a feeling #include<ctime> will be needed, as well as srand(time(0)); to get some kind of pseudorandom choice...but I haven't the foggiest on how to have a random word taken from a file...any suggestions?
Thanks ahead of time!
Here's a rough sketch, assuming that the words are separated by whitespaces (space, tab, newline, etc):
vector<string> words;
ifstream in("words.txt");
while(in) {
string word;
in >> word;
words.push_back(word);
}
string r=words[rand()%words.size()];
The operator >> used on a string will read 1 (white) space separated word from a stream.
So the question is do you want to read the file each time you pick a word or do you want to load the file into memory and then pick up the word from a memory structure. Without more information I can only guess.
Pick a Word from a file:
// Note a an ifstream is also an istream.
std::string pickWordFromAStream(std::istream& s,std::size_t pos)
{
std::istream_iterator<std::string> iter(s);
for(;pos;--pos)
{ ++iter;
}
// This code assumes that pos is smaller or equal to
// the number of words in the file
return *iter;
}
Load a file into memory:
void loadStreamIntoVector(std::istream& s,std::vector<std::string> words)
{
std::copy(std::istream_iterator<std::string>(s),
std::istream_iterator<std::string>(),
std::back_inserter(words)
);
}
Generating a random number should be easy enough. Assuming you only want psudo-random.
I would recommend creating a plain text file (.txt) in Notepad and using the standard C file APIs (fopen(), and fread()) to read from it. You can use fgets() to read each line one at a time.
Once you have your plain text file, just read each line into an array and then randomly choose an entry in the array using the method you've suggested above.