Most effective way to create a naive text summaring algorithm - c++

I'm building a simple naive text summary algorithm. The algorithm works like this:
First step of my algorithm is to remove all stop words(stop words in English).
After my text contains only words with actual meaning I'm going to see how many times each word is used in the text to find the frequency of the word. For example if the word "supercomputer" is used 5 times, it will have frequency = 5.
Then I'm going to calculate each sentences weight by dividing the sum of the frequencies of all words in the sentence to the number of the words in the sentence.
On the last step I'm going to sort the sentences by their length.
I need to write this algorithm in C++ (as V8 NodeJS module), but the problem is that in the past few years I've been working mostly with high-level scripting languages like Javascript and I'm not that experienced in C++. In javascript I could just use regex to remove all stop words and then find the frequency, but in C++ seems to be much more complex.
I came up with the following idea:
struct words {
string word;
int freq;
}
std::vector<words> Words;
The stop words are going to be preloaded in a V8 Local Array or std::vector.
For each word in the text I'm going to loop through all stop words, if the current word is not a stop word, then check if its in the struct, if not -> add a new word to the Words vector, if exists increase freq by 1.
After I have found all the frequencies of all words, I'm going to loop through the text again to find the weight of each sentence.
And with this idea few problems came to my mind:
My texts will be mostly 1000+ words. And for each word looping through 100+ stop words are going to make 100000 iterations just to figure out the stop words. This seems to be really ineffective.
After I have the frequencies I will need to loop one more time through the text 1000+ words with 300+ words(in the vector frequencies) to calculate each sentences weight.
My idea seems to be ineffective, but I'm not well familiar with C++.
So my questions is are there better ways to do this or optimize my algorithm, especially the problems I listed above?
I'm worried about the performance of my algorithm and any tips/suggestions will be greatly appreciated.

For the stopwords, have a look at std::unordered_set. You can store all of your stopword strings in a std::unordered_set<string>, then when you have a string you want to compare, call count(string) to see if it exists.
For the word/frequency pairs, use a std::unordered_map as in some of the comments. It would be fastest if you perform both the find and insert in a single map lookup. Try something like this:
struct Frequency
{
int val;
Frequency() : val(0) {}
void increment()
{
++val;
}
};
std::unordered_map<std::string, Frequency> words;
void processWord(const std::string str)
{
words[str].increment();
}
words[str] searches for a word in the map, adding it if it doesn't exist. New words will call Frequency's constructor which initializes to zero. So all you have to do is call processWord on every word.

Related

Reducing time complexity of string comparison

I have a dictionary .txt file with probably over a thousand words and their definitions. I've already written a program to take the first word of each line from this file and check it against a string input by the user:
void checkWord(string input)
{
std::ifstream inFile;
inFile.open("Oxford.txt");
if (inFile.is_open())
{
string line; //there is a "using std::string" in another file
while (getline(inFile, line))
{
//read the first word from each line
std::istringstream iss(line);
string word;
iss >> word;
//make sure the strings being compared are the same case
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
std::transform(input.begin(), input.end(), input.begin(), ::tolower);
if (word == input)
{
//Do a thing with word
}
}
inFile.close();
return "End of file";
}
else
{
return "Unable to open file";
}
}
But if I'm checking more than a sentence, the time it takes to process becomes noticeable. I've thought about about a few ways of making this time shorter:
Making a .txt file for each letter of the alphabet (Pretty easy to do, but not really a fix in the long-term)
Using unordered_set to compare the strings (like in this question) the only problem with this might be the initial creation of these maps from the text file
Using some other data structure to compare strings? (Like std::map)
Given that the data is already "sorted", what kind of data structure or method should I employ in order to (if possible) reduce time complexity? Also, are there any issues with the function I am using to compare strings? (for example, would string::compare() be quicker than "=="?)
A tree (std::map)? Or a hashmap (std::unsorted_map)? Your linear search is obviously a brute force solution! Both of the above will be substantially superior for multiple searches.
Of course, that only really helps if you are going to use this data multiple times per program run, which you didn't specify in your question. If not, there's not really much benefit in loading and parsing and storing all the data only to perform a single lookup then quit. Just put a break in on success, at least.
You imply that your input file is sorted. You could hack together a binary search solution with file seeking (which is really cheap) and snapping to the nearest newline on each iteration to determine roughly where all the words with the same leading (say) three characters are in your file. For a thousand entries, though, this is probably overkill.
So, there are "simple" fixes, and there are some more complex ones.
The first step is to move all unnecessary things out of the search-loop: Lowercase input once, before the loop, rather than every time - after all, it's not changing. If possible, lowercase the Oxford.txt too, so you don't have to lowercase word for every line.
If you are searching the file multiple times, reading a file multiple times is definitely not a great solution - even if it's cached in the filesystem the second time.
So reading it once into some container, really simple one would be std::vector [and lower-case the string at the same time] and just iterating over it. The next improvement would be to sort the vector and us a binary search (but you'd have to write the binary search yourself - it's not terribly hard)
A slightly more complex solution [but faster to search] would be to use std::map<std::string, std::string> wordlist (but that also takes a bit more space), then use auto pos = wordlist.find(input); if (pos != wordlist.end() ... found word ....
You can benefit from using a prefix tree, also known as a trie data structure, as it fits the use case of having a dictionary and frequently looking up words in it.
The simplest model of a trie is a tree where each node holds a letter and a flag to tell whether the current letter is the end of a word (and, additionally, pointers to other data about the word).
Example picture of a trie containing the dictionary aback abate bid bird birth black blast:
To search for a word, start from the root, and for each letter of your word, follow the node containing the current letter (halt if it isn't present as a child of the current node). The search time is proportional to the look up word length, instead of to the size of your dictionary.
A trie also allows you to easily get the alphabetic (lexicographical) order of words in a dictionary: just do a pre-order traversal of it.
Instead of storing everything in a .txt file, store it in a real database.
SQLite3 is a good choice for simple tasks, since it is in-process instead of requiring an external server.
For a very simple, the C API and SQL statements should be very easy to learn.
Something like:
-- Only do this once, for setup, not each time you run your program.
sqlite> CREATE TABLE dictionary (word TEXT PRIMARY KEY);
sqlite> .import /usr/share/dict/words dictionary;
-- Do this every time you run your program.
sqlite> select count(*) from dictionary where word = 'a';
1

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.
Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

SQL Hash table for words

I'm trying to solve the "find all possible words for a set of letters" problems. There are some good answers out there, but I still can't figure it out.
In my first test, I put the whole dictionary in an array and then looped through each letter. This is super fast, but it takes forever to load the dictionary in the array, and requires huge amount of memory.
So I need to store the dictionary (750,000) letter is a sql database.
I guess there are two solutions to find all the possible words:
Make an advance query that returns all the possible words
Make a simple query that return a fraction of the database with words that might be possible, and then quickly loop through that array and valide the words.
The problem?:
It must be super fast. An iPhone 4 need to be able to get all possible words in under 5-6 seconds so it doesn't hinder the game.
Here's a similar questions:
IOS: Sqlite. Find record fast
Sulthans answer seems like a good idea. Create a hash table, and then:
Bitmask for ASCII letters (ignoring any non-ASCII alphabets). Bit at
position 0 means the word contains "a", at position 1 contains "b"
etc. If we create the same bitmask for our letters, we can select
words such as (wordMask & ~lettersMask) == 0
How do you make the bitmask, hash table, and how do you construct the sql query?
Thanks
sql is probably not the best option. The traditional data structure for storing a collection of words is called a Trie. I'm sure there implementations out there you can find. Someone else will have an answer to that.
The algorithm I envision is to permute the letters you are given, and check each permutation to see if it is in the Trie.

Given a string, find all its permutations that are a word in dictionary

This is an interview question:
Given a string, find all its permutations that are a word in dictionary.
My solution:
Put all words of the dictionary into a suffix tree and then search each permutation of the string in the tree.
The search time is O(n), where n is the size of the string. But the string may have n! permutations.
How do I improve the efficiency?
Your general approach isn't bad.
However, you can prevent having to search for each permutation by rearranging your word so that all it's characters are in alphabetical order, then searching on a dictionary where each word is similarly re-arranged into alphabetical order and mapped to the original word.
I realise that might be a little hard to grasp as is, so here's an example. Say your word is leap. Rearrange this to aelp.
Now in your dictionary you might have the words plea and pale. Having done as suggested, your dictionary will (among other things) contain the following mappings:
...
aelp -> pale
aelp -> plea
...
So now, to find your anagrams you need only find entries for aelp (using, for example, a suffix-tree approach as suggested), rather than for all 4! = 24 permutations of leap.
A quick alternative solution - all depends on the sizes of data structures in question.
If the dictionary is reasonable small and the string is reasonably long, you can go over each entry in the dictionary and figure out if they are a permutation of the string. You can be smarter - you can sort the dictionary and skip certain entries.
You can build a map from a sorted list of characters to a list of words.
For example, given these:
Array (him, hip, his, hit, hob, hoc, hod, hoe, hog, hon, hop, hos, hot)
you would sort them internally:
Array (him, hip, his, hit, bho, cho, dho, eho, gho, hno, hop, hos, hot)
sort the result:
Array (bho, cho, dho, eho, gho, him, hip, his, hit, hno, hop, hos, hot)
In this small sample, we don't have a match, but for a particular word, you would sort it internally, and with this as key look into your map.
Why don't you use a hash map to store the dictionary words? So you get O(1) lookup time. And if your input is in english, you can build another table to tell all the possible letters in your dictionary, using this table, you can filter some inputs at the beginning. Following is an example:
result_list = empty;
for(char in input)
{
if(char not in letter_table)
{
return result_list;
}
}
for(entry in permutations of input)
{
if(entry in dictionary_hash_table)
{
result_list->add_entry();
}
}
return result_list
You should put the words into a trie. Then you can look up the word as you generate the permutations. You can skip over whole blocks of permutations with the first part is not in the trie.
http://en.wikipedia.org/wiki/Trie
Another simple solution could be as algorithm below,
1) Use "next_permutation" to find a unique permutation.
2) Use "find/find_if" to find it against a dictionary.