How to count the occurence of a word in multiple texts? - c++

I have a Binary tree that stores all the words with their occurence in a text. Word as key and number of occurence as value
If i have multiple texts, do I create multiple trees ?
Also, I want to count the idf (inverse document frequency - how many times that word appears in all the texts).
How can I achieve this ?

If I understood your problem correctly, you will need a tree for each file to be able to know how many occurences of a word you have in each one.
Then, for the second part i can't understand if you need the total number of occurences of a word or the number of files that contain that word.
In each case you just need to cycle through all of your tree and look for that word.

Related

Searching word in a file txt file that contains 100,000,000 words in c++

I have a txt file with 100000000 words in every new line.
I want to write a function that takes an input of the word and searches if the word is there or not in the txt file.
I have tried this with map and trie method but I'm getting std:bac_alloc error, this is due to that large number of words can anyone suggest how to solve the issue
Data structures are quite important when programming. If possible I would recommend that you use something like a binary tree. This would require sorting the text file though. If you cannot sort the text file, the best way would be to just iterate over the text file until you get the word that you wanted. Also, your comment should contain more information as to allow us to more easily diagnose your problem
I assume you want to search this word list over and over. Because for a small number of searches just search linear through the file.
Parsing the word list into a suffix tree takes about 20 times the size of the file, more if not optimized. Since you ran out of memory constructing a trie of the word list I assume it's really big. So lets not keep it in memory but process it a bit so you can search faster.
The solution I would propose is to do a dictionary search.
So first turn every whitespace into a newline so you have one word per line instead of multiple lines with multiple words and then sort the file and store it. While you are at it you can remove duplicates. That is our dictionary. Remember the length of the longest word (L) while you do that.
To access the dictionary you need a helper function to read a word at offset X, which can be at the middle of some word. The function should seek to the offset - L and read 2 * L bytes into a buffer. Then from the middle of the buffer search backward and forward to find the word at offset X.
Now to search you open the dictionary and read the word at offset left=0 and offset right = size_of_file, i.e. the first and last word. If your search term is less then the first word or larger then the last word you are done, word not found. If you found the search term you are done too.
Next in a binary search you would take the std::midpoint of left and right, read the word at that offset and check if the search term is less or more and recurse into that interval. This would require O(log n) reads to find the word or determine it's not present.
A dictionary search can do better. Instead of using the midpoint you can approximate where the word should be in the dictionary. Say your dictionary goes from "Aal" to "Zoo" and you are searching for "Zebra". Would you open the dictionary in the middle? No, you would open it near the end because Zerba is much closer to Zoo than Aal. So you need a function that gives you a value (M) between 0 and 1 of where a search term is located relative to the left and right word. Your "midpoint" for the search is then (right - left) * M. Then, like with binary search, determine if the search term is in the left or right interval and recurse.
A dictionary search takes only log log n reads on average if the word list has reasonably uniform distribution.

Find eligible words in a game of Scrabble based on first and last characters

I was looking at various problems associated with game of Scrabble. I came across this problem "Find eligible words in a game of Scrabble based on first and last characters in optimal time." If i use trie DS i can get all words starting with specific character and compare last character of each of those words with given last character. But in that case i am not using information that it is ending with specific character in building DS and in using it brut force way. Is there any better way to organize such that in place of walking all words starting with first characters, i can use that information that it is ending with given last characters.

String Finding Alg w/ Lowest Freq Char

I have 3 text files. One with a set of text to be searched through
(ex. ABCDEAABBCCDDAABC)
One contains a number of patterns to search for in the text
(ex. AB, EA, CC)
And the last containing the frequency of each character
(ex.
A 4
B 4
C 4
D 3
E 1
)
I am trying to write an algorithm to find the least frequent occurring character for each pattern and search a string for those occurrences, then check the surrounding letters to see if the string is a match. Currently, I have the characters and frequencies in their own vectors, respectively. (Where i=0 for each vector would be A 4, respectively.
Is there a better way to do this? Maybe a faster data structure? Also, what are some efficient ways to check the pattern string against the piece of the text string once the least frequent letter is found?
You can run the Aho-Corasick algorithm. Its complexity (once the preprocessing - whose complexity is unrelated to the text - is done), is Θ(n + p), where
n is the length of the text
p is the total number of matches found
This is essentially optimal. There is no point in trying to skip over letters that appear to be frequent:
If the letter is not part of a match, the algorithm takes unit time.
If the letter is part of a match, then the match includes all letters, irrespective of their frequency in the text.
You could run an iteration loop that keeps a count of instances and has a check to see if a character has appeared more than a percentage of times based on total characters searched for and total length of the string. i.e. if you have 100 characters and 5 possibilities, any character that has appeared more than 20% of the hundred can be discounted, increasing efficiency by passing over any value matching that one.

How to get a count of the word sizes in a large amount of text?

I have a large amount text - roughly 7000 words.
I would like to get a count of the words sizes e.g. the count of 4 letter words, 6 letters words using regex.
I am unsure how to go about this - my thought process so far would be to split the sentence into a String array which would allow me to count each individual elements size. Is there an easier way to go about this using a regex? I am using Groovy for this task.
EDIT: So i did get this working using an normal array but it was slightly messy. The final solution simply used Groovy's countBy() method coupled with a small amount of logic for anyone who might come across a similar problem.
Don't forget word boudary token \b. If you don't put it at both ends of a \w{n} token then all words longer than n characters are also found. For a 4 character word \b\w{4}\b for a six character long word use \b\w{6}\b. Here is a demo with 7000 words as input string.
Java implementation:
String dummy = ".....";
Pattern pattern = Pattern.compile("\\b\\w{6}\\b");
Matcher matcher = pattern.matcher(dummy);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
Read the file using any stream word by word and calculate their length. Store counters in an array and increment values after reading each word.
You could generate regexes for each size you want.
\w{6} would get each word with 6 letters exactly
\w{7} would get each word with 7 letters exactly
and so on...
So you could run one of these regex on the text, with the global flag enabled (finding every instance in the whole string). This will give you an array of every match, which you can then find the length of.

given a word forming a meaningful word by adding spaces in between them

You are given a string example "Iamastudent" without any spaces. You will be provided with a predefined dictionary function which verifies whether a given word is present in the dictionary or not. Using this function you have to insert the spaces in the string a print it as "I am a student".
its my interview question and told me too solve in c++, i solved it using dynamic programming but he was not satisfied
the solution i gave is
same as in the below question
Given a phrase without spaces add spaces to make proper sentence
he asked me to do it using trie or suffix array but i couldnt able to figure the solution can any one help me
Find words and put spaces after them
The answer is to use Trie data structure. Create Trie with possible words and keep traversing. with Trie you can generate many different possible words.
now here "iamastudent" with Trie you could generate these words.
i, a, am, a, as, student
now you have to make a proper sentence out of these words. Here the possible solution is markov chain. A markov chain is data structure where it holds probability for next word after a word. so markov chain will be.
"i" : [ "am", "did", "went" ...],
"a" : [ "tree", "dog" ..]
"am" : [ "a" ...]
Now you these many data in sequence
[i], [a, am], [a, as], [student]
Note: I grouped all elements which starts with same character in one
list.
start with "i"
next word is "a". but in markov chain "a" is not there. so go for next word. like this you can continue.
from here onwards it is a dfs search for a valid sentence. well, it was a nice and tricky question.
If there is a unique solution of splitting the sentence then doing it with a trie is simple:
if there are characters in the input string start walking down from the root consuming characters from the string. otherwise terminate.
if it is a compressed trie you will find a mark whenever a prefix is a complete word otherwise if you reach a leaf that's when you output a space
go back to 1 (walking down from the root) starting from the current position in the string
You are done when there are no more characters in the string (you may want to check that at this point you are not traversing the tree).
If the solution is not unique, then whenever you reach the end of the string and you are not at a mark or a leaf in the tree you need to backtrack to the previous space you emitted. You need a stack for positions in the input string.