find a frequency of words in text file - c++

Please can anybody help me? I'm a beginner and I have a hard assignment.
I need to write a c++ program that does the following :
Ask the user to enter two text file the first one contains the text file, the second one contains a list of words in one column Regardless of their number like this:
//output
Enter the text file: text_file.txt
Enter the keywords file: keywords_file.txt
Search for the keywords from the keywords file in the text file
Find the appearance of each word like this :
system: 55 times
analysis: 21 times
Then output in new text file (ofstream)
This is my code it's coded properly ..but it asks the user to enter the words..I want it to take the words from the input text file ..and show the output on the text file (ofstream)
this is a part of keywords contents
//they are as a list in the original file
List item
model,
management,
e-commerce,
system,
cluster
infrastructure,
computer,
knowledge,
metadata,
process,
alter,
simulate,
stock,
inventory,
strategy,
plan,
historical,
deices,
exact,
Analyst,
break even point,
SWOT,
tactic,
develop,
prototype,
feasible,
Inferences,
busy,
cloud compute,
schema,
enterprise,
custom,
expert system,
structure,
data mine,
data warehouse,
organism,
data mart,
operate,
quality assure,
forecast,
report,
this is a part of the book contents
the circuit is characterised by long straights and chicanes. this means the cars’ engines
are at full throttle for over 75% of the lap, a higher percentage than most other circuits.
the track requires heavier-than-average braking over a given lap, as the cars repeatedly
decelerate at the end of some of the world’sfasTest straights for the slow chicanes.
the chicanes are lined by rugged kerbs. riding over these hard is crucial for fast laps.
the long straights require small wings for minimum drag. this means lower downforce,
resulting in lower grip on corners and under braking, and less stability over bumps.
the main high-speed corners lesmo 1, lesmo 2 and parabolica are all right turns.
parts of the circuit are surrounded by trees, which means leaves can be blown
onto the track.
#include <algorithm>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
int main() {
ofstream ofs("output.txt");
ofs << "Keywords " << endl;
string kw_file;
cout << "enter the file name :" << " ";
getline(cin, kw_file);
ifstream keywordFile(kw_file.c_str());
string text_file;
cout << "enter the file name :" << " ";
getline(cin, text_file);
ifstream textFile(text_file.c_str());
if (!keywordFile.is_open() || !textFile.is_open()) {
std::cout << "Error in opening files\n";
return 1;
}
{//vector method
size_t i = 0;
std::vector<std::string> keywordVector;
std::string keyword;
while (keywordFile >> std::ws >> keyword) {
keywordVector.push_back(keyword);
}
std::vector<int> countVector(keywordVector.size());
std::string textWord;
while (textFile >> std::ws >> textWord) {
for (size_t i = 0; i < keywordVector.size(); ++i) {
if (keywordVector[i] == textWord) {
countVector[i]++;
}
}
}
for (size_t i = 0; i < keywordVector.size(); ++i) {
ofs << "The number of times [" << keywordVector[i] << "] appears in
textFile is " << countVector[i] << '\n';
}
}
keywordFile.clear(); textFile.clear();
keywordFile.seekg(0);
textFile.seekg(0);
}

Preliminary
Before even starting to solve this problem, let's think about what is actually required from this:
Read in the keywords from keywords_file.txt
Store these keywords in a data structure for later use.
Read in words from text_file.txt
Use the data structure from Step#2 to compare the words read in Step#3
Data structure simply refers to how your program stores the data that it works on. When you create an array such as
int arr[] = {1, 2, 3};
to later do some manipulations on, that too is a data structure.
Choosing the Data Structure for the Problem
From the above steps, reading in words from the file is a simple problem to solve. The most pressing problem is to figure out the data structure in Step#2.
Let's try to start with a basic one: vector. We are going to use vector instead of arrays like in the C-language primarily for 2 reasons:
Fixed arrays are simple to handle and are the preferred data structure when we have fixed data. But in this problem, we have dynamic data because the number of words in the keywords_file.txt could vary. It's much easier to use vectors as dynamic arrays rather than C-style dynamic arrays because it's much easier to manage memory with vector than with a plain array.
We are using C++ and not C, so let's try to use the data structures available in the Standard Library of C++.
If we choose a vector as the data structure to store the words from keywords_file.txt, then we would also need a data strcture to store the count of each word which would be the number of times it's found in text_file.txt. We can again use a vector for this. We could have some code like (pseudo-code only)
Read in a keyword(using a std::string) from keywords_file.txt
Store it in a vector (using the push_back() function) [Let's call this vector as keywordVector]
Repeat Step#1 and #2 until all the words from keywords_file.txt are read in and added to keywordVector.
Create another vector (let's call it countVector for storing the counts of the word which is of the same size as that of keywordVector.
Initialize all the values in countVector with 0.
Read in a word(again a std::string) from text_file.txt
Search for this word in the vector.
If the word is found in the vector, increment that word's count by increasing the count in countVector for that word, else ignore the word
Repeat Step#5 and Step#6 till all the words from text_file.txt are read in.
Note for Step#8, if word from text_file.txt matches with word at keywordVector[i], then increment the value of the corresponding index in the countVector i.e. ++countVector[i]
This should work for small programs and should solve the problem. Here's how the code would look like:
std::vector<std::string> keywordVector;//empty vector of strings
std::string keyword; //string to store each word read in from keywords_file.txt
while(keywordFile>>std::ws>>keyword) {//std::ws just ignores any whitespace
keywordVector.push_back(keyword);
}
//create vector of same size as keywordVector to keep track of count
std::vector<int> countVector(keywordVector.size());
std::string textWord;
while(textFile>>std::ws>>textWord) {
//Search through the keywordVector tocheck if any word matches
for(size_t i = 0; i < keywordVector.size(); ++i) {
if(keywordVector[i] == textWord) {
countVector[i]++;//Found a match, increment the count
}
}
}
Is that all there is?
No! The above method should work fine for small programs but things become problematic when the number of keywords becomes large and/or the number of words in text_file.txt becomes large.
Imagine this. Your keywords_file.txt becomes very large since it stores all the words in the English Oxford Dictionary and has around a million words. The first part of reading in the keywords would be fine since we require it anyway. But the next part is where the problems start. Imagine all the words in your text_file.txt were the word zebra. Now while searching for zebra in this list, you would have to go through that while loop every time for each word. If there are a billion words in your text_file.txt then you would end up doing a million iterations for each word that you read and the total iterations would be 1billion X 1 million = 1 quadrillion iterations (1 followed by 15 zeroes). This is too big. Your program would complete in a very long time. Assuming each iteration takes 1 nanosecond, then total time would be 1 million seconds (10^15* 10^-9) which is ((1000000/60)/60)/24 = around 11.5 days!! Surely that is not acceptable!
We realize that the basic problem is that we have to search the entire loop every time we read in a word from text_file.txt. Only if there was a way to just lookup the word in the keywordVector directly without having to iterate over the entire vector each and every time.
This is exactly what map data structure helps with. These data structures use a hash-function to quickly lookup things in a collection. They store what is called Key-Value pairs and you use the key to lookup its value.
e.g. if we have something like
{ 101: "Alice", 202: "Bob", 303: "Charlie"}
Then 101 is the key and Alice is the value of that key, 202 is the key, Bob is the value etc.
In the C++ STL there are 2 data structures that are build upon this concept std::unordered_map and std::map. As the names suggest, the first one stores the keys in no particular order but the second one stores them in a sorted way.
Given this very basic intro to maps, we can see how this might be helpful for our case. We don't need to store the words from keyword_file.txt in any particular order. So we can use std::unordered_map, and use the keyword as the key for this map. We'll store the number of times that word appears as the value of this key. Here's some code:
/* Create a map to store key-value pairs of
string and int, where each string is a keyword
from the keywords_file.txt
*/
std::unordered_map<std::string, int> keywordMap;
std::string keyword;
while(keywordFile>>std::ws>>keyword) {
//Initialize each word's count to 0
keywordMap[keyword] = 0;
}
std::string textWord;
while(textFile>>std::ws>>textWord) {
/*We do a find for the textWord in the map.
This find isn't a linear loop like thing (unlike vector)
but uses hashing to quickly look up if textWord exists in the
map or not
*/
if(keywordMap.find(textWord) != end(keywordMap)) {
//If it exists, then we can just directly increment the count
keywordMap[textWord]++;
}
}
Using our time calculations, this time around the lookup of a billion words from the text_file.txt would each take up only in the order of a few nano-seconds since the unordered_map.find() has an average case constant time complexity, unlike the earlier approach's linear complexity.
So for a billion words, it takes an order of a billion nano-seconds which is just 1 second! Imagine the drastic difference in the times. The earlier method took days and this takes seconds. Hashing is a very powerful concept and finds applications in a lot of problems.
Followup
Here's the full code, if you want to use it. This is a basic solution to finding the frequency of words in a file. Since you're beginning out in C++, I'd suggest you take your time to read in-depth into all the data structures used here and use this example to build upon your understanding. Also, if complexity is new to you, please do acquaint yourself with the topic.

Related

String encoding for memory optimization

I have a stream of strings in format something like this a:b, d:a, t:w, i:r, etc. Since I keep on appending these string, in the end it becomes a very large string.
I am trying to encode, for example:
a:b -> 1
d:a -> 2
etc.
My intension is to keep the final string as small as possible to save on memory. Hence I need to give single digit value to string occuring maximum number of times.
I have following method in mind:
Create: map<string, int> - this will keep the string and its count. In the end I will replace string with maximum count with 1, next with 2 and so on till last element of map.
Currently size of final string are ~100,000 characters.
I can't compromise on speed, please suggest if anyone has better technique to achieve this.
If I understand correctly your input strings are of the range "a:a"..."z:z" and you simply need to count the appearances of each in the stream, regardless of order. If your distribution is even enough you can count them in using a uint16_t.
A map is implemented using a tree, so an array is much more efficient than a map both in memory and time.
So you can define an array
array<array<uint16_t, 26>, 26> counters = {{}};
and assuming your input is, for example input = "c:d", you can fill up the array as follows
counters[input[0]-'a'][input[2]-'a']++;
Then finally you can print out the frequencies of the input like this
for (auto i=0; i < counters.size() ; ++i) {
for (auto j=0; j < counters[i].size(); ++j) {
cout<<char(i+'a')<<":"<<char(j+'a')<<" "<<counters[i][j]<<endl;
}
}

Vector of objects search for future objects' use?

I have data file which is being read into a vector. Example :
West Ham 38 12
Leicester City 38 13
In actual file there is more doubles followed the name. Anyways, previously I've used this kind of search:
vector<Team>newTeams; //vector of Team objects
string homeName;
cout << "Enter home team name: ";
cin >> homeName;
cout << endl;
Team ht;
for(Team team : newTeams)
{
if(team.getName() == homeTeam)
{
ht = team;
}
}
Basically I go through a vector and look for a specific team name. If I found team name, I assigned team to ht. Then, I would use ht to get needed data, i.e.:
ht.getHomeGamesPlayed();
ht.getPoints();
ht.getHomeGoalsScored();
So, my question is: is there a better way for search?(Use map of names + vector of doubles?):
map<name, vector<doubles>>;
Also, how do I make search case insensitive? i.e I type in leicester instead of Leicester CIty, and Leicester City would still get picked?
UPD:
Seems like I figured it out, here's the code if you're interested:
string homeName;
cout << "Name: " << endl;
cin >> homeName;
Team ht;
for (Team team : dataTable) {
if (strstr(team.getName().c_str(), homeName.c_str()))
{
ht = team;
}
}
So when I type Leic it picks Leicester City(when I type leic, it doesn't work though)
Yes, std::map would be a better fit for your problem. So would std::unordered_map.
To get case insensitive matches, you can use a string that has been converted to all upper case or all lower case as the map key. Then store the original name separately.
std::map<std::string, std::pair<std::string, std::vector<double>>> mymap;
If you need partial matches, e.g. finding Leicester City when you type leicester, the vector approach you're using now might be best. If you need to match only on the first part of the name, you can still use std::map and use map::lower_bound to find a starting place for your search.
There are a few solutions to this problem. The one I'd suggest is a radix tree with multiple input positions.
First, create a map or set or whatever to hold your objects. Then, you create a radix tree, indexing each partial match of some given width, e.g.
abcde fge
creates an entry for "abcde fge", "bcde fge", "cde fge", ... "e". pointing to your (multi)map value. You can use the property of radix trees that you can easily find all values with a given prefix to perform a fast search to find all matches for a given substring in around O(1) (or O(N) in an N size search term) provided you have a fixed input size. Note however that construction of the tree scales in O(n^2) of the size of the searchable material, in order to remedy this, you can limit the size of the search terms (e.g. 50 characters) that are indexed, or index in increments followed by multiple searches (e.g. index "abcdefg" "cdefg" and "efg", then when searching for "def" you search for "def" and "ef", "ef" results in a prefix match with efg.)
Note that the search string must be at least as long as the width you skip, otherwise you'd have to search entire trees..

Would this method be efficient at finding string permuations

#include <iostream>
#include <string>
using namespace std;
int main()
{
string word;
cin>>word;
int s = word.size();
string original_word = word;
do
{
for(decltype(s) i =1; i!= s;++i){
auto temp =word[i-1];
word[i-1] = word[i];
word[i] = temp;
cout<<word<<endl;
}
}while(word!=original_word);
}
Is this solution efficient and how does it compare by doing this recursively?
Edit: When I tested the program it displayed all permutations
i.e cat produced:
cat
act
atc
tac
tca
cta
Let's imagine tracing this code on the input 12345. On the first pass through the do ... while loop, your code steps the array through these configurations:
21345
23145
23415
23451
Notice that after this iteration of the loop finishes, you've cyclically shifted the array one step. This means that at the end of the next do ... while loop, you'll have cyclically shifted the array twice, then three times, then four times, etc. After n iterations, this will reset the array back to its original configuration. Since each pass of bubbling the character to the end goes through n intermediary steps, this means that your approach will generate at most n2 different permutations of the input string. However, there are n! possible permutations of the input string, and n! greatly exceeds n2 for all n ≥ 4. As a result, this approach can't generate all possible permutations, since it doesn't produce enough unique combinations before returning back to the start.
If you're interested in learning about a ton of different ways to enumerate permutations by individual swaps, you may want to pick up a copy of The Art of Computer Programming or search online for different methods. This is a really interesting topic and in the course of working through these algorithms I think you'll learn a ton of ways to analyze different algorithms and prove correctness.

Increase string overlap matrix building efficiency

I have a huge list (N = ~1million) of strings 100 characters long that I'm trying to find the overlaps between. For instance, one string might be
XXXXXXXXXXXXXXXXXXAACTGCXAACTGGAAXA (and so on)
I need to build an N by N matrix that contains the longest overlap value for every string with every other string. My current method is (pseudocode)
read in all strings to array
create empty NxN matrix
compare each string to every string with a higher array index (to avoid redoing comparisons)
Write longest overlap to matrix
There's a lot of other stuff going on, but I really need a much more efficient way to build the matrix. Even with the most powerful computing clusters I can get my hands on this method takes days.
In case you didn't guess, these are DNA fragments. X indicates "wild card" (probe gave below a threshold quality score) and all other options are a base (A, C, T, or G). I tried to write a quaternary tree algorithm, but this method was far too memory intensive.
I'd love any suggestions you can give for a more efficient method; I'm working in C++ but pseudocode/ideas or other language code would also be very helpful.
Edit: some code excerpts that illustrate my current method. Anything not particularly relevant to the concept has been removed
//part that compares them all to each other
for (int j=0; j<counter; j++) //counter holds # of DNA
for (int k=j+1; k<counter; k++)
int test = determineBestOverlap(DNArray[j],DNArray[k]);
//boring stuff
//part that compares strings. Definitely very inefficient,
//although I think the sheer number of comparisons is the main problem
int determineBestOverlap(string str1, string str2)
{
int maxCounter = 0, bestOffset = 0;
//basically just tries overlapping the strings every possible way
for (int j=0; j<str2.length(); j++)
{
int counter = 0, offset = 0;
while (str1[offset] == str2[j+offset] && str1[offset] != 'X')
{
counter++;
offset++;
}
if (counter > maxCounter)
{
maxCounter = counter;
bestOffset = j;
}
}
return maxCounter;
} //this simplified version doesn't account for flipped strings
Do you really need to know the match between ALL string pairs? If yes, then you will have to compare every string with every other string, which means you will need n^2/2 comparisons, and you will need one half terabyte of memory even if you just store one byte per string pair.
However, i assume what you really are interested in is long strings, those that have more than, say, 20 or 30 or even more than 80 characters in common, and you probably don't really want to know if two string pairs have 3 characters in common while 50 others are X and the remaining 47 don't match.
What i'd try if i were you - still without knowing if that fits your application - is:
1) From each string, extract the largest substring(s) that make(s) sense. I guess you want to ignore 'X'es at the start and end entirely, and if some "readable" parts are broken by a large number of 'X'es, it probably makes sense to treat the readable parts individually instead of using the longer string. A lot of this "which substrings are relevant?" depends on your data and application that i don't really know.
2) Make a list of these longest substrings, together with the number of occurences of each substring. Order this list by string length. You may, but don't really have to, store the indexes of every original string together with the substring. You'll get something like (example)
AGCGCTXATCG 1
GAGXTGACCTG 2
.....
CGCXTATC 1
......
3) Now, from the top to the bottom of the list:
a) Set the "current string" to the string topmost on the list.
b) If the occurence count next to the current string is > 1, you found a match. Search your original strings for the substring if you haven't remembered the indexes, and mark the match.
c) Compare the current string with all strings of the same length, to find matches where some characters are X.
d) Remove the 1st character from the current string. If the resulting string is already in your table, increase its occurence counter by one, else enter it into the table.
e) Repeat 3b with the last, instead of the first, character removed from the current string.
f) Remove the current string from the list.
g) Repeat from 3a) until you run out of computing time, or your remaining strings become too short to be interesting.
If this is a better algorithm depends very much on your data and which comparisons you're really interested in. If your data is very random/you have very few matches, it will probably take longer than your original idea. But it might allow you to find the interesting parts first and skip the less interesting parts.
I don't see many ways to improve the fact that you need to compare each string with each other including shifting them, and that is by itself super long, a computation cluster seems the best approach.
The only thing I see how to improve is the string comparison by itself: replace A,C,T,G and X by binary patterns:
A = 0x01
C = 0x02
T = 0x04
G = 0x08
X = 0x0F
This way you can store one item on 4 bits, i.e. two per byte (this might not be a good idea though, but still a possible option to investigate), and then compare them quickly with a AND operation, so that you 'just' have to count how many consecutive non zero values you have. That's just a way to process the wildcard, sorry I don't have a better idea to reduce the complexity of the overall comparison.

Topic mining algorithm in c/c++

I am working on subject extraction fro articles algorithm using c++.
First I have written code to remove words like articles, propositions etc.
Then rest of the words get store in one char array: char *excluded_string[50] = { 0 };
while ((NULL != word) && (50 > i)) {
ch[i] = strdup(word);
excluded_string[j]=strdup(word);
word = strtok(NULL, " ");
skp = BoyerMoore_skip(ch[i], strlen(ch[i]) );
if(skp != NULL)
{
i++;
continue;
}
j++;
skp is NULL when ch[i] is not articles or similar caregory.
This function checks whether any word belongs to articles or propo...etc
Now at the end ex..[] contains set of required words. Now I want occurrence of each words in this array and after that word which has max occurrence. All if more then one.
What logic should I use?
What I thought is:
Taking and two dimension array. First column will have word. and 2nd column I can use for storing count values.
Then for each word sending that word to the array and for each occurance of that word increment count values and store that count values for that words in 2nd column.
But this is costly and also complex.
Any other idea?
If you wish to count the occurrences of each word in an array then you can do no better than O(n) (i.e. one pass over the array). However, if you try to store the word counts in a two dimensional array then you must also do a lookup each time to see if the word is already there, and this can quickly become O(n^2).
The trick is to use a hash table to do your lookup. As you step through your word list you increment the right entry in the hash table. Each lookup should be O(1), so it ought to be efficient as long as there are sufficiently many words to offset the complexity of the hashing algorithm and memory usage (i.e. don't bother if you're dealing with less than 10 words, say).
Then, when you're done, you just iterate over the entries in the hash table to find the maximum. In fact, I would probably keep track of that while counting the words so there's no need to do it after ("if thisWordCount is greater than currentMaximumCount then currentMaximum = thisWord").
I believe the standard C++ unordered_map type should do what you need. There's an example here.