C++ count functional words occurrence - c++

I'm trying to count occurrences of specific words from a text file, the problem is that when my code is reading the file - it is reading it with white-space delimiters but some of the words i want to count are "2 word words" for example "out from"
additional to this there is a second problem and that is the words like "aren't" and "don't" - my code seem to ignore this words even when i put them with backslash in the map - my guess is that it is getting ignored in the process of reading it from the file for some reason
the end outcome that i am looking for is the frequency of the words that i am searching for.
std::list<std::string> Fwords = {
"a","abroad","as far as","ahead of"};
// Begin reading from file:
std::ifstream fileStream(fileName);
// Check if we've opened the file (as we should have).
if (fileStream.is_open())
while (fileStream.good())
{
// Store the next word in the file in a local variable.
std::string word;
fileStream >> word;
std::cout << "This is the word: " << word << endl;
if (std::find(std::begin(Fwords), std::end(Fwords), word) != std::end(Fwords))
wordsCount[word]++;
}
input:
"ahead of me as far as abroad me"
this would be the expected output:
abroad:1
ahead of:1
as far as:1

This approach won't work. Your problem is that you're reading one word at a time from the file. No amount of backslashing or manipulating the list / map of words will fix that.
But how are you supposed to know how many words to read? You don't—it'll have to be trial and error.
One way to "brute force" this, considering your level of programming, would be to add an else case to
if (std::find(std::begin(Fwords), std::end(Fwords), word) != std::end(Fwords))
{
// ...
}
in which you check for words in the map that begin with the word from the file, e.g. "as," but with a space, so the search is for as . If one or more matches are found, then it's time to read another word from the file, e.g. "as far." This should be put in a loop (or a function called in a loop) so that the search for as far and reading another word "as" happens automatically. Upon successfully finding as far as, you're done. You're also done upon failure to find as , as far , or as far as, i.e. if you don't have these in your map, in which case, you want to run a for loop through each word to check if they are words by themselves, and increase their count if so. In this endeavor, you'll realize that you need the same code as your original code; so it'd be smart to factor it out into a function as well.

Related

When reading from a file in C++, can I just copy the text itself?

Sorry, the wording for the actual question is probably wrong. I have a program that reads in a line from a .txt file and then puts the string into an object to compare it to a string entered by the user. I haven't been able to get it to match, and when I've tried to see what is entered, I don't see much. Maybe there's an invisible character denoting the end of the line? I've tried code like this:
std::cout << "...." << table[row][col]->get() << "...." <<std::endl;
And got
....a
as the result. When reading the file I used std::getline() if that makes a difference.
I didn't find a true fix, although I did see that the length of the read-in string was one int longer than the actual word. I was able to use a substring to cut the end off of the string.

Reducing time complexity of string comparison

I have a dictionary .txt file with probably over a thousand words and their definitions. I've already written a program to take the first word of each line from this file and check it against a string input by the user:
void checkWord(string input)
{
std::ifstream inFile;
inFile.open("Oxford.txt");
if (inFile.is_open())
{
string line; //there is a "using std::string" in another file
while (getline(inFile, line))
{
//read the first word from each line
std::istringstream iss(line);
string word;
iss >> word;
//make sure the strings being compared are the same case
std::transform(word.begin(), word.end(), word.begin(), ::tolower);
std::transform(input.begin(), input.end(), input.begin(), ::tolower);
if (word == input)
{
//Do a thing with word
}
}
inFile.close();
return "End of file";
}
else
{
return "Unable to open file";
}
}
But if I'm checking more than a sentence, the time it takes to process becomes noticeable. I've thought about about a few ways of making this time shorter:
Making a .txt file for each letter of the alphabet (Pretty easy to do, but not really a fix in the long-term)
Using unordered_set to compare the strings (like in this question) the only problem with this might be the initial creation of these maps from the text file
Using some other data structure to compare strings? (Like std::map)
Given that the data is already "sorted", what kind of data structure or method should I employ in order to (if possible) reduce time complexity? Also, are there any issues with the function I am using to compare strings? (for example, would string::compare() be quicker than "=="?)
A tree (std::map)? Or a hashmap (std::unsorted_map)? Your linear search is obviously a brute force solution! Both of the above will be substantially superior for multiple searches.
Of course, that only really helps if you are going to use this data multiple times per program run, which you didn't specify in your question. If not, there's not really much benefit in loading and parsing and storing all the data only to perform a single lookup then quit. Just put a break in on success, at least.
You imply that your input file is sorted. You could hack together a binary search solution with file seeking (which is really cheap) and snapping to the nearest newline on each iteration to determine roughly where all the words with the same leading (say) three characters are in your file. For a thousand entries, though, this is probably overkill.
So, there are "simple" fixes, and there are some more complex ones.
The first step is to move all unnecessary things out of the search-loop: Lowercase input once, before the loop, rather than every time - after all, it's not changing. If possible, lowercase the Oxford.txt too, so you don't have to lowercase word for every line.
If you are searching the file multiple times, reading a file multiple times is definitely not a great solution - even if it's cached in the filesystem the second time.
So reading it once into some container, really simple one would be std::vector [and lower-case the string at the same time] and just iterating over it. The next improvement would be to sort the vector and us a binary search (but you'd have to write the binary search yourself - it's not terribly hard)
A slightly more complex solution [but faster to search] would be to use std::map<std::string, std::string> wordlist (but that also takes a bit more space), then use auto pos = wordlist.find(input); if (pos != wordlist.end() ... found word ....
You can benefit from using a prefix tree, also known as a trie data structure, as it fits the use case of having a dictionary and frequently looking up words in it.
The simplest model of a trie is a tree where each node holds a letter and a flag to tell whether the current letter is the end of a word (and, additionally, pointers to other data about the word).
Example picture of a trie containing the dictionary aback abate bid bird birth black blast:
To search for a word, start from the root, and for each letter of your word, follow the node containing the current letter (halt if it isn't present as a child of the current node). The search time is proportional to the look up word length, instead of to the size of your dictionary.
A trie also allows you to easily get the alphabetic (lexicographical) order of words in a dictionary: just do a pre-order traversal of it.
Instead of storing everything in a .txt file, store it in a real database.
SQLite3 is a good choice for simple tasks, since it is in-process instead of requiring an external server.
For a very simple, the C API and SQL statements should be very easy to learn.
Something like:
-- Only do this once, for setup, not each time you run your program.
sqlite> CREATE TABLE dictionary (word TEXT PRIMARY KEY);
sqlite> .import /usr/share/dict/words dictionary;
-- Do this every time you run your program.
sqlite> select count(*) from dictionary where word = 'a';
1

Finding words in text file to extract data using C++

I am trying to find the total amount of time spent doing a certain activity with C++ and Mac Automator (you do not need to know
Automator to help me). I am using Mac Automator to output a text file using "Event Summary" and "New Text File" actions. It outputs a text file like this:
Viewable text file
I am currently struggling over something very trivial; I cannot accurately find the words "Time" and "Date" in the text file. If I cannot find the words "Time" and "Date" I cannot begin processing the total amount of time spent doing that activity or whether that activity went over midnight (I sometimes work into the ams). So far I think I have spent four hours with mixed results. Any feedback would be appreciated.
The code below is the code I am using at the moment. I can find the word "Time" and "Date" at the very start of the file, or if a ':' is in front of the word "Time" or "Date", but when it is on a different line the programs fails:
cout << "Reading from the file...." << endl;
infile.open("calendar workflow text.txt");
while(infile.getline(buff, BUFFSIZE, ':')){ //reads everything
cout << buff << endl; // prints everything
if(strcmp("Time",buff)==0){
cout<<"Time found in text\n"<<endl;
}
else if (strcmp("Date",buff)==0){
cout<<"Date found in text\n"<<endl;
}
}
infile.close();
cout<<"Total Time in all events: "<<sumtime<<" hrs"<<endl;
return (0);
If you want the automator workflow I can give it to you.
There are a few assumptions that you need to check for:
Is "Time" always going to start at the very first character position? No space or tab before you see the word "Time"?
Can there be multiple "Time" words in the same line?
Can another word appear before "Time"?
strcmp("Time",buff) assumes that your entire string "buff" has just one word in it "Time".
That is not what you want. If assumption 1 is true, you can simply do
if strncmp(buff, "Time", 4) == 0 {
// do something, as you found time
}
Otherwise, for a generic position, you can use strstr(buff, "Time"), for a substring match where "Time" could be anywhere in the string. Once you get the position, skip over exactly the number of characters to get to the value for time. Extract that and perform your calculations.
Typically, in parsing files, you will have to have to some allowance for spaces/tabs etc. Otherwise, the code becomes too brittle and can fail testcases that deviate ever so slightly.

How to read a message from a file, modifying only words?

Suppose I have the following text:
My name is myName. I love
stackoverflow .
Hi, Guys! There is more than one space after "Guys!" 123
And also after "123" there are 2 spaces and newline.
Now I need to read this text file as it is. Need to make some actions only with alphanumeric words. And after it I have to print it with changed words but spaces and newlines and punctuations unchanged and on the same position. When changing alphanumeric words length remains same. I have tried this with library checking for alphanumeric values, but code get very messy. Is there anyother way?
You can read your file line-by-line with fgets() function. It will fill char array and you can work with this array, e.g. iterate over this array, split it into alnum words; change the words and then write fixed string into new file with "fwrite()" function.
If you prefer C++ way of working with files (iostream), you can use istream::getline. It will save spaces; but it will consume "\n". If you need to save even "\n" (it can be '\r' and '\r\n' sometimes), you can use istream::get.
Maybe you should look at Boost Tokenizer. It can break of a string into a series of tokens and iterate through them. The following sample breaks up a phrase into words:
int main()
{
std::string s = "Hi, Guys! There is more...";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++beg)
{
std::cout << *beg << "\n";
}
return 0;
}
But in your case you need to provide a TokenizerFunc that will break up a string at alphanumeric/non-alphanumeric boundaries.
For more information see Boost Tokenizer documentation and implementation of an already provided char_separator, offset_separator and escaped_list_separator.
The reason that your code got messy is usually because you didn't break down your problem in clear functions and classes. If you do, you will have a few functions that each do precisely one thing (not messy). Your main function will then just call these simple functions. If the function names are well chosen, the main function will become short and clear, too.
In this case, your main function needs to do:
Loop: Read every line of a file
On every line, check if and where a "special" word occurs.
If a special word occurs, replace it
Extra hints: a line of text can be stored as a std::string and can be read by std::getline(std::cin, line)

How do you read a word in from a file in C++?

So I was feeling bored and decided I wanted to make a hangman game. I did an assignment like this back in high school when I first took C++. But this was before I even too geometry, so unfortunately I didn't do well in any way shape or form in it, and after the semester I trashed everything in a fit of rage.
I'm looking to make a txt document and just throw in a whole bunch of words
(ie:
test
love
hungery
flummuxed
discombobulated
pie
awkward
you
get
the
idea
)
So here's my question:
How do I get C++ to read a random word from the document?
I have a feeling #include<ctime> will be needed, as well as srand(time(0)); to get some kind of pseudorandom choice...but I haven't the foggiest on how to have a random word taken from a file...any suggestions?
Thanks ahead of time!
Here's a rough sketch, assuming that the words are separated by whitespaces (space, tab, newline, etc):
vector<string> words;
ifstream in("words.txt");
while(in) {
string word;
in >> word;
words.push_back(word);
}
string r=words[rand()%words.size()];
The operator >> used on a string will read 1 (white) space separated word from a stream.
So the question is do you want to read the file each time you pick a word or do you want to load the file into memory and then pick up the word from a memory structure. Without more information I can only guess.
Pick a Word from a file:
// Note a an ifstream is also an istream.
std::string pickWordFromAStream(std::istream& s,std::size_t pos)
{
std::istream_iterator<std::string> iter(s);
for(;pos;--pos)
{ ++iter;
}
// This code assumes that pos is smaller or equal to
// the number of words in the file
return *iter;
}
Load a file into memory:
void loadStreamIntoVector(std::istream& s,std::vector<std::string> words)
{
std::copy(std::istream_iterator<std::string>(s),
std::istream_iterator<std::string>(),
std::back_inserter(words)
);
}
Generating a random number should be easy enough. Assuming you only want psudo-random.
I would recommend creating a plain text file (.txt) in Notepad and using the standard C file APIs (fopen(), and fread()) to read from it. You can use fgets() to read each line one at a time.
Once you have your plain text file, just read each line into an array and then randomly choose an entry in the array using the method you've suggested above.