Count the number of unique words and occurrence of each word - c++

CSCI-15 Assignment #2, String processing. (60 points) Due 9/23/13
You MAY NOT use C++ string objects for anything in this program.
Write a C++ program that reads lines of text from a file using the ifstream getline() method, tokenizes the lines into words ("tokens") using strtok(), and keeps statistics on the data in the file. Your input and output file names will be supplied to your program on the command line, which you will access using argc and argv[].
You need to count the total number of words, the number of unique words, the count of each individual word, and the number of lines. Also, remember and print the longest and shortest words in the file. If there is a tie for longest or shortest word, you may resolve the tie in any consistent manner (e.g., use either the first one or the last one found, but use the same method for both longest and shortest). You may assume the lines comprise words (contiguous lower-case letters [a-z]) separated by spaces, terminated with a period. You may ignore the possibility of other punctuation marks, including possessives or contractions, like in "Jim's house". Lines before the last one in the file will have a newline ('\n') after the period. In your data files, omit the '\n' on the last line. You may assume that the lines will be no longer than 100 characters, the individual words will be no longer than 15 letters and there will be no more than 100 unique words in the file.
Read the lines from the input file, and echo-print them to the output file. After reaching end-of-file on the input file (or reading a line of length zero, which you should treat as the end of the input data), print the words with their occurrence counts, one word/count pair per line, and the collected statistics to the output file. You will also need to create other test files of your own. Also, your program must work correctly with an EMPTY input file – which has NO statistics.
Test file looks like this (exactly 4 lines, with NO NEWLINE on the last line):
the quick brown fox jumps over the lazy dog.
now is the time for all good men to come to the aid of their party.
all i want for christmas is my two front teeth.
the quick brown fox jumps over a lazy dog.
Copy and paste this into a small file for one of your tests.
Hints:
Use a 2-dimensional array of char, 100 rows by 16 columns (why not 15?), to hold the unique words, and a 1-dimensional array of ints with 100 elements to hold the associated counts. For each word, scan through the occupied lines in the array for a match (use strcmp()), and if you find a match, increment the associated count, otherwise (you got past the last word), add the word to the table and set its count to 1.
The separate longest word and the shortest word need to be saved off in their own C-strings. (Why can't you just keep a pointer to them in the tokenized data?)
Remember – put NO NEWLINE at the end of the last line, or your test for end-of-file might not work correctly. (This may cause the program to read a zero-length line before seeing end-of-file.)
This is not a long program – no more than about 2 pages of code
Here is what I have so far:
#include<iostream>
#include<iomanip>
#include<fstream>
#include<string>
#include<cstring>
using namespace std;
void totalwordCount(ifstream &inputFile)
{
char words[100][16]; // Holds the unique words.
char *token;
int totalCount = 0; // Counts the total number of words.
// Read every word in the file.
while(inputFile >> words[99])
{
totalCount++; // Increment the total number of words.
// Tokenize each word and remove spaces, periods, and newlines.
token = strtok(words[99], " .\n");
while(token != NULL)
{
token = strtok(NULL, " .\n");
}
}
cout << "Total number of words in file: " << totalCount << endl;
}
void uniquewordCount(ifstream &inputFile)
{
char words[100][16]; // Holds the unique words
int counter[100];
char *tok = "0";
int uniqueCount = 0; // Counts the total number of unique words
while(!inputFile.eof())
{
uniqueCount++;
tok = strtok(words[99], " .\n");
while(tok != NULL)
{
tok = strtok(NULL, " .\n");
inputFile >> words[99];
if(strcmp(tok, words[99]) == 0)
{
counter[99]++;
}
else
{
words[99][15] += 1;
}
uniqueCount++;
}
}
cout << counter[99] << endl;
}
int main(int argc, char *argv[])
{
ifstream inputFile;
char inFile[12] = "string1.txt";
char outFile[16] = "word result.txt";
// Get the name of the file from the user.
cout << "Enter the name of the file: ";
cin >> inFile;
// Open the input file.
inputFile.open(inFile);
// If successfully opened, process the data.
if(inputFile)
{
while(!inputFile.eof())
{
totalwordCount(inputFile);
uniquewordCount(inputFile);
}
}
return 0;
}
I already took care of how to count the total number of words in the file in the totalwordCount() function, but in the uniquewordCount() function, I am having trouble counting the total number of unique words and counting the number of occurrences of each word. Is there anything that I need to change in the uniquewordCount() function?

This program contains several issues which are to be considered harmful! To prevent bad software being created based on entirely nonsensical assignments like the above, here are a number of hints:
Always test the stream for success after reading from it. Using in.eof() to determine if the stream is in a good state does not work! One of the problems is that you will get an infinite loop if the stream goes bad for a different reason than end of file, e.g., failure to correctly parse a value (this will set std::ios_base::failbit but not std::ios_base::eofbit.
Reading to a fixed size char array a using in >> a without having set up limits for the number of characters to be read is the C++ way to spell gets()! If you really think that using in >> a is the right way to (see next item), you absolutely need to set up the array's width, e.g., using in >> std::setw(sizeof(a)) >> a. You still need to check that this extraction was successful, of course.
From the looks of it, your teacher wants you to actually use std::istream::getline() to read the array, e.g., using in.getline(a, sizeof(a)) (which, of course, needs to be checked for success).
Note that the formatted input, i.e., in >> a already tokenizes the stream being received by spaces! There is no need to faff about with strtok() after that.
Once you have consumed a stream, it is consumed. Assuming the characters don't come from a file but rather from something like standard input, you also can't rewind the stream to read it again. I'd think you want to tokenize the values once and use them for both purposes.
This is more of a sidenote: after you created a stream, its nature should be entirely immaterial for the processing of the stream's content (although, e.g., for string streams you might want to eventually collect the result using the str() member): implement your stream processing functions in terms of std::istream rather than std::ifstream!
Since you have a concrete question ("Is there anything that I need to change in the uniquewordCount() function?"): yes, everything! Throw away this function entirely and rethink what you need to do. Basically, the structure of the functionality should be along the lines of
char buffer[100];
while (in.getline(buffer, sizeof(buffer))) {
// tokenize buffer into words
// for each word check if it already exists
// if the word does not exist, append it to the array of known words and set count to 1
// if the word exists, increment the count
// determine if the word is shorter or longer than the shortest or longest word so far
// if it is the case, remember the word's index or a pointer to it
}

Related

String Management C/C++ & Writing and Reading From txt File

I am facing a problem with reading and writing a string from and to a file respectively.
Purpose:
To enter a string into a text file as a complete sentence, read the string from the text file and separate all words that start from a vowel using a function and display them as a sentence. (The sentence just needs to consist of the words from the string that start with a vowel.)
Problem:
The code is working as intended but as i have used the getline() function to obtain the string from the txt file when i withdraw a substring from it, it includes the entire file after the vowel instead of just the word. I cannot understand how to make the substring only include words.
Code:
#include <fstream>
#include <string>
#include <iostream>
#include <cstring>
using namespace std;
string vowels(string a)
{
int c=sizeof(a);
string b[c];
string d;
static int n;
for(int i=1;i<=c;i++)
{
if (a.find("a")!=-1)
{
b[i]=a.substr(a.find("a",n));
d+=b[i];
n=a.find("a")+1;
}
else if (a.find("e")!=-1)
{
b[i]=a.substr(a.find("e",n));
d+=b[i];
n=a.find("e")+1;
}
else if (a.find("i")!=-1)
{
b[i]=a.substr(a.find("i",n));
d+=b[i];
n=a.find("i")+1;
}
else if (a.find("o")!=-1)
{
b[i]=a.substr(a.find("o",n));
d+=b[i];
n=a.find("o")+1;
}
else if (a.find("u")!=-1)
{
b[i]=a.substr(a.find("u",n));
d+=b[i];
n=a.find("u")+1;
}
}
return d;
}
int main()
{
string input,lne,e;
ofstream file("output.txt", ios::app);
cout<<"Please input text for text file input: ";
getline(cin,input);
file << input;
file.close();
ifstream myfile("output.txt");
getline(myfile,lne);
e=vowels(lne);
cout<<endl<<"Text inside file reads: ";
cout<<lne;
cout<<endl;
cout<<e<<endl;
system("pause");
myfile.close();
return 0;
}
I haven't read your code VERY carefully, but several things stand out:
Look up find_first_of - it'll simplify your code A LOT.
sizeof(a) certainly doesn't do what you think it does [unless you think it gives you the size of the std::string class type - which makes it rather strange as a use-case, why not use either 12 or 24?]
find (and find_first_of), technically speaking, doesn't return -1 when the function isn't finding what you want. It returns std::string::npos [which may appear to be -1, but a) is not guaranteed to be, and b) is unsingned so can't be negative].
Your program only reads one line.
x.substr(n) will give you the string of x from position n - is that what you want?
Don't repeat find, use p = x.find("X"); and then do x.substr(p) [assuming that is what you want].
There are various problems with your code.
int c = sizeof( a );
This is the number of bytes that a string takes up in memory. And you certainly don't want to create an array of this many strings as it makes no sense for what you're trying to achieve. Don't do this to yourself. You're only copying one string inside the loop, all you need is one string and you already have string d.
To get the actual size of a string, you have to call
str.size()
The string.substr(..) has a couple overloads, one of them takes only one argument, an index. This will return sub string starting at that index in the original string. (The string starting at the vowel all the way through to the end of the string)
What you are maybe looking for is the overload that takes two arguments, the start index (beginning of the word and the end of the word).
The string input will not take the newline that you enter to flush cin. And then you add it to the file in append mode, so after running the program a few times your file is a huge one-liner. Did you really intend to do this?
Maybe you should explicitly add a new line to the file after entering the input. Something like file << std::endl;
Also, the conditions in the ifs
if (a.find("a")!=-1)
Don't match what you do next,
b[i]=a.substr(a.find("a",n));
Then you use a static int,
static int n;
This is bad, because this function will only work once. You're lucky that static initializes its values to zero, but you should always initialize explicitly. In your case, you don't need this to be static.
Finally: "so i was unsure of how many loops to run"
When you don't know how many loops you have to run, then a for loop is not adequate.
You should use a while loop or a do while.
You shouldn't try to learn C++ by guessing, because that's what it looks like you're doing. You're trying to do more than you know and making some very silly mistakes. Find a good book to learn from, or at the very least google the functions you're using to see what they do and how to use them properly. (ie: http://www.cplusplus.com/reference/string/string/substr/ )
Here's a list of books from stackoverflow's FAQ: The Definitive C++ Book Guide and List
The last thing is about finding vowels. When you find a vowel, you have to make sure it's at the beginning of a word. Then you want to read it until the word ends, that is when you find a character that is not part of a word. (a whitespace, certain punctuation, ... ) This should mark the beginning and end of the word.

seekg() not working as expected

I have a small program, that is meant to copy a small phrase from a file, but it appears that I am either misinformed as to how seekg() works, or there is a problem in my code preventing the function from working as expected.
The text file contains:
//Intro
previouslyNoted=false
The code is meant to copy the word "false" into a string
std::fstream stats("text.txt", std::ios::out | std::ios::in);
//String that will hold the contents of the file
std::string statsStr = "";
//Integer to hold the index of the phrase we want to extract
int index = 0;
//COPY CONTENTS OF FILE TO STRING
while (!stats.eof())
{
static std::string tempString;
stats >> tempString;
statsStr += tempString + " ";
}
//FIND AND COPY PHRASE
index = statsStr.find("previouslyNoted="); //index is equal to 8
//Place the get pointer where "false" is expected to be
stats.seekg(index + strlen("previouslyNoted=")); //get pointer is placed at 24th index
//Copy phrase
stats >> previouslyNotedStr;
//Output phrase
std::cout << previouslyNotedStr << std::endl;
But for whatever reason, the program outputs:
=false
What I expected to happen:
I believe that I placed the get pointer at the 24th index of the file, which is where the phrase "false" begins. Then the program would've inputted from that index onward until a space character would have been met, or the end of the file would have been met.
What actually happened:
For whatever reason, the get pointer started an index before expected. And I'm not sure as to why. An explanation as to what is going wrong/what I'm doing wrong would be much appreciated.
Also, I do understand that I could simply make previouslyNotedStr a substring of statsStr, starting from where I wish, and I've already tried that with success. I'm really just experimenting here.
The VisualC++ tag means you are on windows. On Windows the end of line takes two characters (\r\n). When you read the file in a string at a time, this end-of-line sequence is treated as a delimiter and you replace it with a single space character.
Therefore after you read the file you statsStr does not match the contents of the file. Every where there is a new line in the file you have replaced two characters with one. Hence when you use seekg to position yourself in the file based on numbers you got from the statsStr string, you end up in the wrong place.
Even if you get the new line handling correct, you will still encounter problems if the file contains two or more consecutive white space characters, because these will be collapsed into a single space character by your read loop.
You are reading the file word by word. There are better methods:
while (getline(stats, statsSTr)
{
// An entire line is read into statsStr.
std::string::size_type posn = statsStr.find("previouslyNoted=");
// ...
}
By reading entire text lines into a string, there is no need to reposition the file.
Also, there is a white-space issue when reading by word. This will affect where you think the text is in the file. For example, white space is skipped, and there is no telling how many spaces, newlines or tabs were skipped.
By the way, don't even think about replacing the text in the same file. Replacement of text only works if the replacement text has the same length as the original text in the file. Write to a new file instead.
Edit 1:
A better method is to declare your key strings as array. This helps with positioning pointers within a string:
static const char key_text[] = "previouslyNoted=";
while (getline(stats, statsStr))
{
std::string::size_type key_position = statsStr.find(key_text);
std::string::size_type value_position = key_position + sizeof(key_text) - 1; // for the nul terminator.
// value_position points to the character after the '='.
// ...
}
You may want to save programming type by making your data file conform to an existing format, such as INI or XML, and using appropriate libraries to parse them.

What is wrong with my UVa code

I tried to solve this problem in UVa but I am getting a wrong answer and I cant seem to find the error
http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=2525
#include<cstdio>
#include<cstring>
using namespace std;
int main()
{
int t,j,k,i=1;
char a[1000];
while(scanf("%d",&t)!=EOF && t)
{
int sum=0;
getchar();
gets(a);
k=strlen(a);
for(j=0;j<k;j++)
{ if(a[j]=='a'||a[j]=='d'||a[j]=='g'||a[j]=='j'||a[j]=='m'||a[j]=='p'||a[j]=='t'||a[j]=='w'||a[j]==32)
sum=sum+1;
else if(a[j]=='b'||a[j]=='e'||a[j]=='h'||a[j]=='k'||a[j]=='n'||a[j]=='q'||a[j]=='u'||a[j]=='x')
sum=sum+2;
else if(a[j]=='c'||a[j]=='f'||a[j]=='i'||a[j]=='l'||a[j]=='o'||a[j]=='r'||a[j]=='v'||a[j]=='y')
sum=sum+3;
else if(a[j]=='s'||a[j]=='z')
sum=sum+4;
}
printf("Case #%d: %d\n",i,sum);
i++;
}
return 0;
}
In the problem description there is a single number that indicates the number of texts that will be in the input afterwards. Your original code was trying to read the number before every row of input.
The attempt to read the number in each one of the rows will fail since the input character set does not include any digits, so you could be inclined to think that there should be no difference. But there is, when you try to read a number it will start by consuming the leading whitespace. If the input is:
< space >< space >a
The output should be 3 (two '0' and one '2' keys), but the attempt to read the number out of the line will consume the two leading whitespace characters and the later gets will read the string "a", rather than " a". Your count will be off by the amount of leading whitespace.
separate your code into functions that do specific things: read the data from the file, calculate the number of key presses for each input, output the result
Benefit:
You can test each function independently. It is also easier to reason about the code.
The maximum size of an input is 100, this means you only need an array of 101 characters( including the final \0) for each input, not 1000.
Since this question is also tagged C++ try to use std::vector and std::string in your code.
The inner for seems right at a cursory glance. The befit of having a specialized function that computes the number of key presses is that you can easily verify it does the correct thing. Make sure you check it thoroughly.

Line Breaks when reading an input file by character in C++

Ok, just to be up front, this IS homework, but it isn't due for another week, and I'm not entirely sure the final details of the assignment. Long story short, without knowing what concepts he'll introduce in class, I decided to take a crack at the assignment, but I've run into a problem. Part of what I need to do for the homework is read individual characters from an input file, and then, given the character's position within its containing word, repeat the character across the screen. The problem I'm having is, the words in the text file are single words, each on a different line in the file. Since I'm not sure we'll get to use <string> for this assignment, I was wondering if there is any way to identify the end of the line without using <string>.
Right now, I'm using a simple ifstream fin; to pull the chars out. I just can't figure out how to get it to recognize the end of one word and the beginning of another. For the sake of including code, the following is all that I've got so far. I was hoping it would display some sort of endl character, but it just prints all the words out run together style.
ifstream fin;
char charIn;
fin.open("Animals.dat");
fin >> charIn;
while(!fin.eof()){
cout << charIn;
fin >> charIn;
}
A few things I forgot to include originally:
I must process each character as it is input (my loop to print it out needs to run before I read in the next char and increase my counter). Also, the length of the words in 'Animals.dat' vary which keeps me from being able to just set a number of iterations. We also haven't covered fin.get() or .getline() so those are off limits as well.
Honestly, I can't imagine this is impossible, but given the restraints, if it is, I'm not too upset. I mostly thought it was a fun problem to sit on for a while.
Why not use an array of chars? You can try it as follow:
#define MAX_WORD_NUM 20
#define MAX_STR_LEN 40 //I think 40 is big enough to hold one word.
char words[MAX_WROD_NUM][MAX_STR_LEN];
Then you can input a word to the words.
cin >> words[i];
The >> operator ignores whitespace, so you'll never get the newline character. You can use c-strings (arrays of characters) even if the <string> class is not allowed:
ifstream fin;
char animal[64];
fin.open("Animals.dat");
while(fin >> animal) {
cout << animal << endl;
}
When reading characters from a c-string (which is what animal is above), the last character is always 0, sometimes represented '\0' or NULL. This is what you check for when iterating over characters in a word. For example:
c = animal[0];
for(int i = 1; c != 0 && i < 64; i++)
{
// do something with c
c = animal[i];
}

Tokenization of a text file with frequency and line occurrence. Using C++

once again I ask for help. I haven't coded anything for sometime!
Now I have a text file filled with random gibberish. I already have a basic idea on how I will count the number of occurrences per word.
What really stumps me is how I will determine what line the word is in. Gut instinct tells me to look for the newline character at the end of each line. However I have to do this while going through the text file the first time right? Since if I do it afterwords it will do no good.
I already am getting the words via the following code:
vector<string> words;
string currentWord;
while(!inputFile.eof())
{
inputFile >> currentWord;
words.push_back(currentWord);
}
This is for a text file with no set structure. Using the above code gives me a nice little(big) vector of words, but it doesn't give me the line they occur in.
Would I have to get the entire line, then process it into words to make this possible?
Use a std::map<std::string, int> to count the word occurrences -- the int is the number of times it exists.
If you need like by line input, use std::getline(std::istream&, std::string&), like this:
std::vector<std::string> lines;
std::ifstream file(...) //Fill in accordingly.
std::string currentLine;
while(std::getline(file, currentLine))
lines.push_back(currentLine);
You can split a line apart by putting it into an std::istringstream first and then using operator>>. (Alternately, you could cobble up some sort of splitter using std::find and other algorithmic primitaves)
EDIT: This is the same thing as in #dash-tom-bang's answer, but modified to be correct with respect to error handing:
vector<string> words;
int currentLine = 1; // or 0, however you wish to count...
string line;
while (getline(inputFile, line))
{
istringstream inputString(line);
string word;
while (inputString >> word)
words.push_back(pair(word, currentLine));
}
Short and sweet.
vector< map< string, size_t > > line_word_counts;
string line, word;
while ( getline( cin, line ) ) {
line_word_counts.push_back();
map< string, size_t > &word_counts = line_word_counts.back();
istringstream line_is( line );
while ( is >> word ) ++ word_counts[ word ];
}
cout << "'Hello' appears on line 5 " << line_word_counts[5-1]["Hello"]
<< " times\n";
You're going to have to abandon reading into strings, because operator >>(istream&, string&) discards white space and the contents of the white space (== '\n' or != '\n', that is the question...) is what will give you line numbers.
This is where OOP can save the day. You need to write a class to act as a "front end" for reading from the file. Its job will be to buffer data from the file, and return words one at a time to the caller.
Internally, the class needs to read data from the file a block (say, 4096 bytes) at a time. Then a string GetWord() (yes, returning by value here is good) method will:
First, read any white space characters, taking care to increment the object's lineNumber member every time it hits a \n.
Then read non-whitespace characters, putting them into the string object you'll be returning.
If it runs out of stuff to read, read the next block and continue.
If the you hit the end of file, the string you have is the whole word (which may be empty) and should be returned.
If the function returns an empty string, that tells the caller that the end of file has been reached. (Files usually end with whitespace characters, so reading whitespace characters cannot imply that there will be a word later on.)
Then you can call this method at the same place in your code as your cin >> line and the rest of the code doesn't need to know the details of your block buffering.
An alternative approach is to read things a line at a time, but all the read functions that would work for you require you to create a fixed-size buffer to read into beforehand, and if the line is longer than that buffer, you have to deal with it somehow. It could get more complicated than the class I described.