Word Frequency of a string (i.e. File I/O)?

Word Frequency of a string (i.e. File I/O)? - c++

I wrote a C++ program that reads a text file. I want the program to count the number of times a word appears, however. For example, the output should look as follows:
Word Frequency Analysis
Word Frequency
I 1
don't 1
know 1
the 2
key 1
to 3
success 1
but 1
key 1
failure 1
is 1
trying 1
please 1
everybody 1
Notice how each word appears only once. What do I need to do in order to achieve this effect??
Here is the text file (i.e. named BillCosby.txt):
I don't know the key to success, but the key to failure is trying to please everybody.
Here is my code so far. I am having an extreme mental block and cannot figure out a way to get the program to read the number of times a word occurs.
#include <iostream>
#include <fstream>
#include <iomanip>
const int BUFFER_LENGTH = 256;
const int NUMBER_OF_STRINGS = 100;
int numberOfElements = 0;
char buffer[NUMBER_OF_STRINGS][BUFFER_LENGTH];
char * words = buffer[0];
int frequency[NUMBER_OF_STRINGS];
int StringLength(char * buffer);
int StringCompare(char * firstString, char * secondString);
int main(){
int isFound = 1;
int count = 1;
std::ifstream input("BillCosby.txt");
if(input.is_open())
{
//Priming read
input >> buffer[numberOfElements];
frequency[numberOfElements] = 1;
while(!input.eof())
{
numberOfElements++;
input >> buffer[numberOfElements];
for(int i = 0; i < numberOfElements; i++){
isFound = StringCompare(buffer[numberOfElements], buffer[i]);
if(isFound == 0)
++count;
}
frequency[numberOfElements] = count;
//frequency[numberOfElements] = 1;
count = 1;
isFound = 1;
}
numberOfElements++;
}
else
std::cout << "File is not open. " << std::endl;
std::cout << "\n\nWord Frequency Analysis " << std::endl;
std::cout << "\n" << std::endl;
std::cout << "Word " << std::setw(25) << "Frequency\n" << std::endl;
for(int i = 0; i < numberOfElements; i++){
int length = StringLength(buffer[i]);
std::cout << buffer[i] << std::setw(25 - length) << frequency[i] <<
std::endl;
}
return 0;
}
int StringLength(char * buffer){
char * characterPointer = buffer;
while(*characterPointer != '\0'){
characterPointer++;
}
return characterPointer - buffer;
}
int StringCompare(char * firstString, char * secondString)
{
while ((*firstString == *secondString || (*firstString == *secondString - 32) ||
(*firstString - 32 == *secondString)) && (*firstString != '\0'))
{
firstString++;
secondString++;
}
if (*firstString > *secondString)
return 1;
else if (*firstString < *secondString)
return -1;
return 0;
}

Your program is quite confusing to read. But this part stuck out to me:
frequency[numberOfElements] = 1;
(in the while loop). You realize that you are always setting the frequency to 1 no matter how many times the word appears right? Maybe you meant to increment the value and not set it to 1?

One approach is to tokenize (split the lines into words), and then use c++ map container. The map would have the word as a key, and word count for value.
For each token, add it into the map, and increment the wordcount. A map key is unique, hence you wouldn't have duplicates.
You can use stringstream for your tokenizer, and you can find the map container reference (incl examples) here.
And don't worry, a good programmer deals with mental blocks on a daily basis -- so get used to it :)

Flow of solution should be something like this:
- initialize storage (you know you have a pretty small file apparently?)
- set initial count to zero (not one)
- read words into array. When you get a new word, see if you already have it; if so, add one to the count at that location; if not, add it to the list of words ("hey - a new word!") and set its count to 1
- loop over all words in the file
Be careful with white space - make sure you are matching only non white space characters. Right now you have "key" twice. I suspect that is a mistake?
Good luck.

Here's a code example that I tested with codepad.org:
#include <iostream>
#include <map>
#include <string>
#include <sstream>
using namespace std;
int main()
{
string s = "I don't know the key to success, but the key to failure is trying to please everybody.";
string word;
map<string,int> freq;
for ( std::string::iterator it=s.begin(); it!=s.end(); ++it)
{
if(*it == ' ')
{
if(freq.find(word) == freq.end()) //First time the word is seen
{
freq[word] = 1;
}
else //The word has been seen before
{
freq[word]++;
}
word = "";
}
else
{
word.push_back(*it);
}
}
for (std::map<string,int>::iterator it=freq.begin(); it!=freq.end(); ++it)
std::cout << it->first << " => " << it->second << '\n';
}
It stops when it finds a space so grammatical symbols will mess things up but you get the point.
Output:
I => 1
but => 1
don't => 1
failure => 1
is => 1
key => 2
know => 1
please => 1
success, => 1 //Note this isn't perfect because of the comma. A quick change can fix this though, I'll let //you figure that out on your own.
the => 2
to => 3
trying => 1

I'm a bit hesitant to post a direct answer to something that looks a lot like homework, but I'm pretty sure if somebody turns this in as homework, any halfway decent teacher/professor is going to demand some pretty serious explanation, so if you do so, you'd better study it carefully and be ready for some serious questions about what all the parts are and how they work.
#include <map>
#include <iostream>
#include <iterator>
#include <algorithm>
#include <string>
#include <fstream>
#include <iomanip>
#include <locale>
#include <vector>
struct alpha_only: std::ctype<char> {
alpha_only() : std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
for (int i=0; i<std::ctype<char>::table_size; i++)
if (isalpha(i)) rc[i] = std::ctype_base::alpha;
return &rc[0];
}
};
typedef std::pair<std::string, unsigned> count;
namespace std {
std::ostream &operator<<(std::ostream &os, ::count const &c) {
return os << std::left << std::setw(25) << c.first
<< std::setw(10) << c.second;
}
}
int main() {
std::ifstream input("billcosby.txt");
input.imbue(std::locale(std::locale(), new alpha_only()));
std::map<std::string, unsigned> words;
std::for_each(std::istream_iterator<std::string>(input),
std::istream_iterator<std::string>(),
[&words](std::string const &w) { ++words[w]; });
std::copy(words.begin(), words.end(),
std::ostream_iterator<count>(std::cout, "\n"));
return 0;
}

Related

Using isdigit() to determine if vector<string> index is letter or number

Trying to determine whether an index entry of a vector of strings is a letter or number. I am trying to use isdigit(), but it won't work because a suitable conversion can't be made using isdigit(stof(eq[i]))
Essentially, if I find that it is a letter, I want to change that value to 0.
#include string
#include vector
using namespace std;
vector <string> eq;
eq[0] = "a";
eq[1] = "3.5";
eq[2] = "7.5";
for (int i = 0; i < eq.size(); i++)
{
try {
isdigit(stof(eq[i]));
throw(eq[i]);
}
catch (exception e) {
cout << "eq[i] is not a number" << endl;
eq[i] == "0";
cout << "eq[i] = " << eq[i] << endl;
}
}
The question is, how could I assess if an index value is a letter, and then if it is, replace that letter with a zero?

How about just having a check as
if(eq[i].size() == 1 && std::isalpha(eq[i][0])) {
eq[i] = "0";
}
Would that work for your case?
EDIT: In case you have something like eq[3] = "abc"; i.e. entire strings rather just single letters, then something like this could be done:
if(
std::any_of(eq[i].cbegin(), eq[i].cend(), [](char c) {
return std::isalpha(c);
})
) {
eq[i] = "0";
}
Here's the documentation for std::any_of

Firstly, as #Peter notified, eq has size zero on creation, and assigning to eq[0], eq[1], and eq[2], cause undefined behaviors. This can be solved by using std::push_back().
Secondly, as #Jerry Jeremiah mentioned, isdigit() check if character is a decimal digit, so I don't see a point using that with stof(), which convert a string to a float.
This is the code that I modified from yours:
#include <string>
#include <vector>
#include <iostream>
using namespace std;
int main()
{
vector <string> eq;
//sample data
eq.push_back("a");
eq.push_back("3.5");
eq.push_back("7.5");
eq.push_back("5");
eq.push_back("xyz");
for (int i = 0; i < eq.size(); i++)
{
try
{
stof(eq[i]);
}
catch (exception e)
{
eq[i] = "0";
}
cout << "eq[" << i << "]" << " = " << eq[i] << endl;
}
}
Result:
eq[0] = 0
eq[1] = 3.5
eq[2] = 7.5
eq[3] = 5
eq[4] = 0
Also, it might be more practical to use the second parameter of stof() (as #Jerry Jeremiah noted), and use if statement instead of try-catch (as #Peter noted).
*Note: Running on Code::Blocks 20.03, g++ 6.3.0, Windows 10, 64 bit.
*More info:
stof() : https://www.cplusplus.com/reference/string/stof/
std::push_back() : https://www.cplusplus.com/reference/vector/vector/push_back/
isdigit() : https://www.cplusplus.com/reference/cctype/isdigit/

If a word is repeated many times in a string, how can I count the number of repetitions of the word and their positions?

If a word is repeated many times in a string, how can I count the number of repetitions of the word and their positions?
#include <cstring>
#include <iostream>
#include <string>
using namespace std;
int main()
{
string str;
getline(cin, str);
string str2;
getline(cin, str2);
const char* p = strstr(str.c_str(), str2.c_str());
if (p)
cout << "'" << str2 << "' find in " << p - str.c_str();
else
cout << target << "not find \"" << str << "\"";
return 0;
}

Just off the top of my head, you could use find() within std::string. find() returns the first match of a substring within your string (or std::string::npos if there is no match), so you would need to loop until find() was not able to find any more matches of your string.
Something like:
#include <string>
#include <vector>
#include <cstdio>
int main(void) {
std::string largeString = "Some string with substrings";
std::string mySubString = "string";
int numSubStrings = 0;
std::vector<size_t> locations;
size_t found = 0;
while(true) {
found = largeString.find(mySubString, found+1);
if (found != std::string::npos) {
numSubStrings += 1;
locations.push_back(found);
}
else {
break; // there are no more matches
}
}
printf("There are %d occurences of: \n%s \nin \n%s\n", numSubStrings, mySubString.c_str(), largeString.c_str());
}
Which outputs:
There are 2 occurences of:
string
in
Some string with substrings

The code below uses a lot of the Standard Library to do common things for us. I use a file to collect words into one large string. I then use a std::stringstream to separate the words on whitespace and I store the individual words in a std::vector (an array that manages its size and grows when needed). In order to get a good count of the words, punctuation and capitalization must also be removed, this is done in the sanitize_word() function. Finally, I add the words to a map where the word is the key, and the int is the count of how many times that word occurred. Finally, I print the map to get a complete word count.
The only place I directly did any string parsing is in the sanitize function, and it was done using the aptly named erase/remove idiom. Letting the Standard Library do the work for us when possible is much simpler.
Locating where a word occurs also becomes trivial after they've been separated and sanitized.
Contents of input.txt:
I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past, I will turn the inner eye to see its path. Where the fear has gone, there will be nothing. Only I will remain.
#include <algorithm>
#include <cctype>
#include <fstream>
#include <iostream>
#include <map>
#include <sstream>
#include <string>
#include <vector>
// Removes puncuation marks and converts words to all lowercase
std::string sanitize_word(std::string word) {
word.erase(std::remove_if(word.begin(), word.end(),
[punc = std::string(".,?!")](auto c) {
return punc.find(c) != std::string::npos;
}),
word.end());
for (auto& c : word) {
c = std::tolower(c);
}
return word;
}
int main() {
// Set up
std::ifstream fin("input.txt");
if (!fin) {
std::cerr << "Error opening file...\n";
return 1;
}
std::string phrases;
for (std::string tmp; std::getline(fin, tmp);) {
phrases += tmp;
}
fin.close();
// Words are collected, now the part we care about
std::stringstream strin(phrases);
std::vector<std::string> words;
for (std::string tmp; strin >> tmp;) {
words.push_back(tmp);
}
for (auto& i : words) {
i = sanitize_word(i);
}
// std::map's operator[]() function will create a new element in the map if it
// doesn't already exist
std::map<std::string, int> wordCounts;
for (auto i : words) {
++wordCounts[i];
}
for (auto i : wordCounts) {
std::cout << i.first << ": " << i.second << '\n';
}
// Now we'll do code to locate a certain word, "fear" for this example
std::string wordToFind("fear");
auto it = wordCounts.find(wordToFind);
std::cout << "\n\n" << it->first << ": " << it->second << '\n';
std::vector<int> locations;
for (std::size_t i = 0; i < words.size(); ++i) {
if (words[i] == wordToFind) {
locations.push_back(i);
}
}
std::cout << "Found at locations: ";
for (auto i : locations) {
std::cout << i << ' ';
}
std::cout << '\n';
}
Output:
and: 2
be: 1
brings: 1
eye: 1
face: 1
fear: 5
gone: 2
has: 2
i: 5
inner: 1
is: 2
it: 2
its: 1
little-death: 1
me: 2
mind-killer: 1
must: 1
my: 1
not: 1
nothing: 1
obliteration: 1
only: 1
over: 1
pass: 1
past: 1
path: 1
permit: 1
remain: 1
see: 1
that: 1
the: 4
there: 1
through: 1
to: 2
total: 1
turn: 1
when: 1
where: 1
will: 5
fear: 5
Found at locations: 3 4 8 20 50

How do I remove repeated words from a string and only show it once with their wordcount

Basically, I have to show each word with their count but repeated words show up again in my program.
How do I remove them by using loops or should I use 2d arrays to store both the word and count?
#include <iostream>
#include <stdio.h>
#include <iomanip>
#include <cstring>
#include <conio.h>
#include <time.h>
using namespace std;
char* getstring();
void xyz(char*);
void tokenizing(char*);
int main()
{
char* pa = getstring();
xyz(pa);
tokenizing(pa);
_getch();
}
char* getstring()
{
static char pa[100];
cout << "Enter a paragraph: " << endl;
cin.getline(pa, 1000, '#');
return pa;
}
void xyz(char* pa)
{
cout << pa << endl;
}
void tokenizing(char* pa)
{
char sepa[] = " ,.\n\t";
char* token;
char* nexttoken;
int size = strlen(pa);
token = strtok_s(pa, sepa, &nexttoken);
while (token != NULL) {
int wordcount = 0;
if (token != NULL) {
int sizex = strlen(token);
//char** fin;
int j;
for (int i = 0; i <= size; i++) {
for (j = 0; j < sizex; j++) {
if (pa[i + j] != token[j]) {
break;
}
}
if (j == sizex) {
wordcount++;
}
}
//for (int w = 0; w < size; w++)
//fin[w] = token;
//cout << fin[w];
cout << token;
cout << " " << wordcount << "\n";
}
token = strtok_s(NULL, sepa, &nexttoken);
}
}
This is the output I get:
I want to show, for example, the word "i" once with its count of 5, and then not show it again.

First of all, since you are using c++, I would recommend you to split text in c++ way(some examples are here), and store every word in map or unordered_map. Example of my realization you can find here
But if you don't want to rewrite your code, you can simply add a variable that will indicate whether a copy of the word was found before or after the word position. If a copy was not found in front, then print your word

This post gives an example to save each word from your 'strtok' function into a vector of string. Then, use string.compare to have each word compared with word[0]. Those indexes match with word[0] are marked in an int array 'used'. The count of match equals to the number marks in the array used ('nused'). Those words of marked are then removed from the vector, and the remaining carries on to the next comparing process. The program ends when no word remained.
You may write a word comparing function to replace 'str.compare(str2)', if you prefer not to use std::vector and std::string.
#include <iostream>
#include <string>
#include <vector>
#include<iomanip>
#include<cstring>
using namespace std;
char* getstring();
void xyz(char*);
void tokenizing(char*);
int main()
{
char* pa = getstring();
xyz(pa);
tokenizing(pa);
}
char* getstring()
{
static char pa[100] = "this is a test and is a test and is test.";
return pa;
}
void xyz(char* pa)
{
cout << pa << endl;
}
void tokenizing(char* pa)
{
char sepa[] = " ,.\n\t";
char* token;
char* nexttoken;
std::vector<std::string> word;
int used[64];
std::string tok;
int nword = 0, nsize, nused;
int size = strlen(pa);
token = strtok_s(pa, sepa, &nexttoken);
while (token)
{
word.push_back(token);
++nword;
token = strtok_s(NULL, sepa, &nexttoken);
}
for (int i = 0; i<nword; i++) std::cout << word[i] << std::endl;
std::cout << "total " << nword << " words.\n" << std::endl;
nsize = nword;
while (nsize > 0)
{
nused = 0;
tok = word[0] ;
used[nused++] = 0;
for (int i=1; i<nsize; i++)
{
if ( tok.compare(word[i]) == 0 )
{
used[nused++] = i; }
}
std::cout << tok << " : " << nused << std::endl;
for (int i=nused-1; i>=0; --i)
{
for (int j=used[i]; j<(nsize+i-nused); j++) word[j] = word[j+1];
}
nsize -= nused;
}
}
Notice that the removal of used words has to do in backward order. If you do it in sequential order, the marked indexes in the 'used' array will need to be changed. A running test:
$ ./a.out
this is a test and is a test and is test.
this
is
a
test
and
is
a
test
and
is
test
total 11 words.
this : 1
is : 3
a : 2
test : 3
and : 2

I read your last comment.
But I am very sorry, I do not know C. So, I will answer in C++.
But anyway, I will answer with the C++ standard approach. That is usually only 10 lines of code . . .
#include <iostream>
#include <algorithm>
#include <map>
#include <string>
#include <regex>
// Regex Helpers
// Regex to find a word
static const std::regex reWord{ R"(\w+)" };
// Result of search for one word in the string
static std::smatch smWord;
int main() {
std::cout << "\nPlease enter text: \n";
if (std::string line; std::getline(std::cin, line)) {
// Words and its appearance count
std::map<std::string, int> words{};
// Count the words
for (std::string s{ line }; std::regex_search(s, smWord, reWord); s = smWord.suffix())
words[smWord[0]]++;
// Show result
for (const auto& [word, count] : words) std::cout << word << "\t\t--> " << count << '\n';
}
return 0;
}

Removing all the characters (a-z, A-Z) from a string in C++

Here's my code:
#include <iostream>
using namespace std;
string moveString(string t, int index)
{
for (int i=index; t[i]!=NULL;i++)
{
t[i]=t[i+1];
}
return t;
}
string delChars(string t)
{
for (int i=0; t[i]!=NULL; i++)
{
if (t[i]>'a' && t[i]<'z')
{
moveString(t, i);
}
else if (t[i]>'A' && t[i]<'Z')
{
moveString(t, i);
}
}
return t;
}
int main()
{
int numberOfSpaces;
string t;
cout << "Text some word: "; cin>>t;
cout<<delChars(t);
return 0;
}
First function moveString should (in theory) take down every single character from a string by 1 index down (starting from given index) - to remove 1 character. The rest is pretty obvious. But:
Input: abc123def
Output: abc123def
What am I doing wrong?
And a additional mini-question: Acutally, what's the best way to "delete" an element from an array? (array of ints, chars, etc.)

Logic Stuff is right but his answer is not enough. You shouldn't increase i after move. Since the i.th character is removed and i points to the next character now.
string delChars(string t)
{
for (int i=0; t[i]!=NULL; )
{
if (t[i]>'a' && t[i]<'z')
{
t = moveString(t, i);
}
else if (t[i]>'A' && t[i]<'Z')
{
t = moveString(t, i);
}
else
i++;
}
return t;
}

moveString takes t by value and you're not assigning its return value, so it doesn't change t in delChars. So, make sure the next thing you learn are references.
Apart from that, I don't know what to tell about t[i] != NULL (if it is undefined behavior or not), but we have std::string::size to get the length of std::string, e.g. i < t.size(). And if you havet[i + 1], the condition should then be i + 1 < t.size().
Whatever, don't play with it like with char arrays, leaving the string with previous size. You can pop_back the last (duplicate) character after shifting the characters.
It's worth mentioning that it can be done in one line of idiomatic C++ algorithms, but you want to get your code working...

What am I doing wrong?
Not using standard algorithms
Actually, what's the best way to "delete" an element from array? (array of ints, chars, etc.)
By using the standard remove-erase idiom:
#include <iostream>
#include <string>
#include <algorithm>
#include <iomanip>
#include <cstring>
int main()
{
using namespace std;
auto s = "!the 54 quick brown foxes jump over the 21 dogs."s;
cout << "before: " << quoted(s) << endl;
s.erase(std::remove_if(s.begin(),
s.end(),
[](auto c) { return std::isalpha(c); }),
s.end());
cout << "after: " << quoted(s) << endl;
return 0;
}
expected output:
before: "!the 54 quick brown foxes jump over the 21 dogs."
after: "! 54 21 ."
I'm not allowed to use standard algorithms
Then keep it simple:
#include <iostream>
#include <string>
#include <algorithm>
#include <iomanip>
#include <cstring>
std::string remove_letters(const std::string& input)
{
std::string result;
result.reserve(input.size());
for (auto c : input) {
if (!std::isalpha(c)) {
result.push_back(c);
}
}
return result;
}
int main()
{
using namespace std;
auto s = "!the 54 quick brown foxes jump over the 21 dogs."s;
cout << "before: " << quoted(s) << endl;
auto s2 = remove_letters(s);
cout << "after: " << quoted(s2) << endl;
return 0;
}

string parsing for C++

I have a text file that has #'s in it...It looks something like this.
#Stuff
1
2
3
#MoreStuff
a
b
c
I am trying to use std::string::find() function to get the positions of the # and then go from there, but I'm not sure how to actually code this.
This is my attempt:
int pos1=0;
while(i<string.size()){
int next=string.find('#', pos1);
i++;}

Here's one i made a while ago... (in C)
int char_pos(char c, char *str) {
char *pch=strchr(str,c);
return (pch-str)+1;
}
Port it to C++ and there you go! ;)
If : Not Found Then returns Negative.
Else : Return 'Positive', Char's 1st found position (1st match)

It's hard to tell from your question what you mean by "position", but it looks like you are trying to do something like this:
#include <fstream>
#include <iostream>
int main()
{
std::ifstream incoming{"string-parsing-for-c.txt"};
std::string const hash{"#"};
std::string line;
for (auto line_number = 0U; std::getline(incoming, line); ++line_number)
{
auto const column = line.find(hash);
if (std::string::npos != column)
{
std::cout << hash << " found on line " << line_number
<< " in column " << column << ".\n";
}
}
}
...or possibly this:
#include <fstream>
#include <iostream>
int main()
{
std::ifstream incoming{"string-parsing-for-c.txt"};
char const hash{'#'};
char byte{};
for (auto offset = 0U; incoming.read(&byte, 1); ++offset)
{
if (hash == byte)
{
std::cout << hash << " found at offset " << offset << ".\n";
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Word Frequency of a string (i.e. File I/O)? - c++

Your program is quite confusing to read. But this part stuck out to me: frequency[numberOfElements] = 1; (in the while loop). You realize that you are always setting the frequency to 1 no matter how many times the word appears right? Maybe you meant to increment the value and not set it to 1?

Related

Using isdigit() to determine if vector<string> index is letter or number

If a word is repeated many times in a string, how can I count the number of repetitions of the word and their positions?

How do I remove repeated words from a string and only show it once with their wordcount

Removing all the characters (a-z, A-Z) from a string in C++

string parsing for C++

Categories

Resources