Count word frequency using map - c++

This is my first time implementing map in C++. So given a character array with text, I want to count the frequency of each word occurring in the text. I decided to implement map to store the words and compare following words and increment a counter.
Following is the code I have written so far.
const char *kInputText = "\
So given a character array with text, I want to count the frequency of
each word occurring in the text.\n\
I decided to implement map to store the\n\
words and compare following words and increment a counter.\n";
typedef struct WordCounts
{
int wordcount;
}WordCounts;
typedef map<string, int> StoreMap;
//countWord function is to count the total number of words in the text.
void countWord( const char * text, WordCounts & outWordCounts )
{
outWordCounts.wordcount = 0;
size_t i;
if(isalpha(text[0]))
outWordCounts.wordcount++;
for(i=0;i<strlen(text);i++)
{
if((isalpha(text[i])) && (!isalpha(text[i-1])))
outWordCounts.wordcount++;
}
cout<<outWordCounts.wordcount;
}
//count_for_map() is to count the word frequency using map.
void count_for_map(const char *text, StoreMap & words)
{
string st;
while(text >> st)
words[st]++;
}
int main()
{
WordCounts wordCounts;
StoreMap w;
countWord( kInputText, wordCounts );
count_for_map(kInputText, w);
for(StoreMap::iterator p = w.begin();p != w.end();++p)
{
std::cout<<p->first<<"occurred" <<p->second<<"times. \n";
}
return 0;
}
Error: No match for 'operator >>' in 'text >> st'
I understand this is an operator overloading error, so I went ahead and
wrote the following lines of code.
//In the count_for_map()
/*istream & operator >> (istream & input,const char *text)
{
int i;
for(i=0;i<strlen(text);i++)
input >> text[i];
return input;
}*/
Am I implementing map in the wrong way?

There is no overload for >> with a const char* left hand side.
text is a const char*, not an istream, so your overload doesn't apply (and the overload 1: is wrong, and 2: already exists in the standard library).
You want to use the more suitable std::istringstream, like this:
std::istringstream textstream(text);
while(textstream >> st)
words[st]++;

If you use modern C++ language, then life will get by far easier.
First. Usage of a std::map is the correct approach.
This is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the "word" to count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and if found, return a reference to the value. If not found, the it will create a new entry with the key and return a reference to the new entry. So, in bot cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<std::string, unsigned int> counter{};
counter[word]++;
But how to get words from a string. A string is like a container containing elements. And in C++ many containers have iterators. And especially for strings there is a dedicated iterator that allows to iterate over patterns in a std::string. It is called std::sregex_token_iterator and described here.. The pattern is given as a std::regex which will give you a great flexibility.
And, because we have such a wonderful and dedicated iterator, we should use it!
Eveything glued together will give a very compact solution, with a minimal number of code lines.
Please see:
#include <iostream>
#include <string>
#include <regex>
#include <map>
#include <iomanip>
const std::regex re{ "\\w+" };
const std::string text{ R"(So given a character array with text, I want to count the frequency of
each word occurring in the text.
I decided to implement map to store the
words and compare following words and increment a counter.")" };
int main() {
std::map<std::string, unsigned int> counter{};
for (auto word{ std::sregex_token_iterator(text.begin(),text.end(),re) }; word != std::sregex_token_iterator(); ++word)
counter[*word]++;
for (const auto& [word, count] : counter)
std::cout << std::setw(20) << word << "\toccurred\t" << count << " times\n";
}

Related

Inserting characters intro a string in C++

I need to insert a character into a string of letters that are in alphabetical order, and this character has to be placed where it belongs alphabetically.
For example I have the string string myString("afgjz"); and the input code
cout << "Input your character" << endl;
char ch;
cin >> ch;
but how can I make it so that after inputting the char(say b) it is then added to the string on the proper position resulting in the string becoming "abfgjz".
You can use std::lower_bound to find the position to insert.
myString.insert(std::lower_bound(myString.begin(), myString.end(), ch), ch);
A more generic solution would be having a function like
namespace sorted
{
template<class Container, class T>
void insert(Container & object, T const & value)
{
using std::begin;
using std::end;
object.insert(std::lower_bound(begin(object),
end(object), value), value);
}
}
And then use
sorted::insert(myString, ch);
Class std::string has the following insert method (apart from other its insert methods):
iterator insert(const_iterator p, charT c);
So all what you need is to find the position where the new character has to be inserted. If the string has already the same character then there are two approaches: either the new character is inserted before the existent character in the string and in this case you should use standard algorithm std::lower_bound or the new character is inserted after the existent character in the string and in this case you should use standard algorithm std::upper_bound.
Here is a demonstrative program that shows how this can be done using standard algorithm std::upper_bound. You may substitute it for std::lower_bound if you like. Though in my opinion it is better to insert the new character after existent one because in some situation you can avoid moving characters after the target position that to insert the new character.
#include <iostream>
#include <algorithm>
#include <string>
int main()
{
std::string myString( "afgjz" );
char c = 'b';
myString.insert( std::upper_bound( myString.begin(), myString.end(), c ), c );
std::cout << myString << std::endl;
return 0;
}
The program output is
abfgjz

How to order strings case-insensitively (not lexicographically)?

I'm attempting to order a list input from a file alphabetically (not lexicographically). So, if the list were:
C
d
A
b
I need it to become:
A
b
C
d
Not the lexicographic ordering:
A
C
b
d
I'm using string variables to hold the input, so I'm looking for some way to modify the strings I'm comparing to all uppercase or lowercase, or if there's some easier way to force an alphabetic comparison, please impart that wisdom. Thanks!
I should also mention that we are limited to the following libraries for this assignment: iostream, iomanip, fstream, string, as well as C libraries, like cstring, cctype, etc.
It looks like I'm just going to have to defeat this problem via some very tedious method of character extraction and toppering for each string.
Converting the individual strings to upper case and comparing them is not made particularly worse by being restricted from using algorithm, iterator, etc. The comparison logic is about four lines of code. Even though it would be nice not to have to write those four lines having to write a sorting algorithm is far more difficult and tedious. (Well, assuming that the usual C version of toupper is acceptable in the first place.)
Below I show a simple strcasecmp() implementation and then put it to use in a complete program which uses restricted libraries. The implementation of strcasecmp() itself doesn't use restricted libraries.
#include <string>
#include <cctype>
#include <iostream>
void toupper(std::string &s) {
for (char &c : s)
c = std::toupper(c);
}
bool strcasecmp(std::string lhs, std::string rhs) {
toupper(lhs); toupper(rhs);
return lhs < rhs;
}
// restricted libraries used below
#include <algorithm>
#include <iterator>
#include <vector>
// Example usage:
// > ./a.out <<< "C d A b"
// A b C d
int main() {
std::vector<std::string> input;
std::string word;
while(std::cin >> word) {
input.push_back(word);
}
std::sort(std::begin(input), std::end(input), strcasecmp);
std::copy(std::begin(input), std::end(input),
std::ostream_iterator<std::string>(std::cout, " "));
std::cout << '\n';
}
You don't have to modify the strings before sorting. You can sort them in place with a case-insensitive single character comparator and std::sort:
bool case_insensitive_cmp(char lhs, char rhs) {
return ::toupper(static_cast<unsigned char>(lhs) <
::toupper(static_cast<unsigned char>(rhs);
}
std::string input = ....;
std::sort(input.begin(), input.end(), case_insensitive_cmp);
std::vector<string> vec {"A", "a", "lorem", "Z"};
std::sort(vec.begin(),
vec.end(),
[](const string& s1, const string& s2) -> bool {
return strcasecmp(s1.c_str(), s2.c_str()) < 0 ? true : false;
});
Use strcasecmp() as comparison function in qsort().
I am not completely sure how to write it, but what you want to do is convert the strings to lower or uppercase.
If the strings are in an array to begin with, you would run through the list, and save the indexes in order in an (int) array.
If you're just comparing letters, then a terrible hack which will work is to mask the upper two bits off each character. Then upper and lower case letters fall on top of each other.

map<string, vector<string>> reassignment of vector value

I am trying to write a program that takes lines from an input file, sorts the lines into 'signatures' for the purpose of combining all words that are anagrams of each other. I have to use a map, storing the 'signatures' as the keys and storing all words that match those signatures into a vector of strings. Afterwards I must print all words that are anagrams of each other on the same line. Here is what I have so far:
#include <iostream>
#include <string>
#include <algorithm>
#include <map>
#include <fstream>
using namespace std;
string signature(const string&);
void printMap(const map<string, vector<string>>&);
int main(){
string w1,sig1;
vector<string> data;
map<string, vector<string>> anagrams;
map<string, vector<string>>::iterator it;
ifstream myfile;
myfile.open("words.txt");
while(getline(myfile, w1))
{
sig1=signature(w1);
anagrams[sig1]=data.push_back(w1); //to my understanding this should always work,
} //either by inserting a new element/key or
//by pushing back the new word into the vector<string> data
//variable at index sig1, being told that the assignment operator
//cannot be used in this way with these data types
myfile.close();
printMap(anagrams);
return 0;
}
string signature(const string& w)
{
string sig;
sig=sort(w.begin(), w.end());
return sig;
}
void printMap(const map& m)
{
for(string s : m)
{
for(int i=0;i<m->second.size();i++)
cout << m->second.at();
cout << endl;
}
}
The first explanation is working, didn't know it was that simple! However now my print function is giving me:
prob2.cc: In function âvoid printMap(const std::map<std::basic_string<char>, std::vector<std::basic_string<char> > >&)â:
prob2.cc:43:36: error: cannot bind âstd::basic_ostream<char>::__ostream_type {aka std::basic_ostream<char>}â lvalue to âstd::basic_ostream<char>&&â
In file included from /opt/centos/devtoolset-1.1/root/usr/lib/gcc/x86_64-redhat-linux/4.7.2/../../../../include/c++/4.7.2/iostream:40:0,
Tried many variations and they always complain about binding
void printMap(const map<string, vector<string>> &mymap)
{
for(auto &c : mymap)
cout << c.first << endl << c.second << endl;
}
anagrams[sig1] will return a reference to a vector<string>. Rather than assign to it, you just want to push_back onto it.
sig1 = signature(w1);
anagrams[sig1].push_back(w1);
As your code is written right now, it's trying to replace the vector instead of add to it. For example, let's assume your input contains both was and saw, and that your signature sorts the letters of the string.
What you want for this case is:
read "was"
sort to get "asw"
insert "was" to get: anagrams["asw"] -> ["was"]
read "saw"
Sort to get "asw" (again)
insert "saw" to get: anagrams["asw"] -> ["was", "saw"]
With the code as you've tried to write it, however, in step 6, instead of adding to the existing vector, you'd overwrite the current vector with a new one containing only "saw", so the result would be just anagrams["asw"] -> ["saw"].
As far as printmap goes: the items in the map aren't std::strings, they're std::pair<std::string, std::vector<std::string>>, so when you try to do:
void printMap(const map& m)
{
for(string s : m)
...that clearly can't work. I'd usually use:
for (auto s : m)
...which makes it easy to get at least that much to compile. To do anything useful with the s, however, you're going to need to realize that it's a pair, so you'll have to work with s.first and s.second (and s.first will be a string, and s.second will be a std::vector<std::string>). To print them out, you'll probably want to print s.first, then some separator, then walk though the items in s.second.

How to make an iterator to a read-only object writable (in C++)

I've created a unordered_set of my own type of struct. I have an iterator to this set and would like to increment a member (count) of the struct that the iterator points to. However, the compiler complains with the following message:
main.cpp:61:18: error: increment of member ‘SentimentWord::count’ in read-only object
How can I fix this?
Here's my code:
#include <fstream>
#include <iostream>
#include <cstdlib>
#include <string>
#include <unordered_set>
using namespace std;
struct SentimentWord {
string word;
int count;
};
//hash function and equality definition - needed to used unordered_set with type SentimentWord
struct SentimentWordHash {
size_t operator () (const SentimentWord &sw) const;
};
bool operator == (SentimentWord const &lhs, SentimentWord const &rhs);
int main(int argc, char **argv){
ifstream fin;
int totalWords = 0;
unordered_set<SentimentWord, SentimentWordHash> positiveWords;
unordered_set<SentimentWord, SentimentWordHash> negativeWords;
//needed for reading in sentiment words
string line;
SentimentWord temp;
temp.count = 0;
fin.open("positive_words.txt");
while(!fin.eof()){
getline(fin, line);
temp.word = line;
positiveWords.insert(temp);
}
fin.close();
//needed for reading in input file
unordered_set<SentimentWord, SentimentWordHash>::iterator iter;
fin.open("041.html");
while(!fin.eof()){
totalWords++;
fin >> line;
temp.word = line;
iter = positiveWords.find(temp);
if(iter != positiveWords.end()){
iter->count++;
}
}
for(iter = positiveWords.begin(); iter != positiveWords.end(); ++iter){
if(iter->count != 0){
cout << iter->word << endl;
}
}
return 0;
}
size_t SentimentWordHash::operator () (const SentimentWord &sw) const {
return hash<string>()(sw.word);
}
bool operator == (SentimentWord const &lhs, SentimentWord const &rhs){
if(lhs.word.compare(rhs.word) == 0){
return true;
}
return false;
}
Any help is greatly appreciated!
Elements in an unordered_set are, by definition, immutable:
In an unordered_set, the value of an element is at the same time its
key, that identifies it uniquely. Keys are immutable, therefore, the
elements in an unordered_set cannot be modified once in the container
- they can be inserted and removed, though.
I would vote that you use an unordered_map instead, using a string as the key and an int as the mapped value.
One solution (but a dirty hack) is to make your counter mutable, which means, that you permit to change it even on const objects.
struct SentimentWord {
string word;
mutable int count;
};
As I already said, this is a dirty hack, since it allows you to violate rules (you soften them). And rules have a reason. I'm not even sure if this works, since the definition of the unordered_set says that the values can't be modified once being inserted, and this also has a reason.
A nicer solution is to use a map which uses the word as a key and the counter as a value. Your code then doesn't have to use find but simply access the element using the subscript operator ("array access" operator) which directly returns a reference (not an iterator). On this reference, use the increment operator, like this:
std::unordered_map<std::string,int> positiveWords;
//...
positiveWords[word]++;
Then you don't need your struct at all, and of course also not your custom comparison operator overload.
Trick (just in case you need it): If you want to order a map by its value (if you need a statistical map with the most frequent words coming first), use a second (but ordered) map with reversed key and value. This will sort it by the original value, which is now the key. Iterate it in reverse order to start with the most frequent words (or construct it with std::greater<int> as the comparison operator, provided as the third template parameter).
std::unordered_set is unhappy because it's worried you will change the object in such a way it is the same as another object, which would violate the set. ISTM you really want a map from string to int (not a set at all), and the iterator will let you change the returned value, if not the key.

sequentially reading a text file in C++

In C++, I want to sequentially read word from a text file, and store each word into an array? After that, I will perform some operation on this array. But I do not know how to handle the first phase: sequentially reading word from a text file and store each word into an array.
I should skip those punctuations, which include ".", ",", "?"
You need to use streams for this. Take a look at the examples here:
Input/Output with files
This sounds like homework. If it is, please be forthright.
First of all, it's almost always a bad idea in C++ to use a raw array -- using a vector is a much better idea. As for your question about punctuation -- that's up to your customer, but my inclination is to separate on whitespace.
Anyway, here's an easy way to do it that takes advantage of operator>>(istream&, string&) separating on whitespace by default.
ifstream infile("/path/to/file.txt");
vector<string> words;
copy(istream_iterator<string>(file), istream_iterator<string>(), back_inserter(words));
Here's a complete program that reads words from a file named "filename", stores them in a std::vector and removes punctuation from the words.
#include <algorithm> // iostream, vector, iterator, fstream, string
struct is_punct {
bool operator()(char c) const {
static const std::string punct(",.:;!?");
return punct.find(c) != std::string::npos;
}
};
int main(int argc, char* argv[])
{
std::ifstream in("filename");
std::vector<std::string> vec((std::istream_iterator<std::string>(in)),
std::istream_iterator<std::string>());
std::transform(vec.begin(), vec.end(),
vec.begin(),
[](std::string s) {
s.erase(std::remove_if(s.begin(), s.end(), is_punct()),
s.end());
return s;
});
// manipulate vec
}
Do you know how many words you'll be reading? If not, you'll need to grow the array as you read more and more words. The easiest way to do that is to use a standard container that does it for you: std::vector. Reading words separated by whitespace is easy as it's the default behavior of std::ifstream::operator>>. Removing punctuation marks requires some extra work, so we'll get to that later.
The basic workflow for reading words from a file goes like this:
#include <fstream>
#include <string>
#include <vector>
int main()
{
std::vector<std::string> words;
std::string w;
std::ifstream file("words.txt"); // opens the file for reading
while (file >> w) // read one word from the file, stops at end-of-file
{
// do some work here to remove punctuation marks
words.push_back(w);
}
return 0;
}
Assuming you're doing homework here, the real key is learning how to remove the punctuation marks from w before adding it to the vector. I would look into the following concepts to help you:
The erase-remove idiom. Note that a std::string behaves like a container of char.
std::remove_if
The ispunct function in the cctype library
Feel free to post more questions if you run into trouble.
Yet another possibility, using (my usual) a special facet:
class my_ctype : public std::ctype<char> {
public:
mask const *get_table() {
// this copies the "classic" table used by <ctype.h>:
static std::vector<std::ctype<char>::mask>
table(classic_table(), classic_table()+table_size);
// Anything we want to separate tokens, we mark its spot in the table as 'space'.
table[','] = (mask)space;
table['.'] = (mask)space;
table['?'] = (mask)space;
// and return a pointer to the table:
return &table[0];
}
my_ctype(size_t refs=0) : std::ctype<char>(get_table(), false, refs) { }
};
Using that, reading the words is pretty easy:
int main(int argc, char **argv) {
std::ifstream infile(argv[1]); // open the file.
infile.imbue(std::locale(std::locale(), new my_ctype()); // use our classifier
// Create a vector containing the words from the file:
std::vector<std::string> words(
(std::istream_iterator<std::string>(infile)),
std::istream_iterator<std::string>());
// and now we're ready to process the words in the vector
// though it might be worth considering using `std::transform`, to take
// the input from the file and process it directly.