Count unique words in a string in C++

Count unique words in a string in C++ - c++

I want to count how many unique words are in string 's' where punctuations and newline character (\n) separates each word. So far I've used the logical or operator to check how many wordSeparators are in the string, and added 1 to the result to get the number of words in string s.
My current code returns 12 as the number of word. Since 'ab', 'AB', 'aB', 'Ab' (and same for 'zzzz') are all same and not unique, how can I ignore the variants of a word? I followed the link: http://www.cplusplus.com/reference/algorithm/unique/, but the reference counts unique item in a vector. But, I am using string and not vector.
Here is my code:
#include <iostream>
#include <string>
using namespace std;
bool isWordSeparator(char & c) {
return c == ' ' || c == '-' || c == '\n' || c == '?' || c == '.' || c == ','
|| c == '?' || c == '!' || c == ':' || c == ';';
}
int countWords(string s) {
int wordCount = 0;
if (s.empty()) {
return 0;
}
for (int x = 0; x < s.length(); x++) {
if (isWordSeparator(s.at(x))) {
wordCount++;
return wordCount+1;
int main() {
string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
int number_of_words = countWords(s);
cout << "Number of Words: " << number_of_words << endl;
return 0;
}

What you need to make your code case-insensitive is tolower().
You can apply it to your original string using std::transform:
std::transform(s.begin(), s.end(), s.begin(), ::tolower);
I should add however that your current code is much closer to C than to C++, perhaps you should check out what standard library has to offer.
I suggest istringstream + istream_iterator for tokenizing and either unique_copy or set for getting rid of the duplicates, like this: https://ideone.com/nb4BEH

You could create a set of strings, save the position of the last separator (starting with 0) and use substring to extract the word, then insert it into the set. When done just return the set's size.
You could make the whole operation easier by using string::split - it tokenizes the string for you. All you have to do is insert all of the elements in the returned array to the set and again return it's size.
Edit: as per comments, you need a custom comparator to ignore case for comparisons.

First of all I'd suggest rewriting isWordSeparator like this:
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
since your current implementation doesn't handle all the punctuation and space, like \t or +.
Also, incrementing wordCount when isWordSeparator is true is incorrect for example if you have something like ?!.
So, a less error-prone approach would be to substitute all separators by space and then iterate words inserting them into an (unordered) set:
#include <iterator>
#include <unordered_set>
#include <algorithm>
#include <cctype>
#include <sstream>
int countWords(std::string s) {
std::transform(s.begin(), s.end(), s.begin(), [](char c) {
if (isWordSeparator(c)) {
return ' ';
}
return std::tolower(c);
});
std::unordered_set<std::string> uniqWords;
std::stringstream ss(s);
std::copy(std::istream_iterator<std::string>(ss), std::istream_iterator<std::string(), std::inserter(uniqWords));
return uniqWords.size();
}

While splitting the string into words, insert all words into a std::set. This will get rid of the duplicates. Then it's just a matter of calling set::size() to get the number of unique words.
I'm using the boost::split() function from the boost string algorithm library in my solution, because is almost standard nowadays.
Explanations in the comments in code...
#include <iostream>
#include <string>
#include <set>
#include <boost/algorithm/string.hpp>
using namespace std;
// Function suggested by user 'mshrbkv':
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
// This is used to make the set case-insensitive.
// Alternatively you could call boost::to_lower() to make the
// string all lowercase before calling boost::split().
struct IgnoreCaseCompare {
bool operator()( const std::string& a, const std::string& b ) const {
return boost::ilexicographical_compare( a, b );
}
};
int main()
{
string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
// Define a set that will contain only unique strings, ignoring case.
set< string, IgnoreCaseCompare > words;
// Split the string by using your isWordSeparator function
// to define the delimiters. token_compress_on collapses multiple
// consecutive delimiters into only one.
boost::split( words, s, isWordSeparator, boost::token_compress_on );
// Now the set contains only the unique words.
cout << "Number of Words: " << words.size() << endl;
for( auto& w : words )
cout << w << endl;
return 0;
}
Demo: http://coliru.stacked-crooked.com/a/a3b51a6c6a3b4ee8

You can consider SQLite c++ wrapper

Related

To remove the Duplicates from the given string (without sorting it) [duplicate]

This question already has answers here:
Remove duplicates from the string in CPP
(8 answers)
Closed 1 year ago.
The question is asking to remove duplicates from the string, I came up with a solution to remove duplicates but for that, I sorted the string.
So I wanted to know is there a way of removing duplicates from a string without sorting the string.
Test Cases :
"allcbcd" -> "alcbd"
My Code (the one in which string has to be sorted) :
#include <iostream>
#include <string>
#include <algorithm>
using namespace std;
string removeDup(string s)
{
if (s.length() == 0)
{
return "";
}
sort(s.begin(), s.end());
char ch = s[0];
string ans = removeDup(s.substr(1));
if (ch == ans[0])
{
return ans;
}
return (ch + ans);
}
int main()
{
cout << removeDup("allcbbcd") << endl;
return 0;
}

Make a boolean array of size 256 considering only ASCII values.
Loop over the string and check the ASCII index value of the character in the array. If it is already set to true, then ignore that character. If not, add that character to your resultant string and set the ASCII index value to true in the array.
Finally, print the resultant string.
If you want to make it support for UTF-8 chars, use a map instead of an array.

Here is one way to do it without sorting (or using a recursive loop):
#include <iostream>
#include <string>
#include <algorithm>
using namespace std;
string removeDup(string s)
{
string::size_type index = 0;
string::iterator end = s.end();
while (index < s.size()) {
end = remove(s.begin()+index+1, end, s[index]);
++index;
}
s.erase(end, s.end());
return s;
}
int main()
{
cout << removeDup("allcbbcd") << endl;
return 0;
}
Online Demo

C++ count the number of words in a string that end in 'y' or 'z'

I'm trying to write a program that looks at the last letter of each word in a single string and determines if it ends in y or z and count it.
For example:
"fez day" -> 2
"day fyyyz" -> 2
Everything I've looked up uses what looks to be arrays, but I don't know how to use those yet. I'm trying to figure out how to do it using for loops.
I honestly don't know where to start. I feel like some of my smaller programs could be used to help this, but I'm struggling in trying to figure out how to combine them.
This code counts the amount of words in a string:
int words = 0;
bool connectedLetter;
for (auto c : s)
{
if (c == ' ')
{
connectedLetter = false;
}
if ( c != ' ' && connectedLetter == false)
{
++words;
connectedLetter = true;
}
and it might be useful to try and figure out how to get the code to see separate words.
I've used this program to count the amount of vowels in the entire program:
int vowels{0};
for (auto c : s)
{
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u'
|| c == 'A' || c == 'E' || c == 'I' || c == 'O' || c == 'U')
{
++vowels;
}
}
and then I've done a small program to see every other letter in a string
auto len = s.size();
for (auto i = 0; i < len; i = i + 2)
{
result += s.at(i);
}
I feel like I know the concepts behind it, but its configuring it together which is stopping me

You may also use existing C++ functions that are dedicated to do, what you want.
The solution is to take advantage of basic IOstream functionalities. You may know that the extractor operator >> will extract words from an stream (like std::cin or any other stream) until it hits the next white space.
So reading words is simple:
std::string word{}; std::cin >> word;
will read a complete word from std::cin.
OK, we have a std::string and no stream. But here C++ helps you with the std::istringstream. This will convert a std::string to a stream object. You can then use all iostream functionalities with this stringstream.
Then, for counting elements, following a special requirement, we have a standard algorithm from the C++ library: std::count_if.
It expects a begin and an end iterator. And here we simply using the std::istream_iterator which will call the extractor operator >> for all strings that are in the stream.
WIth a Lambda, given to the std::count_if, we check, if a word meets the required condition.
We will get then a very compact piece of code.
#include <iostream>
#include <sstream>
#include <string>
#include <algorithm>
#include <iterator>
int main() {
// test string
std::string testString{ "day fyyyz" };
// We want to extract words from the string, so, convert string to stream.
std::istringstream iss{ testString };
// count words, meeting a special condition
std::cout << std::count_if(std::istream_iterator<std::string>(iss), {},
[](const std::string& s) { return s.back() == 'y' || s.back() == 'z'; });
return 0;
}
Of course there are tons of other possible solutions.
Edit
Pete Becker asked for a more flexible solution. Also here C++ offers a dedicated functionality. The std::sregex_token_iterator.
Here we can specify any word pattern with a regex and the simply get or count the matches.
An even simpler piece of code is the result:
#include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
const std::regex re{ R"(\w+[zy])" };
int main() {
// test string
std::string s{ "day, fyyyz, abc , zzz" };
// count words, meeting a special condition
std::cout << std::vector(std::sregex_token_iterator(s.begin(), s.end(), re), {}).size();
return 0;
}

If you're not going to use an array (or something similar, like a string) it's probably easiest to just use two ints. For simplicity, let's call them current and previous. You'll also need a count, which you'll want to initialize to 0.
Start by initializing both to EOF.
Read a character into current.
If current is a space or EOF (well, anything you don't consider part of a word), and previous is z or previous is y, increment count.
If current is EOF, print out count, and you're done.
Copy the value in current into previous.
Go back to step 2.

std::string is much smarter than many people realize. In particular, it has member functions find_first_of, find_first_not_of, find_last_of, and find_last_not_of that are very helpful for simple parsing. I'd approach it like this:
std::string str = "fez day"; // for example
std::string targets = "yz";
int target_count = 0;
char delims = ' ';
std::string::pos_type pos = str.find_first_not_of(delims);
while (pos < str.length()) {
pos = str.find_first_of(delims, pos);
if (pos == std::string::npos)
pos = str.length();
if (targets.find(str[pos-1] != std::string::npos)
++target_count;
pos = str.find_first_not_of(delims, pos);
}
std::cout << target_count << '\n';
Now, if I need to change this to accommodate comma-separated words, I just change
char delims = ' ';
to
std::string delims = " ,";
or to
const char* delims = " ,"; // my preference
and if I need to change the characters that I'm looking for, just change the contents of targets. (In fact, I'd use const char* targets = "xy"; and search with std::strchr, which reduces overhead a bit, but that's not particularly important.)

How to get a word vector from a string?

I want to store words separated by spaces into single string elements in a vector.
The input is a string that may end or may not end in a symbol( comma, period, etc.)
All symbols will be separated by spaces too.
I created this function but it doesn't return me a vector of words.
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t character = 0; character < sentence.size(); ++character)
{
if (sentence[character] == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
What did I do wrong?

Your problem has already been resolved by answers and comments.
I would like to give you the additional information that such functionality is already existing in C++.
You could take advantage of the fact that the extractor operator extracts space separated tokens from a stream. Because a std::string is not a stream, we can put the string first into an std::istringstream and then extract from this stream vie the std:::istream_iterator.
We could life make even more easier.
Since roundabout 10 years we have a dedicated, special C++ functionality for splitting strings into tokens, explicitely designed for this purpose. The std::sregex_token_iterator. And because we have such a dedicated function, we should simply use it.
The idea behind it is the iterator concept. In C++ we have many containers and always iterators, to iterate over the similar elements in these containers. And a string, with similar elements (tokens), separated by a delimiter, can also be seen as such a container. And with the std::sregex:token_iterator, we can iterate over the elements/tokens/substrings of the string, splitting it up effectively.
This iterator is very powerfull and you can do really much much more fancy stuff with it. But that is too much for here. Important is that splitting up a string into tokens is a one-liner. For example a variable definition using a range constructor for iterating over the tokens.
See some examples below:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <regex>
const std::regex delimiter{ " " };
const std::regex reWord{ "(\\w+)" };
int main() {
// Some debug print function
auto print = [](const std::vector<std::string>& sv) -> void {
std::copy(sv.begin(), sv.end(), std::ostream_iterator<std::string>(std::cout, "\n")); std::cout << "\n"; };
// The test string
std::string test{ "word1 word2 word3 word4." };
//-----------------------------------------------------------------------------------------
// Solution 1: use istringstream and then extract from there
std::istringstream iss1(test);
// Define a vector (CTAD), use its range constructor and, the std::istream_iterator as iterator
std::vector words1(std::istream_iterator<std::string>(iss1), {});
print(words1); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 2: directly use dedicated function sregex_token iterator
std::vector<std::string> words2(std::sregex_token_iterator(test.begin(), test.end(), delimiter, -1), {});
print(words2); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 3: directly use dedicated function sregex_token iterator and look for words only
std::vector<std::string> words3(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {});
print(words3); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 4: Use such iterator in an algorithm, to copy data to a vector
std::vector<std::string> words4{};
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::back_inserter(words4));
print(words4); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 5: Use such iterator in an algorithm for direct output
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}

You added the index instead of the character:
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t i = 0; i < sentence.size(); ++i)
{
char character = sentence[i];
if (character == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}

Since your mistake was only due to the reason, that you named your iterator variable character even though it is actually not a character, but rather an iterator or index, I would like to suggest to use a ranged-base loop here, since it avoids this kind of confusion. The clean solution is obviously to do what #ArminMontigny said, but I assume you are prohibited to use stringstreams. The code would look like this:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (char& character: sentence) // Now `character` is actually a character.
{
if (character==' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
word_vector.push_back(result_word); // In your solution, you forgot to push the last word into the vector.
return word_vector;
}
int main() {
string sentence="Maybe try range based loops";
vector<string> result= single_words(sentence);
for(string& word: result)
cout<<word<<" ";
return 0;
}

How to check if a string is all lowercase and alphanumerics?

Is there a method that checks for these cases? Or do I need to parse each letter in the string, and check if it's lower case (letter) and is a number/letter?

You can use islower(), isalnum() to check for those conditions for each character. There is no string-level function to do this, so you'll have to write your own.

Assuming that the "C" locale is acceptable (or swap in a different set of characters for criteria), use find_first_not_of()
#include <string>
bool testString(const std::string& str)
{
std::string criteria("abcdefghijklmnopqrstuvwxyz0123456789");
return (std::string::npos == str.find_first_not_of(criteria);
}

It's not very well known, but a locale actually does have functions to determine characteristics of entire strings at a time. Specifically, the ctype facet of a locale has a scan_is and a scan_not that scan for the first character that fits a specified mask (alpha, numeric, alphanumeric, lower, upper, punctuation, space, hex digit, etc.), or the first that doesn't fit it, respectively. Other than that, they work a bit like std::find_if, returning whatever you passed as the "end" to signal failure, otherwise returning a pointer to the first item in the string that doesn't fit what you asked for.
Here's a quick sample:
#include <locale>
#include <iostream>
#include <iomanip>
int main() {
std::string inputs[] = {
"alllower",
"1234",
"lower132",
"including a space"
};
// We'll use the "classic" (C) locale, but this works with any
std::locale loc(std::locale::classic());
// A mask specifying the characters to search for:
std::ctype_base::mask m = std::ctype_base::lower | std::ctype_base::digit;
for (int i=0; i<4; i++) {
char const *pos;
char const *b = &*inputs[i].begin();
char const *e = &*inputs[i].end();
std::cout << "Input: " << std::setw(20) << inputs[i] << ":\t";
// finally, call the actual function:
if ((pos=std::use_facet<std::ctype<char> >(loc).scan_not(m, b, e)) == e)
std::cout << "All characters match mask\n";
else
std::cout << "First non-matching character = \"" << *pos << "\"\n";
}
return 0;
}
I suspect most people will prefer to use std::find_if though -- using it is nearly the same, but can be generalized to many more situations quite easily. Even though this has much narrower applicability, it's not really a lot easier to user (though I suppose if you're scanning large chunks of text, it might well be at least a little faster).

You could use the tolower & strcmp to compare if the original_string and the tolowered string.And do the numbers individually per character.
(OR) Do both per character as below.
#include <algorithm>
static inline bool is_not_alphanum_lower(char c)
{
return (!isalnum(c) || !islower(c));
}
bool string_is_valid(const std::string &str)
{
return find_if(str.begin(), str.end(), is_not_alphanum_lower) == str.end();
}
I used the some info from:
Determine if a string contains only alphanumeric characters (or a space)

Just use std::all_of
bool lowerAlnum = std::all_of(str.cbegin(), str.cend(), [](const char c){
return isdigit(c) || islower(c);
});
If you don't care about locale (i.e. the input is pure 7-bit ASCII) then the condition can be optimized into
[](const char c){ return ('0' <= c && c <= '9') || ('a' <= c && c <= 'z'); }

If your strings contain ASCII-encoded text and you like to write your own functions (like I do) then you can use this:
bool is_lower_alphanumeric(const string& txt)
{
for(char c : txt)
{
if (!((c >= '0' and c <= '9') or (c >= 'a' and c <= 'z'))) return false;
}
return true;
}

string analysis

IF a string may include several un-necessary elements, e.g., such as #, #, $,%.
How to find them and delete them?
I know this requires a loop iteration, but I do not know how to represent sth such as #, #, $,%.
If you can give me a code example, then I will be really appreciated.

The usual standard C++ approach would be the erase/remove idiom:
#include <string>
#include <algorithm>
#include <iostream>
struct OneOf {
std::string chars;
OneOf(const std::string& s) : chars(s) {}
bool operator()(char c) const {
return chars.find_first_of(c) != std::string::npos;
}
};
int main()
{
std::string s = "string with #, #, $, %";
s.erase(remove_if(s.begin(), s.end(), OneOf("##$%")), s.end());
std::cout << s << '\n';
}
and yes, boost offers some neat ways to write it shorter, for example using boost::erase_all_regex
#include <string>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
int main()
{
std::string s = "string with #, #, $, %";
erase_all_regex(s, boost::regex("[##$%]"));
std::cout << s << '\n';
}

If you want to get fancy, there is Boost.Regex otherwise you can use the STL replace function in combination with the strchr function..

And if you, for some reason, have to do it yourself C-style, something like this would work:
char* oldstr = ... something something dark side ...
int oldstrlen = strlen(oldstr)+1;
char* newstr = new char[oldstrlen]; // allocate memory for the new nicer string
char* p = newstr; // get a pointer to the beginning of the new string
for ( int i=0; i<oldstrlen; i++ ) // iterate over the original string
if (oldstr[i] != '#' && oldstr[i] != '#' && etc....) // check that the current character is not a bad one
*p++ = oldstr[i]; // append it to the new string
*p = 0; // dont forget the null-termination

I think for this I'd use std::remove_copy_if:
#include <string>
#include <algorithm>
#include <iostream>
struct bad_char {
bool operator()(char ch) {
return ch == '#' || ch == '#' || ch == '$' || ch == '%';
}
};
int main() {
std::string in("This#is#a$string%with#extra#stuff$to%ignore");
std::string out;
std::remove_copy_if(in.begin(), in.end(), std::back_inserter(out), bad_char());
std::cout << out << "\n";
return 0;
}
Result:
Thisisastringwithextrastufftoignore
Since the data containing these unwanted characters will normally come from a file of some sort, it's also worth considering getting rid of them as you read the data from the file instead of reading the unwanted data into a string, and then filtering it out. To do this, you could create a facet that classifies the unwanted characters as white space:
struct filter: std::ctype<char>
{
filter(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::mask());
rc['#'] = std::ctype_base::space;
rc['#'] = std::ctype_base::space;
rc['$'] = std::ctype_base::space;
rc['%'] = std::ctype_base::space;
return &rc[0];
}
};
To use this, you imbue the input stream with a locale using this facet, and then read normally. For the moment I'll use an istringstream, though you'd normally use something like an istream or ifstream:
int main() {
std::istringstream in("This#is#a$string%with#extra#stuff$to%ignore");
in.imbue(std::locale(std::locale(), new filter));
std::copy(std::istream_iterator<char>(in),
std::istream_iterator<char>(),
std::ostream_iterator<char>(std::cout));
return 0;
}

Is this C or C++? (You've tagged it both ways.)
In pure C, you pretty much have to loop through character by character and delete the unwanted ones. For example:
char *buf;
int len = strlen(buf);
int i, j;
for (i = 0; i < len; i++)
{
if (buf[i] == '#' || buf[i] == '#' || buf[i] == '$' /* etc */)
{
for (j = i; j < len; j++)
{
buf[j] = buf[j+1];
}
i --;
}
}
This isn't very efficient - it checks each character in turn and shuffles them all up if there's one you don't want. You have to decrement the index afterwards to make sure you check the new next character.

General algorithm:
Build a string that contains the characters you want purged: "##$%"
Iterate character by character over the subject string.
Search if each character is found in the purge set.
If a character matches, discard it.
If a character doesn't match, append it to a result string.
Depending on the string library you are using, there are functions/methods that implement one or more of the above steps, such as strchr() or find() to determine if a character is in a string.

use the characterizer operator, ie a would be 'a'. you haven't said whether your using C++ strings(in which case you can use the find and replace methods) or C strings in which case you'd use something like this(this is by no means the best way, but its a simple way):
void RemoveChar(char* szString, char c)
{
while(*szString != '\0')
{
if(*szString == c)
memcpy(szString,szString+1,strlen(szString+1)+1);
szString++;
}
}

You can use a loop and call find_last_of (http://www.cplusplus.com/reference/string/string/find_last_of/) repeatedly to find the last character that you want to replace, replace it with blank, and then continue working backwards in the string.

Something like this would do :
bool is_bad(char c)
{
if( c == '#' || c == '#' || c == '$' || c == '%' )
return true;
else
return false;
}
int main(int argc, char **argv)
{
string str = "a #test ##string";
str.erase(std::remove_if(str.begin(), str.end(), is_bad), str.end() );
}
If your compiler supports lambdas (or if you can use boost), it can be made even shorter. Example using boost::lambda :
string str = "a #test ##string";
str.erase(std::remove_if(str.begin(), str.end(), (_1 == '#' || _1 == '#' || _1 == '$' || _1 == '%')), str.end() );
(yay two lines!)

A character is represented in C/C++ by single quotes, e.g. '#', '#', etc. (except for a few that need to be escaped).
To search for a character in a string, use strchr(). Here is a link to a sample code:
http://www.cplusplus.com/reference/clibrary/cstring/strchr/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Count unique words in a string in C++ - c++

You can consider SQLite c++ wrapper

Related

To remove the Duplicates from the given string (without sorting it) [duplicate]

C++ count the number of words in a string that end in 'y' or 'z'

How to get a word vector from a string?

How to check if a string is all lowercase and alphanumerics?

string analysis

Categories

Resources