string analysis - c++

IF a string may include several un-necessary elements, e.g., such as #, #, $,%.
How to find them and delete them?
I know this requires a loop iteration, but I do not know how to represent sth such as #, #, $,%.
If you can give me a code example, then I will be really appreciated.

The usual standard C++ approach would be the erase/remove idiom:
#include <string>
#include <algorithm>
#include <iostream>
struct OneOf {
std::string chars;
OneOf(const std::string& s) : chars(s) {}
bool operator()(char c) const {
return chars.find_first_of(c) != std::string::npos;
}
};
int main()
{
std::string s = "string with #, #, $, %";
s.erase(remove_if(s.begin(), s.end(), OneOf("##$%")), s.end());
std::cout << s << '\n';
}
and yes, boost offers some neat ways to write it shorter, for example using boost::erase_all_regex
#include <string>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
int main()
{
std::string s = "string with #, #, $, %";
erase_all_regex(s, boost::regex("[##$%]"));
std::cout << s << '\n';
}

If you want to get fancy, there is Boost.Regex otherwise you can use the STL replace function in combination with the strchr function..

And if you, for some reason, have to do it yourself C-style, something like this would work:
char* oldstr = ... something something dark side ...
int oldstrlen = strlen(oldstr)+1;
char* newstr = new char[oldstrlen]; // allocate memory for the new nicer string
char* p = newstr; // get a pointer to the beginning of the new string
for ( int i=0; i<oldstrlen; i++ ) // iterate over the original string
if (oldstr[i] != '#' && oldstr[i] != '#' && etc....) // check that the current character is not a bad one
*p++ = oldstr[i]; // append it to the new string
*p = 0; // dont forget the null-termination

I think for this I'd use std::remove_copy_if:
#include <string>
#include <algorithm>
#include <iostream>
struct bad_char {
bool operator()(char ch) {
return ch == '#' || ch == '#' || ch == '$' || ch == '%';
}
};
int main() {
std::string in("This#is#a$string%with#extra#stuff$to%ignore");
std::string out;
std::remove_copy_if(in.begin(), in.end(), std::back_inserter(out), bad_char());
std::cout << out << "\n";
return 0;
}
Result:
Thisisastringwithextrastufftoignore
Since the data containing these unwanted characters will normally come from a file of some sort, it's also worth considering getting rid of them as you read the data from the file instead of reading the unwanted data into a string, and then filtering it out. To do this, you could create a facet that classifies the unwanted characters as white space:
struct filter: std::ctype<char>
{
filter(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::mask());
rc['#'] = std::ctype_base::space;
rc['#'] = std::ctype_base::space;
rc['$'] = std::ctype_base::space;
rc['%'] = std::ctype_base::space;
return &rc[0];
}
};
To use this, you imbue the input stream with a locale using this facet, and then read normally. For the moment I'll use an istringstream, though you'd normally use something like an istream or ifstream:
int main() {
std::istringstream in("This#is#a$string%with#extra#stuff$to%ignore");
in.imbue(std::locale(std::locale(), new filter));
std::copy(std::istream_iterator<char>(in),
std::istream_iterator<char>(),
std::ostream_iterator<char>(std::cout));
return 0;
}

Is this C or C++? (You've tagged it both ways.)
In pure C, you pretty much have to loop through character by character and delete the unwanted ones. For example:
char *buf;
int len = strlen(buf);
int i, j;
for (i = 0; i < len; i++)
{
if (buf[i] == '#' || buf[i] == '#' || buf[i] == '$' /* etc */)
{
for (j = i; j < len; j++)
{
buf[j] = buf[j+1];
}
i --;
}
}
This isn't very efficient - it checks each character in turn and shuffles them all up if there's one you don't want. You have to decrement the index afterwards to make sure you check the new next character.

General algorithm:
Build a string that contains the characters you want purged: "##$%"
Iterate character by character over the subject string.
Search if each character is found in the purge set.
If a character matches, discard it.
If a character doesn't match, append it to a result string.
Depending on the string library you are using, there are functions/methods that implement one or more of the above steps, such as strchr() or find() to determine if a character is in a string.

use the characterizer operator, ie a would be 'a'. you haven't said whether your using C++ strings(in which case you can use the find and replace methods) or C strings in which case you'd use something like this(this is by no means the best way, but its a simple way):
void RemoveChar(char* szString, char c)
{
while(*szString != '\0')
{
if(*szString == c)
memcpy(szString,szString+1,strlen(szString+1)+1);
szString++;
}
}

You can use a loop and call find_last_of (http://www.cplusplus.com/reference/string/string/find_last_of/) repeatedly to find the last character that you want to replace, replace it with blank, and then continue working backwards in the string.

Something like this would do :
bool is_bad(char c)
{
if( c == '#' || c == '#' || c == '$' || c == '%' )
return true;
else
return false;
}
int main(int argc, char **argv)
{
string str = "a #test ##string";
str.erase(std::remove_if(str.begin(), str.end(), is_bad), str.end() );
}
If your compiler supports lambdas (or if you can use boost), it can be made even shorter. Example using boost::lambda :
string str = "a #test ##string";
str.erase(std::remove_if(str.begin(), str.end(), (_1 == '#' || _1 == '#' || _1 == '$' || _1 == '%')), str.end() );
(yay two lines!)

A character is represented in C/C++ by single quotes, e.g. '#', '#', etc. (except for a few that need to be escaped).
To search for a character in a string, use strchr(). Here is a link to a sample code:
http://www.cplusplus.com/reference/clibrary/cstring/strchr/

Related

Word counter returning incorrect number of words

I've been trying to create a program that reads text from a file and stores it in a string. I feed the string to a function that counts every word in the string.
However its only accurate assuming the user leaves some whitespace at the end of a line and doesn't creates blank lines.... not a very good word counter.
Creating a blank line results in a false increment to the word count.
I'm not sure if my main problem is using a boolean to do this or checking for whitespace and '\n' characters.
bool countingLetters = false;
int wordCount = 0;
for (int i = 0; i < text.length(); i++)
{
if (text[i] == ' ' && countingLetters == true)
{
countingLetters = false;
wordCount++;
}
if (text[i] != ' ' && countingLetters == false)
{
countingLetters = true;
}
if (text[i] == '\n' && countingLetters == true)
{
countingLetters = false;
wordCount++;
}
}
Your code is basically a state machine. To complete your solution, just count in the string ending.
Add this to the end of your code:
if(countingLetters) { // word at the end of string, without any space charactor
wordCount++;
}
Or if you can be sure it's C-style string, like std::string, you can just index 1 pass the last charactor, and handle '\0'in same way of space and '\n' .
To improve your code, use isspace (and this covers more space charactor, including '\t', etc.). And better to use else if pattern. Also, it's not good pratice to ==true. Just use boolean as condition.
Or maybe, isalpha(c) fits more to your need.
bool countingLetters = false;
int wordCount = 0;
for (char c:text) {
if (!isalpha(c) && countingLetters) { // this also works for newline
countingLetters = false;
++wordCount;
} else if (isalpha(c) && !countingLetters) {
countingLetters = true;
} // otherwise just skip
}
if(countingLetters) { // word at the end of string, without any space charactor
++wordCount;
}
And it's not acceptable to insert extra charactor just for such a simple task. For example, text may be const.
An alternative is to count the beginning of a "word".
Let us say the beginning of a word is a letter after a non-letter. We can adjust this if desired.
int wordCount = 0;
int prior = '\n'; // some non-letter
for (int i = 0; i < text.length(); i++) {
if (isalpha(text[i]) && !isalpha(prior)) {
wordCount++;
}
prior = text[i];
}
C++ also provides some very high-level ways to do this.
One is by using a loop over a stringstream, which splits text on whitespace:
#include <sstream>
#include <string>
std::size_t count_words( const std::string& s )
{
std::size_t count = 0;
std::istringstream ss( s );
std::string t;
while (ss >> t) count += 1;
return count;
}
Another is using a stream iterator algorithm:
#include <iterator>
#include <sstream>
#include <string>
std::size_t count_words( const std::string& s )
{
std::istringstream ss( s );
return std::distance(
std::istream_iterator <std::string> ( ss ),
std::istream_iterator <std::string> ()
);
}
Yet another is using a regular expression:
#include <iterator>
#include <regex>
#include <string>
std::size_t count_words( const std::string& s )
{
std::regex re( "\\w+" );
return std::distance(
std::sregex_iterator( s.begin(), s.end(), re ),
std::sregex_iterator()
);
}
I’m sure there are many more, but those three are the ones that come off the top of my head.

C++ count the number of words in a string that end in 'y' or 'z'

I'm trying to write a program that looks at the last letter of each word in a single string and determines if it ends in y or z and count it.
For example:
"fez day" -> 2
"day fyyyz" -> 2
Everything I've looked up uses what looks to be arrays, but I don't know how to use those yet. I'm trying to figure out how to do it using for loops.
I honestly don't know where to start. I feel like some of my smaller programs could be used to help this, but I'm struggling in trying to figure out how to combine them.
This code counts the amount of words in a string:
int words = 0;
bool connectedLetter;
for (auto c : s)
{
if (c == ' ')
{
connectedLetter = false;
}
if ( c != ' ' && connectedLetter == false)
{
++words;
connectedLetter = true;
}
and it might be useful to try and figure out how to get the code to see separate words.
I've used this program to count the amount of vowels in the entire program:
int vowels{0};
for (auto c : s)
{
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u'
|| c == 'A' || c == 'E' || c == 'I' || c == 'O' || c == 'U')
{
++vowels;
}
}
and then I've done a small program to see every other letter in a string
auto len = s.size();
for (auto i = 0; i < len; i = i + 2)
{
result += s.at(i);
}
I feel like I know the concepts behind it, but its configuring it together which is stopping me
You may also use existing C++ functions that are dedicated to do, what you want.
The solution is to take advantage of basic IOstream functionalities. You may know that the extractor operator >> will extract words from an stream (like std::cin or any other stream) until it hits the next white space.
So reading words is simple:
std::string word{}; std::cin >> word;
will read a complete word from std::cin.
OK, we have a std::string and no stream. But here C++ helps you with the std::istringstream. This will convert a std::string to a stream object. You can then use all iostream functionalities with this stringstream.
Then, for counting elements, following a special requirement, we have a standard algorithm from the C++ library: std::count_if.
It expects a begin and an end iterator. And here we simply using the std::istream_iterator which will call the extractor operator >> for all strings that are in the stream.
WIth a Lambda, given to the std::count_if, we check, if a word meets the required condition.
We will get then a very compact piece of code.
#include <iostream>
#include <sstream>
#include <string>
#include <algorithm>
#include <iterator>
int main() {
// test string
std::string testString{ "day fyyyz" };
// We want to extract words from the string, so, convert string to stream.
std::istringstream iss{ testString };
// count words, meeting a special condition
std::cout << std::count_if(std::istream_iterator<std::string>(iss), {},
[](const std::string& s) { return s.back() == 'y' || s.back() == 'z'; });
return 0;
}
Of course there are tons of other possible solutions.
Edit
Pete Becker asked for a more flexible solution. Also here C++ offers a dedicated functionality. The std::sregex_token_iterator.
Here we can specify any word pattern with a regex and the simply get or count the matches.
An even simpler piece of code is the result:
#include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <regex>
const std::regex re{ R"(\w+[zy])" };
int main() {
// test string
std::string s{ "day, fyyyz, abc , zzz" };
// count words, meeting a special condition
std::cout << std::vector(std::sregex_token_iterator(s.begin(), s.end(), re), {}).size();
return 0;
}
If you're not going to use an array (or something similar, like a string) it's probably easiest to just use two ints. For simplicity, let's call them current and previous. You'll also need a count, which you'll want to initialize to 0.
Start by initializing both to EOF.
Read a character into current.
If current is a space or EOF (well, anything you don't consider part of a word), and previous is z or previous is y, increment count.
If current is EOF, print out count, and you're done.
Copy the value in current into previous.
Go back to step 2.
std::string is much smarter than many people realize. In particular, it has member functions find_first_of, find_first_not_of, find_last_of, and find_last_not_of that are very helpful for simple parsing. I'd approach it like this:
std::string str = "fez day"; // for example
std::string targets = "yz";
int target_count = 0;
char delims = ' ';
std::string::pos_type pos = str.find_first_not_of(delims);
while (pos < str.length()) {
pos = str.find_first_of(delims, pos);
if (pos == std::string::npos)
pos = str.length();
if (targets.find(str[pos-1] != std::string::npos)
++target_count;
pos = str.find_first_not_of(delims, pos);
}
std::cout << target_count << '\n';
Now, if I need to change this to accommodate comma-separated words, I just change
char delims = ' ';
to
std::string delims = " ,";
or to
const char* delims = " ,"; // my preference
and if I need to change the characters that I'm looking for, just change the contents of targets. (In fact, I'd use const char* targets = "xy"; and search with std::strchr, which reduces overhead a bit, but that's not particularly important.)

Count unique words in a string in C++

I want to count how many unique words are in string 's' where punctuations and newline character (\n) separates each word. So far I've used the logical or operator to check how many wordSeparators are in the string, and added 1 to the result to get the number of words in string s.
My current code returns 12 as the number of word. Since 'ab', 'AB', 'aB', 'Ab' (and same for 'zzzz') are all same and not unique, how can I ignore the variants of a word? I followed the link: http://www.cplusplus.com/reference/algorithm/unique/, but the reference counts unique item in a vector. But, I am using string and not vector.
Here is my code:
#include <iostream>
#include <string>
using namespace std;
bool isWordSeparator(char & c) {
return c == ' ' || c == '-' || c == '\n' || c == '?' || c == '.' || c == ','
|| c == '?' || c == '!' || c == ':' || c == ';';
}
int countWords(string s) {
int wordCount = 0;
if (s.empty()) {
return 0;
}
for (int x = 0; x < s.length(); x++) {
if (isWordSeparator(s.at(x))) {
wordCount++;
return wordCount+1;
int main() {
string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
int number_of_words = countWords(s);
cout << "Number of Words: " << number_of_words << endl;
return 0;
}
What you need to make your code case-insensitive is tolower().
You can apply it to your original string using std::transform:
std::transform(s.begin(), s.end(), s.begin(), ::tolower);
I should add however that your current code is much closer to C than to C++, perhaps you should check out what standard library has to offer.
I suggest istringstream + istream_iterator for tokenizing and either unique_copy or set for getting rid of the duplicates, like this: https://ideone.com/nb4BEH
You could create a set of strings, save the position of the last separator (starting with 0) and use substring to extract the word, then insert it into the set. When done just return the set's size.
You could make the whole operation easier by using string::split - it tokenizes the string for you. All you have to do is insert all of the elements in the returned array to the set and again return it's size.
Edit: as per comments, you need a custom comparator to ignore case for comparisons.
First of all I'd suggest rewriting isWordSeparator like this:
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
since your current implementation doesn't handle all the punctuation and space, like \t or +.
Also, incrementing wordCount when isWordSeparator is true is incorrect for example if you have something like ?!.
So, a less error-prone approach would be to substitute all separators by space and then iterate words inserting them into an (unordered) set:
#include <iterator>
#include <unordered_set>
#include <algorithm>
#include <cctype>
#include <sstream>
int countWords(std::string s) {
std::transform(s.begin(), s.end(), s.begin(), [](char c) {
if (isWordSeparator(c)) {
return ' ';
}
return std::tolower(c);
});
std::unordered_set<std::string> uniqWords;
std::stringstream ss(s);
std::copy(std::istream_iterator<std::string>(ss), std::istream_iterator<std::string(), std::inserter(uniqWords));
return uniqWords.size();
}
While splitting the string into words, insert all words into a std::set. This will get rid of the duplicates. Then it's just a matter of calling set::size() to get the number of unique words.
I'm using the boost::split() function from the boost string algorithm library in my solution, because is almost standard nowadays.
Explanations in the comments in code...
#include <iostream>
#include <string>
#include <set>
#include <boost/algorithm/string.hpp>
using namespace std;
// Function suggested by user 'mshrbkv':
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
// This is used to make the set case-insensitive.
// Alternatively you could call boost::to_lower() to make the
// string all lowercase before calling boost::split().
struct IgnoreCaseCompare {
bool operator()( const std::string& a, const std::string& b ) const {
return boost::ilexicographical_compare( a, b );
}
};
int main()
{
string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
// Define a set that will contain only unique strings, ignoring case.
set< string, IgnoreCaseCompare > words;
// Split the string by using your isWordSeparator function
// to define the delimiters. token_compress_on collapses multiple
// consecutive delimiters into only one.
boost::split( words, s, isWordSeparator, boost::token_compress_on );
// Now the set contains only the unique words.
cout << "Number of Words: " << words.size() << endl;
for( auto& w : words )
cout << w << endl;
return 0;
}
Demo: http://coliru.stacked-crooked.com/a/a3b51a6c6a3b4ee8
You can consider SQLite c++ wrapper

How to extract words out of a string and store them in different array in c++

How to split a string and store the words in a separate array without using strtok or istringstream and find the greatest word?? I am only a beginner so I should accomplish this using basic functions in string.h like strlen, strcpy etc. only. Is it possible to do so?? I've tried to do this and I am posting what I have done. Please correct my mistakes.
#include<iostream.h>
#include<stdio.h>
#include<string.h>
void count(char n[])
{
char a[50], b[50];
for(int i=0; n[i]!= '\0'; i++)
{
static int j=0;
for(j=0;n[j]!=' ';j++)
{
a[j]=n[j];
}
static int x=0;
if(strlen(a)>x)
{
strcpy(b,a);
x=strlen(a);
}
}
cout<<"Greatest word is:"<<b;
}
int main( int, char** )
{
char n[100];
gets(n);
count(n);
}
The code in your example looks like it's written in C. Functions like strlen and strcpy originates in C (although they are also part of the C++ standard library for compatibility via the header cstring).
You should start learning C++ using the Standard Library and things will get much easier. Things like splitting strings and finding the greatest element can be done using a few lines of code if you use the functions in the standard library, e.g:
// The text
std::string text = "foo bar foobar";
// Wrap text in stream.
std::istringstream iss{text};
// Read tokens from stream into vector (split at whitespace).
std::vector<std::string> words{std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
// Get the greatest word.
auto greatestWord = *std::max_element(std::begin(words), std::end(words), [] (const std::string& lhs, const std::string& rhs) { return lhs.size() < rhs.size(); });
Edit:
If you really want to dig down in the nitty-gritty parts using only functions from std::string, here's how you can do to split the text into words (I leave finding the greatest word to you, which shouldn't be too hard):
// Use vector to store words.
std::vector<std::string> words;
std::string text = "foo bar foobar";
std::string::size_type beg = 0, end;
do {
end = text.find(' ', beg);
if (end == std::string::npos) {
end = text.size();
}
words.emplace_back(text.substr(beg, end - beg));
beg = end + 1;
} while (beg < text.size());
I would write two functions. The first one skips blank characters for example
const char * SkipSpaces( const char *p )
{
while ( *p == ' ' || *p == '\t' ) ++p;
return ( p );
}
And the second one copies non blank characters
const char * CopyWord( char *s1, const char *s2 )
{
while ( *s2 != ' ' && *s2 != '\t' && *s2 != '\0' ) *s1++ = *s2++;
*s1 = '\0';
return ( s2 );
}
try to get a word in a small array(obviously no word is >35 characters) you can get the word by checking two successive spaces and then put that array in strlen() function and then check if the previous word was larger then drop that word else keep the new word
after all this do not forget to initialize the word array with '\0' or null character after every word catch or this would happen:-
let's say 1st word in that array was 'happen' and 2nd 'to' if you don't initialize then your array will be after 1st catch :
happen
and 2nd catch :
*to*ppen
Try this. Here ctr will be the number of elements in the array(or vector) of individual words of the sentence. You can split the sentence from whatever letter you want by changing function call in main.
#include<iostream>
#include<string>
#include<vector>
using namespace std;
void split(string s, char ch){
vector <string> vec;
string tempStr;
int ctr{};
int index{s.length()};
for(int i{}; i<=index; i++){
tempStr += s[i];
if(s[i]==ch || s[i]=='\0'){
vec.push_back(tempStr);
ctr++;
tempStr="";
continue;
}
}
for(string S: vec)
cout<<S<<endl;
}
int main(){
string s;
getline(cin, s);
split(s, ' ');
return 0;
}

How to check if a string is all lowercase and alphanumerics?

Is there a method that checks for these cases? Or do I need to parse each letter in the string, and check if it's lower case (letter) and is a number/letter?
You can use islower(), isalnum() to check for those conditions for each character. There is no string-level function to do this, so you'll have to write your own.
Assuming that the "C" locale is acceptable (or swap in a different set of characters for criteria), use find_first_not_of()
#include <string>
bool testString(const std::string& str)
{
std::string criteria("abcdefghijklmnopqrstuvwxyz0123456789");
return (std::string::npos == str.find_first_not_of(criteria);
}
It's not very well known, but a locale actually does have functions to determine characteristics of entire strings at a time. Specifically, the ctype facet of a locale has a scan_is and a scan_not that scan for the first character that fits a specified mask (alpha, numeric, alphanumeric, lower, upper, punctuation, space, hex digit, etc.), or the first that doesn't fit it, respectively. Other than that, they work a bit like std::find_if, returning whatever you passed as the "end" to signal failure, otherwise returning a pointer to the first item in the string that doesn't fit what you asked for.
Here's a quick sample:
#include <locale>
#include <iostream>
#include <iomanip>
int main() {
std::string inputs[] = {
"alllower",
"1234",
"lower132",
"including a space"
};
// We'll use the "classic" (C) locale, but this works with any
std::locale loc(std::locale::classic());
// A mask specifying the characters to search for:
std::ctype_base::mask m = std::ctype_base::lower | std::ctype_base::digit;
for (int i=0; i<4; i++) {
char const *pos;
char const *b = &*inputs[i].begin();
char const *e = &*inputs[i].end();
std::cout << "Input: " << std::setw(20) << inputs[i] << ":\t";
// finally, call the actual function:
if ((pos=std::use_facet<std::ctype<char> >(loc).scan_not(m, b, e)) == e)
std::cout << "All characters match mask\n";
else
std::cout << "First non-matching character = \"" << *pos << "\"\n";
}
return 0;
}
I suspect most people will prefer to use std::find_if though -- using it is nearly the same, but can be generalized to many more situations quite easily. Even though this has much narrower applicability, it's not really a lot easier to user (though I suppose if you're scanning large chunks of text, it might well be at least a little faster).
You could use the tolower & strcmp to compare if the original_string and the tolowered string.And do the numbers individually per character.
(OR) Do both per character as below.
#include <algorithm>
static inline bool is_not_alphanum_lower(char c)
{
return (!isalnum(c) || !islower(c));
}
bool string_is_valid(const std::string &str)
{
return find_if(str.begin(), str.end(), is_not_alphanum_lower) == str.end();
}
I used the some info from:
Determine if a string contains only alphanumeric characters (or a space)
Just use std::all_of
bool lowerAlnum = std::all_of(str.cbegin(), str.cend(), [](const char c){
return isdigit(c) || islower(c);
});
If you don't care about locale (i.e. the input is pure 7-bit ASCII) then the condition can be optimized into
[](const char c){ return ('0' <= c && c <= '9') || ('a' <= c && c <= 'z'); }
If your strings contain ASCII-encoded text and you like to write your own functions (like I do) then you can use this:
bool is_lower_alphanumeric(const string& txt)
{
for(char c : txt)
{
if (!((c >= '0' and c <= '9') or (c >= 'a' and c <= 'z'))) return false;
}
return true;
}