C++ writing a function that extracts words from a paragraph

C++ writing a function that extracts words from a paragraph - c++

The program I am writing reads a text file, breaks the paragraph into individual words, compares them to a list of "sensitive words" and if a word from the text file matches a word from the Sensitive word list, it is censored. I have wrote functions that find the beginning of each word, and a function that will censor or replace words on the Sensitive word list with "#####" (which I left out of this post). A word in this case is any string that contains alphanumeric characters.
The function I am having trouble with is the function that will "extract" or return the individual words to compare to the sensitive word list (extractWord). At the moment it just returns the first letter of the last word in the sentence. So right now all the function does is return "w". I need all the individual words.
Here is what I have so far ...
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
bool wordBeginsAt (const std::string& message, int pos);
bool isAlphanumeric (char c); //
std::string extractWord (const std::string& fromMessage, int beginningAt);
int main()
{
string word = "I need to break these words up individually. 12345 count as words";
string newWord;
for (int i = 0; i < word.length(); ++i)
{
if (wordBeginsAt(word, i))
{
newWord = extractWord(word, i);
}
}
//cout << newWord; // testing output
return 0;
}
bool wordBeginsAt (const std::string& message, int pos)
{
if(pos==0)
{return true;}
else
if (isAlphanumeric(message[pos])==true && isAlphanumeric(message[pos- 1])==false)
{
return true;
}
else
return false;
}
bool isAlphanumeric (char c)
{
return (c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')
|| (c >= '0' && c <= '9');
}
std::string extractWord (const std::string& fromMessage, int beginningAt)
{
string targetWord= "";
targetWord = targetWord + fromMessage[beginningAt];
return targetWord;
}
edit: after trying to use targetWord as an array (which I couldn't define the size) and using several different for and while loops within extractWord I found a solution:
std::string extractWord (const std::string& fromMessage, int beginningAt)
{
string targetWord= "";
while (isAlphanumeric(fromMessage[beginningAt++]))
{
targetWord = targetWord + fromMessage[beginningAt-1];
}
return targetWord;

Since this is a C++ question, how about using modern C++, instead of using dressed-up C code? The modern C++ library has all the algorithms and functions needed to implement all of this work for you:
#include <algorithm>
#include <cctype>
std::string paragraph;
// Somehow, figure out how to get your paragraph into this std::string, then:
auto b=paragraph.begin(), e=paragraph.end();
while (b != e)
{
// Find first alphanumeric character, using POSIX isalnum()
auto p=std::find_if(b, e, [](char c) { return isalnum(c); });
// Find the next non-alphanumeric chararacter
b=std::find_if(p, e, [](char c) { return !isalnum(c); });
if (isbadword(std::string(p, b)))
std::fill(p, b, '#');
}
This does pretty much what you asked, in a fraction of the size of all that manual code that manually searches this stuff. All you have to do is to figure out what...
bool isbadword(const std::string &s)
...needs to do.
Your homework assignment is how to slightly tweak this code to avoid, in certain specific situations, calling isbadword() with an empty string.

Related

Count unique words in a string in C++

I want to count how many unique words are in string 's' where punctuations and newline character (\n) separates each word. So far I've used the logical or operator to check how many wordSeparators are in the string, and added 1 to the result to get the number of words in string s.
My current code returns 12 as the number of word. Since 'ab', 'AB', 'aB', 'Ab' (and same for 'zzzz') are all same and not unique, how can I ignore the variants of a word? I followed the link: http://www.cplusplus.com/reference/algorithm/unique/, but the reference counts unique item in a vector. But, I am using string and not vector.
Here is my code:
#include <iostream>
#include <string>
using namespace std;
bool isWordSeparator(char & c) {
return c == ' ' || c == '-' || c == '\n' || c == '?' || c == '.' || c == ','
|| c == '?' || c == '!' || c == ':' || c == ';';
}
int countWords(string s) {
int wordCount = 0;
if (s.empty()) {
return 0;
}
for (int x = 0; x < s.length(); x++) {
if (isWordSeparator(s.at(x))) {
wordCount++;
return wordCount+1;
int main() {
string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
int number_of_words = countWords(s);
cout << "Number of Words: " << number_of_words << endl;
return 0;
}

What you need to make your code case-insensitive is tolower().
You can apply it to your original string using std::transform:
std::transform(s.begin(), s.end(), s.begin(), ::tolower);
I should add however that your current code is much closer to C than to C++, perhaps you should check out what standard library has to offer.
I suggest istringstream + istream_iterator for tokenizing and either unique_copy or set for getting rid of the duplicates, like this: https://ideone.com/nb4BEH

You could create a set of strings, save the position of the last separator (starting with 0) and use substring to extract the word, then insert it into the set. When done just return the set's size.
You could make the whole operation easier by using string::split - it tokenizes the string for you. All you have to do is insert all of the elements in the returned array to the set and again return it's size.
Edit: as per comments, you need a custom comparator to ignore case for comparisons.

First of all I'd suggest rewriting isWordSeparator like this:
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
since your current implementation doesn't handle all the punctuation and space, like \t or +.
Also, incrementing wordCount when isWordSeparator is true is incorrect for example if you have something like ?!.
So, a less error-prone approach would be to substitute all separators by space and then iterate words inserting them into an (unordered) set:
#include <iterator>
#include <unordered_set>
#include <algorithm>
#include <cctype>
#include <sstream>
int countWords(std::string s) {
std::transform(s.begin(), s.end(), s.begin(), [](char c) {
if (isWordSeparator(c)) {
return ' ';
}
return std::tolower(c);
});
std::unordered_set<std::string> uniqWords;
std::stringstream ss(s);
std::copy(std::istream_iterator<std::string>(ss), std::istream_iterator<std::string(), std::inserter(uniqWords));
return uniqWords.size();
}

While splitting the string into words, insert all words into a std::set. This will get rid of the duplicates. Then it's just a matter of calling set::size() to get the number of unique words.
I'm using the boost::split() function from the boost string algorithm library in my solution, because is almost standard nowadays.
Explanations in the comments in code...
#include <iostream>
#include <string>
#include <set>
#include <boost/algorithm/string.hpp>
using namespace std;
// Function suggested by user 'mshrbkv':
bool isWordSeparator(char c) {
return std::isspace(c) || std::ispunct(c);
}
// This is used to make the set case-insensitive.
// Alternatively you could call boost::to_lower() to make the
// string all lowercase before calling boost::split().
struct IgnoreCaseCompare {
bool operator()( const std::string& a, const std::string& b ) const {
return boost::ilexicographical_compare( a, b );
}
};
int main()
{
string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
// Define a set that will contain only unique strings, ignoring case.
set< string, IgnoreCaseCompare > words;
// Split the string by using your isWordSeparator function
// to define the delimiters. token_compress_on collapses multiple
// consecutive delimiters into only one.
boost::split( words, s, isWordSeparator, boost::token_compress_on );
// Now the set contains only the unique words.
cout << "Number of Words: " << words.size() << endl;
for( auto& w : words )
cout << w << endl;
return 0;
}
Demo: http://coliru.stacked-crooked.com/a/a3b51a6c6a3b4ee8

You can consider SQLite c++ wrapper

How to extract words out of a string and store them in different array in c++

How to split a string and store the words in a separate array without using strtok or istringstream and find the greatest word?? I am only a beginner so I should accomplish this using basic functions in string.h like strlen, strcpy etc. only. Is it possible to do so?? I've tried to do this and I am posting what I have done. Please correct my mistakes.
#include<iostream.h>
#include<stdio.h>
#include<string.h>
void count(char n[])
{
char a[50], b[50];
for(int i=0; n[i]!= '\0'; i++)
{
static int j=0;
for(j=0;n[j]!=' ';j++)
{
a[j]=n[j];
}
static int x=0;
if(strlen(a)>x)
{
strcpy(b,a);
x=strlen(a);
}
}
cout<<"Greatest word is:"<<b;
}
int main( int, char** )
{
char n[100];
gets(n);
count(n);
}

The code in your example looks like it's written in C. Functions like strlen and strcpy originates in C (although they are also part of the C++ standard library for compatibility via the header cstring).
You should start learning C++ using the Standard Library and things will get much easier. Things like splitting strings and finding the greatest element can be done using a few lines of code if you use the functions in the standard library, e.g:
// The text
std::string text = "foo bar foobar";
// Wrap text in stream.
std::istringstream iss{text};
// Read tokens from stream into vector (split at whitespace).
std::vector<std::string> words{std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
// Get the greatest word.
auto greatestWord = *std::max_element(std::begin(words), std::end(words), [] (const std::string& lhs, const std::string& rhs) { return lhs.size() < rhs.size(); });
Edit:
If you really want to dig down in the nitty-gritty parts using only functions from std::string, here's how you can do to split the text into words (I leave finding the greatest word to you, which shouldn't be too hard):
// Use vector to store words.
std::vector<std::string> words;
std::string text = "foo bar foobar";
std::string::size_type beg = 0, end;
do {
end = text.find(' ', beg);
if (end == std::string::npos) {
end = text.size();
}
words.emplace_back(text.substr(beg, end - beg));
beg = end + 1;
} while (beg < text.size());

I would write two functions. The first one skips blank characters for example
const char * SkipSpaces( const char *p )
{
while ( *p == ' ' || *p == '\t' ) ++p;
return ( p );
}
And the second one copies non blank characters
const char * CopyWord( char *s1, const char *s2 )
{
while ( *s2 != ' ' && *s2 != '\t' && *s2 != '\0' ) *s1++ = *s2++;
*s1 = '\0';
return ( s2 );
}

try to get a word in a small array(obviously no word is >35 characters) you can get the word by checking two successive spaces and then put that array in strlen() function and then check if the previous word was larger then drop that word else keep the new word
after all this do not forget to initialize the word array with '\0' or null character after every word catch or this would happen:-
let's say 1st word in that array was 'happen' and 2nd 'to' if you don't initialize then your array will be after 1st catch :
happen
and 2nd catch :
*to*ppen

Try this. Here ctr will be the number of elements in the array(or vector) of individual words of the sentence. You can split the sentence from whatever letter you want by changing function call in main.
#include<iostream>
#include<string>
#include<vector>
using namespace std;
void split(string s, char ch){
vector <string> vec;
string tempStr;
int ctr{};
int index{s.length()};
for(int i{}; i<=index; i++){
tempStr += s[i];
if(s[i]==ch || s[i]=='\0'){
vec.push_back(tempStr);
ctr++;
tempStr="";
continue;
}
}
for(string S: vec)
cout<<S<<endl;
}
int main(){
string s;
getline(cin, s);
split(s, ' ');
return 0;
}

How to sort file names with numbers and alphabets in order in C?

I have used the following code to sort files in alphabetical order and it sorts the files as shown in the figure:
for(int i = 0;i < maxcnt;i++)
{
for(int j = i+1;j < maxcnt;j++)
{
if(strcmp(Array[i],Array[j]) > 0)
{
strcpy(temp,Array[i]);
strcpy(Array[i],Array[j]);
strcpy(Array[j],temp);
}
}
}
But I need to sort it as order seen in Windows explorer
How to sort like this way? Please help

For a C answer, the following is a replacement for strcasecmp(). This function recurses to handle strings that contain alternating numeric and non-numeric substrings. You can use it with qsort():
int strcasecmp_withNumbers(const void *void_a, const void *void_b) {
const char *a = void_a;
const char *b = void_b;
if (!a || !b) { // if one doesn't exist, other wins by default
return a ? 1 : b ? -1 : 0;
}
if (isdigit(*a) && isdigit(*b)) { // if both start with numbers
char *remainderA;
char *remainderB;
long valA = strtol(a, &remainderA, 10);
long valB = strtol(b, &remainderB, 10);
if (valA != valB)
return valA - valB;
// if you wish 7 == 007, comment out the next two lines
else if (remainderB - b != remainderA - a) // equal with diff lengths
return (remainderB - b) - (remainderA - a); // set 007 before 7
else // if numerical parts equal, recurse
return strcasecmp_withNumbers(remainderA, remainderB);
}
if (isdigit(*a) || isdigit(*b)) { // if just one is a number
return isdigit(*a) ? -1 : 1; // numbers always come first
}
while (*a && *b) { // non-numeric characters
if (isdigit(*a) || isdigit(*b))
return strcasecmp_withNumbers(a, b); // recurse
if (tolower(*a) != tolower(*b))
return tolower(*a) - tolower(*b);
a++;
b++;
}
return *a ? 1 : *b ? -1 : 0;
}
Notes:
Windows needs stricmp() rather than the Unix equivalent strcasecmp().
The above code will (obviously) give incorrect results if the numbers are really big.
Leading zeros are ignored here. In my area, this is a feature, not a bug: we usually want UAL0123 to match UAL123. But this may or may not be what you require.
See also Sort on a string that may contain a number and How to implement a natural sort algorithm in c++?, although the answers there, or in their links, are certainly long and rambling compared with the above code, by about a factor of at least four.

Natural sorting is the way that you must take here . I have a working code for my scenario. You probably can make use of it by altering it according to your needs :
#ifndef JSW_NATURAL_COMPARE
#define JSW_NATURAL_COMPARE
#include <string>
int natural_compare(const char *a, const char *b);
int natural_compare(const std::string& a, const std::string& b);
#endif
#include <cctype>
namespace {
// Note: This is a convenience for the natural_compare
// function, it is *not* designed for general use
class int_span {
int _ws;
int _zeros;
const char *_value;
const char *_end;
public:
int_span(const char *src)
{
const char *start = src;
// Save and skip leading whitespace
while (std::isspace(*(unsigned char*)src)) ++src;
_ws = src - start;
// Save and skip leading zeros
start = src;
while (*src == '0') ++src;
_zeros = src - start;
// Save the edges of the value
_value = src;
while (std::isdigit(*(unsigned char*)src)) ++src;
_end = src;
}
bool is_int() const { return _value != _end; }
const char *value() const { return _value; }
int whitespace() const { return _ws; }
int zeros() const { return _zeros; }
int digits() const { return _end - _value; }
int non_value() const { return whitespace() + zeros(); }
};
inline int safe_compare(int a, int b)
{
return a < b ? -1 : a > b;
}
}
int natural_compare(const char *a, const char *b)
{
int cmp = 0;
while (cmp == 0 && *a != '\0' && *b != '\0') {
int_span lhs(a), rhs(b);
if (lhs.is_int() && rhs.is_int()) {
if (lhs.digits() != rhs.digits()) {
// For differing widths (excluding leading characters),
// the value with fewer digits takes priority
cmp = safe_compare(lhs.digits(), rhs.digits());
}
else {
int digits = lhs.digits();
a = lhs.value();
b = rhs.value();
// For matching widths (excluding leading characters),
// search from MSD to LSD for the larger value
while (--digits >= 0 && cmp == 0)
cmp = safe_compare(*a++, *b++);
}
if (cmp == 0) {
// If the values are equal, we need a tie
// breaker using leading whitespace and zeros
if (lhs.non_value() != rhs.non_value()) {
// For differing widths of combined whitespace and
// leading zeros, the smaller width takes priority
cmp = safe_compare(lhs.non_value(), rhs.non_value());
}
else {
// For matching widths of combined whitespace
// and leading zeros, more whitespace takes priority
cmp = safe_compare(rhs.whitespace(), lhs.whitespace());
}
}
}
else {
// No special logic unless both spans are integers
cmp = safe_compare(*a++, *b++);
}
}
// All else being equal so far, the shorter string takes priority
return cmp == 0 ? safe_compare(*a, *b) : cmp;
}
#include <string>
int natural_compare(const std::string& a, const std::string& b)
{
return natural_compare(a.c_str(), b.c_str());
}

What you want to do is perform "Natural Sort". Here is a blog post about it, explaining implementation in python I believe. Here is a perl module that accomplishes it. There also seems to be a similar question at How to implement a natural sort algorithm in c++?

Taking into account that this has a c++ tag, you could elaborate on #Joseph Quinsey's answer and create a natural_less function to be passed to the standard library.
using namespace std;
bool natural_less(const string& lhs, const string& rhs)
{
return strcasecmp_withNumbers(lhs.c_str(), rhs.c_str()) < 0;
}
void example(vector<string>& data)
{
std::sort(data.begin(), data.end(), natural_less);
}
I took the time to write some working code as an exercise
https://github.com/kennethlaskoski/natural_less

Modifying this answer:
bool compareNat(const std::string& a, const std::string& b){
if (a.empty())
return true;
if (b.empty())
return false;
if (std::isdigit(a[0]) && !std::isdigit(b[0]))
return true;
if (!std::isdigit(a[0]) && std::isdigit(b[0]))
return false;
if (!std::isdigit(a[0]) && !std::isdigit(b[0]))
{
if (a[0] == b[0])
return compareNat(a.substr(1), b.substr(1));
return (toUpper(a) < toUpper(b));
//toUpper() is a function to convert a std::string to uppercase.
}
// Both strings begin with digit --> parse both numbers
std::istringstream issa(a);
std::istringstream issb(b);
int ia, ib;
issa >> ia;
issb >> ib;
if (ia != ib)
return ia < ib;
// Numbers are the same --> remove numbers and recurse
std::string anew, bnew;
std::getline(issa, anew);
std::getline(issb, bnew);
return (compareNat(anew, bnew));
}
toUpper() function:
std::string toUpper(std::string s){
for(int i=0;i<(int)s.length();i++){s[i]=toupper(s[i]);}
return s;
}
Usage:
#include <iostream> // std::cout
#include <string>
#include <algorithm> // std::sort, std::copy
#include <iterator> // std::ostream_iterator
#include <sstream> // std::istringstream
#include <vector>
#include <cctype> // std::isdigit
int main()
{
std::vector<std::string> str;
str.push_back("20.txt");
str.push_back("10.txt");
str.push_back("1.txt");
str.push_back("z2.txt");
str.push_back("z10.txt");
str.push_back("z100.txt");
str.push_back("1_t.txt");
str.push_back("abc.txt");
str.push_back("Abc.txt");
str.push_back("bcd.txt");
std::sort(str.begin(), str.end(), compareNat);
std::copy(str.begin(), str.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}

Your problem is that you have an interpretation behind parts of the file name.
In lexicographical order, Slide1 is before Slide10 which is before Slide5.
You expect Slide5 before Slide10 as you have an interpretation of the substrings 5 and 10 (as integers).
You will run into more problems, if you had the
name of the month in the filename, and would expect them to be ordered by date (i.e. January comes before August). You will need to adjust your sorting to this interpretation (and the "natural" order will depend on your interpretation, there is no generic solution).
Another approach is to format the filenames in a way that your sorting and the lexicographical order agree. In your case, you would use leading zeroes and a fixed length for the number. So Slide1 becomes Slide01, and then you will see that sorting them lexicographically will yield the result you would like to have.
However, often you cannot influence the output of an application, and thus cannot enforce your format directly.
What I do in those cases: write a little script/function that renames the file to a proper format, and then use standard sorting algorithms to sort them. The advantage of this is that you do not need to adapt your sorting, and can use existing software for the sorting.
On the downside, there are situations where this is not feasible (as filenames need to be fixed).

How to check if a string is all lowercase and alphanumerics?

Is there a method that checks for these cases? Or do I need to parse each letter in the string, and check if it's lower case (letter) and is a number/letter?

You can use islower(), isalnum() to check for those conditions for each character. There is no string-level function to do this, so you'll have to write your own.

Assuming that the "C" locale is acceptable (or swap in a different set of characters for criteria), use find_first_not_of()
#include <string>
bool testString(const std::string& str)
{
std::string criteria("abcdefghijklmnopqrstuvwxyz0123456789");
return (std::string::npos == str.find_first_not_of(criteria);
}

It's not very well known, but a locale actually does have functions to determine characteristics of entire strings at a time. Specifically, the ctype facet of a locale has a scan_is and a scan_not that scan for the first character that fits a specified mask (alpha, numeric, alphanumeric, lower, upper, punctuation, space, hex digit, etc.), or the first that doesn't fit it, respectively. Other than that, they work a bit like std::find_if, returning whatever you passed as the "end" to signal failure, otherwise returning a pointer to the first item in the string that doesn't fit what you asked for.
Here's a quick sample:
#include <locale>
#include <iostream>
#include <iomanip>
int main() {
std::string inputs[] = {
"alllower",
"1234",
"lower132",
"including a space"
};
// We'll use the "classic" (C) locale, but this works with any
std::locale loc(std::locale::classic());
// A mask specifying the characters to search for:
std::ctype_base::mask m = std::ctype_base::lower | std::ctype_base::digit;
for (int i=0; i<4; i++) {
char const *pos;
char const *b = &*inputs[i].begin();
char const *e = &*inputs[i].end();
std::cout << "Input: " << std::setw(20) << inputs[i] << ":\t";
// finally, call the actual function:
if ((pos=std::use_facet<std::ctype<char> >(loc).scan_not(m, b, e)) == e)
std::cout << "All characters match mask\n";
else
std::cout << "First non-matching character = \"" << *pos << "\"\n";
}
return 0;
}
I suspect most people will prefer to use std::find_if though -- using it is nearly the same, but can be generalized to many more situations quite easily. Even though this has much narrower applicability, it's not really a lot easier to user (though I suppose if you're scanning large chunks of text, it might well be at least a little faster).

You could use the tolower & strcmp to compare if the original_string and the tolowered string.And do the numbers individually per character.
(OR) Do both per character as below.
#include <algorithm>
static inline bool is_not_alphanum_lower(char c)
{
return (!isalnum(c) || !islower(c));
}
bool string_is_valid(const std::string &str)
{
return find_if(str.begin(), str.end(), is_not_alphanum_lower) == str.end();
}
I used the some info from:
Determine if a string contains only alphanumeric characters (or a space)

Just use std::all_of
bool lowerAlnum = std::all_of(str.cbegin(), str.cend(), [](const char c){
return isdigit(c) || islower(c);
});
If you don't care about locale (i.e. the input is pure 7-bit ASCII) then the condition can be optimized into
[](const char c){ return ('0' <= c && c <= '9') || ('a' <= c && c <= 'z'); }

If your strings contain ASCII-encoded text and you like to write your own functions (like I do) then you can use this:
bool is_lower_alphanumeric(const string& txt)
{
for(char c : txt)
{
if (!((c >= '0' and c <= '9') or (c >= 'a' and c <= 'z'))) return false;
}
return true;
}

string analysis

IF a string may include several un-necessary elements, e.g., such as #, #, $,%.
How to find them and delete them?
I know this requires a loop iteration, but I do not know how to represent sth such as #, #, $,%.
If you can give me a code example, then I will be really appreciated.

The usual standard C++ approach would be the erase/remove idiom:
#include <string>
#include <algorithm>
#include <iostream>
struct OneOf {
std::string chars;
OneOf(const std::string& s) : chars(s) {}
bool operator()(char c) const {
return chars.find_first_of(c) != std::string::npos;
}
};
int main()
{
std::string s = "string with #, #, $, %";
s.erase(remove_if(s.begin(), s.end(), OneOf("##$%")), s.end());
std::cout << s << '\n';
}
and yes, boost offers some neat ways to write it shorter, for example using boost::erase_all_regex
#include <string>
#include <iostream>
#include <boost/algorithm/string/regex.hpp>
int main()
{
std::string s = "string with #, #, $, %";
erase_all_regex(s, boost::regex("[##$%]"));
std::cout << s << '\n';
}

If you want to get fancy, there is Boost.Regex otherwise you can use the STL replace function in combination with the strchr function..

And if you, for some reason, have to do it yourself C-style, something like this would work:
char* oldstr = ... something something dark side ...
int oldstrlen = strlen(oldstr)+1;
char* newstr = new char[oldstrlen]; // allocate memory for the new nicer string
char* p = newstr; // get a pointer to the beginning of the new string
for ( int i=0; i<oldstrlen; i++ ) // iterate over the original string
if (oldstr[i] != '#' && oldstr[i] != '#' && etc....) // check that the current character is not a bad one
*p++ = oldstr[i]; // append it to the new string
*p = 0; // dont forget the null-termination

I think for this I'd use std::remove_copy_if:
#include <string>
#include <algorithm>
#include <iostream>
struct bad_char {
bool operator()(char ch) {
return ch == '#' || ch == '#' || ch == '$' || ch == '%';
}
};
int main() {
std::string in("This#is#a$string%with#extra#stuff$to%ignore");
std::string out;
std::remove_copy_if(in.begin(), in.end(), std::back_inserter(out), bad_char());
std::cout << out << "\n";
return 0;
}
Result:
Thisisastringwithextrastufftoignore
Since the data containing these unwanted characters will normally come from a file of some sort, it's also worth considering getting rid of them as you read the data from the file instead of reading the unwanted data into a string, and then filtering it out. To do this, you could create a facet that classifies the unwanted characters as white space:
struct filter: std::ctype<char>
{
filter(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::mask());
rc['#'] = std::ctype_base::space;
rc['#'] = std::ctype_base::space;
rc['$'] = std::ctype_base::space;
rc['%'] = std::ctype_base::space;
return &rc[0];
}
};
To use this, you imbue the input stream with a locale using this facet, and then read normally. For the moment I'll use an istringstream, though you'd normally use something like an istream or ifstream:
int main() {
std::istringstream in("This#is#a$string%with#extra#stuff$to%ignore");
in.imbue(std::locale(std::locale(), new filter));
std::copy(std::istream_iterator<char>(in),
std::istream_iterator<char>(),
std::ostream_iterator<char>(std::cout));
return 0;
}

Is this C or C++? (You've tagged it both ways.)
In pure C, you pretty much have to loop through character by character and delete the unwanted ones. For example:
char *buf;
int len = strlen(buf);
int i, j;
for (i = 0; i < len; i++)
{
if (buf[i] == '#' || buf[i] == '#' || buf[i] == '$' /* etc */)
{
for (j = i; j < len; j++)
{
buf[j] = buf[j+1];
}
i --;
}
}
This isn't very efficient - it checks each character in turn and shuffles them all up if there's one you don't want. You have to decrement the index afterwards to make sure you check the new next character.

General algorithm:
Build a string that contains the characters you want purged: "##$%"
Iterate character by character over the subject string.
Search if each character is found in the purge set.
If a character matches, discard it.
If a character doesn't match, append it to a result string.
Depending on the string library you are using, there are functions/methods that implement one or more of the above steps, such as strchr() or find() to determine if a character is in a string.

use the characterizer operator, ie a would be 'a'. you haven't said whether your using C++ strings(in which case you can use the find and replace methods) or C strings in which case you'd use something like this(this is by no means the best way, but its a simple way):
void RemoveChar(char* szString, char c)
{
while(*szString != '\0')
{
if(*szString == c)
memcpy(szString,szString+1,strlen(szString+1)+1);
szString++;
}
}

You can use a loop and call find_last_of (http://www.cplusplus.com/reference/string/string/find_last_of/) repeatedly to find the last character that you want to replace, replace it with blank, and then continue working backwards in the string.

Something like this would do :
bool is_bad(char c)
{
if( c == '#' || c == '#' || c == '$' || c == '%' )
return true;
else
return false;
}
int main(int argc, char **argv)
{
string str = "a #test ##string";
str.erase(std::remove_if(str.begin(), str.end(), is_bad), str.end() );
}
If your compiler supports lambdas (or if you can use boost), it can be made even shorter. Example using boost::lambda :
string str = "a #test ##string";
str.erase(std::remove_if(str.begin(), str.end(), (_1 == '#' || _1 == '#' || _1 == '$' || _1 == '%')), str.end() );
(yay two lines!)

A character is represented in C/C++ by single quotes, e.g. '#', '#', etc. (except for a few that need to be escaped).
To search for a character in a string, use strchr(). Here is a link to a sample code:
http://www.cplusplus.com/reference/clibrary/cstring/strchr/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ writing a function that extracts words from a paragraph - c++

Related

Count unique words in a string in C++

How to extract words out of a string and store them in different array in c++

How to sort file names with numbers and alphabets in order in C?

How to check if a string is all lowercase and alphanumerics?

string analysis

Categories

Resources