Separating alphabetic characters in C++ STL - c++

I've been practicing C++ for a competition next week. And in the sample problem I've been working on, requires splitting of paragraphs into words. Of course, that's easy. But this problem is so weird, that the words like: isn't should be separated as well: isn and t. I know it's weird but I have to follow this.
I have a function split() that takes a constant char delimiter as one of the parameter. It's what I use to separate words from spaces. But I can't figure out this one. Even numbers like: phil67bs should be separated as phil and bs.
And no, I don't ask for full code. A pseudocode will do, or something that will help me understand what to do. Thanks!
PS: Please no recommendations for external libs. Just the STL. :)

Filter out numbers, spaces and anything else that isn't a letter by using a proper locale. See this SO thread about treating everything but numbers as a whitespace. So use a mask and do something similar to what Jerry Coffin suggests but only for letters:
struct alphabet_only: std::ctype<char>
{
alphabet_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['['], std::ctype_base::upper);
std::fill(&rc['a'], &rc['{'], std::ctype_base::lower);
return &rc[0];
}
};
And, boom! You're golden.
Or... you could just do a transform:
char changeToLetters(const char& input){ return isalpha(input) ? input : ' '; }
vector<char> output;
output.reserve( myVector.size() );
transform( myVector.begin(), myVector.end(), insert_iterator(output), ptr_fun(changeToLetters) );
Which, um, is much easier to grok, just not as efficient as Jerry's idea.
Edit:
Changed 'Z' to '[' so that the value 'Z' is filled. Likewise with 'z' to '{'.

This sounds like a perfect job for the find_first_of function which finds the first occurrence of a set of characters. You can use this to look for arbitrary stop characters and generate words from the spaces between such stop characters.
Roughly:
size_t previous = 0;
for (; ;) {
size_t next = str.find_first_of(" '1234567890", previous);
// Do processing
if (next == string::npos)
break;
previous = next + 1;
};

Just change your function to delimit on anything that isn't an alphabetic character. Is there anything in particular that you are having trouble with?
Break down the problem: First, write a function that gets the first "word" from the sentence. This is easy; just look for the first non-alphabetic character. The next step is to remove all leading non-alphabetic character from the remaining string. From there, just repeat.

You can do something like this:
vector<string> split(const string& str)
{
vector<string> splits;
string cur;
for(int i = 0; i < str.size(); ++i)
{
if(str[i] >= '0' && str[i] <= '9')
{
if(!cur.empty())
{
splits.push_back(cur);
}
cur="";
}
else
{
cur += str[i];
}
}
if(! cur.empty())
{
splits.push_back(cur);
}
return splits;
}

let's assume that the input is in a std::string (use std::getline(cin, line) for example to read a full line from cin)
std::vector<std::string> split(std::string const& input)
{
std::string::const_iterator it(input), end(input.end());
std::string current;
vector<std::string> words;
for(; it != end; ++it)
{
if (isalpha(*it))
{
current.push_back(*it); // add this char to the current word
}
else
{
// push the current word in to the result list
words.push_back(current);
current.clear(); // next word
}
}
return words;
}
I've not tested it, but I guess it ought to work...

Related

What is the reason behind the debugging getting stopped abruptly in the following code?

Here is the code to find the number of matches of a string, which is input from the user, can be found in the file temp.txt. If, for example, we want love to be counted, then matches like love, lovely, beloved should be considered. We also want to count the total number of words in temp.txt file.
I am doing a line by line reading here, not word by word.
Why does the debugging stop at totalwords += counting(line)?
/*this code is not working to count the words*/
#include<iostream>
#include<fstream>
#include<string>
using namespace std;
int totalwords{0};
int counting(string line){
int wordcount{0};
if(line.empty()){
return 1;
}
if(line.find(" ")==string::npos){wordcount++;}
else{
while(line.find(" ")!=string::npos){
int index=0;
index = line.find(" ");
line.erase(0,index);
wordcount++;
}
}
return wordcount;
}
int main() {
ifstream in_file;
in_file.open("temp.txt");
if(!in_file){
cerr<<"PROBLEM OPENING THE FILE"<<endl;
}
string line{};
int counter{0};
string word {};
cout<<"ENTER THE WORD YOU WANT TO COUNT IN THE FILE: ";
cin>>word;
int n {0};
n = ( word.length() - 1 );
while(getline(in_file>>ws,line)){
totalwords += counting(line);
while(line.find(word)!=string::npos){
counter++;
int index{0};
index = line.find(word);
line.erase(0,(index+n));
}
}
cout<<endl;
cout<<counter<<endl;
cout<<totalwords;
return 0;
}
line.erase(0, index); doesn't erase the space, you need
line.erase(0, index + 1);
Your code reveals a few problems...
At very first, counting a single word for an empty line doesn't appear correct to me. Second, erasing again and again from the string is pretty inefficient, with every such operation all of the subsequent characters are copied towards the front. If you indeed wanted to do so you might rather want to search from the end of the string, avoiding that. But you can actually do so without ever modifying the string if you use the second parameter of std::string::find (which defaults to 0, so has been transparent to you...):
int index = line.find(' ' /*, 0*); // first call; 0 is default, thus implicit
index = line.find(' ', index + 1); // subsequent call
Note that using the character overload is more efficient if you search for a single character anyway. However, this variant doesn't consider other whitespace like e. g. tabulators.
Additionally, the variant as posted in the question doesn't consider more than one subsequent whitespace! In your erasing variant – which erases one character too few, by the way – you would need to skip incrementing the word count if you find the space character at index 0.
However I'd go with a totally new approach, looking at each character separately; you need a stateful loop for in that case, though, i.e. you need to remember if you already are within a word or not. It might look e. g. like this:
size_t wordCount = 0; // note: prefer an unsigned type, negative values
// are meaningless anyway
// size_t is especially fine as it is guaranteed to be
// large enough to hold any count the string might ever
// contain characters
bool inWord = false;
for(char c : line)
{
if(isspace(static_cast<unsigned char>(c)))
// you can check for *any* white space that way...
// note the cast to unsigned, which is necessary as isspace accepts
// an int and a bare char *might* be signed, thus result in negative
// values
{
// no word any more...
inWord = false;
}
else if(inWord)
{
// well, nothing to do, we already discovered a word earlier!
//
// as we actually don't do anything here you might just skip
// this block and check for the opposite: if(!inWord)
}
else
{
// OK, this is the start of a word!
// so now we need to count a new one!
++wordCount;
inWord = true;
}
}
Now you might want to break words at punctuation characters as well, so you might actually want to check for:
if(isspace(static_cast<unsigned char>(c)) || ispunct(static_cast<unsigned char>(c))
A bit shorter is the following variant:
if(/* space or punctuation */)
{
inWord = false;
}
else
{
wordCount += inWord; // adds 0 or 1 depending on the value
inWord = false;
}
Finally: All code is written freely, thus unchecked – if you find a bug, please fix yourself...
debugging getting stopped abruptly
Does debugging indeed stop at the indicated line? I observed instead that the program hangs within the while loop in counting. You may make this visible by inserting an indicator output (marked by HERE in following code):
int counting(string line){
int wordcount{0};
if(line.empty()){
return 1;
}
if(line.find(" ")==string::npos){wordcount++;}
else{
while(line.find(" ")!=string::npos){
int index=0;
index = line.find(" ");
line.erase(0,index);
cout << '.'; // <--- HERE: indicator output
wordcount++;
}
}
return wordcount;
}
As Jarod42 pointed out, the erase call you are using misses the space itself. That's why you are finding spaces and “counting words” forever.
There is also an obvious misconception about words and separators of words visible in your code:
empty lines don't contain words
consecutive spaces don't indicate words
words may be separated by non-spaces (parentheses for example)
Finally, as already mentioned: if the problem is about counting total words, it's not necessary to discuss the other parts. And after the test (see HERE) above, it also appears to be independent on file input. So your code could be reduced to something like this:
#include <iostream>
#include <string>
int counting(std::string line) {
int wordcount = 0;
if (line.empty()) {
return 1;
}
if (line.find(" ") == std::string::npos) {
wordcount++;
} else {
while (line.find(" ") != std::string::npos) {
int index = 0;
index = line.find(" ");
line.erase(0, index);
wordcount++;
}
}
return wordcount;
}
int main() {
int totalwords = counting("bla bla");
std::cout << totalwords;
return 0;
}
And in this form, it's much easier to see if it works. We expect to see a 2 as output. To get there, it's possible to try correcting your erase call, but the result would then still be wrong (1) since you are actually counting spaces. So it's better to take the time and carefully read Aconcagua's insightful answer.

makeValidWord(std::string word) not working properly

I'm programming a hash table thing in C++, but this specific piece of code will not run properly. It should return a string of alpha characters and ' and -, but I get cases like "t" instead of "art" when I try to input "'aRT-*".
isWordChar() return a bool value depending on whether the input is a valid word character or not using isAlpha()
// Words cannot contain any digits, or special characters EXCEPT for
// hyphens (-) and apostrophes (') that occur in the middle of a
// valid word (the first and last characters of a word must be an alpha
// character). All upper case characters in the word should be convertd
// to lower case.
// For example, "can't" and "good-hearted" are considered valid words.
// "12mOnkEYs-$" will be converted to "monkeys".
// "Pa55ive" will be stripped "paive".
std::string WordCount::makeValidWord(std::string word) {
if (word.size() == 0) {
return word;
}
string r = "";
string in = "";
size_t incr = 0;
size_t decr = word.size() - 1;
while (incr < word.size() && !isWordChar(word.at(incr))) {
incr++;
}
while (0 < decr && !isWordChar(word.at(decr))) {
decr--;
}
if (incr > decr) {
return r;
}
while (incr <= decr) {
if (isWordChar(word.at(incr)) || word.at(incr) == '-' || word.at(incr) == '\'') {
in =+ word.at(incr);
}
incr++;
}
for (size_t i = 0; i < in.size(); i++) {
r += tolower(in.at(i));
}
return r;
}
Assuming you can use standard algorithms its better to rewrite your function using them. This achieves 2 goals:
code is more readable, since using algorithms shows intent along with code itself
there is less chance to make error
So it should be something like this:
std::string WordCount::makeValidWord(std::string word) {
auto first = std::find_if(word.cbegin(), word.cend(), isWordChar);
auto last = std::find_if(word.crbegin(), word.crend(), isWordChar);
std::string i;
std::copy_if(first, std::next(last), std::back_inserter(i), [](char c) {
return isWordChar(c) || c == '-' || c == '\'';
});
std::string r;
std::transform(i.cbegin(), i.cend(), std::back_inserter(r), std::tolower);
return r;
}
I am going to echo #Someprogrammerdude and say: Learn to use a debugger!
I pasted your code into Visual Studio (changed isWordChar() to isalpha()), and stepped it through with the debugger. Then it was pretty trivial to notice this happening:
First loop of while (incr <= decr) {:
Second loop:
Ooh, look at that; the variable in does not update correctly - instead of collecting a string of the correct characters it only holds the last one. How can that be?
in =+ word.at(incr); Hey, that is not right, that operator should be +=.
Many errors are that easy and effortless to find and correct if you use a debugger. Pick one up today. :)

C++-Tokenizer extremely slow

I just don't get what i do wrong. The unicode tokenizer function shown below is very very slow. Maybe somebody can give me a hint how to speed it up?
Thanks for any help. By the way, ustring is Glib::ustring.
sep1 are separators that should not show up in the results
sep2 are separators that should be single tokens in the result
void tokenize(const ustring & u, const ustring & sep1,
const ustring & sep2, vector<ustring> & tokens) {
ustring s;
s.reserve(100);
ostringstream os;
gunichar c;
for (int i = 0; i < u.length(); i++) {
c = u[i];
if (sep1.find(c) != ustring::npos) {
tokens.push_back(s);
s = "";
}
else if (sep2.find(c) != ustring::npos) {
tokens.push_back(s);
s = "";
s.append(1, c);
tokens.push_back(s);
s = "";
}
else {
s.append(1, c);
}
}
if (s!="")
tokens.push_back(s);
}
I now changed it to (now between 1 and 2 seconds) :
ustring s;
s.reserve(100);
ostringstream os;
gunichar c;
set<gunichar> set_sep1;
int i=0;
for (i=0;i<sep1.size();i++)
{
set_sep1.insert(sep1[i]);
}
set<gunichar> set_sep2;
for (i=0;i<sep2.size();i++)
{
set_sep2.insert(sep2[i]);
}
int start_index=-1;
int ulen=u.length();
i=0;
for (ustring::const_iterator it=u.begin();it!=u.end();++it)
{
c=*it;
if (set_sep1.find(c)!=set_sep1.end())
{
if (start_index!=-1 && start_index<i)
tokens.push_back(u.substr(start_index,i-start_index));
start_index=i+1;
s="";
}
else if (set_sep2.find(c)!=set_sep2.end())
{
tokens.push_back(s);
s="";
tokens.push_back(s);
start_index=i+1;
s="";
}
i++;
}
if (start_index!=-1 && start_index<ulen)
tokens.push_back(u.substr(start_index,ulen-start_index));
The to things that may be "very very slow" here are:
ustring::length()
ustring::append()
Random access to ustring via operator[] : e.g. c=u[i];
Try the following:
Instead of calling u.length() in a loop, store the length in a variable and in the loop compare to that variable
Append the characters of the current token to an ostringstream or wostringstream instead of a ustring
Iterate through ustring using iterators instead of indices involving random access.
Example:
for(ustring::const_iterator it = u.cbegin(); it != u.cend(); it++)
{
c = *it;
//implementation follows
}
I think the following will speed up your code significantly, but it's a little work to find out. Currently you are:
iterating over each character of u.
doing a find in sep1 to see if that character is among the separators.
appending one character at a time, as needed.
Assuming your list of separators is smaller than the strings you are parsing, you are probably better off doing the following:
For each seperator, do a find to see if the separator is in the string
If found, append the entire substring in one go, and do a find on the remaing substring.
A second optimization is to order your separators in terms of most likely to succeed. If for e.g. "," is the most commonly use separator, make sure the find runs on that first. If one separator is much hotter than others, this will make a big difference.

Efficient way to check if std::string has only spaces

I was just talking with a friend about what would be the most efficient way to check if a std::string has only spaces. He needs to do this on an embedded project he is working on and apparently this kind of optimization matters to him.
I've came up with the following code, it uses strtok().
bool has_only_spaces(std::string& str)
{
char* token = strtok(const_cast<char*>(str.c_str()), " ");
while (token != NULL)
{
if (*token != ' ')
{
return true;
}
}
return false;
}
I'm looking for feedback on this code and more efficient ways to perform this task are also welcome.
if(str.find_first_not_of(' ') != std::string::npos)
{
// There's a non-space.
}
In C++11, the all_of algorithm can be employed:
// Check if s consists only of whitespaces
bool whiteSpacesOnly = std::all_of(s.begin(),s.end(),isspace);
Why so much work, so much typing?
bool has_only_spaces(const std::string& str) {
return str.find_first_not_of (' ') == str.npos;
}
Wouldn't it be easier to do:
bool has_only_spaces(const std::string &str)
{
for (std::string::const_iterator it = str.begin(); it != str.end(); ++it)
{
if (*it != ' ') return false;
}
return true;
}
This has the advantage of returning early as soon as a non-space character is found, so it will be marginally more efficient than solutions that examine the whole string.
To check if string has only whitespace in c++11:
bool is_whitespace(const std::string& s) {
return std::all_of(s.begin(), s.end(), isspace);
}
in pre-c++11:
bool is_whitespace(const std::string& s) {
for (std::string::const_iterator it = s.begin(); it != s.end(); ++it) {
if (!isspace(*it)) {
return false;
}
}
return true;
}
Here's one that only uses STL (Requires C++11)
inline bool isBlank(const std::string& s)
{
return std::all_of(s.cbegin(),s.cend(),[](char c) { return std::isspace(c); });
}
It relies on fact that if string is empty (begin = end) std::all_of also returns true
Here is a small test program: http://cpp.sh/2tx6
Using strtok like that is bad style! strtok modifies the buffer it tokenizes (it replaces the delimiter chars with \0).
Here's a non modifying version.
const char* p = str.c_str();
while(*p == ' ') ++p;
return *p != 0;
It can be optimized even further, if you iterate through it in machine word chunks. To be portable, you would also have to take alignment into consideration.
I do not approve of you const_casting above and using strtok.
A std::string can contain embedded nulls but let's assume it will be all ASCII 32 characters before you hit the NULL terminator.
One way you can approach this is with a simple loop, and I will assume const char *.
bool all_spaces( const char * v )
{
for ( ; *v; ++v )
{
if( *v != ' ' )
return false;
}
return true;
}
For larger strings, you can check word-at-a-time until you reach the last word, and then assume the 32-bit word (say) will be 0x20202020 which may be faster.
Something like:
return std::find_if(
str.begin(), str.end(),
std::bind2nd( std::not_equal_to<char>(), ' ' ) )
== str.end();
If you're interested in white space, and not just the space character,
then the best thing to do is to define a predicate, and use it:
struct IsNotSpace
{
bool operator()( char ch ) const
{
return ! ::is_space( static_cast<unsigned char>( ch ) );
}
};
If you're doing any text processing at all, a collection of such simple
predicates will be invaluable (and they're easy to generate
automatically from the list of functions in <ctype.h>).
it's highly unlikely you'll beat a compiler optimized naive algorithm for this, e.g.
string::iterator it(str.begin()), end(str.end())
for(; it != end && *it == ' '; ++it);
return it == end;
EDIT: Actually - there is a quicker way (depending on size of string and memory available)..
std::string ns(str.size(), ' ');
return ns == str;
EDIT: actually above is not quick.. it's daft... stick with the naive implementation, the optimizer will be all over that...
EDIT AGAIN: dammit, I guess it's better to look at the functions in std::string
return str.find_first_not_of(' ') == string::npos;
I had a similar problem in a programming assignment, and here is one other solution I came up with after reviewing others. here I simply create a new sentence without the new spaces. If there are double spaces I simply overlook them.
string sentence;
string newsent; //reconstruct new sentence
string dbl = " ";
getline(cin, sentence);
int len = sentence.length();
for(int i = 0; i < len; i++){
//if there are multiple whitespaces, this loop will iterate until there are none, then go back one.
if (isspace(sentence[i]) && isspace(sentence[i+1])) {do{
i++;
}while (isspace(sentence[i])); i--;} //here, you have to dial back one to maintain at least one space.
newsent +=sentence[i];
}
cout << newsent << "\n";
Hm...I'd do this:
for (auto i = str.begin(); i != str.end() ++i)
if (!isspace(i))
return false;
Pseudo-code, isspace is located in cctype for C++.
Edit: Thanks to James for pointing out that isspace has undefined behavior on signed chars.
If you are using CString, you can do
CString myString = " "; // All whitespace
if(myString.Trim().IsEmpty())
{
// string is all whitespace
}
This has the benefit of trimming all newline, space and tab characters.

C++ Remove new line from multiline string

Whats the most efficient way of removing a 'newline' from a std::string?
#include <algorithm>
#include <string>
std::string str;
str.erase(std::remove(str.begin(), str.end(), '\n'), str.cend());
The behavior of std::remove may not quite be what you'd expect.
A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
See an explanation of it here.
If the newline is expected to be at the end of the string, then:
if (!s.empty() && s[s.length()-1] == '\n') {
s.erase(s.length()-1);
}
If the string can contain many newlines anywhere in the string:
std::string::size_type i = 0;
while (i < s.length()) {
i = s.find('\n', i);
if (i == std::string:npos) {
break;
}
s.erase(i);
}
You should use the erase-remove idiom, looking for '\n'. This will work for any standard sequence container; not just string.
Here is one for DOS or Unix new line:
void chomp( string &s)
{
int pos;
if((pos=s.find('\n')) != string::npos)
s.erase(pos);
}
Slight modification on edW's solution to remove all exisiting endline chars
void chomp(string &s){
size_t pos;
while (((pos=s.find('\n')) != string::npos))
s.erase(pos,1);
}
Note that size_t is typed for pos, it is because npos is defined differently for different types, for example, -1 (unsigned int) and -1 (unsigned float) are not the same, due to the fact the max size of each type are different. Therefore, comparing int to size_t might return false even if their values are both -1.
s.erase(std::remove(s.begin(), s.end(), '\n'), s.end());
The code removes all newlines from the string str.
O(N) implementation best served without comments on SO and with comments in production.
unsigned shift=0;
for (unsigned i=0; i<length(str); ++i){
if (str[i] == '\n') {
++shift;
}else{
str[i-shift] = str[i];
}
}
str.resize(str.length() - shift);
std::string some_str = SOME_VAL;
if ( some_str.size() > 0 && some_str[some_str.length()-1] == '\n' )
some_str.resize( some_str.length()-1 );
or (removes several newlines at the end)
some_str.resize( some_str.find_last_not_of(L"\n")+1 );
Another way to do it in the for loop
void rm_nl(string &s) {
for (int p = s.find("\n"); p != (int) string::npos; p = s.find("\n"))
s.erase(p,1);
}
Usage:
string data = "\naaa\nbbb\nccc\nddd\n";
rm_nl(data);
cout << data; // data = aaabbbcccddd
All these answers seem a bit heavy to me.
If you just flat out remove the '\n' and move everything else back a spot, you are liable to have some characters slammed together in a weird-looking way. So why not just do the simple (and most efficient) thing: Replace all '\n's with spaces?
for (int i = 0; i < str.length();i++) {
if (str[i] == '\n') {
str[i] = ' ';
}
}
There may be ways to improve the speed of this at the edges, but it will be way quicker than moving whole chunks of the string around in memory.
If its anywhere in the string than you can't do better than O(n).
And the only way is to search for '\n' in the string and erase it.
for(int i=0;i<s.length();i++) if(s[i]=='\n') s.erase(s.begin()+i);
For more newlines than:
int n=0;
for(int i=0;i<s.length();i++){
if(s[i]=='\n'){
n++;//we increase the number of newlines we have found so far
}else{
s[i-n]=s[i];
}
}
s.resize(s.length()-n);//to delete only once the last n elements witch are now newlines
It erases all the newlines once.
About answer 3 removing only the last \n off string code :
if (!s.empty() && s[s.length()-1] == '\n') {
s.erase(s.length()-1);
}
Will the if condition not fail if the string is really empty ?
Is it not better to do :
if (!s.empty())
{
if (s[s.length()-1] == '\n')
s.erase(s.length()-1);
}
To extend #Greg Hewgill's answer for C++11:
If you just need to delete a newline at the very end of the string:
This in C++98:
if (!s.empty() && s[s.length()-1] == '\n') {
s.erase(s.length()-1);
}
...can now be done like this in C++11:
if (!s.empty() && s.back() == '\n') {
s.pop_back();
}
Optionally, wrap it up in a function. Note that I pass it by ptr here simply so that when you take its address as you pass it to the function, it reminds you that the string will be modified in place inside the function.
void remove_trailing_newline(std::string* str)
{
if (str->empty())
{
return;
}
if (str->back() == '\n')
{
str->pop_back();
}
}
// usage
std::string str = "some string\n";
remove_trailing_newline(&str);
Whats the most efficient way of removing a 'newline' from a std::string?
As far as the most efficient way goes--that I'd have to speed test/profile and see. I'll see if I can get back to you on that and run some speed tests between the top two answers here, and a C-style way like I did here: Removing elements from array in C. I'll use my nanos() timestamp function for speed testing.
Other References:
See these "new" C++11 functions in this reference wiki here: https://en.cppreference.com/w/cpp/string/basic_string
https://en.cppreference.com/w/cpp/string/basic_string/empty
https://en.cppreference.com/w/cpp/string/basic_string/back
https://en.cppreference.com/w/cpp/string/basic_string/pop_back