Erase words from a string (in C++) - c++

I want to delete some words from a string but my code doesn't work . I don't have any errors or warnings , but I'm thinking that my string becomes empty. Could someone help me with this? I tried to convert my initial strings into 2 vectors, so that I can navigate more easily then
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
int main()
{
string s("Somewhere down the road");
string t("down");
istringstream iss(s);
vector <string> plm;
vector <string> plm2;
do
{
string sub;
iss >> sub;
plm.push_back(sub);
} while (iss);
for(unsigned int i=0 ; i<plm.size();i++){
cout<<plm[i];}
istringstream ist(t);
do
{
string subb;
ist >> subb;
plm2.push_back(subb);
} while (ist);
for(int i=0;i<plm.size();i++){
for(int j=0;j<plm2.size();i++){
{if (plm[i]==plm2[j])
plm.erase(plm.begin()+j);}}}
for(int i=0 ; i<plm.size();i++)
cout<<plm[i];
}

Warning: this is really just a comment that's too long to fit in a comment field. Oh, and a bit of a rant at that.
I'm sure glad we have these modern languages to make life so much easier than it was decades ago. Consider, for example, what this job looked like an the long-since moribund SNOBOL 4 programming language:
s = 'somewhere down the road'
del s 'down' = :s(del)
OUTPUT = s
God, it's nice that we've since made so much progress that we don't have to deal with 3 whole lines of code, and we can now do the job with only 52 lines instead (oh, except that the 52 lines don't actually work, but let's ignore that for the moment).
I guess, in fairness, we can do the job a little more compactly in C++ though. One obvious way would be with std::remove_copy, some stream iterators, and a stringstream or two:
std::istringstream input("somewhere down the road");
std::string del_str("down");
std::istream_iterator<std::string> in(input), end;
std::ostringstream result;
std::remove_copy(in, end, std::ostream_iterator<std::string>(result, " "), del_str);
std::cout << result.str();

There is no benefit in converting to vector - string itself already provides all that is necessary for what you want to do. Anyway, do it this way:
vector<char> v;
v.assign(s.c_str(), s.c_str() + s.length()); // without...
v.assign(s.c_str(), s.c_str() + s.length() + ); // including...
// ... terminating null character
Now it gets easy:
size_t pos = s.find(t);
if(pos != string::npos)
{
s.erase(pos, t.length());
}
This does not care, however, about leaving multiple whitespace or if t is not an entire word within s (e. g. t = "down"; s = "I'm going to downtown."; would result in s == "I'm going to town."), but you did not do so either...

First problem is, if std::string::erase is called only with the beginning position, it erases everything until the end of string.
Second problem is, that the code will just erase all letters which are in the second string, one by one. I.e. not the entire word - for that, you would need to check if the entire word matches, and only then erase (the entire length of the word). Ask yourself - what will happen in the code, if e.g. the first two letters will match, but not the rest of the word?

In your second for loop you never incremented j and inside the if (plm[i]==plm2[j]) block you used j instead of i as your offset in erase().
for(int i=0;i<plm.size();i++)
{
for(int j=0;j<plm2.size();j++)//here you need to increment j
{
if (plm[i]==plm2[j])
plm.erase(plm.begin()+i);//here the offset should be i
}
}
Another thing don't use a do...while loop to read from the stringstream and push back on the vector. If the reading fails you will be pushing invalid data to the vector, instead try something like:
string sub;
while(iss >> sub;)
plm.push_back(sub);//only if reading is successful
...//do the same for the other istringstream too

You do not increment j this is the first thing I saw on your code. Write it correctly then if it still doesnt work, then ask!

Related

How do I make an alphabetized list of all distinct words in a file with the number of times each word was used?

I am writing a program using Microsoft Visual C++. In the program I must read in a text file and print out an alphabetized list of all distinct words in that file with the number of times each word was used.
I have looked up different ways to alphabetize a string but they do not work with the way I have my string initialized.
// What is inside my text file
Any experienced programmer engaged in writing programs for use by others knows
that, once his program is working correctly, good output is a must. Few people
really care how much time and trouble a programmer has spent in designing and
debugging a program. Most people see only the results. Often, by the time a
programmer has finished tackling a difficult problem, any output may look
great. The programmer knows what it means and how to interpret it. However,
the same cannot be said for others, or even for the programmer himself six
months hence.
string lines;
getline(input, lines); // Stores what is in file into the string
I expect an alphabetized list of words with the number of times each word was used. So far, I do not know how to begin this process.
It's rather simple, std::map automatically sorts based on key in the key/value pair you get. The key/value pair represents word/count which is what you need. You need to do some filtering for special characters and such.
EDIT: std::stringstream is a nice way of splitting std::string using whitespace delimiter as it's the default delimiter. Therefore, using stream >> word you will get whitespace-separated words. However, this might not be enough due to punctuation. For example: Often, has comma which we need to filter out. Therefore, I used std::replaceif which replaces puncts and digits with whitespaces.
Now a new problem arises. In your example, you have: "must.Few" which will be returned as one word. After replacing . with we have "must Few". So I'm using another stringstream on the filtered "word" to make sure I have only words in the final result.
In the second loop you will notice if(word == "") continue;, this can happen if the string is not trimmed. If you look at the code you will find out that we aren't trimming after replacing puncts and digits. That is, "Often," will be "Often " with trailing whitespace. The trailing whitespace causes the second loop to extract an empty word. This is why I added the condition to ignore it. You can trim the filtered result and then you wouldn't need this check.
Finally, I have added ignorecase boolean to check if you wish to ignore the case of the word or not. If you wish to do so, the program will simply convert the word to lowercase and then add it to the map. Otherwise, it will add the word the same way it found it. By default, ignorecase = true, if you wish to consider case, just call the function differently: count_words(input, false);.
Edit 2: In case you're wondering, the statement counts[word] will automatically create key/value pair in the std::map IF there isn't any key matching word. So when we call ++: if the word isn't in the map, it will create the pair, and increment value by 1 so you will have newly added word. If it exists already in the map, this will increment the existing value by 1 and hence it acts as a counter.
The program:
#include <iostream>
#include <map>
#include <sstream>
#include <cstring>
#include <cctype>
#include <string>
#include <iomanip>
#include <algorithm>
std::string to_lower(const std::string& str) {
std::string ret;
for (char c : str)
ret.push_back(tolower(c));
return ret;
}
std::map<std::string, size_t> count_words(const std::string& str, bool ignorecase = true) {
std::map<std::string, size_t> counts;
std::stringstream stream(str);
while (stream.good()) {
// wordW may have multiple words connected by special chars/digits
std::string wordW;
stream >> wordW;
// filter special chars and digits
std::replace_if(wordW.begin(), wordW.end(),
[](const char& c) { return std::ispunct(c) || std::isdigit(c); }, ' ');
// now wordW may have multiple words seperated by whitespaces, extract them
std::stringstream word_stream(wordW);
while (word_stream.good()) {
std::string word;
word_stream >> word;
// ignore empty words
if (word == "") continue;
// add to count.
ignorecase ? counts[to_lower(word)]++ : counts[word]++;
}
}
return counts;
}
void print_counts(const std::map<std::string, size_t>& counts) {
for (auto pair : counts)
std::cout << std::setw(15) << pair.first << " : " << pair.second << std::endl;
}
int main() {
std::string input = "Any experienced programmer engaged in writing programs for use by others knows \
that, once his program is working correctly, good output is a must.Few people \
really care how much time and trouble a programmer has spent in designing and \
debugging a program.Most people see only the results.Often, by the time a \
programmer has finished tackling a difficult problem, any output may look \
great.The programmer knows what it means and how to interpret it.However, \
the same cannot be said for others, or even for the programmer himself six \
months hence.";
auto counts = count_words(input);
print_counts(counts);
return 0;
}
I have tested this with Visual Studio 2017 and here is the part of the output:
a : 5
and : 3
any : 2
be : 1
by : 2
cannot : 1
care : 1
correctly : 1
debugging : 1
designing : 1
As others have already noted, an std::map handles the counting you care about quite easily.
Iostreams already have a tokenize to break an input stream up into words. In this case, we want to to only "think" of letters as characters that can make up words though. A stream uses a locale to make that sort of decision, so to change how it's done, we need to define a locale that classifies characters as we see fit.
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white space
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except lower- and upper-case letters, which are classified accordingly:
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
With that in place, we tell the stream to use our ctype facet, then simply read words from the file and count them in the map:
std::cin.imbue(std::locale(std::locale(), new alpha_only));
std::map<std::string, std::size_t> counts;
std::string word;
while (std::cin >> word)
++counts[to_lower(word)];
...and when we're done with that, we can print out the results:
for (auto w : counts)
std::cout << w.first << ": " << w.second << "\n";
Id probably start by inserting all of those words into an array of strings, then start with the first index of the array and compare that with all of the other indexes if you find matches, add 1 to a counter and after you went through the array you could display the word you were searching for and how many matches there were and then go onto the next element and compare that with all of the other elements in the array and display etc. Or maybe if you wanna make a parallel array of integers that holds the number of matches you could do all the comparisons at one time and the displays at one time.
EDIT:
Everyone's answer seems more elegant because of the map's inherent sorting. My answer functions more as a parser, that later sorts the tokens. Therefore my answer is only useful to the extent of a tokenizer or lexer, whereas Everyone's answer is only good for sorted data.
You first probably want to read in the text file. You want to use a streambuf iterator to read in the file(found here).
You will now have a string called content, which is the content of you file. Next you will want to iterate, or loop, over the contents of this string. To do that you'll want to use an iterator. There should be a string outside of the loop that stores the current word. You will iterate over the content string, and each time you hit a letter character, you will add that character to your current word string. Then, once you hit a space character, you will take that current word string, and push it back into the wordString vector. (Note: that means that this will ignore non-letter characters, and that only spaces denote word separation.)
Now that we have a vector of all of our words in strings, we can use std::sort, to sort the vector in alphabetical order.(Note: capitalized words take precedence over lowercase words, and therefore will be sorted first.) Then we will iterate over our vector of stringWords and convert them into Word objects (this is a little heavy-weight), that will store their appearances and the word string. We will push these Word objects into a Word vector, but if we discover a repeat word string, instead of adding it into the Word vector, we'll grab the previous entry and increment its appearance count.
Finally, once this is all done, we can iterate over our Word object vector and output the word followed by its appearances.
Full Code:
#include <vector>
#include <fstream>
#include <iostream>
#include <streambuf>
#include <algorithm>
#include <string>
class Word //define word object
{
public:
Word(){appearances = 1;}
~Word(){}
int appearances;
std::string mWord;
};
bool isLetter(const char x)
{
return((x >= 'a' && x <= 'z') || (x >= 'A' && x <= 'Z'));
}
int main()
{
std::string srcFile = "myTextFile.txt"; //what file are we reading
std::ifstream ifs(srcFile);
std::string content( (std::istreambuf_iterator<char>(ifs) ),
( std::istreambuf_iterator<char>() )); //read in the file
std::vector<std::string> wordStringV; //create a vector of word strings
std::string current = ""; //define our current word
for(auto it = content.begin(); it != content.end(); ++it) //iterate over our input
{
const char currentChar = *it; //make life easier
if(currentChar == ' ')
{
wordStringV.push_back(current);
current = "";
continue;
}
else if(isLetter(currentChar))
{
current += *it;
}
}
std::sort(wordStringV.begin(), wordStringV.end(), std::less<std::string>());
std::vector<Word> wordVector;
for(auto it = wordStringV.begin(); it != wordStringV.end(); ++it) //iterate over wordString vector
{
std::vector<Word>::iterator wordIt;
//see if the current word string has appeared before...
for(wordIt = wordVector.begin(); wordIt != wordVector.end(); ++wordIt)
{
if((*wordIt).mWord == *it)
break;
}
if(wordIt == wordVector.end()) //...if not create a new Word obj
{
Word theWord;
theWord.mWord = *it;
wordVector.push_back(theWord);
}
else //...otherwise increment the appearances.
{
++((*wordIt).appearances);
}
}
//print the words out
for(auto it = wordVector.begin(); it != wordVector.end(); ++it)
{
Word theWord = *it;
std::cout << theWord.mWord << " " << theWord.appearances << "\n";
}
return 0;
}
Side Notes
Compiled with g++ version 4.2.1 with target x86_64-apple-darwin, using the compiler flag -std=c++11.
If you don't like iterators you can instead do
for(int i = 0; i < v.size(); ++i)
{
char currentChar = vector[i];
}
It's important to note that if you are capitalization agnostic simply use std::tolower on the current += *it; statement (ie: current += std::tolower(*it);).
Also, you seem like a beginner and this answer might have been too heavyweight, but you're asking for a basic parser and that is no easy task. I recommend starting by parsing simpler strings like math equations. Maybe make a calculator app.

How to read a complex input with istream&, string& and getline in c++?

I am very new to C++, so I apologize if this isn't a good question but I really need help in understanding how to use istream.
There is a project I have to create where it takes several amounts of input that can be on one line or multiple and then pass it to a vector (this is only part of the project and I would like to try the rest on my own), for example if I were to input this...
>> aaa bb
>> ccccc
>> ddd fff eeeee
Makes a vector of strings with "aaa", "bb", "ccccc", "ddd", "fff", "eeeee"
The input can be a char or string and the program stops asking for input when the return key is hit.
I know getline() gets a line of input and I could probably use a while loop to try and get the input such as...(correct me if I'm wrong)
while(!string.empty())
getline(cin, string);
However, I don't truly understand istream and it doesn't help that my class has not gone over pointers so I don't know how to use istream& or string& and pass it into a vector. On the project description, it said to NOT use stringstream but use functionality from getline(istream&, string&). Can anyone give somewhat of a detailed explanation as to how to make a function using getline(istream&, string&) and then how to use it in the main function?
Any little bit helps!
You're on the right way already; solely, you'd have to pre-fill the string with some dummy to enter the while loop at all. More elegant:
std::string line;
do
{
std::getline(std::cin, line);
}
while(!line.empty());
This should already do the trick reading line by line (but possibly multiple words on one line!) and exiting, if the user enters an empty line (be aware that whitespace followed by newline won't be recognised as such!).
However, if anything on the stream goes wrong, you'll be trapped in an endless loop processing previous input again and again. So best check the stream state as well:
if(!std::getline(std::cin, line))
{
// this is some sample error handling - do whatever you consider appropriate...
std::cerr << "error reading from console" << std::endl;
return -1;
}
As there might be multiple words on a single line, you'd yet have to split them. There are several ways to do so, quite an easy one is using an std::istringstream – you'll discover that it ressembles to what you likely are used to using std::cin:
std::istringstream s(line);
std::string word;
while(s >> word)
{
// append to vector...
}
Be aware that using operator>> ignores leading whitespace and stops after first trailing one (or end of stream, if reached), so you don't have to deal with explicitly.
OK, you're not allowed to use std::stringstream (well, I used std::istringstream, but I suppose this little difference doesn't count, does it?). Changes matter a little, it gets more complex, on the other hand, we can decide ourselves what counts as words an what as separators... We might consider punctuation marks as separators just like whitespace, but allow digits to be part of words, so we'd accept e. g. ab.7c d as "ab", "7c", "d":
auto begin = line.begin();
auto end = begin;
while(end != line.end()) // iterate over each character
{
if(std::isalnum(static_cast<unsigned char>(*end)))
{
// we are inside a word; don't touch begin to remember where
// the word started
++end;
}
else
{
// non-alpha-numeric character!
if(end != begin)
{
// we discovered a word already
// (i. e. we did not move begin together with end)
words.emplace_back(begin, end);
// ('words' being your std::vector<std::string> to place the input into)
}
++end;
begin = end; // skip whatever we had already
}
}
// corner case: a line might end with a word NOT followed by whitespace
// this isn't covered within the loop, so we need to add another check:
if(end != begin)
{
words.emplace_back(begin, end);
}
It shouldn't be too difficult to adjust to different interpretations of what is a separator and what counts as word (e. g. std::isalpha(...) || *end == '_' to detect underscore as part of words, but digits not). There are quite a few helper functions you might find useful...
You could input the value of the first column, then call functions based on the value:
void Process_Value_1(std::istream& input, std::string& value);
void Process_Value_2(std::istream& input, std::string& value);
int main()
{
// ...
std::string first_value;
while (input_file >> first_value)
{
if (first_value == "aaa")
{
Process_Value_1(input_file, first_value);
}
else if (first_value = "ccc")
{
Process_Value_2(input_file, first_value);
}
//...
}
return 0;
}
A sample function could be:
void Process_Value_1(std::istream& input, std::string& value)
{
std::string b;
input >> b;
std::cout << value << "\t" << b << endl;
input.ignore(1000, '\n'); // Ignore until newline.
}
There are other methods to perform the process, such as using tables of function pointers and std::map.

How can I reach the second word in a string?

I'm new here and this is my first question, so don't be too harsh :]
I'm trying to reverse a sentence, i.e. every word separately.
The problem is that I just can't reach the second word, or even reach the ending of a 1-word sentence. What is wrong?
char* reverse(char* string)
{
int i = 0;
char str[80];
while (*string)
str[i++] = *string++;
str[i] = '\0'; //null char in the end
char temp;
int wStart = 0, wEnd = 0, ending = 0; //wordStart, wordEnd, sentence ending
while (str[ending]) /*####This part just won't stop####*/
{
//skip spaces
while (str[wStart] == ' ')
wStart++; //wStart - word start
//for each word
wEnd = wStart;
while (str[wEnd] != ' ' && str[wEnd])
wEnd++; //wEnd - word ending
ending = wEnd; //for sentence ending
for (int k = 0; k < (wStart + wEnd) / 2; k++) //reverse
{
temp = str[wStart];
str[wStart++] = str[wEnd];
str[wEnd--] = temp;
}
}
return str;
}
Your code is somewhat unidiomatic for C++ in that it doesn't actually make use of a lot of common and convenient C++ facilities. In your case, you could benefit from
std::string which takes care of maintaining a buffer big enough to accomodate your string data.
std::istringstream which can easily split a string into spaces for you.
std::reverse which can reverse a sequence of items.
Here's an alternative version which uses these facilities:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <vector>
std::string reverse( const std::string &s )
{
// Split the string on spaces by iterating over the stream
// elements and inserting them into the 'words' vector'.
std::vector<std::string> words;
std::istringstream stream( s );
std::copy(
std::istream_iterator<std::string>( stream ),
std::istream_iterator<std::string>(),
std::back_inserter( words )
);
// Reverse the words in the vector.
std::reverse( words.begin(), words.end() );
// Join the words again (inserting one space between two words)
std::ostringstream result;
std::copy(
words.begin(),
words.end(),
std::ostream_iterator<std::string>( result, " " )
);
return result.str();
}
At the end of the first word, after it's traversed, str[wEnd] is a space and
you remember this index when you assign ending = wEnd.
Immediately, you reverse the characters in the word. At that point,
str[ending] is not a space because you included that space in the
letter-reversal of the word.
Depending on whether there are extra
spaces between words in the rest of the input, execution varies from this point, but it does eventually end with
you reversing a word that ended at the null terminator on the string
because you end the loop that increments wEnd on that null terminator and
include it in the final word reversal.
The very next iteration walks off of
the initialized part of the input string and the execution is undetermined from there because, heck, who knows what's in that array (str is stack-allocated, so it's whatever's sitting around in the memory occupied by the stack at that point).
On top of all of that, you don't update wStart except in the reversal loop,
and it never moves to wEnd all the way (see the loop exit condition), so come to think of it, you're never getting past that first word. Assuming that was fixed, you'd still have the problem I outlined at first.
All this assumes that you didn't just send this function something longer than 80 characters and break it that way.
Oh, and as mentioned in one of the comments on the question, you're returning stack-allocated local storage, which isn't going to go anywhere good either.
Hoo, boy. Where to start?
In C++, use std::string instead of char* if you can.
char[80] is an overflow risk if string is input by a user; it should be dynamically allocated. Preferably by using std::string; otherwise use new / new[]. If you meant to use C, then malloc.
cmbasnett also pointed out that you can't actually return str (and get the expected results) if you declare / allocate it the way you did. Traditionally, you'd pass in a char* destination and not allocate anything in the function at all.
Set ending to wEnd + 1; wEnd points to the last non-null character of the string in question (eventually, if it works right), so in order for str[ending] to break out of the loop, you have to increment once to get to the null char. Disregard that, I misread the while loop.
It looks like you need to use ((wEnd - wStart) + 1), not (wStart + wEnd). Although you should really use something like while(wEnd > wStart) instead of a for loop in this context.
You also should be setting wStart = ending; or something before you leave the loop, because otherwise it's going to get stuck on the first word.

Wordlist transfer for an anagram program

I'm almost finished with my program, but there's one last bug that I'm having problems ferreting out. The program is supposed to check about 10 scrambled words against a wordlist to see what the scrambled words are anagrams of. To do this, I alphabetized each word in the wordlist (apple would become aelpp), set that as the key of a map, and made the corresponding entry the original, unalphabetized word.
The program is messing up when it comes to the entries in the map. When the entry is six characters or less, the program tags a random character on the end of the string. I've narrowed down what can be causing the problem to a single loop:
while(myFile){
myFile.getline(str, 30);
int h=0;
for (; str[h] != 0; h++)//setting the initial version of str
{
strInit[h]=str[h]; //strInit is what becomes the entry into the map.
}
strInit[h+1]='\0'; //I didn't know if the for loop would include the null char
cout<<strInit; //Personal error-checking; not necessary for the program
}
And if it's necessary, here's the entire program:
Program
Prevent issues, use normal functions:
getline(str, 30);
strncpy(strInit, str, 30);
Prevent more issues, use standard strings:
std::string strInit, str;
while (std::getline(myFile, str)) {
strInit = str;
// do stuff
}
Best not to use raw C arrays at all! Here's a version, using modern C++:
#include <string>
std::string str;
while (std::getline(myFile, str))
{
// do something useful with str
// Example: mymap[str] = f(str);
std::cout << str; //Personal error-checking; not necessary for the program
}

Cleaning a string of punctuation in C++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.
Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).
The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.
So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.
Here are the functions I have written to read in from the file and then the one that cleans it.
Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.
Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.
int get_words(map<string, int>& mapz)
{
int cnt = 0; //set out counter to zero
map<string, int>::const_iterator mapzIter;
ifstream input; //declare instream
input.open( "prog2.d" ); //open instream
assert( input ); //assure it is open
string s; //temp strings to read into
string not_s;
input >> s;
while(!input.eof()) //read in until EOF
{
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
}
input.close(); //close instream
for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
cnt = cnt + mapzIter->second;
return cnt; //return number of words in instream
}
void clean_entry(const string& non_clean, string& clean)
{
int i, j, begin, end;
for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);
begin = i;
if(begin ==(int)non_clean.length())
return;
for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);
end = j;
clean = non_clean.substr(begin, (end-begin));
for(i = 0; i < (int)clean.size(); i++)
clean[i] = tolower(clean[i]);
}
The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
to
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() > 0)
{
mapz[not_s]++; //increment occurence
}
input >>s;
EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.
Further improvements would be to
declare variables only when you use them, and in the innermost scope
use c++-style casts instead of the c-style (int) casts
use empty() instead of length() == 0 comparisons
use the prefix increment operator for the iterators (i.e. ++mapzIter)
A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:
if (!not_s.empty())
mapz[not_s]++;
Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:
string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
string clean;
... // as before
if(begin ==(int)non_clean.length())
return clean;
... // as before
return clean;
}
This makes it clearer what the function is doing (taking a string, and returning something based on that string).
The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.
From the basic structure, I think you could split the code into (at least):
getNextWord: Return the next (non blank) word from the stream (returns false if none left)
clean_entry: What you have now
getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.
The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:
bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);
The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.
The main component of 'getWords' then can be simplified down to:
std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
++map[nextCleanWord];
}
An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.