read multiple lines but especially... parsing them efficiently - c++

I need to read multiple lines with specific keywords at the beginning.
I have a basic problem and I'd need a hand to help me.
Here are the kind of input:
keyword1 0.0 0.0
keyword1 1.0 5.0
keyword2 10.0
keyword3 0.5
keyword4 6.0
rules are:
lines containing keyword1 & keyword2 SHOULD be in that order AND before any other lines.
lines containing keyword3 & keyword4 can be in any order
keyword1 HAS TO be followed by 2 double
keyword2, 3 & 4 HAVE TO be followed by 1 double
at the end of a block of lines containing all the four keyword followed by their double, the "loop" breaks and a calculation is triggered.
Here's the source I have:
using namespace std;
int main (int argc, const char * argv[]) {
vector<double> arrayInputs;
string line;
double keyword1_first, keyword1_second, keyword4,
keyword3, keyword2;
bool inside_keyword1=false, after_keyword2=false,
keyword4_defined=false, keyword3_defined=false ;
//cin.ignore();
while (getline(cin, line)) {
if (inside_keyword1 && after_keyword2 && keyword3 && keyword4) {
break;
}
else
{
std::istringstream split(line);
std::vector<std::string> tokens;
char split_char = ' ';
for (std::string each; std::getline(split, each, split_char); tokens.push_back(each));
if (tokens.size() > 2)
{
if (tokens[0] != "keyword1") return EXIT_FAILURE; // input format error
else
{
keyword1_first = atof(tokens[1].c_str());
keyword1_second = atof(tokens[2].c_str());
inside_keyword1 = true;
}
}
else
{
if (tokens[0] == "keyword2")
{
if (inside_keyword1)
{
keyword2 = atof(tokens[1].c_str());
after_keyword2 = true;
}
else return EXIT_FAILURE; // cannot define anything else keyword2 after keyword1 definition
}
else if (tokens[0] == "keyword3")
{
if (inside_keyword1 && after_keyword2)
{
keyword3 = atof(tokens[1].c_str());
keyword3_defined = true;
}
else return EXIT_FAILURE; // cannot define keyword3 outside a keyword1
}
else if (tokens[0] == "keyword4")
{
if (inside_keyword1 && after_keyword2)
{
keyword4 = atof(tokens[1].c_str());
keyword4_defined = true;
}
else return EXIT_FAILURE; // cannot define keyword4 outside a keyword1
}
}
}
}
// Calculation
// output
return EXIT_SUCCESS;
}
My question is: Is there a more efficient way to go about this besides using booleans in the reading/parsing loop ?

You ask about something "more efficient", but it seems you don't have a particular performance objective. So what you want here is probably more like a Code Review. There's a site for that, in particular:
https://codereview.stackexchange.com/
But anyway...
You are correct to intuit that four booleans are not really called for here. That's 2^4 = 16 different "states", many of which you should never be able to get to. (Your specification explicitly forbids, for instance, keyword3_defined == true when after_keyword1 == false).
Program state can be held in enums and booleans, sure. That makes it possible for a "forgetful" loop to revisit a line of code under different circumstances, yet still remember what phase of processing it is in. It's useful in many cases, including in sophisticated parsers. But if your task is linear and simple, it's better to implicitly "know" the state based on having reached a certain line of code.
As an educational example to show the contrast I'm talking about, here's a silly state machine to read in a letter A followed by any number of letter Bs:
enum State {
beforeReadingAnA,
haveReadAnA,
readingSomeBs,
doneReadingSomeBs
};
State s = beforeReadingAnA;
char c;
while(true) {
switch (s) {
case beforeReadingAnA:
cin >> c;
if (cin.good() && c == 'A') {
// good! accept and state transition to start reading Bs...
s = haveReadAnA;
} else {
// ERROR: expected an A
return EXIT_CODE_FAILURE;
};
break;
case haveReadAnA:
// We've read an A, so state transition into reading Bs
s = readingSomeBs;
break;
case readingSomeBs:
cin >> c;
if (cin.good() && c == 'B') {
// good! stay in the readingSomeBs state
} else if (cin.eof()) {
// reached the end of the input after 0 or more Bs
s = doneReadingSomeBs;
} else {
// ERROR: expected a B or the EOF
return EXIT_CODE_FAILURE;
}
break;
case doneReadingSomeBs:
// all done!
return EXIT_CODE_SUCCESS;
}
}
As mentioned, it's a style of coding that can be very, very useful. Yet for this case it's ridiculous. Compare with a simple linear piece of code that does the same thing:
// beforeReadingAnA is IMPLICIT
char c;
cin >> c;
if (cin.fail() || c != 'A')
return EXIT_CODE_FAILURE;
// haveReadAnA is IMPLICIT
do {
// readingSomeBs is IMPLICIT
cin >> c;
if (cin.eof())
return EXIT_CODE_SUCCESS;
if (cin.fail() || c != 'B')
return EXIT_CODE_FAILURE;
}
// doneReadingSomeBs is IMPLICIT
All the state variables disappear. They are unnecessary because the program just "knows where it is". If you rethink your example then you can probably do the same. You won't need four booleans because you can put your cursor on a line of code and say with confidence what those four boolean values would have to be if that line of code happens to be running.
As far as efficiency goes, the <iostream> classes can make life easier than you have it here and be more idiomatically C++ without invoking C-isms like atof or ever having to use c_str(). Let's look at a simplified excerpt of your code that just reads the doubles associated with "keyword1".
string line;
getline(cin, line);
istringstream split(line);
vector<string> tokens;
char split_char = ' ';
string each;
while (getline(split, each, split_char)) {
tokens.push_back(each);
}
double keyword1_first, keyword1_second;
if (tokens.size() > 2) {
if (tokens[0] != "keyword1") {
return EXIT_FAILURE; // input format error
} else {
keyword1_first = atof(tokens[1].c_str());
keyword1_second = atof(tokens[2].c_str());
}
}
Contrast that with this:
string keyword;
cin >> keyword;
if (keyword != "keyword1") {
return EXIT_FAILURE;
}
double keyword1_first, keyword1_second;
cin >> keyword1_first >> keyword1_second;
Magic. Iostreams can detect the type you are trying to read or write. If it encounters a problem interpreting the input in the way you ask for, then it will leave the input in the buffer so you can try reading it another way. (In the case of asking for a string, the behavior is to read a series of characters up to whitespace...if you actually wanted an entire line, you would use getline as you had done.)
The error handling is something you'll have to deal with, however. It's possible to tell iostreams to use exception-handling methodology, so that the standard response to encountering a problem (such as a random word in a place where a double was expected) would be to crash your program. But the default is to set a failure flag that you need to test:
cin erratic behaviour
There's nuance to iostream, so you probably want to do some survey of Q&A...I've been learning a bit myself lately while answering/asking here:
Output error when input isn't a number. C++
When to use printf/scanf vs cout/cin?

Related

c++ appending text into a string until see a specific character

I have more than one input files like this:
>1aab_
GKGDPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKT
MSAKEKGKFEDMAKADKARYEREMKTYIPPKGE
>1j46_A
MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAE
KWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK
>1k99_A
MKKLKKHPDFPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYK
ELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK
>2lef_A
MHIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQA
KYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK
Here, what I have to do:
vector <string> names;
vector <string> seqs;
names.resize(total); //"total" is already known.
seqs.resize(total);
counter=0;char input;
while ((input = myInput.get()) != EOF)
{
if(input=='>')
names[counter]= take all line (>1aab_, >1j46_A, so...)
else
untill the see next '>' append the character into sequence[counter]
counter++;
}
Finally it will be like this:
names[0]=">1aab_"
sequence[0]="GKGDPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE"
and so on..
I am thinking about for 2 hours and I couldn't figure out it. Can anyone help about that? Thanks in advance.
There's a few ways to solve it; I'll present some examples but I'm not testing/compiling this code, so there may be minor bugs - the logic is the important bit.
Since your pseudocode appears to be processing the input character by character, I've taken that as a requirement.
The way you seem to be thinking about it would be implemented with essentially a pair of loops - one for reading the name, the other for reading the sequence - which are enclosed in an outer loop, in order to process all records.
This would look something like the following:
// first character in file should be a '>', indicating the start
// of a record.
input = myInput.get();
if (input != '>')
{
std::cerr << "Malformed input file!" << std::endl;
return /*...*/;
}
do
{
// record name continues up until the newline
while ((input = myInput.get()) != EOF)
{
if (input == '\n' || input == '\r')
break;
names[counter].push_back(input);
}
// read sequence until we hit a '>' or EOF
while ((input = myInput.get()) != EOF)
{
if (input == '>')
{
// advance to next record number
counter++;
break;
}
sequence[counter].push_back(input);
}
} while (input != EOF && counter < total);
You'll also notice I moved the check for the initial '>' to before the loop, just as a way of ingesting (and discarding) the character, as well as a basic sanity check of the input. This is because we really use this character to mark the end of the sequence (rather than the "start of a record") - when we enter the loop, we assume we're already reading the record name.
Another way to approach it is to use a state machine. Essentially, this utilises additional variables to track the state the parser is in.
For this particular case, you only have two states: either you're reading a record name, or the sequence. So, we can just use a single boolean to track which state we're in.
Armed with the state variable, we can then make decisions about what to do with the character we just read based upon the state we're in. At the simplest level here, if we're in "read the record name" state, we add the character to the names variable, otherwise we add it to the sequence variable.
// state flag to indicate if we're currently reading a name line,
// i.e. a line starting with ">"
// This should be set true by the first record we encounter, so
// we'll set it false (to indicate we're reading a sequence) in
// order to allow us to detect bad input files.
bool reading_name = false;
// indicate we're on the first record, so we can avoid incrementing
// the record counter
bool first_record = true;
// process input character-by-character until end of file
while ((input = myInput.get()) != EOF)
{
// check for start of new record
if (input == '>')
{
// for robustness, verify we're not already reading a name,
// as this probably indicates invalid input
if (reading_name)
{
std::cerr << "Input is malformed?!" << endl;
break;
}
// switch to reading name state
reading_name = true;
// advance to next record, but only if it isn't the first record
if (first_record)
{
// disable the first_record flag, and explicitly set the
// record counter to 0.
first_record = false;
counter = 0;
}
else if (++counter >= total)
{
std::cerr << "Error: too many records!" << std::endl;
break;
}
}
// first character in file should start a new record
else if (first_record)
{
std::cerr << "Missing record start character at beginning of input!" << std::endl;
break;
}
// make sure we are processing a valid record number
else if (counter >= total)
{
std::cerr << "Invalid record number!" << std::endl;
break;
}
// continue reading the name
else if (reading_name)
{
// check if we've reached the end of the line; you
// may also want/need to check for \r if your input
// files may have Windows-style line endings
if (input == '\n')
{
// switch to reading sequence state
reading_name = false;
}
else
{
// add character to current name
names[counter].push_back(input);
}
}
// continue reading the sequence
else
{
// you might need to handle line ending characters here,
// maybe just skipping them?
// add character to current sequence
sequence[counter].push_back(input);
}
}
This adds a fair amount of complexity, which is of questionable value for this particular exercise, but does make adding additional states easier in future. It also has the benefit of only a single place in the code where I/O is done, which reduces the chances of errors (not checking for EOF, overflow array bounds, etc.).
In this case, we're actually using the '>' character as an indicator that a new record is starting, so we add a bit of extra logic to make sure that all works properly with the record counter. You could also just use a signed integer for your counter variable and start it at -1, so it will increment to 0 at the start of the first record, but using signed variables to index into arrays isn't a good idea.
There are more ways to approach this problem, but hopefully this gives you somewhere to start on your own solution.

Is there a way to check input data type using only basic concepts?

I'm being challenged to find ways to perform tasks that usually require the use of headers (besides iostream and iomanip) or greater-than-basic C++ knowledge. How can I check the data type of user input using only logical operators, basic arithmetic (+, -, *, /, %), if statements, and while loops?
Obviously the input variable has a declared data type in the first place, but this problem is covering the possibility of the user inputting the wrong data type.
I've tried several methods including the if (!(cin >> var1)) trick, but nothing works correctly. Is this possible at all?
Example
int main() {
int var1, var2;
cin >> var1;
cin >> var2;
cout << var1 << " - " << var2 << " = " << (var1-var2);
return 0;
}
It's possible to input asdf and 5.25 here, so how do I check that the input aren't integers as expected, using only the means I stated earlier?
I understand this problem is vague in many ways, mostly because the restrictions are extremely specific and listing everything I'm allowed to use would be a pain. I guess part of the problem as mentioned in the comments is figuring out how to distinguish between data types in the first place.
You can do that using simple operations, although it might be a little difficult, for example the following function can be used to check if the input is a decimal number. You can extend the idea and check if there is a period in between for floating point numbers.
Add a comment if you need further help.
bool isNumber(char *inp){
int i = 0;
if (inp[0] == '+' || inp[0] == '-') i = 1;
int sign = (inp[0] == '-' ? -1 : 1);
for (; inp[i]; i++){
if (!(inp[i] >= '0' && inp[i] <= '9'))
return false;
}
return true;
}
General checking after reading is done like this:
stream >> variable;
if (not stream.good()) {
// not successful
}
This can be done on any std::ios. It works for standard types (any numeric type, char, string, etc.) stopping at whitespace. If your variable could not be read, good returns false. You can customize it for your own classes (including control over good's return value):
istream & operator>>(istream & stream, YourClass & c)
{
// Read the data from stream into c
return stream;
}
For your specific problem: Suppose you read the characters 42. There is no way of distinguishing between reading it as
- an int
- a double
as both would be perfectly fine. You have to specify the input format more precisely.
The standard library is not magic - you just have to parse the data read from the user, similarly to what the standard library does.
First read the input from the user:
std::string s;
cin >> s;
(you may use getline instead if you want to read a whole line)
Then you can go on parsing it; we'll try to distinguish between integer (*[+-]?[0-9]+ *), real number (*[+-][0-9](\.[0-9]*)?([Ee][+-]?[0-9]+)? *), string (*"[^"]" *) and anything else ("bad").
enum TokenType {
Integer,
Real,
String,
Bad
};
The basic building block is a routine that "eats" consecutive digits; this will help us with the [0-9]* and [0-9]+ parts.
void eatdigits(const char *&rp) {
while(*rp>='0' && *rp<='0') rp++;
}
Also, a routine that skips whitespace can be handy:
void skipws(const char *&rp) {
while(*rp==' ') rp++;
// feel free to skip also tabs and whatever
}
Then we can attack the real problem
TokenType categorize(const char *rp) {
first, we want to skip the whitespace
skipws(rp);
then, we'll try to match the easiest stuff: the string
if(*rp=='"') {
// Skip the string content
while(*rp && *rp!='"') rp++;
// If the string stopped with anything different than " we
// have a parse error
if(!*rp) return Bad;
// Otherwise, skip the trailing whitespace
skipws(rp);
// And check if we got at the end
return *rp?Bad:String;
}
Then, on to numbers, notice that the real and integer definitions start in the same way; we have a common branch:
// If there's a + or -, it's fine, skip it
if(*rp=='+' || *rp=='-') rp++;
const char *before=rp;
// Skip the digits
eatdigits(rp);
// If we didn't manage to find any digit, it's not a valid number
if(rp==start) return Bad;
// If it ends here or after whitespace, it's an integer
if(!*rp) return Integer;
before = rp;
skipws(rp);
if(before!=rp) return *rp?Bad:Integer;
If we notice that there's still stuff, we tackle the real number:
// Maybe something after the decimal dot?
if(*rp=='.') {
rp++;
eatdigits(rp);
}
// Exponent
if(*rp=='E' || *rp=='e') {
rp++;
if(*rp=='+' || *rp=='-') rp++;
before=rp;
eatdigits(rp);
if(before==rp) return Bad;
}
skipws(rp);
return *rp?Bad:Real;
}
You can easily invoke this routine after reading the input.
(notice that here the string thing is just for fun, cin does not have any special processing for double-quotes delimited strings).

How do I check that stream extraction has consumed all input?

In the following function, I try to see if a string s is convertible to type T by seeing if I can read a type T, and if the input is completely consumed afterwards. I want
template <class T>
bool can_be_converted_to(const std::string& s, T& t)
{
std::istringstream i(s);
i>>std::boolalpha;
i>>t;
if (i and i.eof())
return true;
else
return false;
}
However, can_be_converted_to<bool>("true") evaluates to false, because i.eof() is false at the end of the function.
This is correct, even though the function has read the entire string, because it hasn't attempted to read past the end of the string. (So, apparently this function works for int and double because istringstream reads past the end when reading these.)
So, assuming that I should indeed be checking (i and <input completely consumed>):
Q: How do I check that the input was completely consumed w/o using eof()?
Use peek() or get() to check what's next in the stream:
return (i >> std::boolalpha >> t && i.peek() == EOF);
Your version doesn't work for integers, either. Consider this input: 123 45. It'll read 123 and report true, even though there are still some characters left in the stream.
In many implementations of the standard library the eof will only be set after you tried reading beyond the end. You can verify that in your code by doing:
char _;
if (i && !(i >> _)) { // i is in a valid state, but
// reading a single extra char fails
Extending on jrok's answer, you can use i.get() just as easily as
i.peek(), at least in this case. (I don't know if there is any reason
to prefer one to the other.)
Also, following the convention that white space is never anything but a
separator, you might want to extract it before checking for the end.
Something like:
return i >> std::ws && i.get() == std::istream::traits_type::eof();
Some older implementations of std::ws were buggy, and would put the
stream in an error state. In that case, you'd have to inverse the test,
and do something like:
return !(i >> std::ws) || i.get() == std::istream::traits_type::eof();
Or just read the std::ws before the condition, and depend uniquely on
the i.get().
(I don't know if buggy std::ws is still a problem. I developed a
version of it that worked back when it was, and I've just continued to
use it.)
I would like to offer a completely different approach:
Take your input string, tokenise it yourself, and then convert the individual fields using boost::lexical_cast<T>.
Reason: I wasted an afternoon on parsing a string containing 2 int and 2 double fields, separated by spaces. Doing the following:
int i, j;
double x, y;
std::istringstream ins{str};
ins >> i >> j >> x >> y;
// how to check errors???...
parses the correct input such as
`"5 3 9.9e+01 5.5e+02"`
correctly, but does not detect the problem with this:
`"5 9.6e+01 5.5e+02"`
What happens is that i will be set to 5 (OK), j will be set to 9 (??), x to 6.0 (=0.6e+01), y to 550 (OK). I was quite surprised to see failbit not being set... (platform info: OS X 10.9, Apple Clang++ 6.0, C++11 mode).
Of course you can say now, "But wait, the Standard states that it should be so", and you may be right, but knowing that it is a feature rather than a bug does not reduce the pain if you want to do proper error checking without writing miles of code.
OTOH, if you use "Marius"'s excellent tokeniser function and split str first on whitespace then suddenly everything becomes very easy. Here is a slightly modified version of the tokeniser. I re-wrote it to return a vector of strings; the original is a template that puts the tokens in a container with elements convertible to strings. (For those who need such a generic approach please consult the original link above.)
// \param str: the input string to be tokenized
// \param delimiters: string of delimiter characters
// \param trimEmpty: if true then empty tokens will be trimmed
// \return a vector of strings containing the tokens
std::vector<std::string> tokenizer(
const std::string& str,
const std::string& delimiters = " ",
const bool trimEmpty = false
) {
std::vector<std::string> tokens;
std::string::size_type pos, lastPos = 0;
const char* strdata = str.data();
while(true) {
pos = str.find_first_of(delimiters, lastPos);
if(pos == std::string::npos) {
// no more delimiters
pos = str.length();
if(pos != lastPos || !trimEmpty) {
tokens.emplace_back(strdata + lastPos, pos - lastPos);
}
break;
} else {
if(pos != lastPos || !trimEmpty) {
tokens.emplace_back(strdata + lastPos, pos - lastPos);
}
}
lastPos = pos + 1;
}
return tokens;
}
and then just use it like this (ParseError is some exception object):
std::vector<std::string> tokens = tokenizer(str, " \t", true);
if (tokens.size() < 4)
throw ParseError{"Too few fields in " + str};
try {
unsigned int i{ boost::lexical_cast<unsigned int>(tokens[0]) },
j{ boost::lexical_cast<unsigned int>(tokens[1]) };
double x{ boost::lexical_cast<double>(tokens[2]) },
y{ boost::lexical_cast<double>(tokens[3]) };
// print or process i, j, x, y ...
} catch(const boost::bad_lexical_cast& error) {
throw ParseError{"Could not parse " + str};
}
Note: you can use the Boost split or the tokenizer if you wish, but they were slower than Marius' tokeniser (at least in my environment).
Update: Instead of boost::lexical_cast<T> you can use the C++11 "std::sto*" functions (e.g. stoi to convert a string token to an int). These throw two kinds of exceptions: std::invalid_argument if the conversion could not be performed and std::out_of_range if the converted value cannot be represented.
You could either catch these separately or their parent std::runtime_error. Modifications to the example code above is left as an exercise to the reader :-)

Skip reading a line in a INI file if its length greater than n in C++

I want to skip reading a line in the INI file if has more than 1000 characters.This is the code i'm using:
#define MAX_LINE 1000
char buf[MAX_LINE];
CString strTemp;
str.Empty();
for(;;)
{
is.getline(buf,MAX_LINE);
strTemp=buf;
if(strTemp.IsEmpty()) break;
str+=strTemp;
if(str.Find("^")>-1)
{
str=str.Left( str.Find("^") );
do
{
is.get(buf,2);
} while(is.gcount()>0);
is.getline(buf,2);
}
else if(strTemp.GetLength()!=MAX_LINE-1) break;
}
//is.getline(buf,MAX_LINE);
return is;
...
The problem i'm facing is that if the characters exceed 1000 if seems to fall in a infinite loop(unable to read next line).How can i make the getline to skip that line and read the next line??
const std::size_t max_line = 1000; // not a macro, macros are disgusting
std::string line;
while (std::getline(is, line))
{
if (line.length() > max_line)
continue;
// else process the line ...
}
How abut checking the return value of getline and break if that fails?
..or if is is an istream, you could check for an eof() condition to break you out.
#define MAX_LINE 1000
char buf[MAX_LINE];
CString strTemp;
str.Empty();
while(is.eof() == false)
{
is.getline(buf,MAX_LINE);
strTemp=buf;
if(strTemp.IsEmpty()) break;
str+=strTemp;
if(str.Find("^")>-1)
{
str=str.Left( str.Find("^") );
do
{
is.get(buf,2);
} while((is.gcount()>0) && (is.eof() == false));
stillReading = is.getline(buf,2);
}
else if(strTemp.GetLength()!=MAX_LINE-1)
{
break;
}
}
return is;
For something completely different:
std::string strTemp;
str.Empty();
while(std::getline(is, strTemp)) {
if(strTemp.empty()) break;
str+=strTemp.c_str(); //don't need .c_str() if str is also a std::string
int pos = str.Find("^"); //extracted this for speed
if(pos>-1){
str=str.Left(pos);
//Did not translate this part since it was buggy
} else
//not sure of the intent here either
//it would stop reading if the line was less than 1000 characters.
}
return is;
This uses strings for ease of use, and no maximum limits on lines. It also uses the std::getline for the dynamic/magic everything, but I did not translate the bit in the middle since it seemed very buggy to me, and I couldn't interpret the intent.
The part in the middle simply reads two characters at a time until it reaches the end of the file, and then everything after that would have done bizarre stuff since you weren't checking return values. Since it was completely wrong, I didn't interpret it.

Cleaning a string of punctuation in C++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.
Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).
The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.
So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.
Here are the functions I have written to read in from the file and then the one that cleans it.
Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.
Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.
int get_words(map<string, int>& mapz)
{
int cnt = 0; //set out counter to zero
map<string, int>::const_iterator mapzIter;
ifstream input; //declare instream
input.open( "prog2.d" ); //open instream
assert( input ); //assure it is open
string s; //temp strings to read into
string not_s;
input >> s;
while(!input.eof()) //read in until EOF
{
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
}
input.close(); //close instream
for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
cnt = cnt + mapzIter->second;
return cnt; //return number of words in instream
}
void clean_entry(const string& non_clean, string& clean)
{
int i, j, begin, end;
for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);
begin = i;
if(begin ==(int)non_clean.length())
return;
for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);
end = j;
clean = non_clean.substr(begin, (end-begin));
for(i = 0; i < (int)clean.size(); i++)
clean[i] = tolower(clean[i]);
}
The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
to
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() > 0)
{
mapz[not_s]++; //increment occurence
}
input >>s;
EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.
Further improvements would be to
declare variables only when you use them, and in the innermost scope
use c++-style casts instead of the c-style (int) casts
use empty() instead of length() == 0 comparisons
use the prefix increment operator for the iterators (i.e. ++mapzIter)
A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:
if (!not_s.empty())
mapz[not_s]++;
Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:
string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
string clean;
... // as before
if(begin ==(int)non_clean.length())
return clean;
... // as before
return clean;
}
This makes it clearer what the function is doing (taking a string, and returning something based on that string).
The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.
From the basic structure, I think you could split the code into (at least):
getNextWord: Return the next (non blank) word from the stream (returns false if none left)
clean_entry: What you have now
getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.
The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:
bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);
The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.
The main component of 'getWords' then can be simplified down to:
std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
++map[nextCleanWord];
}
An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.