How do I check that stream extraction has consumed all input? - c++

In the following function, I try to see if a string s is convertible to type T by seeing if I can read a type T, and if the input is completely consumed afterwards. I want
template <class T>
bool can_be_converted_to(const std::string& s, T& t)
{
std::istringstream i(s);
i>>std::boolalpha;
i>>t;
if (i and i.eof())
return true;
else
return false;
}
However, can_be_converted_to<bool>("true") evaluates to false, because i.eof() is false at the end of the function.
This is correct, even though the function has read the entire string, because it hasn't attempted to read past the end of the string. (So, apparently this function works for int and double because istringstream reads past the end when reading these.)
So, assuming that I should indeed be checking (i and <input completely consumed>):
Q: How do I check that the input was completely consumed w/o using eof()?

Use peek() or get() to check what's next in the stream:
return (i >> std::boolalpha >> t && i.peek() == EOF);
Your version doesn't work for integers, either. Consider this input: 123 45. It'll read 123 and report true, even though there are still some characters left in the stream.

In many implementations of the standard library the eof will only be set after you tried reading beyond the end. You can verify that in your code by doing:
char _;
if (i && !(i >> _)) { // i is in a valid state, but
// reading a single extra char fails

Extending on jrok's answer, you can use i.get() just as easily as
i.peek(), at least in this case. (I don't know if there is any reason
to prefer one to the other.)
Also, following the convention that white space is never anything but a
separator, you might want to extract it before checking for the end.
Something like:
return i >> std::ws && i.get() == std::istream::traits_type::eof();
Some older implementations of std::ws were buggy, and would put the
stream in an error state. In that case, you'd have to inverse the test,
and do something like:
return !(i >> std::ws) || i.get() == std::istream::traits_type::eof();
Or just read the std::ws before the condition, and depend uniquely on
the i.get().
(I don't know if buggy std::ws is still a problem. I developed a
version of it that worked back when it was, and I've just continued to
use it.)

I would like to offer a completely different approach:
Take your input string, tokenise it yourself, and then convert the individual fields using boost::lexical_cast<T>.
Reason: I wasted an afternoon on parsing a string containing 2 int and 2 double fields, separated by spaces. Doing the following:
int i, j;
double x, y;
std::istringstream ins{str};
ins >> i >> j >> x >> y;
// how to check errors???...
parses the correct input such as
`"5 3 9.9e+01 5.5e+02"`
correctly, but does not detect the problem with this:
`"5 9.6e+01 5.5e+02"`
What happens is that i will be set to 5 (OK), j will be set to 9 (??), x to 6.0 (=0.6e+01), y to 550 (OK). I was quite surprised to see failbit not being set... (platform info: OS X 10.9, Apple Clang++ 6.0, C++11 mode).
Of course you can say now, "But wait, the Standard states that it should be so", and you may be right, but knowing that it is a feature rather than a bug does not reduce the pain if you want to do proper error checking without writing miles of code.
OTOH, if you use "Marius"'s excellent tokeniser function and split str first on whitespace then suddenly everything becomes very easy. Here is a slightly modified version of the tokeniser. I re-wrote it to return a vector of strings; the original is a template that puts the tokens in a container with elements convertible to strings. (For those who need such a generic approach please consult the original link above.)
// \param str: the input string to be tokenized
// \param delimiters: string of delimiter characters
// \param trimEmpty: if true then empty tokens will be trimmed
// \return a vector of strings containing the tokens
std::vector<std::string> tokenizer(
const std::string& str,
const std::string& delimiters = " ",
const bool trimEmpty = false
) {
std::vector<std::string> tokens;
std::string::size_type pos, lastPos = 0;
const char* strdata = str.data();
while(true) {
pos = str.find_first_of(delimiters, lastPos);
if(pos == std::string::npos) {
// no more delimiters
pos = str.length();
if(pos != lastPos || !trimEmpty) {
tokens.emplace_back(strdata + lastPos, pos - lastPos);
}
break;
} else {
if(pos != lastPos || !trimEmpty) {
tokens.emplace_back(strdata + lastPos, pos - lastPos);
}
}
lastPos = pos + 1;
}
return tokens;
}
and then just use it like this (ParseError is some exception object):
std::vector<std::string> tokens = tokenizer(str, " \t", true);
if (tokens.size() < 4)
throw ParseError{"Too few fields in " + str};
try {
unsigned int i{ boost::lexical_cast<unsigned int>(tokens[0]) },
j{ boost::lexical_cast<unsigned int>(tokens[1]) };
double x{ boost::lexical_cast<double>(tokens[2]) },
y{ boost::lexical_cast<double>(tokens[3]) };
// print or process i, j, x, y ...
} catch(const boost::bad_lexical_cast& error) {
throw ParseError{"Could not parse " + str};
}
Note: you can use the Boost split or the tokenizer if you wish, but they were slower than Marius' tokeniser (at least in my environment).
Update: Instead of boost::lexical_cast<T> you can use the C++11 "std::sto*" functions (e.g. stoi to convert a string token to an int). These throw two kinds of exceptions: std::invalid_argument if the conversion could not be performed and std::out_of_range if the converted value cannot be represented.
You could either catch these separately or their parent std::runtime_error. Modifications to the example code above is left as an exercise to the reader :-)

Related

How to read a complex input with istream&, string& and getline in c++?

I am very new to C++, so I apologize if this isn't a good question but I really need help in understanding how to use istream.
There is a project I have to create where it takes several amounts of input that can be on one line or multiple and then pass it to a vector (this is only part of the project and I would like to try the rest on my own), for example if I were to input this...
>> aaa bb
>> ccccc
>> ddd fff eeeee
Makes a vector of strings with "aaa", "bb", "ccccc", "ddd", "fff", "eeeee"
The input can be a char or string and the program stops asking for input when the return key is hit.
I know getline() gets a line of input and I could probably use a while loop to try and get the input such as...(correct me if I'm wrong)
while(!string.empty())
getline(cin, string);
However, I don't truly understand istream and it doesn't help that my class has not gone over pointers so I don't know how to use istream& or string& and pass it into a vector. On the project description, it said to NOT use stringstream but use functionality from getline(istream&, string&). Can anyone give somewhat of a detailed explanation as to how to make a function using getline(istream&, string&) and then how to use it in the main function?
Any little bit helps!
You're on the right way already; solely, you'd have to pre-fill the string with some dummy to enter the while loop at all. More elegant:
std::string line;
do
{
std::getline(std::cin, line);
}
while(!line.empty());
This should already do the trick reading line by line (but possibly multiple words on one line!) and exiting, if the user enters an empty line (be aware that whitespace followed by newline won't be recognised as such!).
However, if anything on the stream goes wrong, you'll be trapped in an endless loop processing previous input again and again. So best check the stream state as well:
if(!std::getline(std::cin, line))
{
// this is some sample error handling - do whatever you consider appropriate...
std::cerr << "error reading from console" << std::endl;
return -1;
}
As there might be multiple words on a single line, you'd yet have to split them. There are several ways to do so, quite an easy one is using an std::istringstream – you'll discover that it ressembles to what you likely are used to using std::cin:
std::istringstream s(line);
std::string word;
while(s >> word)
{
// append to vector...
}
Be aware that using operator>> ignores leading whitespace and stops after first trailing one (or end of stream, if reached), so you don't have to deal with explicitly.
OK, you're not allowed to use std::stringstream (well, I used std::istringstream, but I suppose this little difference doesn't count, does it?). Changes matter a little, it gets more complex, on the other hand, we can decide ourselves what counts as words an what as separators... We might consider punctuation marks as separators just like whitespace, but allow digits to be part of words, so we'd accept e. g. ab.7c d as "ab", "7c", "d":
auto begin = line.begin();
auto end = begin;
while(end != line.end()) // iterate over each character
{
if(std::isalnum(static_cast<unsigned char>(*end)))
{
// we are inside a word; don't touch begin to remember where
// the word started
++end;
}
else
{
// non-alpha-numeric character!
if(end != begin)
{
// we discovered a word already
// (i. e. we did not move begin together with end)
words.emplace_back(begin, end);
// ('words' being your std::vector<std::string> to place the input into)
}
++end;
begin = end; // skip whatever we had already
}
}
// corner case: a line might end with a word NOT followed by whitespace
// this isn't covered within the loop, so we need to add another check:
if(end != begin)
{
words.emplace_back(begin, end);
}
It shouldn't be too difficult to adjust to different interpretations of what is a separator and what counts as word (e. g. std::isalpha(...) || *end == '_' to detect underscore as part of words, but digits not). There are quite a few helper functions you might find useful...
You could input the value of the first column, then call functions based on the value:
void Process_Value_1(std::istream& input, std::string& value);
void Process_Value_2(std::istream& input, std::string& value);
int main()
{
// ...
std::string first_value;
while (input_file >> first_value)
{
if (first_value == "aaa")
{
Process_Value_1(input_file, first_value);
}
else if (first_value = "ccc")
{
Process_Value_2(input_file, first_value);
}
//...
}
return 0;
}
A sample function could be:
void Process_Value_1(std::istream& input, std::string& value)
{
std::string b;
input >> b;
std::cout << value << "\t" << b << endl;
input.ignore(1000, '\n'); // Ignore until newline.
}
There are other methods to perform the process, such as using tables of function pointers and std::map.

Reading from FileStream with arbitrary delimiter

I have encountered a problem to read msg from a file using C++. Usually what people does is create a file stream then use getline() function to fetch msg. getline() function can accept an additional parameter as delimiter so that it return each "line" separated by the new delimiter but not default '\n'. However, this delimiter has to be a char. In my usecase, it is possible the delimiter in the msg is something else like "|--|", so I try to get a solution such that it accept a string as delimiter instead of a char.
I have searched StackOverFlow a little bit and found some interesting posts.
Parse (split) a string in C++ using string delimiter (standard C++)
This one gives a solution to use string::find() and string::substr() to parse with arbitrary delimiter. However, all the solutions there assumes input is a string instead of a stream, In my case, the file stream data is too big/waste to fit into memory at once so it should read in msg by msg (or a bulk of msg at once).
Actually, read through the gcc implementation of std::getline() function, it seems it is much more easier to handle the case delimiter is a singe char. Since every time you load in a chunk of characters, you can always search the delimiter and separate them. While it is different if you delimiter is more than one char, the delimiter itself may straddle between two different chunks and cause many other corner cases.
Not sure whether anyone else has faced this kind of requirement before and how you guys handled it elegantly. It seems it would be nice to have a standard function like istream& getNext (istream&& is, string& str, string delim)? This seems to be a general usecase to me. Why not this one is in Standard lib so that people no longer to implement their own version separately?
Thank you very much
The STL simply does not natively support what you are asking for. You will have to write your own function (or find a 3rd party function) that does what you need.
For instance, you can use std::getline() to read up to the first character of your delimiter, and then use std::istream::get() to read subsequent characters and compare them to the rest of your delimiter. For example:
std::istream& my_getline(std::istream &input, std::string &str, const std::string &delim)
{
if (delim.empty())
throw std::invalid_argument("delim cannot be empty!");
if (delim.size() == 1)
return std::getline(input, str, delim[0]);
str.clear();
std::string temp;
char ch;
bool found = false;
do
{
if (!std::getline(input, temp, delim[0]))
break;
str += temp;
found = true;
for (int i = 1; i < delim.size(); ++i)
{
if (!input.get(ch))
{
if (input.eof())
input.clear(std::ios_base::eofbit);
str.append(delim.c_str(), i);
return input;
}
if (delim[i] != ch)
{
str.append(delim.c_str(), i);
str += ch;
found = false;
break;
}
}
}
while (!found);
return input;
}
if you are ok with reading byte by byte, you could build a state transition table implementation of a finite state machine to recognize your stop condition
std::string delimeter="someString";
//initialize table with a row per target string character, a column per possible char and all zeros
std::vector<vector<int> > table(delimeter.size(),std::vector<int>(256,0));
int endState=delimeter.size();
//set the entry for the state looking for the next letter and finding that character to the next state
for(unsigned int i=0;i<delimeter.size();i++){
table[i][(int)delimeter[i]]=i+1;
}
now in you can use it like this
int currentState=0;
int read=0;
bool done=false;
while(!done&&(read=<istream>.read())>=0){
if(read>=256){
currentState=0;
}else{
currentState=table[currentState][read];
}
if(currentState==endState){
done=true;
}
//do your streamy stuff
}
granted this only works if the delimiter is in extended ASCII, but it will work fine for some thing like your example.
It seems, it is easiest to create something like getline(): read to the last character of the separator. Then check if the string is long enough for the separator and, if so, if it ends with the separator. If it is not, carry on reading:
std::string getline(std::istream& in, std::string& value, std::string const& separator) {
std::istreambuf_iterator<char> it(in), end;
if (separator.empty()) { // empty separator -> return the entire stream
return std::string(it, end);
}
std::string rc;
char last(separator.back());
for (; it != end; ++it) {
rc.push_back(*it);
if (rc.back() == last
&& separator.size() <= rc.size()
&& rc.substr(rc.size() - separator.size()) == separator) {
return rc.resize(rc.size() - separator.size());
}
}
return rc; // no separator was found
}

Can I use 2 or more delimiters in C++ function getline? [duplicate]

This question already has answers here:
How can I read and parse CSV files in C++?
(39 answers)
Closed 4 years ago.
I would like to know how can I use 2 or more delimiters in the getline functon, that's my problem:
The program reads a text file... each line is goning to be like:
New Your, Paris, 100
CityA, CityB, 200
I am using getline(file, line), but I got the whole line, when I want to to get CityA, then CityB and then the number; and if I use ',' delimiter, I won't know when is the next line, so I'm trying to figure out some solution..
Though, how could I use comma and \n as a delimiter?
By the way,I'm manipulating string type,not char, so strtok is not possible :/
some scratch:
string line;
ifstream file("text.txt");
if(file.is_open())
while(!file.eof()){
getline(file, line);
// here I need to get each string before comma and \n
}
You can read a line using std::getline, then pass the line to a std::stringstream and read the comma separated values off it
string line;
ifstream file("text.txt");
if(file.is_open()){
while(getline(file, line)){ // get a whole line
std::stringstream ss(line);
while(getline(ss, line, ',')){
// You now have separate entites here
}
}
No, std::getline() only accepts a single character, to override the default delimiter. std::getline() does not have an option for multiple alternate delimiters.
The correct way to parse this kind of input is to use the default std::getline() to read the entire line into a std::string, then construct a std::istringstream, and then parse it further, into comma-separate values.
However, if you are truly parsing comma-separated values, you should be using a proper CSV parser.
Often, it is more intuitive and efficient to parse character input in a hierarchical, tree-like manner, where you start by splitting the string into its major blocks, then go on to process each of the blocks, splitting them up into smaller parts, and so on.
An alternative to this is to tokenize like strtok does -- from the beginning of input, handling one token at a time until the end of input is encountered. This may be preferred when parsing simple inputs, because its is straightforward to implement. This style can also be used when parsing inputs with nested structure, but this requires maintaining some kind of context information, which might grow too complex to maintain inside a single function or limited region of code.
Someone relying on the C++ std library usually ends up using a std::stringstream, along with std::getline to tokenize string input. But, this only gives you one delimiter. They would never consider using strtok, because it is a non-reentrant piece of junk from the C runtime library. So, they end up using streams, and with only one delimiter, one is obligated to use a hierarchical parsing style.
But zneak brought up std::string::find_first_of, which takes a set of characters and returns the position nearest to the beginning of the string containing a character from the set. And there are other member functions: find_last_of, find_first_not_of, and more, which seem to exist for the sole purpose of parsing strings. But std::string stops short of providing useful tokenizing functions.
Another option is the <regex> library, which can do anything you want, but it is new and you will need to get used to its syntax.
But, with very little effort, you can leverage existing functions in std::string to perform tokenizing tasks, and without resorting to streams. Here is a simple example. get_to() is the tokenizing function and tokenize demonstrates how it is used.
The code in this example will be slower than strtok, because it constantly erases characters from the beginning of the string being parsed, and also copies and returns substrings. This makes the code easy to understand, but it does not mean more efficient tokenizing is impossible. It wouldn't even be that much more complicated than this -- you would just keep track of your current position, use this as the start argument in std::string member functions, and never alter the source string. And even better techniques exist, no doubt.
To understand the example's code, start at the bottom, where main() is and where you can see how the functions are used. The top of this code is dominated by basic utility functions and dumb comments.
#include <iostream>
#include <string>
#include <utility>
namespace string_parsing {
// in-place trim whitespace off ends of a std::string
inline void trim(std::string &str) {
auto space_is_it = [] (char c) {
// A few asks:
// * Suppress criticism WRT localization concerns
// * Avoid jumping to conclusions! And seeing monsters everywhere!
// Things like...ah! Believing "thoughts" that assumptions were made
// regarding character encoding.
// * If an obvious, portable alternative exists within the C++ Standard Library,
// you will see it in 2.0, so no new defect tickets, please.
// * Go ahead and ignore the rumor that using lambdas just to get
// local function definitions is "cheap" or "dumb" or "ignorant."
// That's the latest round of FUD from...*mumble*.
return c > '\0' && c <= ' ';
};
for(auto rit = str.rbegin(); rit != str.rend(); ++rit) {
if(!space_is_it(*rit)) {
if(rit != str.rbegin()) {
str.erase(&*rit - &*str.begin() + 1);
}
for(auto fit=str.begin(); fit != str.end(); ++fit) {
if(!space_is_it(*fit)) {
if(fit != str.begin()) {
str.erase(str.begin(), fit);
}
return;
} } } }
str.clear();
}
// get_to(string, <delimiter set> [, delimiter])
// The input+output argument "string" is searched for the first occurance of one
// from a set of delimiters. All characters to the left of, and the delimiter itself
// are deleted in-place, and the substring which was to the left of the delimiter is
// returned, with whitespace trimmed.
// <delimiter set> is forwarded to std::string::find_first_of, so its type may match
// whatever this function's overloads accept, but this is usually expressed
// as a string literal: ", \n" matches commas, spaces and linefeeds.
// The optional output argument "found_delimiter" receives the delimiter character just found.
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters, char& found_delimiter) {
const auto pos = str.find_first_of(std::forward<D>(delimiters));
if(pos == std::string::npos) {
// When none of the delimiters are present,
// clear the string and return its last value.
// This effectively makes the end of a string an
// implied delimiter.
// This behavior is convenient for parsers which
// consume chunks of a string, looping until
// the string is empty.
// Without this feature, it would be possible to
// continue looping forever, when an iteration
// leaves the string unchanged, usually caused by
// a syntax error in the source string.
// So the implied end-of-string delimiter takes
// away the caller's burden of anticipating and
// handling the range of possible errors.
found_delimiter = '\0';
std::string result;
std::swap(result, str);
trim(result);
return result;
}
found_delimiter = str[pos];
auto left = str.substr(0, pos);
trim(left);
str.erase(0, pos + 1);
return left;
}
template <typename D>
inline std::string get_to(std::string& str, D&& delimiters) {
char discarded_delimiter;
return get_to(str, std::forward<D>(delimiters), discarded_delimiter);
}
inline std::string pad_right(const std::string& str,
std::string::size_type min_length,
char pad_char=' ')
{
if(str.length() >= min_length ) return str;
return str + std::string(min_length - str.length(), pad_char);
}
inline void tokenize(std::string source) {
std::cout << source << "\n\n";
bool quote_opened = false;
while(!source.empty()) {
// If we just encountered an open-quote, only include the quote character
// in the delimiter set, so that a quoted token may contain any of the
// other delimiters.
const char* delimiter_set = quote_opened ? "'" : ",'{}";
char delimiter;
auto token = get_to(source, delimiter_set, delimiter);
quote_opened = delimiter == '\'' && !quote_opened;
std::cout << " " << pad_right('[' + token + ']', 16)
<< " " << delimiter << '\n';
}
std::cout << '\n';
}
}
int main() {
string_parsing::tokenize("{1.5, null, 88, 'hi, {there}!'}");
}
This outputs:
{1.5, null, 88, 'hi, {there}!'}
[] {
[1.5] ,
[null] ,
[88] ,
[] '
[hi, {there}!] '
[] }
I don't think that's how you should attack the problem (even if you could do it); instead:
Use what you have to read in each line
Then split up that line by the commas to get the pieces that you want.
If strtok will do the job for #2, you can always convert your string into a char array.

Is there a way to check input data type using only basic concepts?

I'm being challenged to find ways to perform tasks that usually require the use of headers (besides iostream and iomanip) or greater-than-basic C++ knowledge. How can I check the data type of user input using only logical operators, basic arithmetic (+, -, *, /, %), if statements, and while loops?
Obviously the input variable has a declared data type in the first place, but this problem is covering the possibility of the user inputting the wrong data type.
I've tried several methods including the if (!(cin >> var1)) trick, but nothing works correctly. Is this possible at all?
Example
int main() {
int var1, var2;
cin >> var1;
cin >> var2;
cout << var1 << " - " << var2 << " = " << (var1-var2);
return 0;
}
It's possible to input asdf and 5.25 here, so how do I check that the input aren't integers as expected, using only the means I stated earlier?
I understand this problem is vague in many ways, mostly because the restrictions are extremely specific and listing everything I'm allowed to use would be a pain. I guess part of the problem as mentioned in the comments is figuring out how to distinguish between data types in the first place.
You can do that using simple operations, although it might be a little difficult, for example the following function can be used to check if the input is a decimal number. You can extend the idea and check if there is a period in between for floating point numbers.
Add a comment if you need further help.
bool isNumber(char *inp){
int i = 0;
if (inp[0] == '+' || inp[0] == '-') i = 1;
int sign = (inp[0] == '-' ? -1 : 1);
for (; inp[i]; i++){
if (!(inp[i] >= '0' && inp[i] <= '9'))
return false;
}
return true;
}
General checking after reading is done like this:
stream >> variable;
if (not stream.good()) {
// not successful
}
This can be done on any std::ios. It works for standard types (any numeric type, char, string, etc.) stopping at whitespace. If your variable could not be read, good returns false. You can customize it for your own classes (including control over good's return value):
istream & operator>>(istream & stream, YourClass & c)
{
// Read the data from stream into c
return stream;
}
For your specific problem: Suppose you read the characters 42. There is no way of distinguishing between reading it as
- an int
- a double
as both would be perfectly fine. You have to specify the input format more precisely.
The standard library is not magic - you just have to parse the data read from the user, similarly to what the standard library does.
First read the input from the user:
std::string s;
cin >> s;
(you may use getline instead if you want to read a whole line)
Then you can go on parsing it; we'll try to distinguish between integer (*[+-]?[0-9]+ *), real number (*[+-][0-9](\.[0-9]*)?([Ee][+-]?[0-9]+)? *), string (*"[^"]" *) and anything else ("bad").
enum TokenType {
Integer,
Real,
String,
Bad
};
The basic building block is a routine that "eats" consecutive digits; this will help us with the [0-9]* and [0-9]+ parts.
void eatdigits(const char *&rp) {
while(*rp>='0' && *rp<='0') rp++;
}
Also, a routine that skips whitespace can be handy:
void skipws(const char *&rp) {
while(*rp==' ') rp++;
// feel free to skip also tabs and whatever
}
Then we can attack the real problem
TokenType categorize(const char *rp) {
first, we want to skip the whitespace
skipws(rp);
then, we'll try to match the easiest stuff: the string
if(*rp=='"') {
// Skip the string content
while(*rp && *rp!='"') rp++;
// If the string stopped with anything different than " we
// have a parse error
if(!*rp) return Bad;
// Otherwise, skip the trailing whitespace
skipws(rp);
// And check if we got at the end
return *rp?Bad:String;
}
Then, on to numbers, notice that the real and integer definitions start in the same way; we have a common branch:
// If there's a + or -, it's fine, skip it
if(*rp=='+' || *rp=='-') rp++;
const char *before=rp;
// Skip the digits
eatdigits(rp);
// If we didn't manage to find any digit, it's not a valid number
if(rp==start) return Bad;
// If it ends here or after whitespace, it's an integer
if(!*rp) return Integer;
before = rp;
skipws(rp);
if(before!=rp) return *rp?Bad:Integer;
If we notice that there's still stuff, we tackle the real number:
// Maybe something after the decimal dot?
if(*rp=='.') {
rp++;
eatdigits(rp);
}
// Exponent
if(*rp=='E' || *rp=='e') {
rp++;
if(*rp=='+' || *rp=='-') rp++;
before=rp;
eatdigits(rp);
if(before==rp) return Bad;
}
skipws(rp);
return *rp?Bad:Real;
}
You can easily invoke this routine after reading the input.
(notice that here the string thing is just for fun, cin does not have any special processing for double-quotes delimited strings).

Cleaning a string of punctuation in C++

Ok so before I even ask my question I want to make one thing clear. I am currently a student at NIU for Computer Science and this does relate to one of my assignments for a class there. So if anyone has a problem read no further and just go on about your business.
Now for anyone who is willing to help heres the situation. For my current assignment we have to read a file that is just a block of text. For each word in the file we are to clear any punctuation in the word (ex : "can't" would end up as "can" and "that--to" would end up as "that" obviously with out the quotes, quotes were used just to specify what the example was).
The problem I've run into is that I can clean the string fine and then insert it into the map that we are using but for some reason with the code I have written it is allowing an empty string to be inserted into the map. Now I've tried everything that I can come up with to stop this from happening and the only thing I've come up with is to use the erase method within the map structure itself.
So what I am looking for is two things, any suggestions about how I could a) fix this with out simply just erasing it and b) any improvements that I could make on the code I already have written.
Here are the functions I have written to read in from the file and then the one that cleans it.
Note: the function that reads in from the file calls the clean_entry function to get rid of punctuation before anything is inserted into the map.
Edit: Thank you Chris. Numbers are allowed :). If anyone has any improvements to the code I've written or any criticisms of something I did I'll listen. At school we really don't get feed back on the correct, proper, or most efficient way to do things.
int get_words(map<string, int>& mapz)
{
int cnt = 0; //set out counter to zero
map<string, int>::const_iterator mapzIter;
ifstream input; //declare instream
input.open( "prog2.d" ); //open instream
assert( input ); //assure it is open
string s; //temp strings to read into
string not_s;
input >> s;
while(!input.eof()) //read in until EOF
{
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
}
input.close(); //close instream
for(mapzIter = mapz.begin(); mapzIter != mapz.end(); mapzIter++)
cnt = cnt + mapzIter->second;
return cnt; //return number of words in instream
}
void clean_entry(const string& non_clean, string& clean)
{
int i, j, begin, end;
for(i = 0; isalnum(non_clean[i]) == 0 && non_clean[i] != '\0'; i++);
begin = i;
if(begin ==(int)non_clean.length())
return;
for(j = begin; isalnum(non_clean[j]) != 0 && non_clean[j] != '\0'; j++);
end = j;
clean = non_clean.substr(begin, (end-begin));
for(i = 0; i < (int)clean.size(); i++)
clean[i] = tolower(clean[i]);
}
The problem with empty entries is in your while loop. If you get an empty string, you clean the next one, and add it without checking. Try changing:
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() == 0)
{
input >> s;
clean_entry(s, not_s);
}
mapz[not_s]++; //increment occurence
input >>s;
to
not_s = "";
clean_entry(s, not_s);
if((int)not_s.length() > 0)
{
mapz[not_s]++; //increment occurence
}
input >>s;
EDIT: I notice you are checking if the characters are alphanumeric. If numbers are not allowed, you may need to revisit that area as well.
Further improvements would be to
declare variables only when you use them, and in the innermost scope
use c++-style casts instead of the c-style (int) casts
use empty() instead of length() == 0 comparisons
use the prefix increment operator for the iterators (i.e. ++mapzIter)
A blank string is a valid instance of the string class, so there's nothing special about adding it into the map. What you could do is first check if it's empty, and only increment in that case:
if (!not_s.empty())
mapz[not_s]++;
Style-wise, there's a few things I'd change, one would be to return clean from clean_entry instead of modifying it:
string not_s = clean_entry(s);
...
string clean_entry(const string &non_clean)
{
string clean;
... // as before
if(begin ==(int)non_clean.length())
return clean;
... // as before
return clean;
}
This makes it clearer what the function is doing (taking a string, and returning something based on that string).
The function 'getWords' is doing a lot of distinct actions that could be split out into other functions. There's a good chance that by splitting it up into it's individual parts, you would have found the bug yourself.
From the basic structure, I think you could split the code into (at least):
getNextWord: Return the next (non blank) word from the stream (returns false if none left)
clean_entry: What you have now
getNextCleanWord: Calls getNextWord, and if 'true' calls CleanWord. Returns 'false' if no words left.
The signatures of 'getNextWord' and 'getNextCleanWord' might look something like:
bool getNextWord (std::ifstream & input, std::string & str);
bool getNextCleanWord (std::ifstream & input, std::string & str);
The idea is that each function does a smaller more distinct part of the problem. For example, 'getNextWord' does nothing but get the next non blank word (if there is one). This smaller piece therefore becomes an easier part of the problem to solve and debug if necessary.
The main component of 'getWords' then can be simplified down to:
std::string nextCleanWord;
while (getNextCleanWord (input, nextCleanWord))
{
++map[nextCleanWord];
}
An important aspect to development, IMHO, is to try to Divide and Conquer the problem. Split it up into the individual tasks that need to take place. These sub-tasks will be easier to complete and should also be easier to maintain.