Example of overloading C++ extraction operator >> to parse data

Example of overloading C++ extraction operator >> to parse data - c++

I am looking for a good example of how to overload the stream input operator (operator>>) to parse some data with simple text formatting. I have read this tutorial but I would like to do something a bit more advanced. In my case I have fixed strings that I would like to check for (and ignore). Supposing the 2D point format from the link were more like
Point{0.3 =>
0.4 }
where the intended effect is to parse out the numbers 0.3 and 0.4. (Yes, this is an awfully silly syntax, but it incorporates several ideas I need). Mostly I just want to see how to properly check for the presence of fixed strings, ignore whitespace, etc.
Update:
Oops, the comment I made below has no formatting (this is my first time using this site).
I found that whitespace can be skipped with something like
std::cin >> std::ws;
And for eating up strings I have
static bool match_string(std::istream &is, const char *str){
size_t nstr = strlen(str);
while(nstr){
if(is.peek() == *str){
is.ignore(1);
++str;
--nstr;
}else{
is.setstate(is.rdstate() | std::ios_base::failbit);
return false;
}
}
return true;
}
Now it would be nice to be able to get the position (line number) of a parsing error.
Update 2:
Got line numbers and comment parsing working, using just 1 character look-ahead. The final result can be seen here in AArray.cpp, in the function parse(). The project is a (de)serializable C++ PHP-like array class.

Your operator>>(istream &, object &) should get data from the input stream, using its formatted and/or unformatted extraction functions, and put it into your object.
If you want to be more safe (after a fashion), construct and test an istream::sentry object before you start. If you encounter a syntax error, you may call setstate( ios_base::failbit ) to prevent any other processing until you call my_stream.clear().
See <istream> (and istream.tcc if you're using SGI STL) for examples.

Related

Efficiently reading two comma-separated floats in brackets from a string without being affected by the global locale

I am a developer of a library and our old code uses sscanf() and sprintf() to read/write a variety of internal types from/to strings. We have had issues with users who used our library and had a locale that was different from the one we based our XML files on ("C" locale). In our case this resulted in incorrect values parsed from those XML files and those submitted as strings in run-time. The locale may be changed by a user directly but can also be changed without the knowledge of the user. This can happen if the locale-changes occurs inside another library, such as GTK, which was the "perpetrator" in one bug report. Therefore, we obviously want to remove any dependency from the locale to permanently free ourselves from these issues.
I have already read other questions and answers in the context of float/double/int/... especially if they are separated by a character or located inside brackets, but so far the proposed solutions I found were not satisfying to us. Our requirements are:
No dependencies on libraries other than the standard library. Using anything from boost is therefore, for example, not an option.
Must be thread-safe. This is meant in specific regarding the locale, which can be changed globally. This is really awful for us, as therefore a thread of our library can be affected by another thread in the user's program, which may also be running code of a completely different library. Anything affected by setlocale() directly is therefore not an option. Also, setting the locale before starting to read/write and setting it back to the original value thereafter is not a solution due to race conditions in threads.
While efficiency is not the topmost priority (#1 & #2 are), it is still definitely of our concern, as strings may be read and written in run-time quite frequently, depending on the user's program. The faster, the better.
Edit: As an additional note: boost::lexical_cast is not guaranteed to be unaffected by the locale (source: Locale invariant guarantee of boost::lexical_cast<>). So that would not be a solution even without requirement #1.
I gathered the following information so far:
First of all, what I saw being suggested a lot is using boost's lexical_cast but unfortunately this is not an option for us as at all, as we can't require all users to also link to boost (and because of the lacking locale-safety, see above). I looked at the code to see if we can extract anything from it but I found it difficult to understand and too large in length, and most likely the big performance-gainers are using locale-dependent functions anyways.
Many functions introduced in C++11, such as std::to_string, std::stod, std::stof, etc. depend on the global locale just the way sscanf and sprintf do, which is extremely unfortunate and to me not understandable, considering that std::thread has been added.
std::stringstream seems to be a solution in general, since it is thread-safe in the context of the locale, but also in general if guarded right. However, if it is constructed freshly every time it can be slow (good comparison: http://www.boost.org/doc/libs/1_55_0/doc/html/boost_lexical_cast/performance.html). I assume this can be solved by having one such stream per thread configured and available, clearing it each time after usage. However, a problem is that it doesn't solve formats as easily as sscanf() does, for example: " { %g , %g } ".
sscanf() patterns that we, for example, need to be able to read are:
" { %g , %g }"
" { { %g , %g } , { %g , %g } }"
" { top: { %g , %g } , left: { %g , %g } , bottom: { %g , %g } , right: { %g , %g }"
Writing these with stringstreams seems no big deal, but reading them seems problematic, especially considering the whitespaces.
Should we use std::regex in this context or is this overkill? Are stringstreams a good solution for this task or is there any better way to do this given the mentioned requirements? Also, are there any other problems in the context of thread-safety and locales that I have not considered in my question - especially regarding the usage of std::stringstream?

In your case the stringstream seems to be the best approach, as you can control it's locale independently of the global locale that was set. But it's true that the formatted reading is not as easy as with sscanf().
Form the point of view of performance, stream input with regex is an overkill for this kind of simple comma separated reading : on an informal benchmark it was more than 10 times slower than a scanf().
You can easily write a little auxiliary class to facilitate reading formats like you have enumerated. Here the general idea on another SO answer The use can be as easy as:
sst >> mandatory_input(" { ")>> x >> mandatory_input(" , ")>>y>> mandatory_input(" } ");
If you're interested, I've written one some time ago. Here the full article with examples and explanation as well as source code. The class is 70 lines of code, but most of them to provide error processing functions in case these are needed. It has acceptable performance, but is still slower than scanf().

Based on the suggestions by Christophe and some other stackoverflow answers I found, I created a set of 2 methods and 1 class to achieve all stream parsing functionality we required. The following methods are sufficient to parse the formats proposed in the question:
The following methods strips preceding whitespaces and then skips an optional character:
template<char matchingCharacter>
std::istream& optionalChar(std::istream& inputStream)
{
if (inputStream.fail())
return inputStream;
inputStream >> std::ws;
if (inputStream.peek() == matchingCharacter)
inputStream.ignore();
else
// If peek is executed but no further characters remain,
// the failbit will be set, we want to undo this
inputStream.clear(inputStream.rdstate() & ~std::ios::failbit);
return inputStream;
}
The second methods strips preceding whitespaces and then checks for a mandatory character. If it doesn't match, the fail bit will be set:
template<char matchingCharacter>
std::istream& mandatoryChar(std::istream& inputStream)
{
if (inputStream.fail())
return inputStream;
inputStream >> std::ws;
if (inputStream.peek() == matchingCharacter)
inputStream.ignore();
else
inputStream.setstate(std::ios_base::failbit);
return inputStream;
}
It makes sense to use a global stringstream (call strStream.str(std::string()) and call clear() before each usage) to increase performance, as hinted to in my question. With the optional character checks I could make the parsing more lenient towards other styles. Here is an example usage:
// Format is: " { { %g , %g } , { %g , %g } } " but we are lenient regarding the format,
// so this is also allowed: " { %g %g } { %g %g } "
std::stringstream sstream(inputString);
sstream.clear();
sstream >> optionalChar<'{'> >> mandatoryChar<'{'> >> val1 >>
optionalChar<','> >> val2 >>
mandatoryChar<'}'> >> optionalChar<','> >> mandatoryChar<'{'> >> val3 >>
optionalChar<','> >> val4;
if (sstream.fail())
logError(inputString);
Addition - Checking for mandatory strings:
Last but not least I created a class for checking for mandatory strings in streams from scratch, based on the idea by Christophe. Header-file:
class MandatoryString
{
public:
MandatoryString(char const* mandatoryString);
friend std::istream& operator>> (std::istream& inputStream, const MandatoryString& mandatoryString);
private:
char const* m_chars;
};
Cpp file:
MandatoryString::MandatoryString(char const* mandatoryString)
: m_chars(mandatoryString)
{}
std::istream& operator>> (std::istream& inputStream, const MandatoryString& mandatoryString)
{
if (inputStream.fail())
return inputStream;
char const* currentMandatoryChar = mandatoryString.m_chars;
while (*currentMandatoryChar != '\0')
{
static const std::locale spaceLocale("C");
if (std::isspace(*currentMandatoryChar, spaceLocale))
{
inputStream >> std::ws;
}
else
{
int peekedChar = inputStream.get();
if (peekedChar != *currentMandatoryChar)
{
inputStream.setstate(std::ios::failbit);
break;
}
}
++currentMandatoryChar;
}
return inputStream;
}
The MandatoryString class is used similar to the above methods, e.g.:
sstream >> MandatoryString(" left");
Conclusion:
While this solution might be more verbose than sscanf, it gives us all the flexibility we needed while being able to use stringstreams, which make this solution generally thread-safe and not depending on the global locale. Also it is easy to check for errors and once an fail bit is set, the parsing will be halted inside the suggested methods. For very long sequences of values to parse in a string, this can actually becomes more readable than sscanf: For example it allows to split the parsing cross multiple lines with the preceding mandatory strings being on the same line with the corresponding variables, respectively.T̶h̶e̶ ̶o̶n̶l̶y̶ ̶p̶a̶r̶t̶ ̶t̶h̶a̶t̶ ̶d̶o̶e̶s̶ ̶n̶o̶t̶ ̶w̶o̶r̶k̶ ̶n̶i̶c̶e̶l̶y̶ ̶w̶i̶t̶h̶ ̶t̶h̶i̶s̶ ̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶i̶s̶ ̶p̶a̶r̶s̶i̶n̶g̶ ̶m̶u̶l̶t̶i̶p̶l̶e̶ ̶h̶e̶x̶a̶d̶e̶c̶i̶m̶a̶l̶s̶ ̶f̶r̶o̶m̶ ̶o̶n̶e̶ ̶s̶t̶r̶i̶n̶g̶,̶ ̶w̶h̶i̶c̶h̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶a̶ ̶s̶e̶c̶o̶n̶d̶ ̶s̶t̶r̶e̶a̶m̶ ̶a̶n̶d̶ ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶d̶d̶i̶t̶i̶o̶n̶a̶l̶ ̶l̶i̶n̶e̶s̶ ̶o̶f̶ ̶c̶o̶d̶e̶ ̶o̶f̶ ̶c̶l̶e̶a̶r̶i̶n̶g̶ ̶a̶n̶d̶ ̶g̶e̶t̶L̶i̶n̶e̶ ̶c̶a̶l̶l̶s̶.̶ After overloading the stream operators << and >> for our internal types, everything looks very clean and is easily maintainable. Parsing multiple hexadecimals also works fine, we just reset the previously set std::hex value to std::dec after the operation is done.

How can I Portably Catch and Handle UTF "EN DASH" Minuses During c++ STL File Reading?

I'm maintaining a large open source project, so I'm running into an odd fringe case on the I/O front.
When my app parses a user parameter file containing a line of text like the following:
CH3 CH2 CH2 CH2 −68.189775 2 180.0 ! TraPPE 1
...at first it looks innocent because it is formatted as desired. But then I see the minus is a UTF character (−) rather than (-).
I'm just using STL's >> with the ifstream object.
When it attempts to convert to a negative and fails on the UTF character STL apparently just sets the internal flag to "bad", which was triggering my logic that stops the reading process. This is sort of good as without that logic I would have had an even harder time tracking it down.
But it's definitely not my desired error handling. I want to catch common minus like characters when reading a double with >>, replace them and complete the conversion if the string is otherwise a properly formatted negative number.
This appears to be happening to my users relatively frequently as they're copying and pasting from programs (calculator or Excel perhaps in Windows?) to get their file values.
I was somewhat surprised not to find this problem on Stack Overflow, as it seems pretty ubiquitous. I found some reference to this on this question:
c++ error cannot be used as a function, some stray error [closed]
...but that was a slightly different problem, in which the code contained that kind of similar, but noncompatible "minus-like" EN DASH UTF character.
Does anyone have a good solution (preferably compact, portable, and reusable) for catch such bad minuses when reading doubles or signed integers?
Note:
I don't want to use Boost or c++11 as believe it or not some of my users on certain supercomputers don't have access to those libraries. I'm try to keep it as portable as possible.

May be using a custom std::num_get is for you. Other character to value aspects can be overwritten as well.
#include <iostream>
#include <string>
#include <sstream>
class num_get : public std::num_get<wchar_t>
{
public:
iter_type do_get( iter_type begin, iter_type end, std::ios_base & str,
std::ios_base::iostate & error, float & value ) const
{
bool neg=false;
if(*begin==8722) {
begin++;
neg=true;
}
iter_type i = std::num_get<wchar_t>::do_get(begin, end, str, error, value);
if (!(error & std::ios_base::failbit))
{
if(neg)
value=-value;
}
return i;
}
};
int main(int argc,char ** argv) {
std::locale new_locale(std::cin.getloc(), new num_get);
// Parsing wchar_t streams makes live easier but in principle
// it should work with char (e.g. UTF8 as well)
static const std::wstring ws(L"CH3 CH2 CH2 CH2 −68.189775 2 180.0 ! TraPPE 1");
std::basic_stringstream<wchar_t> wss(ws);
std::wstring a;
std::wstring b;
std::wstring c;
float f=0;
// Imbue this new locale into wss
wss.imbue(new_locale);
for(int i=0;i<4;i++) {
std::wstring s;
wss >> s >> std::ws;
std::wcerr << s << std::endl;
}
wss >> f;
std::wcerr << f << std::endl;
}

Not gonna happen except manually. There are many characters in Unicode, there's an Em Dash as well as an En Dash, and most likely quite a few more. For example, did you consider the possibility of an Em Dash and then a non-breaking-space and then some numbers? Or an RTL override? Unicode is legend because the possibilities are nearly endless, and double-legend in C++ because the Standard support for it could be charitably described as ISIS's support for sanity.
The only real way to do this is to find each situation as your users report it, and handle it manually- i.e., do not use operator>> for double.

Is it possible to manipulate some text with an user-defined I/O manipulator?

Is there a (clean) way to manipulate some text from std::cin before inserting it into a std::string, so that the following would work:
cin >> setw(80) >> Uppercase >> mystring;
where mystring is std::string (I don't want to use any wrappers for strings).
Uppercase is a manipulator. I think it needs to act on the Chars in the buffer directly (no matter what is considered uppercase rather than lowercase now). Such a manipulator seems difficult to implement in a clean way, as user-defined manipulators, as far as I know, are used to just change or mix some pre-determined format flags easily.

(Non-extended) manipulators usually only set flags and data which the extractors afterwards read and react to. (That is what xalloc, iword, and pword are for.) What you could, obviously, do, is to write something analogous to std::get_money:
struct uppercasify {
uppercasify(std::string &s) : ref(s) {}
uppercasify(const uppercasify &other) : ref(other.ref) {}
std::string &ref;
}
std::istream &operator>>(std::istream &is, uppercasify uc) { // or &&uc in C++11
is >> uc.ref;
boost::to_upper(uc.ref);
return is;
}
cin >> setw(80) >> uppercasify(mystring);
Alternatively, cin >> uppercase could return not a reference to cin, but an instantiation of some (template) wrapper class uppercase_istream, with the corresponding overload for operator>>. I don't think having a manipulator modify the underlying stream buffer's contents is a good idea.
If you're desperate enough, I guess you could also imbue a hand-crafted locale resulting in uppercasing strings. I don't think I'd let anything like that go through a code review, though – it's simply just waiting to surprise and bite the next person working on the code.

You may want to check out boost iostreams. Its framework allows defining filters which can manipulate the stream. http://www.boost.org/doc/libs/1_49_0/libs/iostreams/doc/index.html

C++: std::istream check for EOF without reading / consuming tokens / using operator>>

I would like to test if a std::istream has reached the end without reading from it.
I know that I can check for EOF like this:
if (is >> something)
but this has a series of problems. Imagine there are many, possibly virtual, methods/functions which expect std::istream& passed as an argument.
This would mean I have to do the "housework" of checking for EOF in each of them, possibly with different type of something variable, or create some weird wrapper which would handle the scenario of calling the input methods.
All I need to do is:
if (!IsEof(is)) Input(is);
the method IsEof should guarantee that the stream is not changed for reading, so that the above line is equivalent to:
Input(is)
as regards the data read in the Input method.
If there is no generic solution which would word for and std::istream, is there any way to do this for std::ifstream or cin?
EDIT:
In other words, the following assert should always pass:
while (!IsEof(is)) {
int something;
assert(is >> something);
}

The istream class has an eof bit that can be checked by using the is.eof() member.
Edit: So you want to see if the next character is the EOF marker without removing it from the stream? if (is.peek() == EOF) is probably what you want then. See the documentation for istream::peek

That's impossible. How is the IsEof function supposed to know that the next item you intend to read is an int?
Should the following also not trigger any asserts?
while(!IsEof(in))
{
int x;
double y;
if( rand() % 2 == 0 )
{
assert(in >> x);
} else {
assert(in >> y);
}
}
That said, you can use the exceptions method to keep the "house-keeping' in one place.
Instead of
if(IsEof(is)) Input(is)
try
is.exceptions( ifstream::eofbit /* | ifstream::failbit etc. if you like */ )
try {
Input(is);
} catch(const ifstream::failure& ) {
}
It doesn't stop you from reading before it's "too late", but it does obviate the need to have if(is >> x) if(is >> y) etc. in all the functions.

Normally,
if (std::is)
{
}
is enough. There is also .good(), .bad(), .fail() for more exact information
Here is a reference link: http://www.cplusplus.com/reference/iostream/istream/

There are good reasons for which there is no isEof function: it is hard to specify in an usable way. For instance, operator>> usually begin by skipping white spaces (depending on a flag) while some other input functions are able to read space. How would you isEof() handle the situation? Begin by skipping spaces or not? Would it depend on the flag used by operator>> or not? Would it restore the white spaces in the stream or not?
My advice is use the standard idiom and characterize input failure instead of trying to predict only one cause of them: you'd still need to characterize and handle the others.

No, in the general case there is no way of knowing if the next read operation will reach eof.
If the stream is connected to a keyboard, the EOF condition is that I will type Ctrl+Z/Ctrl+D at the next prompt. How would IsEof(is) detect that?

Reading a fixed number of chars with << on an istream

I was trying out a few file reading strategies in C++ and I came across this.
ifstream ifsw1("c:\\trys\\str3.txt");
char ifsw1w[3];
do {
ifsw1 >> ifsw1w;
if (ifsw1.eof())
break;
cout << ifsw1w << flush << endl;
} while (1);
ifsw1.close();
The content of the file were
firstfirst firstsecond
secondfirst secondsecond
When I see the output it is printed as
firstfirst
firstsecond
secondfirst
I expected the output to be something like:
fir
stf
irs
tfi
.....
Moreover I see that "secondsecond" has not been printed. I guess that the last read has met the eof and the cout might not have been executed. But the first behavior is not understandable.

The extraction operator has no concept of the size of the ifsw1w variable, and (by default) is going to extract characters until it hits whitespace, null, or eof. These are likely being stored in the memory locations after your ifsw1w variable, which would cause bad bugs if you had additional variables defined.
To get the desired behavior, you should be able to use
ifsw1.width(3);
to limit the number of characters to extract.

It's virtually impossible to use std::istream& operator>>(std::istream&, char *) safely -- it's like gets in this regard -- there's no way for you to specify the buffer size. The stream just writes to your buffer, going off the end. (Your example above invokes undefined behavior). Either use the overloads accepting a std::string, or use std::getline(std::istream&, std::string).
Checking eof() is incorrect. You want fail() instead. You really don't care if the stream is at the end of the file, you care only if you have failed to extract information.
For something like this you're probably better off just reading the whole file into a string and using string operations from that point. You can do that using a stringstream:
#include <string> //For string
#include <sstream> //For stringstream
#include <iostream> //As before
std::ifstream myFile(...);
std::stringstream ss;
ss << myFile.rdbuf(); //Read the file into the stringstream.
std::string fileContents = ss.str(); //Now you have a string, no loops!

You're trashing the memory... its reading past the 3 chars you defined (its reading until a space or a new line is met...).
Read char by char to achieve the output you had mentioned.
Edit : Irritate is right, this works too (with some fixes and not getting the exact result, but that's the spirit):
char ifsw1w[4];
do{
ifsw1.width(4);
ifsw1 >> ifsw1w;
if(ifsw1.eof()) break;
cout << ifsw1w << flush << endl;
}while(1);
ifsw1.close();

The code has undefined behavior. When you do something like this:
char ifsw1w[3];
ifsw1 >> ifsw1w;
The operator>> receives a pointer to the buffer, but has no idea of the buffer's actual size. As such, it has no way to know that it should stop reading after two characters (and note that it should be 2, not 3 -- it needs space for a '\0' to terminate the string).
Bottom line: in your exploration of ways to read data, this code is probably best ignored. About all you can learn from code like this is a few things you should avoid. It's generally easier, however, to just follow a few rules of thumb than try to study all the problems that can arise.
Use std::string to read strings.
Only use fixed-size buffers for fixed-size data.
When you do use fixed buffers, pass their size to limit how much is read.
When you want to read all the data in a file, std::copy can avoid a lot of errors:
std::vector<std::string> strings;
std::copy(std::istream_iterator<std::string>(myFile),
std::istream_iterator<std::string>(),
std::back_inserter(strings));

To read the whitespace, you could used "noskipws", it will not skip whitespace.
ifsw1 >> noskipws >> ifsw1w;
But if you want to get only 3 characters, I suggest you to use the get method:
ifsw1.get(ifsw1w,3);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js