Read only certain characters from inputstream (C++) - c++

I have a class BitSet with a data member d_bits. d_bits is an object that has member functions readString and resize. Now I want to define the extraction operator (operator<<) for BitSet, which ignores leading whitespaces and then reads zeroes and ones from the istream and stops once a different character is encountered. I made the following cumbersome function:
istream &operator>>(istream &in, BitSet &bitSet){
while (in.peek() == ' ')
in.ignore();
string bits;
char ch;
while (in.peek() == '1' || in.peek() == '0'){
in.get(ch);
bits.append(1, ch);
}
bitSet.d_bits.resize(bits.length());
bitSet.d_bits.readString(bits);
return in;
}
It works correctly, but I am unhappy with it for several reasons. Ignoring whitespace characters one by one seems unnecessarily tedious. Also, once the whitespaces are ignored, reading the characters one by one and then appending them to a string also seems excessively slow. I looked around for the member functions of istreams that might be convenient but I could not find a better way of doing this. So is this the preferred way of extracting only a part of the contents of an istream?

Related

Difference between istream::get(char&) and operator>> (char&)

My question seems to be the same as this one, but I didn't find an answer since the original question seems to ask something more specific.
In C++98, what is the difference between
char c;
cin.get(c);
and
char c;
cin >> c;
?
I've checked the cplusplus reference for get and operator>>, and they look the same to me.
I've tried above code and they seem to behave the same when I input a char.
The difference depends on when there is a whitespace character on the stream buffer.
Consider the input ' foo'
char c;
cin.get(c);
Will store ' ' in c
However
char c;
cin >> c;
Will skip the whitespace and store 'f' in c
In addition to what's already been said, std::istream::get() is also an unformatted input function so the gcount() of the stream is affected, unlike the formatted extractor. Most of the overloads of get() and getline() have mostly been made obselete by the introduction of std::string, its stream extractors, and std::getline(). I'd say to use std::istream::get() whenever you need a single, unformatted character straight from the buffer (by using its single or zero argument overload). It's certainly quicker than turning off the skipping of whitespace first before using the formatted extractor. Also use std::string instead of raw character buffers and is >> str for formatted data or std::getline(is, str) for unformatted data.

Is it possible to manipulate some text with an user-defined I/O manipulator?

Is there a (clean) way to manipulate some text from std::cin before inserting it into a std::string, so that the following would work:
cin >> setw(80) >> Uppercase >> mystring;
where mystring is std::string (I don't want to use any wrappers for strings).
Uppercase is a manipulator. I think it needs to act on the Chars in the buffer directly (no matter what is considered uppercase rather than lowercase now). Such a manipulator seems difficult to implement in a clean way, as user-defined manipulators, as far as I know, are used to just change or mix some pre-determined format flags easily.
(Non-extended) manipulators usually only set flags and data which the extractors afterwards read and react to. (That is what xalloc, iword, and pword are for.) What you could, obviously, do, is to write something analogous to std::get_money:
struct uppercasify {
uppercasify(std::string &s) : ref(s) {}
uppercasify(const uppercasify &other) : ref(other.ref) {}
std::string &ref;
}
std::istream &operator>>(std::istream &is, uppercasify uc) { // or &&uc in C++11
is >> uc.ref;
boost::to_upper(uc.ref);
return is;
}
cin >> setw(80) >> uppercasify(mystring);
Alternatively, cin >> uppercase could return not a reference to cin, but an instantiation of some (template) wrapper class uppercase_istream, with the corresponding overload for operator>>. I don't think having a manipulator modify the underlying stream buffer's contents is a good idea.
If you're desperate enough, I guess you could also imbue a hand-crafted locale resulting in uppercasing strings. I don't think I'd let anything like that go through a code review, though – it's simply just waiting to surprise and bite the next person working on the code.
You may want to check out boost iostreams. Its framework allows defining filters which can manipulate the stream. http://www.boost.org/doc/libs/1_49_0/libs/iostreams/doc/index.html

Override >> operator like int

this is part of a homework assignment. I don't want an answer just help. I have to make a class called MyInt that can store any sized positive integer. I can only use cstring cctype iomanip and iostream libraries. I really don't understand even where to begin on this.
6) Create an overload of the extraction operator >> for reading integers from an input stream. This operator should ignore any leading white space before the number, then read consecutive digits until a non-digit is encountered (this is the same way that >> for a normal int works, so we want to make ours work the same way). This operator should only extract and store the digits in the object. The "first non-digit" encountered after the number may be part of the next input, so should not be extracted. You may assume that the first non-whitespace character in the input will be a digit. i.e. you do not have to error check for entry of an inappropriate type (like a letter) when you have asked for a number.
Example: Suppose the following code is executed, and the input typed is " 12345 7894H".
MyInt x, y;
char ch;
cin >> x >> y >> ch;
The value of x should now be 12345, the value of y should be 7894 and the value of ch should be 'H'.
The last state of my code is as follows:
istream& operator>>(istream& s, MyInt& N){
N.Resize(5);
N.currentSize=1;
char c;
int i = 0;
s >> c;
N.DigitArray[i++] = C2I(c);
N.currentSize++;
c = s.peek();
while(C2I(c) != -1){
s >> c;
if(N.currentSize >= N.maxSize)
N.Resize(N.maxSize + 5);
N.DigitArray[i] = C2I(c);
i++;
N.currentSize++;
}
}
It almost works! Now it grabs the right number but it doesn't end when I hit enter, I have to enter a letter for it to end.
You can create an operator>> overload for your class this way (as a free function, not inside the class):
std::istream& operator>>(std::istream& lhs, MyInt& rhs) {
// read from lhs into rhs
// then return lhs to allow chaining
return lhs;
}
You can use the members peek and read of istream to read in characters, and isspace to test if a character is a space, and isdigit to check if a character is a number (isspace and isdigit are in the <cctype> header).
First of all, your operator>> should be concerned only with extracting the sequence of chars from the stream and knowing when to stop based on your rules for that. Then, it should defer to a constructor of myInt to actually ingest that string. After all, that class will probably want to expose constructors like:
myInt bigone ("123456123451234123121");
for more general-purpose use, right? And, functions should have a single responsibility.
So your general form will be:
istream& operator>> (istream& is, myInt x)
{
string s = extract_digits_from_stream(is);
x = myInt(s);
return is; // chaining
}
Now how do you extract just digits from a stream and stop at a non-digit? Well, the peek function comes to mind, as does unget. I'd look at source code for the extraction operator for regular integers and see what it does.

Ignore byte-order marks in C++, reading from a stream

I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:
template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
string str;
getline (in, str);
stringstream ss(str);
ss >> val;
}
However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?
(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)
You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.
std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
is.putback(ch);
// now the stream can be passed around and used without worrying about the extra character in the stream.
int i;
readFromStream<int>(is,i);
Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.
On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:
std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
fs.seekg(0);
} else {
std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}
Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.
std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));
* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)
The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)
You have to start by reading the first byte or two of the stream, and
deciding whether it is part of a BOM or not. It's a bit of a pain,
since you can only putback a single byte, whereas you typically will
want to read four. The simplest solution is to open the file, read the
initial bytes, memorize how many you need to skip, then seek back to the
beginning and skip them.
With a not-so-clean solution, I solved by removing non printing chars:
bool isNotAlnum(unsigned char c)
{
return (c < ' ' || c > '~');
}
...
str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());
Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:
// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
bool have_bom{ true };
for(const auto& c : boms) {
if((unsigned char)fs.get() != c) have_bom = false;
}
if(!have_bom) fs.seekg(0);
return;
}
It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.
Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See #Dúthomhas comment below.

Example of overloading C++ extraction operator >> to parse data

I am looking for a good example of how to overload the stream input operator (operator>>) to parse some data with simple text formatting. I have read this tutorial but I would like to do something a bit more advanced. In my case I have fixed strings that I would like to check for (and ignore). Supposing the 2D point format from the link were more like
Point{0.3 =>
0.4 }
where the intended effect is to parse out the numbers 0.3 and 0.4. (Yes, this is an awfully silly syntax, but it incorporates several ideas I need). Mostly I just want to see how to properly check for the presence of fixed strings, ignore whitespace, etc.
Update:
Oops, the comment I made below has no formatting (this is my first time using this site).
I found that whitespace can be skipped with something like
std::cin >> std::ws;
And for eating up strings I have
static bool match_string(std::istream &is, const char *str){
size_t nstr = strlen(str);
while(nstr){
if(is.peek() == *str){
is.ignore(1);
++str;
--nstr;
}else{
is.setstate(is.rdstate() | std::ios_base::failbit);
return false;
}
}
return true;
}
Now it would be nice to be able to get the position (line number) of a parsing error.
Update 2:
Got line numbers and comment parsing working, using just 1 character look-ahead. The final result can be seen here in AArray.cpp, in the function parse(). The project is a (de)serializable C++ PHP-like array class.
Your operator>>(istream &, object &) should get data from the input stream, using its formatted and/or unformatted extraction functions, and put it into your object.
If you want to be more safe (after a fashion), construct and test an istream::sentry object before you start. If you encounter a syntax error, you may call setstate( ios_base::failbit ) to prevent any other processing until you call my_stream.clear().
See <istream> (and istream.tcc if you're using SGI STL) for examples.