Using a regex_iterator on an istream - c++

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:
Read it in the istream a character at a time
Collect the characters
When a delimiter is hit return the collection as a token
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A\nB\rC\n\r" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.
Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?

This question is about code appearance:
Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that
C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.
boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r", this can save buffer size.
boost::match_partial has 4 drawbacks:
In the worst case, like "ABC\n" this saves the user no size and he must slurp the whole istream
If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
The inclusion of boost always causes bloat
Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.
For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.

I think not. istream_iterator has the input_iterator_tag tag, whereas regex_iterator expects to be initialized using bi-directional iterators (bidirectional_iterator_tag).
If your delimiter regex is complex enough to avoid reading the stream yourself, the best way to do this is to indeed slurp the istream.

Related

Why do we need to use a stringstream when splitting a string?

Please note, I've never used streams before today, so my understanding of them remains rather vague. Apologies when I say something appallingly stupid.
Here I have a short bit of code that splits up a stringstream into a bunch of strings at each space.
vector<string> words;
stringstream ss("some random words that I wrote just now");
string word;
while(getline(ss, word, ' ')){
words.push_back(word);
}
I'm wondering why we're using a stringstream here, rather than just a string.
This would work like:
Create a string object at memory location x
When referenced, go through each character and check if it is a space. All previous characters should be saved somewhere temporary.
If it is a space, grab all the stuff we've just stored and stick it on the end of the vector, then, clear the temp storage thing. If it's not a space, go back to step 2
What's storing "some random words that I wrote just now" as a stringstream going to do to help us here?
Is it just making a stream of characters so that we can check through them? Is this necessary? Are we always doing this, even in other languages?
I'm wondering why we're using a stringstream here, rather than just a string.
If this is the question, then one big reason why stringstream is used is simply -- because it works with little effort by the programmer. The less code you write, the less chance for bugs to occur.
Your method of using just std::string and searching for spaces requires the C++ programmer to write all of those steps (create a string, manually search for spaces, etc). It may be trivial to write, but even the best programmers can make mistakes in trivial code. The code may have bugs, may not cover all of the corner cases, etc.
As to ease of use:
When a C++ programmer sees stringstream with respect to usage of separating a sting with whitespace, the purpose of the code is immediately known.
If on the other hand, a programmer decides to manually parse the data by using just string and searching for spaces, the code is not immediately realized as to what it does when another programmer reads the code. Sure, it may be a quick realization of the code by the other programmer, but I can bet the other programmer will say "why didn't you use stringstream?".
What's storing "some random words that I wrote just now" as a stringstream going to do to help us here? Is it just making a stream of characters so that we can check through them? Is this necessary?
std::stringstream just allows you to use the usual input/output operations such as >> and std::getline on a string. You can't use std::getline to read from an std::string, so you put the string in a std::streamstream first. You can totally parse a string by looping over the characters yourself as you described.
Are we always doing this, even in other languages?
Not in Python at least. There you would just do words = line.split(' ').

A better replacement for istrstream?

istrstream was perfect for my needs - basically, take a fixed char buffer, and give me a simple way to extract lines getline() and test for eof()
I'm switching our projects to C++ 17 compliance - which has deprecated istrsteam - apparently because there are too many C++ programmers who cannot fathom fixed buffer memory management (are you serious?!)
At any rate, the istringstream provides the same use semantics, but it imposes the need to now copy the entire fixed character buffer at construction time.
This is an anti-pattern.
What I am looking for is either a way to use a string_view in place of a string for the istringstream, or alternately a better replacement for stringstream which itself handles externally managed fixed buffer (it need only point into it, it never need worry about managing that resource, just as strstream did).
Currently, in VS 2017, this is illegal, and if I understand things correctly, is illegal everywhere in the current state-of-art of C++ (I'm sure you'll correct me if I'm wrong!)
std::string_view raw_view(reinterpret_cast<const char *>(raw_buffer.get()), raw_buffer.size());
std::istringstream raw_stream(raw_view);
So - ideas?
Note: Peter Sommerlad has a proposal for this exact idea here for the C++ standards body:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0448r1.pdf
Continue using istrstream for the time being. It likely won't be removed until either P0448 (using std::span<char> as the source/destination of a stream buffer) or P0408 (the ability to move data into/outof stringstreams) is adopted by the standard. Either of those would serve your needs well.
That being said, if all you're trying to do is get substrings between \ns, it would be far more efficient (even with the above proposals) to just use a regex search. Or just a regular search, since you're just looking for \n. That would give you a pair of iterators that represents a line. Using iostreams for line-by-line processing of an already-loaded character buffer is overkill and will never be as efficient as the alternative.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

C++ tokenization

I am writing a lexer in C++ and I am reading from a file character by character, however, how do you do tokenization in this case? I can't use strtok since I have character not a string. Somehow I need to keep reading until I reach a delimeter?
The answer is Yes. You need to keep reading until you hit a delimiter.
There are multiple solutions.
The simplest thing to do is exactly that: keep a buffer (std::string) of the characters you already read until you reach a delimiter. At that point, you build a token from the accumulated characters in the buffer, clear the buffer, and push the delimiter (if necessary) in the buffer.
Another solution would be to read ahead of the time: ie, pick up the entire line with std::getline (for example), and then check what's on this line. In general the end-of-line is a natural token delimiter.
This works well... when delimiters are easy.
Unfortunately some languages, like C++, have awkward grammars. For example, in C++ >> can be either:
the operator >> (for right-shift and stream extraction)
the end of two nested templates (ie could be rewritten as > >)
In those cases... well, just don't bother with the difference in the tokenizer, and let your AST building pass disambiguate, it's got more information.
On the basis of information provided you.
If you want to read upto a delimiter from a File, use getline(char *,int,char) function.
getline() is use to read upto n characters or upto a delimiter.
Example:
#include<fstream.h>
using namespace std;
main()
{
fstream f;
f.open("test.cpp",ios::in);
char *c;
f.getline(c,2,' ');
cout<<c; // upto 1 char or till a space
}

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).