C++ tokenization

C++ tokenization - c++

I am writing a lexer in C++ and I am reading from a file character by character, however, how do you do tokenization in this case? I can't use strtok since I have character not a string. Somehow I need to keep reading until I reach a delimeter?

The answer is Yes. You need to keep reading until you hit a delimiter.

There are multiple solutions.
The simplest thing to do is exactly that: keep a buffer (std::string) of the characters you already read until you reach a delimiter. At that point, you build a token from the accumulated characters in the buffer, clear the buffer, and push the delimiter (if necessary) in the buffer.
Another solution would be to read ahead of the time: ie, pick up the entire line with std::getline (for example), and then check what's on this line. In general the end-of-line is a natural token delimiter.
This works well... when delimiters are easy.
Unfortunately some languages, like C++, have awkward grammars. For example, in C++ >> can be either:
the operator >> (for right-shift and stream extraction)
the end of two nested templates (ie could be rewritten as > >)
In those cases... well, just don't bother with the difference in the tokenizer, and let your AST building pass disambiguate, it's got more information.

On the basis of information provided you.
If you want to read upto a delimiter from a File, use getline(char *,int,char) function.
getline() is use to read upto n characters or upto a delimiter.
Example:
#include<fstream.h>
using namespace std;
main()
{
fstream f;
f.open("test.cpp",ios::in);
char *c;
f.getline(c,2,' ');
cout<<c; // upto 1 char or till a space
}

Related

The best method for filling char array (gets vs cin.getline)

I'm using C ++ 11. I'm wondering if there are any advantages to using cin.getline () compared to gets ().
I need to fill a char array.
Also, should I use fgets or getline for files?

I'm wondering if there are any advantages to using cin.getline () compared to gets ().
I am assuming you really mean gets, not fgets.
Yes, there definitely is. gets is known to be a security problem. cin.getline() does not suffer from that problem.
It's worth comparing fgets and cin.getline.
The only difference that I see is that fgets will include the newline character in the output while cin.getline won't.
Most of the time, the newline character is ignored by application code. Hence, it is better to use cin.getline() or istream::getline() in general. If presence of the newline character in the output is important to you for some reason, you should use fgets.
Another reason to prefer istream::getline is that you can specify a character for the delimiter. If you need to parse a comma separated values (CSV) file, you can use:
std::ifstream fstr("some file name.csv");
fstr.getline(data, data_size, ',');

Of course.
First of all gets doesn't check of length of the input - so if the input if longer than char array, you are getting an overflow.
On the other hand cin.getline allows to specify the size of stream.
Anyway, the consensus among C++ programmers is that you should avoid raw arrays anyway.

Input Redirection reading integers and char C++

Thanks for taking your time to read this!
I'm having trouble parsing through a file with input redirection and I am having trouble reading through integers and characters.
Without using getline(), how do you read in the file including integers, characters, and any amount of whitespaces? (I know the >> operator can skip whitespace but fails when it hits a character)
Thanks!

The first thing you need to realise is that, fundamentally, there are no things like "integers" in your file. Your file does not contain typed data: it contains bytes.
Now, since C++ doesn't support any text encodings, for our purposes here we can consider bytes equivalent to "characters". (In reality, you'll probably layer something like a UTF-8 support library on top of your code, at which point "characters" takes on a whole new meaning. But we'll save that discussion for another day.)
At the most basic, then, we can just extract a bunch of bytes. Let's say 50 at a time:
std::ifstream ifs("filename.dat");
static constexpr const size_t CHUNK_SIZE = 50;
char buf[CHUNK_SIZE];
while (ifs.read(buf, CHUNK_SIZE)) {
const size_t num_extracted = ifs.gcount();
parseData(buf, num_extracted);
}
The function parseData would then examine those bytes in whatever manner you see fit.
For many text files this is unnecessarily arduous. So, as you've discovered, the IOStreams part of the C++ Standard Library provides us with some shortcuts. For example, std::getline will read bytes up to a delimiter, rather than reading a certain quantity of bytes.
Using this, we can read in things "line by line" — assuming a "line" is a sequence of bytes terminated by a \n (or \r\n if your platform performs line-ending translation, and you haven't put the stream into binary mode):
std::ifstream ifs("filename.dat");
static constexpr const size_t CHUNK_SIZE = 50;
std::string line;
while (std::getline(ifs, line)) {
parseLine(line);
}
Instead of \n you can provide, as a third argument to std::getline, some other delimiter.
The other facility it offers is operator<<, which will pick out tokens (sequences of bytes delimited by whitespace) and attempt to "lexically cast" them; that is, it'll try to interpret friendly human ASCII text into C++ data. So if your input is "123 abc", you can pull out the "123" into an int with value 123, and the "abc" into another string.
If you need more complex parsing, though, you're back to the initial offering, and to the conclusion of my answer: read everything and parse it byte-by-byte as you see fit. To help with this, there's sscanf inherited from the C standard library, or spooky incantations from Boost; or you could just write your own algorithms.
The above is true of any compatible input stream, be it a std::ifstream, a std::istringstream, or the good old ready-provided std::istream instance named std::cin (which I guess is how you're accepting the data, given your mention of input redirection: shell scripting?).

Using a regex_iterator on an istream

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:
Read it in the istream a character at a time
Collect the characters
When a delimiter is hit return the collection as a token
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A\nB\rC\n\r" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.
Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?

This question is about code appearance:
Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that
C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.
boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r", this can save buffer size.
boost::match_partial has 4 drawbacks:
In the worst case, like "ABC\n" this saves the user no size and he must slurp the whole istream
If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
The inclusion of boost always causes bloat
Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.
For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.

I think not. istream_iterator has the input_iterator_tag tag, whereas regex_iterator expects to be initialized using bi-directional iterators (bidirectional_iterator_tag).
If your delimiter regex is complex enough to avoid reading the stream yourself, the best way to do this is to indeed slurp the istream.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).

Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.

What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.

What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.

No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

C++ reading CSVs

Im having a bit of trouble reading CSVs. I have multiple types of data, so i am not sure how to get this to work:
string, string, bool, bool, int
I cant simply use >> to read in the data since the deliminator is not whitespace. scanf doesnt work, since it needs a human input, not file input, getline only reads in strings and also includes the \n char for some reason.
how can i read my csv properly?

You CAN use getline. There's an overload where the third argument passed can be a char for the delimiter. Just throw it all in a loop

Another option (which isn't typically recommended for C++, though), is fscanf. You're right that scanf is no good for you, but fscanf is its file-based equivalent.
Another canonical solution typically employed in C, but which isn't so strongly recommended in C++, is to go ahead and use getline, and then use strtok or a simple parser to parse each line.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ tokenization - c++

I am writing a lexer in C++ and I am reading from a file character by character, however, how do you do tokenization in this case? I can't use strtok since I have character not a string. Somehow I need to keep reading until I reach a delimeter?

The answer is Yes. You need to keep reading until you hit a delimiter.

Related

The best method for filling char array (gets vs cin.getline)

Input Redirection reading integers and char C++

Using a regex_iterator on an istream

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

C++ reading CSVs

Categories

Resources