Tokenizer efficiency question - c++

I'm writing a compiler front end for a project and I'm trying to understand what's the best method of tokenize the source code.
I can't choose between two ways:
1) the tokenizer read all tokens:
bool Parser::ReadAllTokens()
{
Token token;
while( m_Lexer->ReadToken( &token ) )
{
m_Tokens->push_back( token );
token.Reset(); // reset the token values..
}
return !m_Tokens->empty();
}
and then the parsing phase begins, operating on the m_Tokens list. In this way the methods getNextToken(), peekNextToken() and ungetToken() are relatively easy to implement by iterator, and the parsing code is well written and clear ( not broken by getNextToken() i.e. :
getNextToken();
useToken();
getNextToken();
peekNextToken();
if( peeked is something )
ungetToken();
..
..
)
2) the parsing phase begins and when needed, the token is created and used ( the code seems not so clear )
What's the best method??and why??and the efficiency??
thanks in advance for the answers

Traditionally, compiler construction classes teach you to read tokens, one by one, as you parse. The reason for that, is that back in the days, memory resources were scarce. You had kilobytes to your disposal, and not gigabytes as you do today.
Having said that, I don't mean to recommend you to read all tokens in advance, and then parse from your list of tokens. Input is of arbitrary size. If you hog too much memory, the system will become slow. Since it looks like you only need one token in the lookahead, I'd read one at a time from the input stream. The operating system will buffer and cache the input stream for you, so it'll be fast enough for most purposes.

It would be better to use something like Boost::Spirit to tokenise. Why reinvent the wheel?

Your method (1) is generally overkill - it is not required to tokenize an entire file prior parsing it.
A good way to go is to implement a buffered tokenizer, which will store in a list the tokens that were poke or unget, and which will consume the element of this list on "get" or read tokens from the file when the list gets empty (a la FILE*).

The first method is better, as you can then also understand the code 3 month later...

Related

Why do we need to use a stringstream when splitting a string?

Please note, I've never used streams before today, so my understanding of them remains rather vague. Apologies when I say something appallingly stupid.
Here I have a short bit of code that splits up a stringstream into a bunch of strings at each space.
vector<string> words;
stringstream ss("some random words that I wrote just now");
string word;
while(getline(ss, word, ' ')){
words.push_back(word);
}
I'm wondering why we're using a stringstream here, rather than just a string.
This would work like:
Create a string object at memory location x
When referenced, go through each character and check if it is a space. All previous characters should be saved somewhere temporary.
If it is a space, grab all the stuff we've just stored and stick it on the end of the vector, then, clear the temp storage thing. If it's not a space, go back to step 2
What's storing "some random words that I wrote just now" as a stringstream going to do to help us here?
Is it just making a stream of characters so that we can check through them? Is this necessary? Are we always doing this, even in other languages?
I'm wondering why we're using a stringstream here, rather than just a string.
If this is the question, then one big reason why stringstream is used is simply -- because it works with little effort by the programmer. The less code you write, the less chance for bugs to occur.
Your method of using just std::string and searching for spaces requires the C++ programmer to write all of those steps (create a string, manually search for spaces, etc). It may be trivial to write, but even the best programmers can make mistakes in trivial code. The code may have bugs, may not cover all of the corner cases, etc.
As to ease of use:
When a C++ programmer sees stringstream with respect to usage of separating a sting with whitespace, the purpose of the code is immediately known.
If on the other hand, a programmer decides to manually parse the data by using just string and searching for spaces, the code is not immediately realized as to what it does when another programmer reads the code. Sure, it may be a quick realization of the code by the other programmer, but I can bet the other programmer will say "why didn't you use stringstream?".
What's storing "some random words that I wrote just now" as a stringstream going to do to help us here? Is it just making a stream of characters so that we can check through them? Is this necessary?
std::stringstream just allows you to use the usual input/output operations such as >> and std::getline on a string. You can't use std::getline to read from an std::string, so you put the string in a std::streamstream first. You can totally parse a string by looping over the characters yourself as you described.
Are we always doing this, even in other languages?
Not in Python at least. There you would just do words = line.split(' ').

C++ how to check if the std::cin buffer is empty

The title is misleading because I'm more interested in finding an alternate solution. My gut feeling is that checking whether the buffer is empty is not the most ideal solution (at least in my case).
I'm new to C++ and have been following Bjarne Stroustrup's Programming Principles and Practices using C++. I'm currently on Chapter 7, where we are "refining" the calculator from Chapter 6. (I'll put the links for the source code at the end of the question.)
Basically, the calculator can take multiple inputs from the user, delimited by semi-colons.
> 5+2; 10*2; 5-1;
= 7
> = 20
> = 4
>
But I'd like to get rid of the prompt character ('>') for the last two answers, and display it again only when the user input is asked for. My first instinct was to find a way to check if the buffer is empty, and if so, cout the character and if not, proceed with couting the answer. But after a bit of googling I realized the task is not as easy as I initially thought... And also that maybe that wasn't a good idea to begin with.
I guess essentially my question is how to get rid of the '>' characters for the last two answers when there are multiple inputs. But if checking the cin buffer is possible and is not a bad idea after all, I'd love to know how to do it.
Source code: https://gist.github.com/Spicy-Pumpkin/4187856492ccca1a24eaa741d7417675
Header file: http://www.stroustrup.com/Programming/PPP2code/std_lib_facilities.h
^ You need this header file. I assume it is written by the author himself.
Edit: I did look around the web for some solutions, but to be honest none of them made any sense to me. It's been like 4 days since I picked up C++ and I have a very thin background in programming, so sometimes even googling is a little tough..
As you've discovered, this is a deceptively complicated task. This is because there are multiple issues here at play, both the C++ library, and the actual underlying file.
C++ library
std::cin, and C++ input streams, use an intermediate buffer, a std::streambuf. Input from the underlying file, or an interactive terminal, is not read character by character, but rather in moderately sized chunks, where possible. Let's say:
int n;
std::cin >> n;
Let's say that when this is done and over is, n contains the number 42. Well, what actually happened is that std::cin, more than likely, did not read just two characters, '4' and '2', but whatever additional characters, beyond that, were available on the std::cin stream. The remaining characters were stored in the std::streambuf, and the next input operation will read them, before actually reading the underlying file.
And it is equally likely that the above >> did not actually read anything from the file, but rather fetched the '4' and the '2' characters from the std::streambuf, that were left there after the previous input operation.
It is possible to examine the underlying std::streambuf, and determine whether there's anything unread there. But this doesn't really help you.
If you were about to execute the above >> operator, you looked at the underlying std::streambuf, and discover that it contains a single character '4', that also doesn't tell you much. You need to know what the next character is in std::cin. It could be a space or a newline, in which case all you'll get from the >> operator is 4. Or, the next character could be '2', in which case >> will swallow at least '42', and possibly more digits.
You can certainly implement all this logic yourself, look at the underlying std::streambuf, and determine whether it will satisfy your upcoming input operation. Congratulations: you've just reinvented the >> operator. You might as well just parse the input, a character at a time, yourself.
The underlying file
You determined that std::cin does not have sufficient input to satisfy your next input operation. Now, you need to know whether or not input is available on std::cin.
This now becomes an operating system-specific subject matter. This is no longer covered by the standard C++ library.
Conclusion
This is doable, but in all practical situations, the best solution here is to use an operating system-specific approach, instead of C++ input streams, and read and buffer your input yourself. On Linux, for example, the classical approach is to set fd 0 to non-blocking mode, so that read() does not block, and to determine whether or not there's available input, just try read() it. If you did read something, put it into a buffer that you can look at later. Once you've consumed all previously-read buffered input, and you truly need to wait for more input to be read, poll() the file descriptor, until it's there.

spimi algorithm misunderstanding

I'm trying to implement a single-pass in-memory indexer in C++
But in the algorithm, I think there is something wrong or ( most probably) I have misunderstanding
SPIMI-INVERT(token_stream)
output_file = NEWFILE()
dictionary = NEWHASH()
while (free memory available)
token ← next(token_stream)
if term(token) ∈ dictionary
then postings_list = ADDTODICTIONARY(dictionary, term(token))
else postings_list=GETPOSTINGSLIST(dictionary,term(token))
if full(postings_list)
then postings_list = DOUBLEPOSTINGSLIST(dictionary, term(token))
ADDTOPOSTINGSLIST(postings_list, docID(token))
sorted_terms ← SORTTERMS(dictionary)
WRITEBLOCKTODISK(sorted_terms,dictionary,output_file)
return output_file
Let's assume that I did all parsings and turned all the documents into a stream of tokens where tokens are term,doc_id pairs
http://nlp.stanford.edu/IR-book/html/htmledition/single-pass-in-memory-indexing-1.html says that SPIMI-INVERT function is called for every block.
Alright let's start then,
we read the stream block by block, so now I have one single block and
sent it over SPIMI-INVERT function as an argument
the function does some processing with the token for the dictionary
somehow ( maybe because the dictionary is too big) we don't have free
memory anymore when we are in the while loop.
The algorithm breaks the loop and writes the current dictionary to
disk.
But from outside world (as a caller of the function) I have no idea if
the block that I send it as an argument processed totally or not. Don't you
think that there is something wrong here?
Because no answer so far and after talking to my professor, I am answering my question.
I must say that the algorithm is not really clear, because my professor was not sure too. And I am answering this question like how I interpreted it.
token stream is a file that includes tokens(term , doc-id pair)
while (there is token in token stream)
dict = new dictionary()
while(there is free memory available)
token = next_token()
dict.add_to_postingList(token)
write_dict_to_file(dict)
delete dict
//Assuming posting list is dynamically sized and dictionary knows if a term exist.
Here I implemented it in C++, might be usefull

Improving efficiency of std::string in a compiler

I'm attempting to build a scanner for a compiler of a C like language and am getting caught up on an efficient way to generate tokens... I have a scan function:
vector<Token> scan(string &input);
And also a main function, which reads in a lexically correct file and removes comments. (the language does not support /* , */ comments) I am using a DFA with maximal munch to generate tokens... and I'm pretty sure that part of the scanner is reasonably efficient. However the scanner does not handle large files well, because they all end up in one string... and all of the concatenation of 1000 lines of a file with line 1001 is breaking the scanner. Unfortunately my FSM can not deal with comments because they are allowed to contain any Unicode and other odd characters. I was wondering... is there a better way to go from a file in stdin, to a vector of tokens, keeping in mind that the function scan must take a single string and, return a single vector, and all tokens must be in a single vector at the end of scanning... Anyway, here is the code which "scans": Please don't laugh at my bad idea too hard :)
string in = "";
string build;
while(true)
{
getline(cin, build);
if( cin.eof() )
break;
if(build.find ("//") != string::npos)
build = build.substr(0, build.find("//",0));
in += " " + build;
}
try {
vector<Token> wlpp = scan(in);
...
...
A couple of things that you might want to consider:
in += " " + build;
Is very inefficient and probably not want you want in that loop, but that doesn't seem to be where you're running in to problems. (at the very least, get some idea about the size of your inputs and do in.reserve(size) before that.
The better design for your scanner might be as a class that wraps the input file as an istream_iterator<Token> and implement an appropriate operator>> for Token. If you really wanted it in a vector, you could then do something like vector<Token> v(istream_iterator<Token>(cin), istream_iterator<Token>()); and be done with it. Your operator>> would then just swallow comments and populate a token before returning.

Counting lines of code

I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?
There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.
At its sourceforge page you can find the perl source code for download.
Well, if by line counters, you mean programs which count lines, then the
algorithm is pretty trivial: just count the number of '\n' in the
code. If, on the other hand, you mean programs which count C++
statements, or produce other metrics... Although not 100% accurate,
I've gotten pretty good results in the past just by counting '}' and
';' (ignoring those in comments and string and character literals, of
course). Anything more accurate would probably require parsing the
actual C++.
You don't need to actually parse the code to count line numbers, it's enough to tokenise it.
The algorithm could look like:
int lastLine = -1;
int lines = 0;
for each token {
if (isCode(token) && lastLine != token.line) {
++lines;
lastLine = token.line;
}
}
The only information you need to collect during tokenisation is:
what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
at which line in the file the token occures.
On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.
EDIT
I've mentioned "tokenisation", let me describe it for you quickly:
Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:
#include "something.h"
/*
This is my program.
It is quite useless.
*/
int main() {
return something(2+3); // this is equal to 5
}
could look like:
PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)
Et cetera.
Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).
Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.
Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.
On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.
I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.