Should JSON lexer be greedy?

Should JSON lexer be greedy? - c++

I am building a toy JSON parser in C++, just for the learning experience.
While building the lexer, I came across a dilemma: should the lexer be greedy? If so, where is this defined? I could not find any directive in either JSON or ECMA-404.
In particular, while trying to tokenize the following (invalid number):
0.x123
Should my lexer try to parse it as the invalid number "0.x123" (greedy behavior) or the invalid number "0.x" followed by by the valid number "123" (but ultimately parsing it as an invalid sequence of tokens)?
Also, while tokenizing strings, should it be the lexer responsibility to check if the string is valid (for instance if a backslash is only followed by the allowable escape characters) or should I check this constraint in a different semantic analysis step? I guess this is more of an architectural preference, but I am curious about your opinions.

Invalid is invalid. If you can't parse it, bail at the earliest opportunity and raise an error.
There's no need to be greedy here because you'll just waste time processing data that has zero impact on the situation.

Related

How to exclude parts of input from being parsed?

OK, so I've set up a complete Bison grammar (+ its Lex counterpart) and this is what I need :
Is there any way I can set up a grammar rule so that a specific portion of input is excluded from being parsed, but instead retrieved as-is?
E.g.
external_code : EXT_CODE_START '{' '}';
For instance, how could I get the part between the curly brackets as a string, without allowing the parser to consume it (since it'll be "external" code, it won't abide by my current language rules... so, it's ok - text is fine).
How would you go about that?
Should I tackle the issue by adding a token to the Lexer? (same as I do with string literals, for example?)
Any ideas are welcome! (I hope you've understood what I need...)
P.S. Well, I also thought of treating the whole situation pretty much as I do with C-style multiline comments (= capture when the comment begins, in the Lexer, and then - from within a custom function, keep going until the end-of-comment is found). That'd definitely be some sort of solution. But isn't there anything... easier?

You can call the lexer's input/yyinput function to read characters from the input stream and do something with them (and they won't be tokenized so the parser will never see them).
You can use lexer states, putting the lexer in a different state where it will skip over the excluded text, rather than returning it as tokens.
The problem with either of the above from a parser action is dealing with the parser's one token lookahead, which occurs in some (but not all) cases. For example, the following will probably work:
external_code: EXT_CODE_START '{' { skip_external_code(); } '}'
as the action will be in a default reduction state with no lookahead. In this case, skip_external_code could either just set the lexer state (second option above), or it could call input until it gets to the matching } and then calls unput once (first option above).
Note that the skip_external_code function needs to be defined in the 3rd section of the the lexer file so it has access to static functions and macros in the lexer (which both of these techniques depend on).

Is it bad idea using regex to tokenize string for lexer?

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).
For instance,
begin x:=1;y:=2;
then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).

Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.
Although performance-wise it can be more efficient if you do it yourself, it isn't a must.

Using regular expressions is THE traditional way to generate your tokens.
lex and yacc (or flex and bison) are a traditional compiler creation pair, where lex does nothing except tokenize symbols and pass them to YACC
http://en.wikipedia.org/wiki/Lex_%28software%29
YACC is a stack based state machine (pushdown automaton) that processes the symbols.
I think regex processing is the way to go for parsing symbols of any level of complexity. As Oak mentions, you'll end up writing your own (probably inferior) regex parser. The only exception would be if it is dead simple, and even your posted example starts to exceed "dead simple".
in lex syntax:
:= return ASSIGN_TOKEN_OR_WHATEVER;
begin return BEGIN_TOKEN;
[0-9]+ return NUMBER;
[a-zA-Z][a-zA-Z0-9]* return WORD;
Character sequences are optionally passed along with the token.
Individual characters that are tokens in their own right [e.g. ";" )get passed along unmodified. Its not the only way, but I have found it to work very well.
Have a look:
http://www.faqs.org/docs/Linux-HOWTO/Lex-YACC-HOWTO.html

How to make flex try the second longest matching regular expression?

This question might sound a little confusing. I'm using Flex to pass tokens to Bison.
The behavior I want is that Flex matches the longest regular expression and passes that token (it DOES work like this), but if that token doesn't work with the grammar, it then matches the second longest regular expression and passes that token.
I'm struggling to think of a way to create this behavior. How could I make this happen?
To clarify, for example, say I have two rules:
"//" return TOKEN_1;
"///" return TOKEN_2;
Given the string "///", I'd like it to first pass TOKEN_2 (it does).
If TOKEN_2 doesn't fit with the grammar as specified in Bison, it then passes TOKEN_1 (which is also valid).
How can I create this behavior?

In flex, you can have a rule that tries to do something but fails and tries the second-best rule by using the REJECT macro:
REJECT directs the scanner to proceed on to the
"second best" rule which matched the input (or a
prefix of the input). The rule is chosen as
described above in "How the Input is Matched", and
yytext and yyleng set up appropriately. It may
either be one which matched as much text as the
originally chosen rule but came later in the flex
input file, or one which matched less text.
(source: The Flex Manual Page).
So to answer your question about getting the second-longest expression, you might be able to do this using REJECT (though you have to be careful, because it could just pick something of the same length with equal priority).
Note that flex will run slower with REJECT being used because it needs to maintain extra logic to "fall back" to worse matches at any point. I'd suggest only using this if there's no other way to fix your problem.

Sorry but you cant do that. I'm actually unsure how much flex talks to bison. I do know there is a mode for REPL parsing and i do know there is another mode that parses it all.
You'll have to inline the rule. For example instead of // and / you write a rule which accepts /// then another that assumes /// means // /. But that gets messy and i only did that in a specific case in my code.

I would just have the lexer scan two tokens // and / and then have the parser deal with cases when they are supposed to be regarded as one token, or separate. I.e. a grammar rule that begins with /// can actually be refactored into one which starts with // and /. In other words, do not have a TOKEN_2 at all. In same cases this sort of thing can be tricky though, because the LARL(1) parser has only one token of lookahead. It has to make a shift or reduce decision based on seeing the // only, without regard for the / which follows.
I had an idea for solving this with a hacky approach involving a lexical tie in, but it proved unworkable.
The main flaw with the idea is that there isn't any easy way to do error recovery in yacc which is hidden from the user. If a syntax error is triggered, that is visible. The yyerror function could contain a hack to try to hide this, but it lacks the context information.
In other words, you can't really use Yacc error actions to trigger a backtracking search for another parse.

This is tough for bison/yacc to deal with, as it doesn't do backtracking. Even if you use a backtracking parser generator like btyacc, it doesn't really help unless it also backtracks through the lexer (which would likely require a parser generator with integrated lexer.)
My suggestion would be to have the lexer recognize a slash immediately followed by a slash specially and return a different token:
\//\/ return SLASH;
\/ return '/'; /* not needed if you have the catch-all rule: */
. return *yytext;
Now you need to 'assemble' multi-slash 'tokens' as non-terminals in the grammer.
single_slash: SLASH | '/' ;
double_slash: SLASH SLASH | SLASH '/' ;
triple_slash: SLASH SLASH SLASH | SLASH SLASH '/' ;
However, you'll now likely find you have conflicts in the grammar due to the 1-token lookahead not being enough. You may be able to resolve those by using btyacc or bison's %glr-parser option.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}

Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.

A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)

A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.

A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.

When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?

A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Finite State Machine parser

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:
global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...
Syntax explanation:
{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.
So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:
Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
upper: reads global, exe, lib+ target names...
normal: inside a scope, can read SOURCES..., create user variables...
list: adds items to a list until a newline is encountered.
Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).
Thanks for any input.

If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.
If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.
If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.

There are two stages to analyzing a text input stream for parsing:
Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.
Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.
So in response to your questions:
Should I use an enum or an abstract class + derivatives for my states?
I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.
When reading a list, I need to NOT ignore \n.
You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.
Here's the draft states I have in mind: ...
See my commments on lexical and grammatical analysis above.

For parsing I always try to use something already proven to work: ANTLR with ANTLRWorks which is of great help for designing and testing a grammar. You can generate code for C/C++ (and other languages) but you need to build the ANTLR runtime for those languages.
Of course if you find flex or bison easier to use you can use them too (I know that they generate only C and C++ but I may be wrong since I didn't use them for some time).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js