Tokens and Grammar - c++

Reading through Programming:Principles and Practice Using C++ . On Chapter 6 were creating a calculator and using tokens to identify each character of the equation one by one. Then we use a Grammar to set the rules for each element (I believe?). Now both of these I dont understand properly. He jumps through some of this text without properly explaining and assumes that you just "get" whats going on. Or Im missing something.
Tokens:
So I understand that we "split" or tokenize? each element so they can be evaluated separately. But I dont understand how they are created?
Here is their example:
class Token {       
public:
char kind;
double value;
};
So from what I understand is. We create a class called Token. We make it public. Then inside this class we define 2 variables kind and value. Next do we initialise the variables below?
Token t;                // t is a Token
t.kind = '+';          // t represents a +
Token t2;                  // t2 is another Token
t2.kind = '8';             // we use the digit 8 as the “kind” for numbers
t2.value = 3.14;
So my question for tokens is. Why are the values '+', '8' and 3.14? Could they be anything or is there a reason its 8? and why value 3.14? Can t.kind be '-', or '*' etc?
Grammar:
So ive now learned that a grammar can be defined by any terminology I want, but reading the following grammar doesn't make sense to me.
// a simple expression grammar:
Expression:
Term
Expression "+" Term         // addition
Expression "–" Term         // subtraction
Term:
Primary
Term "*" Primary             // multiplication
Term "/" Primary              // division
Term "%" Primary               // remainder (modulo)
Primary:
Number
"(" Expression ")"             // grouping
Number:
floating-point-literal
Could someone possibly come up with a smaller example. I dont understand how he got the above grammar from 45+11.5/7 . I understand that we need to set a rule for this program so that it will evaluate *, / and () before + and -. But how does the above tree accomplish that?

Related

How to interpret Regex subtraction with grouping

I would be grateful if someone could explain how the following regex should be interpreted; it is from the W3C reference for Namespaces in XML 1.0, and defines an NCName ([4]) as:
Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
I can understand subtraction when applied to lists, such as:
[a-z-[aeiuo]] representing the list of all consonants (see http://www.regular-expressions.info/charclasssubtract.html), but not when applied to a group (apologies if this is the wrong term) as shown above.
The comment indicates how I should interpret the regex, but I'm struggling; why not just:
Name - ( ':' )
if the intention is for NCName to be Name minus ':', then why are the zero or more characters required on either side (I'm not asking a separate question, just indicating my area of confusion)?
Please accept my thanks in advance.
The documents published by W3C use a variant of the EBNF Notation to describe the languages standardized by them.
It is described in section "6 Notation" of the XML Recommendation.
The example you posted:
NCName ::= Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
How to read it:
NCName is the object described by the rule;
::= separates the name of the described object (on the left) by the expression that describes it (on the right);
Name is an object already described by another rule;
- is the except symbol; A - B in EBNF means "matches A but doesn't match B";
(...) - the parentheses create a group; they make the expression inside them behave as a single item;
Char is another object already described by another rule in the documentation; it basically means a Unicode character;
* - repetition, matches the previous item zero or more times;
':' - string in single or double quotes is a string literal; it represents itself; here, the colon character;
Put together, it means a NCName is a Name that doesn't contain :.
The comment seems incorrect (or maybe it is just bad worded).

How to read the identifier 'class' in Flex?

I am trying to write a compiler for COOL language and am right now at lexical analysis. Concretely, Flex matches the largest pattern as I understand.
Thus if you have in Flex:
class A inherits B
Now if my token for class is returned by following pattern:
^"class" return CLASS;
For my inherits token:
^"class"[ ]+[a-zA-Z]+[0-9]?[ ]+"inherits"[ ]+ return INHERITS;
Now since flex matches the largest pattern, it will always return INHERITS and never class. Is there a work around to this problem?
I can here return token for class alone. But how do I return token for inherits since it MUST be preceded by a class token and its name followed by another string token?
But if I try to impose constraints on inherits, then flex will match the largest pattern not class alone.
Then should I return the enums/number for class identifier individually? And if I do that, how do I identify 'inherits' identifier?
EDIT:
class A inherits B {
main(): SELF_TYPE{...}
}
How does the flex match against main? My reflexer differentiates between TypeID which is A and main, which it declares ObjectID. The only it can do that is by looking ahead at the paranthesis and if it finds (, it declares an ObjectID. But if I do that, I counter the same problem as above: flex will never match against ( but always main(.
You are trying to do too much in Flex, and perhaps you misunderstand the role and boundaries of the lexical phase. You shouldn't be attempting to parse the whole sentence with a Flex regex alone. Flex's job is to consume a stream of text, and convert it to a stream of integer tokens. The sentence you've provided:
class A inherits B
represents multiple tokens from a language that requires parsing. Flex is not a parser, it is a lexical scanner/tokenizer. (Technically it is a parser of bytes or characters, but you want to "parse" atomic units that represent the words of your language, not characters).
So there are 4 distinct tokens (atomic units), also known as TERMINALS in the above sentence: [CLASS, A, INHERITS, B].
You need an IDENTIFIER rule for Flex, such that anything that doesn't match a token, falls through to an IDENTIFIER, so the tokens returned by Flex to the parser are:
CLASS IDENTIFIER INHERITS IDENTIFIER
The job for Flex is to parse each word / token and convert the text to distinct integer values to be consumed by Bison or any other parser.
You typically have a Yacc/Bison BNF grammar to handle:
class_decl:
CLASS IDENTIFIER
| CLASS IDENTIFIER INHERITS IDENTIFIER
;
So your Lex rule would be thus, and you need to return the IDENTIFIER token to parser, while attaching the actual symbol (A, B). You get that from the yytext variable:
LETTER [a-zA-Z_]
DIGIT [0-9]
LETTERDIGIT [a-zA-Z0-9_]
%%
"class" return(CLASS);
"inherits" return(INHERITS);
{LETTER}{LETTERDIGIT}* {
yylval.sym = new Symbol(yytext);
yylval.sym->line = line;
fprintf(stderr, "TOKEN IDENTIFIER(%s)\n", yytext);
return(IDENTIFIER);
}
If you are really trying to do all of this within Flex, then it is possible, but you will end up with a mess, like if you try to parse HTML with regex... :)

Mapping a simple expression to a valid regex

I have an application where grammar school teachers can place an answer box on a page after a question. The answer box is configured by the teacher with an answer line that specifies acceptable answers to the question. I don't expect them to give me a valid regular expression for the answer so I let them write the answer in a simplified form, where a '*' represents 0 or more of anything and ',' separates multiple acceptable answers. So an answer line that contained
*cup,glass
would accept 'teacup' , 'coffee cup' , 'cup' or 'glass' but not 'cup holder'.
Is there a way I can map the answer line they provide to a single regex that I can compare the student's answer with to give me a true or false answer, i.e., it's an acceptable answer to the question, or it isn't?
Thanks
The language isn't specified in the question as I write this - the exact form of the answer will depend heavily on that. Let's assume JavaScript, as most of the poster's tags seem JavaScript-related.
function toRegexp(e) {
return new RegExp(
"^(?:"+
e.split(/,/).map(
function(x){
return x.replace(/([.?+^$[\]\\(){}|-])/g, "\\$1").replace(/\*/g,".*");
}
).join("|")+
")$", "i");
}
(With thanks to this answer for the bit that escapes the special characters.)
I'm not sure what language you are doing this in but given an input string e.g.*cup,glass
add ^( to the start
add )$ to the end
replace all * with .*
replace all , with |
Giving ^(.*cup|glass)$.
All of those steps should be pretty trivial in any language.
Thanks for all your input. The solution I arrived at is http://jsfiddle.net/76zXf/12/. This seems to do everything I asked for, plus allow any number of spaces before and after the correct answer, and allow "+" and "-" in the answer. (So an Answer Line of "3,+3,-3" works fine for the question "What is the square root of 9?") I was able to do it with a simpler build function than proprelkey posted:
$('#answerLine').change(function() {
s = this.value.replace(/\*/,'.*'); // change all "*" for ".*"
s = s.replace(/\+/,'\\+'); // change all "+" for "\+*"
s = s.replace(/\-/,'\\-'); // change all "-" for "\-*"
a1 = s.split(/,/); // get individual terms into an array
a2 = a1.map( // for each term . . .
function(x){
exp = '^\\s*' + x + '\\s*$'; // build complete reg expression
return( exp ); // return this expression to array
}
);
re = RegExp( a2.join("|"),'i'); // our final, complete regExp
I hope I'm not missing any important cases. Thanks again.

'R: Invalid use of repetition operators'

I'm writing a small function in R as follows:
tags.out <- as.character(tags.out)
tags.out.unique <- unique(tags.out)
z <- NROW(tags.out.unique)
for (i in 1:10) {
l <- length(grep(tags.out.unique[i], x = tags.out))
tags.count <- append(x = tags.count, values = l) }
Basically I'm looking to take each element of the unique character vector (tags.out.unique) and count it's occurrence in the vector prior to the unique function.
This above section of code works correctly, however, when I replace for (i in 1:10) with for (i in 1:z) or even some number larger than 10 (18000 for example) I get the following error:
Error in grep(tags.out.unique[i], x = tags.out) :
invalid regular expression 'c++', reason 'Invalid use of repetition operators
I would be extremely grateful if anyone were able to help me understand what's going on here.
Many thanks.
The "+" in "c++" (which you're passing to grep as a pattern string) has a special meaning. However, you want the "+" to be interpreted literally as the character "+", so instead of
grep(pattern="c++", x="this string contains c++")
you should do
grep(pattern="c++", x="this string contains c++", fixed=TRUE)
If you google [regex special characters] or something similar, you'll see that "+", "*" and many others have a special meaning. In your case you want them to be interpreted literally -- see ?grep.
It would appear that one of the elements of tags.out_unique is c++ which is (as the error message plainly states) an invalid regular expression.
You are currently programming inefficiently. The R-inferno is worth a read, noting especially that Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
Given you are simply trying to count the number of times each value occurs there is no need for the loop or regex
counts <- table(tags.out)
# the unique values
names(counts)
should give you the results you want.

Writing regular expressions and rules in Sublime Text 2 syntax definitions

I'm very interested in Syntax Definitions for Sblime text 2
I've studied the basics but I don't know how to write RE (and rules) for smth like
variable = sentense, i.e. myvar = func(foo, bar) + baz
I can't write anything better than ^\s*([^=\n]+)=([^=\n]+\n) (that doesn't work)
How to write this RE in proper way?
Also, i have some difficulties in defining RE for block
IF i FROM .. TO ..
...
ELSE
...
END IF
Hoe to write it?
In this case you have to write a parser. A regex won't work because the patterns may vary. You've already noticed it when you stated 'variable = sentence'.
For this, you can use spoofax or javacup for grammar definitions. I'll give you a snip in JavaCup:
Scanner issues: suppose 'variable' follows the pattern: (_|[a-zA-Z])(_|[a-zA-Z])*
and 'number' is: ([0-9])+
Note that number could be any decimal or int, but here I state it as that pattern, supposing my language only deals with integer (or whatever that pattern means :) ).
Now we can declare our grammar following the JavaCUP syntax. Which is more or less like:
expression ::= variable "=" sentence
sentence ::= sentence "+" sentence;
sentence ::= sentence "-" sentence;
sentence ::= sentence "*" sentence;
sentence ::= sentence "/" sentence;
sentence ::= number;
...and that goes further.
If you've never had any compiler's class, it may seems very difficult to see. Plus there is a lot of grammar's restrictions to avoiding infinity loop in the parser, depending on which you're using (RL or LL).
Anyway, the real answer for your question is: you can't do what you want only with regex, i'll need more concepts.