How to interpret Regex subtraction with grouping - regex

I would be grateful if someone could explain how the following regex should be interpreted; it is from the W3C reference for Namespaces in XML 1.0, and defines an NCName ([4]) as:
Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
I can understand subtraction when applied to lists, such as:
[a-z-[aeiuo]] representing the list of all consonants (see http://www.regular-expressions.info/charclasssubtract.html), but not when applied to a group (apologies if this is the wrong term) as shown above.
The comment indicates how I should interpret the regex, but I'm struggling; why not just:
Name - ( ':' )
if the intention is for NCName to be Name minus ':', then why are the zero or more characters required on either side (I'm not asking a separate question, just indicating my area of confusion)?
Please accept my thanks in advance.

The documents published by W3C use a variant of the EBNF Notation to describe the languages standardized by them.
It is described in section "6 Notation" of the XML Recommendation.
The example you posted:
NCName ::= Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
How to read it:
NCName is the object described by the rule;
::= separates the name of the described object (on the left) by the expression that describes it (on the right);
Name is an object already described by another rule;
- is the except symbol; A - B in EBNF means "matches A but doesn't match B";
(...) - the parentheses create a group; they make the expression inside them behave as a single item;
Char is another object already described by another rule in the documentation; it basically means a Unicode character;
* - repetition, matches the previous item zero or more times;
':' - string in single or double quotes is a string literal; it represents itself; here, the colon character;
Put together, it means a NCName is a Name that doesn't contain :.
The comment seems incorrect (or maybe it is just bad worded).

Related

Tokens and Grammar

Reading through Programming:Principles and Practice Using C++ . On Chapter 6 were creating a calculator and using tokens to identify each character of the equation one by one. Then we use a Grammar to set the rules for each element (I believe?). Now both of these I dont understand properly. He jumps through some of this text without properly explaining and assumes that you just "get" whats going on. Or Im missing something.
Tokens:
So I understand that we "split" or tokenize? each element so they can be evaluated separately. But I dont understand how they are created?
Here is their example:
class Token {       
public:
char kind;
double value;
};
So from what I understand is. We create a class called Token. We make it public. Then inside this class we define 2 variables kind and value. Next do we initialise the variables below?
Token t;                // t is a Token
t.kind = '+';          // t represents a +
Token t2;                  // t2 is another Token
t2.kind = '8';             // we use the digit 8 as the “kind” for numbers
t2.value = 3.14;
So my question for tokens is. Why are the values '+', '8' and 3.14? Could they be anything or is there a reason its 8? and why value 3.14? Can t.kind be '-', or '*' etc?
Grammar:
So ive now learned that a grammar can be defined by any terminology I want, but reading the following grammar doesn't make sense to me.
// a simple expression grammar:
Expression:
Term
Expression "+" Term         // addition
Expression "–" Term         // subtraction
Term:
Primary
Term "*" Primary             // multiplication
Term "/" Primary              // division
Term "%" Primary               // remainder (modulo)
Primary:
Number
"(" Expression ")"             // grouping
Number:
floating-point-literal
Could someone possibly come up with a smaller example. I dont understand how he got the above grammar from 45+11.5/7 . I understand that we need to set a rule for this program so that it will evaluate *, / and () before + and -. But how does the above tree accomplish that?

Creating a rule for a printable character in lex/yacc

I would like to create a grammar rule for a printable character (any character which returns true using C isprint() function.
For this purpose i created the following regex rule inside my lex file:
[\x20-\x7E] { yylval.ch = strdup(yytext); return CHARACTER; }
The regular expression contains all the printable characters based on their ASCII hexadecimal value.
On my first attempt this rule was located in the bottom, but any printable character that was already stated before obviously wasn't included, for example if my input was the character '+' and i had a previous rule:
"+" { return PLUS_OPERATOR; }
The parser accepted it as a PLUS_OPERATOR and not as CHARACTER.
Than i tried to place the character rule on top of my scanner, and from the same reason as before - all the following rules with characters in the printable range could not be matched.
My question is what can i do to create a rule that will match all printable characters but also rules for specific characters.
The only thing that i can think of is to putt it on the bottom and use a grammar rule with all one-character regular expression rules and the character rule (ex. CHAR : PLUS_OPERATOR | MINUS_OPERATOR | EQUAL_OPERATOR | CHARACTER)
I have a lot more than 3 one character rules in my lex file so obviously i'm looking for a more elegant solution.
The only solution is the one you propose: create a non-terminal which is the union of all the relevant terminals.
Personally, I find grammars much more readable if single-character tokens are written as themselves, so I would write:
printable: '+' | '-' | '=' | CHAR
in the bison file, and in the scanner:
[-+=] { yylval.ch = yytext[0]; return yylval.ch; }
[[:print:]] { yylval.ch = yytext[0]; return CHAR; }
(which in turn requires the semantic type to be a union with both char and char* fields; the advantage is that you don't need to worry about freeing the strings created for operator characters.)
That is about as elegant as it gets, I'm afraid.

In Boost Spirit Qi, how do I match every character up to the next whitespace (with pre-skip)

Within a boost::spirit::qi grammar rule, how do you match a string of characters up to and excluding the next whitespace character, as defined by the supplied skipper?
For example, if the grammar is a set of attributes defined as:
attributeList = '(' >> *attribute >> ')';
attribute = (name : value) | (name : value units);
How do I match any character for name up to and excluding the first skipper character?
For example, for name, I would like to pre-skip, then match all characters except ':' or a skipper character. Do I have to instantiate a skipper within the grammar class, so that I can do something like:
name = +qi::char_ !(skipper | ':');
or can I access the existing supplied skipper object somehow and reference it directly? Also, I don't believe this needs to be wrapped in qi:: lexeme[]...
Thanks in advance for correcting the error of my ways
In order to do this, you'll need to suppress skipping, so qi::lexeme will have to be involved (or at least qi::no_skip, but you'd only use it to reimplement qi::lexeme), and to do precisely what you write you'll also need the skip parser. Then you could write
qi::lexeme[ +(qi::char_ - ':' - skipper) ]
The requirements seem rather lax, though. It is unusual to allow even non-printable characters such as the bell sign (ASCII 7) in identifiers. I don't know what exactly you're trying to do, so I can't answer such design questions for you, but to me it seems like there's a good chance you'd be happier with a more standard rule such as
qi::lexeme[ qi::alpha >> *qi::alnum ]
(for a very simple example. Your mileage may vary wrt underscores etc.)

How to read the identifier 'class' in Flex?

I am trying to write a compiler for COOL language and am right now at lexical analysis. Concretely, Flex matches the largest pattern as I understand.
Thus if you have in Flex:
class A inherits B
Now if my token for class is returned by following pattern:
^"class" return CLASS;
For my inherits token:
^"class"[ ]+[a-zA-Z]+[0-9]?[ ]+"inherits"[ ]+ return INHERITS;
Now since flex matches the largest pattern, it will always return INHERITS and never class. Is there a work around to this problem?
I can here return token for class alone. But how do I return token for inherits since it MUST be preceded by a class token and its name followed by another string token?
But if I try to impose constraints on inherits, then flex will match the largest pattern not class alone.
Then should I return the enums/number for class identifier individually? And if I do that, how do I identify 'inherits' identifier?
EDIT:
class A inherits B {
main(): SELF_TYPE{...}
}
How does the flex match against main? My reflexer differentiates between TypeID which is A and main, which it declares ObjectID. The only it can do that is by looking ahead at the paranthesis and if it finds (, it declares an ObjectID. But if I do that, I counter the same problem as above: flex will never match against ( but always main(.
You are trying to do too much in Flex, and perhaps you misunderstand the role and boundaries of the lexical phase. You shouldn't be attempting to parse the whole sentence with a Flex regex alone. Flex's job is to consume a stream of text, and convert it to a stream of integer tokens. The sentence you've provided:
class A inherits B
represents multiple tokens from a language that requires parsing. Flex is not a parser, it is a lexical scanner/tokenizer. (Technically it is a parser of bytes or characters, but you want to "parse" atomic units that represent the words of your language, not characters).
So there are 4 distinct tokens (atomic units), also known as TERMINALS in the above sentence: [CLASS, A, INHERITS, B].
You need an IDENTIFIER rule for Flex, such that anything that doesn't match a token, falls through to an IDENTIFIER, so the tokens returned by Flex to the parser are:
CLASS IDENTIFIER INHERITS IDENTIFIER
The job for Flex is to parse each word / token and convert the text to distinct integer values to be consumed by Bison or any other parser.
You typically have a Yacc/Bison BNF grammar to handle:
class_decl:
CLASS IDENTIFIER
| CLASS IDENTIFIER INHERITS IDENTIFIER
;
So your Lex rule would be thus, and you need to return the IDENTIFIER token to parser, while attaching the actual symbol (A, B). You get that from the yytext variable:
LETTER [a-zA-Z_]
DIGIT [0-9]
LETTERDIGIT [a-zA-Z0-9_]
%%
"class" return(CLASS);
"inherits" return(INHERITS);
{LETTER}{LETTERDIGIT}* {
yylval.sym = new Symbol(yytext);
yylval.sym->line = line;
fprintf(stderr, "TOKEN IDENTIFIER(%s)\n", yytext);
return(IDENTIFIER);
}
If you are really trying to do all of this within Flex, then it is possible, but you will end up with a mess, like if you try to parse HTML with regex... :)

Writing regular expressions and rules in Sublime Text 2 syntax definitions

I'm very interested in Syntax Definitions for Sblime text 2
I've studied the basics but I don't know how to write RE (and rules) for smth like
variable = sentense, i.e. myvar = func(foo, bar) + baz
I can't write anything better than ^\s*([^=\n]+)=([^=\n]+\n) (that doesn't work)
How to write this RE in proper way?
Also, i have some difficulties in defining RE for block
IF i FROM .. TO ..
...
ELSE
...
END IF
Hoe to write it?
In this case you have to write a parser. A regex won't work because the patterns may vary. You've already noticed it when you stated 'variable = sentence'.
For this, you can use spoofax or javacup for grammar definitions. I'll give you a snip in JavaCup:
Scanner issues: suppose 'variable' follows the pattern: (_|[a-zA-Z])(_|[a-zA-Z])*
and 'number' is: ([0-9])+
Note that number could be any decimal or int, but here I state it as that pattern, supposing my language only deals with integer (or whatever that pattern means :) ).
Now we can declare our grammar following the JavaCUP syntax. Which is more or less like:
expression ::= variable "=" sentence
sentence ::= sentence "+" sentence;
sentence ::= sentence "-" sentence;
sentence ::= sentence "*" sentence;
sentence ::= sentence "/" sentence;
sentence ::= number;
...and that goes further.
If you've never had any compiler's class, it may seems very difficult to see. Plus there is a lot of grammar's restrictions to avoiding infinity loop in the parser, depending on which you're using (RL or LL).
Anyway, the real answer for your question is: you can't do what you want only with regex, i'll need more concepts.