How to specify a specific string in Regex - regex

I'm tinkering around with flex and bison to create a small calculator program. The token will be something like this:
read A
read B
sum := A + B
write sum
Read, write will be keyword indicating reading a value in or writing a value to the output. ":=" is the assignment operator. A,B are identifiers, which can be strings. There will also be comment //comment and block comment /* asdfsd */
Would these regular expression be correct to specify the little grammar I specify?
[:][=] //assignment operator
[ \t] //skipping whitespace
[a-zA-Z0-9]+ //identifiers
[Rr][Ee][Aa][Dd] //read symbols, not case-sensitive
[/][/] `//comment`
For the assignment operator and the comment regex, can I just do this instead? would flex and bison accept it?
":=" //assignment operator
"//" //comment

Yes, ":=" and "//" will work, though the comment rule should really be "//".* because you want to skip everything after the // (until the end of line). If you just match "//", flex will try to tokenize what comes after it, which you don't want because a comment doesn't have to consist of valid tokens (and even if it did, those tokens should be seen by the parser).
Further [Rr][Ee][Aa][Dd] should be placed before the identifier rule. Otherwise it will never be matched (because if two rules can match the same lexeme, flex will pick the one that comes first in the file). It can also be written more succinctly as (?i:read) or you can enable case insensitivity globally with %option caseless and just write read.

You can start with (with ignore case option):
(read|write)\s+[a-z]+ will match read/write expression;
[a-z]+\s:=[a-z+\/* -]* will match assignation with simple calculus;
\/\/.* will match an inline comment;
\/\*[\s\S]*\*\/ will match multi-lines comments.
Keep in mind that theses are basic regex and may not fit for too complex syntaxes.
You can try it with Regex101.com for example

Related

Regular expression to count ALL newline characters in C++

I am trying to write a rules.l file to generate flex to read through any given input and print out every possible thing (for example - string, int, +, -, if, else, etc), along with its length, token, and what line it is on. Everything works as it should, except that it is not counting newline characters within a string literal.
I have googled my heart out and read all kinds of things, and they all say that just using the expression \n should allow me to count every newline in the text.
I also use [ \t] to eat whitespace.
My output should say:
< line: 14, lexeme: |"last"|
but instead it says:
> line: 10, lexeme: |"last"|
Any input/advice would be greatly appreciated!
Here is a bit of my .l file for context:
%option noyywrap
%{
int line_number = 1;
%}
%%
if { return TOK_IF; }
else { return TOK_ELSE; }
.
.
.
[a-zA-Z]([a-zA-Z]|[0-9]|"_")* { return TOK_IDENTIFIER; }
\"(\\.|[^"\\])*\" { return TOK_STRINGLIT; }
[ \t]+ ;
[\n] {++line_number;}
I'd suggest you add
%option yylineno
to your Flex file, and then use the yylineno variable instead of trying to count newlines yourself. Flex gets the value right, and usually manages to optimise the computation.
That said, \"([^"])*\" is not the optimal way to read string literals, because it will terminate at the first quote. That will fail disastrously if the string literal is "\"Bother,\" he said. \"It's too short.\""
Here's a better one:
\"(\\(.|\n)|[^\\"\n])*\"
(That will not match string literals which included unescaped newline characters; in C++, those are not legal. But you'll need to add another rule to match the erroneous string and produce an appropriate error message.)
I suppose it is possible that you must conform to the artificial requirements of a course designed by someone unaware of the yylineno feature. In that case, the simple solution of adding line_number = yylineno; at the beginning of every rule would probably be considered cheating.
What you will need to do is what flex itself does (but it doesn't make mistakes, and we programmers do): figure out which rules might match text including one or more newlines, and insert code in those specific rules to count the newlines matched. Typically, the rules in question are multi-line comments and string literals themselves (since the string literal might include a backslash line continuation.)
One way to figure out which rules might be matching newlines is to turn on the yylineno feature, and then examine the code generated by flex. Search for YY_RULE_SETUP in that file; the handler for every parser rule (including the ones whose action does nothing) starts with that macro invocation. If you have enabled %option yylineno, flex figures out which rules might match a newline, and inserts code before YY_RULE_SETUP to fix yylineno. These rules start with the comment /* rule N can match eol */, where N is the index of the rule. You'll need to count rules in your source file to match N with the line number. Or you can look for the #line directive in the generated code.

The most efficient lookahead substitute for jflex

I am writing tokenizer in jflex. I need to match words like interferon-a as one token, and words like interferon-alpha as three.
Obvious solution would be lookaheads, but they do not work in jflex. For a similar task, I wrote a function matching one additional wildcard character after the matched pattern, checking if it is a whitespace in java code and pushing it back with or without a part of the matched string.
REGEX = [:letter:]+\-[:letter:]\.
From string interferon-alpha it would match interferon-al.
Then, in Java code section it would check if the last character of the match is a whitespace. It is not, so -al would be pushed back and interferon returned.
In the case of interferon-a, whitespace would be pushed back and interferon returned.
However, this function does not work if matched string does not have anything succeeding. Also, it seems quite clunky. Hence, I was wondering if there is any 'nicer' way of ensuring that the following character is a whitespace without actually matching and returning it.
JFlex certainly has a lookahead facility, the same as (f)lex. Unlike Java regex lookahead assertions, the JFlex lookahead can only be applied at the end of a match, but it is otherwise similar. It is described in the Semantics section of JFlex manual:
In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either $ (the end of line operator) or / followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match…
So you could certainly write the rule:
[:letter:]+\-[:letter:]/\s
However, you cannot put such a rule in a macro definition (REGEX = …), as the manual also mentions (in the section on macros):
The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators.
So the lookahead operator can only be used in a pattern rule.
Note that \s matches any whitespace character, including newline characters, while . does not match any newline character. I think that's what lead to your comment that REGEX = [:letter:]+\-[:letter:]\. "does not work if matched string does not have anything succeeding" (I'm guessing that you meant "does not have anything succeeding it on the same line, and also that you intended to write . rather than \.).
Rather than testing for following whitespace, you might (depending on your language) prefer to test for a non-word character:
[:letter:]+\-[:letter:]/\W
or to craft a more precise specification as a set of Unicode properties, as in the definition of \W (also found in the linked section of the JFlex manual).
Having said all that, I'd like to repeat the advice from my previous answer to a similar question of yours: put more specific patterns first. For example, using the following pair of patterns will guarantee that the first one picks up words with a single letter suffix, while avoiding the need to explicitly pushback.
[:letter:]+(-[:letter:])? { /* matches 'interferon' or 'interferon-a' */ }
[:letter:]+/-[:letter:]+ { /* matches only 'interferon' from 'interferon-alpha' */ }
Of course, in this case you could easily avoid the collision between the second pattern and the first pattern by using {2,} instead of + for the second repetition, but it's perfectly OK to rely on pattern ordering since it's often inconvenient to guarantee that patterns don't overlap.

How can I use regular expressions to match a 'broken' string, or a proper string?

What I mean is that I need a regular expression that can match either something like this...
"I am a sentence."
or something like this...
"I am a sentence.
(notice the missing quotation mark at the end of the second one). My attempt at this so far is
["](\\.|[^"])*["]*
but that isn't working. Thanks for the help!
Edit for clarity: I am intending for this to be something like a C style string. I want functionality that will match with a string even if the string is not closed properly.
You could write the pattern as:
["](\\.|[^"\n])*["]?
which only has two small changes:
It excludes newline characters inside the string, so that the invalid string will only match to the end of the line. (. does not match newline, but a negated character class does, unless of course the newline is explicitly negated.)
It makes the closing doubke quote optional rather than arbitrarily repeated.
However, it is hard to imagine a use case in which you just want to silently ignore the error. So I wiuld recommend writing two rules:
["](\\.|[^"\n])*["] { /* valid string */ }
["](\\.|[^"\n])* { /* invalid string */ }
Note that the first pattern is guaranteed to match a valid string because it will match one more character than the other pattern and (f)lex always goes with the longer match.
Also, writing two overlapping rules like that does not cause any execution overhead, because of the way (f)lex compiles the patterns. In effect, the common prefix is automatically factored out.

How to create a regex without certain group of letters in lex

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)
This is my code :
%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}
Some cases when the output should return ok
int var3
int _varR3
int _AA3_
And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.
So my question is what should I modify to make my expression ignore the 'dataType' words after space?
Thank you.
A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.
To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...
If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.
This is really not the way to solve this particular problem.
The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.
However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.
For example, you could do something like this:
dataType int|float|char|String
id [[:alpha:]_][[:alnum:]_]*
%%
{dataType}[[:white:]]+{dataType} { puts("Error: two types"); }
{dataType}[[:white:]]+{id} { puts("Valid declaration"); }
/* ... more rules ... */
The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

Finding Elvis ?:

I have been tasked to find Elvis (using eclipse search). Is there any regex that I can use to find him?
The "Elvis Operator" (?:) is a shortening of Java's ternary operator.
I have tried \?[\s\S]*[:] but it doesn't match multiline.
Is there such a refactoring where I could change Elvis into an if-else block?
Edit
Sorry, I had posted a regex for the ternary operator, if your problem is multiline you could use this:
\?(\p{Z}|\r\n|\n)*:
You'll need to explicitly match line delimiters if you want to match across multiple lines. \R will match any of them(platform-independent), in Eclipse 3.4 anyway, or you can use the proper one for your file (\r, \n, \r\n). E.g. \?.*\R*.*: will work if there's only one line break. You can't use \R in a character class, though, so if you don't know how many lines the operator might span, you'd have to construct a character class with your line delimiter and any character that might appear in an operand. Something like ([-\r\n\w\s\[\](){}=!/%*+&^|."']*)\?([-\r\n\w\s\[\](){}=!/%*+&^|."']*):([-\r\n\w\s\[\](){}=!/%*+&^|."']*). I've included parentheses to capture the operands as groups so you could find and replace.
You've got a pretty big problem, though, if this is Java (and probably any other language). The ternary conditional ?: operator creates an expression, while an if statement is not an expression. Consider:
boolean even = true;
int foo = even ? 2 : 3;
int bar = if (even) 2 else 3;
The third line is syntactically incorrect; the two conditional constructs are not equivalent. (What you'd actually get from the second line if you used my regex to find and replace is if (int foo = even) 2 else 3; which has additional problems.)
So, you can find the ?: operators with the regex above (or something similar; I may have missed some characters you need to include in the class), but you won't necessarily be able to replace them with 'if' statements.