How to capture a literal in antlr4? - regex

I am looking to make a rule for a regex character class that is of the form:
character_range
: '[' literal '-' literal ']'
;
For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?

Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.
In the parser:
character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;
And in the lexer:
OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];
The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.

You can also define a range of unicode characters. Try something like this in your lexer rules:
fragment LETTER: [a-zA-Z];
fragment LETTER_UNICODE: [\u0080-\uFFFF];
UTF8CHAR: ( LETTER | LETTER_UNICODE );

Related

C++ special characters in a regular expression

I have to parse a regular expression which can contain special symbols such as \s and \d. The problem is, I can't distinguish the \ when i am parsing the expression, I mean '\s' == 's', therefore I cannot distinguish between special character and basic character. How can I solve this?
Raw string literals since C++11 can help you to improve the readability:
"a\\sb" // matches: a[whitespace]b
"a\\\\sb" // matches: a\sb
becomes:
R"(a\sb)" // matches: a[whitespace]b
R"(a\\sb)" // matches: a\sb
You're confusing user input and character literals. You catch the user input \ by comparing all input characters with the character literal '\\'.

Antlr4 match Cyrillic, Latin, Polish and Greek Letters plus special characters?

In Antlr4 grammar I need to match with the help of Regular Expressions Latin,Cyrillic,Polish and Greek Letters plus special characters. This is what I have:
STRING: ['][\p{L} 0-9\[\]\^\$\.\|\?\*\+\(\)\\~`\!##%&\-_+={}""<>:;,\/°]*['];
So I am saying that a String starts and ends with ''. Inside I can have any letter (\p{L}), number and special character except from '. I have tested this on regex101.com and it exactly what I want. But in Antlr4 it is not working. Instead the closest thing I get is:
['][a-zA-Z0-9 \[\]\^\$\.\|\?\*\+\(\)\\~`\!##%&\-_+={}""<>:;,\/°]*[']
But the Problem is that something like 'Ąłćórżnęł' won't be accepted in my language, but it should be.
Am I doing something wrong in Antlr4 or is that a limitation ? How could I manage to get it to work in Antlr4 ? String is a Lexer Rule.
\p{L} is not supported by ANTLR. You will have to write these ranges out by hand like this: [\u1234-\u5678] (change \u.... with your hexadecimal Unicode points), where \u1234 is the start of the range and \u5678 the end. Note that you can put more than 1 range in your character set: [\u1234-\u1238\u3456-\u5679].
Thanks, but how about a regular expression in Antlr4 where I allow everything inside of a String except a character like '. But I say that a string start with and end with '
That would look like this:
STRING : '\'' ~[']* '\'';
and with escaped quotes and not allowing line breaks, do this:
STRING : '\'' ( ~['\r\n] | '\\' ['\\] )* '\'';

Recognize special characters

I've got a little question (I've used Google before):
Is there a way, how to match all special unicode characters except quotes?
I have this code:
STRING: '"' (NUMBER|LETTER|' '|'!'|'?'|':'|'.'|'/'|'*')* '"';
fragment LETTER: ('a'..'z'|'A'..'Z');
fragment DIGIT: ('0'..'9');
Is there more efficient way?
Thanks for feedback!
~["], or the old v3 style ~'"', matches any character except a quote.
If you also want to exclude line breaks, do something like this:
STRING : '"' ~["\r\n]* '"';
From the official docs:
~x
Match any single character not in the set described by x. Set x can be a single character literal, a range, or a subrule set like ~(’x’|’y’|’z’) or ~[xyz]. Here is a rule that uses ~ to match any character other than characters using ~[\r\n]*:
COMMENT : '#' ~[\r\n]* '\r'? '\n' -> skip ;

c++ regexp for not preceded by backslash and preceded by backslash

I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.
In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!
You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"

Vim regex not matching spaces in a character class

I'm using vim to do a search and replace with this command:
%s/lambda\s*{\([\n\s\S]\)*//gc
I'm trying to match for all word, endline and whitespace characters after a {. For instance, the entirety of this line should match:
lambda {
FactoryGirl.create ...
Instead, it only matches up to the newline and no spaces before FactoryGirl. I've tried manually replacing all the spaces before, just in case there were tab characters instead, but no dice. Can anyone explain why this doesn't work?
The \s is an atom for whitespace; \n, though it looks similar, syntactically is an escape sequence for a newline character. Inside the collection atom [...], you cannot include other atoms, only characters (including some special ones like \n. From :help /[]:
The following translations are accepted when the 'l' flag is not
included in 'cpoptions' {not in Vi}:
\e <Esc>
\t <Tab>
\r <CR> (NOT end-of-line!)
\b <BS>
\n line break, see above |/[\n]|
\d123 decimal number of character
\o40 octal number of character up to 0377
\x20 hexadecimal number of character up to 0xff
\u20AC hex. number of multibyte character up to 0xffff
\U1234 hex. number of multibyte character up to 0xffffffff
NOTE: The other backslash codes mentioned above do not work inside
[]!
So, either specify the whitespace characters literally [ \t\n...], use the corresponding character class expression [[:space:]...], or combine the atom with the collection via logical or \%(\s\|[...]\).
Vim interprets characters inside of the [ ... ] character classes differently. It's not literally, since that regex wouldn't fully match lambda {sss or lambda {\\\. What \s and \S are interpreted as...I still can't explain.
However, I was able to achieve nearly what I wanted with:
%s/lambda\s*{\([\n a-zA-z]\)*//gc
That ignores punctuation, which I wanted. This works, but is dangerous:
%s/lambda\s*{\([\n a-zA-z]\|.\)*//gc
Because adding on a character after the last character like } causes vim to hang while globbing. So my solution was to add the punctuation I needed into the character class.