Recognize special characters - regex

I've got a little question (I've used Google before):
Is there a way, how to match all special unicode characters except quotes?
I have this code:
STRING: '"' (NUMBER|LETTER|' '|'!'|'?'|':'|'.'|'/'|'*')* '"';
fragment LETTER: ('a'..'z'|'A'..'Z');
fragment DIGIT: ('0'..'9');
Is there more efficient way?
Thanks for feedback!

~["], or the old v3 style ~'"', matches any character except a quote.
If you also want to exclude line breaks, do something like this:
STRING : '"' ~["\r\n]* '"';
From the official docs:
~x
Match any single character not in the set described by x. Set x can be a single character literal, a range, or a subrule set like ~(’x’|’y’|’z’) or ~[xyz]. Here is a rule that uses ~ to match any character other than characters using ~[\r\n]*:
COMMENT : '#' ~[\r\n]* '\r'? '\n' -> skip ;

Related

How to capture a literal in antlr4?

I am looking to make a rule for a regex character class that is of the form:
character_range
: '[' literal '-' literal ']'
;
For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?
Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.
In the parser:
character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;
And in the lexer:
OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];
The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.
You can also define a range of unicode characters. Try something like this in your lexer rules:
fragment LETTER: [a-zA-Z];
fragment LETTER_UNICODE: [\u0080-\uFFFF];
UTF8CHAR: ( LETTER | LETTER_UNICODE );

allow only specific escape characters in regular expression

I have been looking through other regex questions, but have not been able to find an answer. I am working on a grammar in ANTLR4 and there is a regular expression that has been eluding me.
I am trying to match any character, except for \ followed by anything other than [btnrf"\].
I have tried ( ~([\\][.]) | [\\][btnrf"\] ) but the ~ only negates a single character as far as I can tell. I get the error:
error AC0050: extraneous input '[.]' expecting RPAREN while looking for lexer rule element
It seems like it shouldn't be too hard to exclude \* but allow the small list of acceptable escaped characters. I have been on http://www.regex101.com and I don't have any trouble matching the allowable characters, but for some reason I just can't figure out how to disallow escape characters besides the ones mentioned above, while also allowing all other characters.
Manually specifying every valid input character seems like overkill, but that may be what it comes down to. Something like:
[a-ZA-Z0-9_!##$%^&*()\-+=/.,<>;':\b\t\n\r\f\"\\]*
That may not be 100% valid, but the idea is just listing all valid possible characters, which by default would exclude any invalid escape characters. It seems like there should be a simpler way. Any tips or links to useful information would greatly appreciated.
The actual rule that I have so far, which allows anything enclosed in double quotes as a valid string:
STRING : '"' (~[\"] | '\\"')* '"';
I don't have ANTLR handy, but the following seems to do what you're after :
\([^\\].\)\|\(\\[btnrf\\"\\\\]\)
so effectively allow "EITHER anything other than a backslash followed by any character, OR a backslash followed by a specified character".
eg, putting that string in a file regexfile, and given a datafile containing
\a
\b
\\
xy
then performing grep -f regexfile datafile will exclude the \a, and return :
\b
\\
xy

Antlr4 match Cyrillic, Latin, Polish and Greek Letters plus special characters?

In Antlr4 grammar I need to match with the help of Regular Expressions Latin,Cyrillic,Polish and Greek Letters plus special characters. This is what I have:
STRING: ['][\p{L} 0-9\[\]\^\$\.\|\?\*\+\(\)\\~`\!##%&\-_+={}""<>:;,\/°]*['];
So I am saying that a String starts and ends with ''. Inside I can have any letter (\p{L}), number and special character except from '. I have tested this on regex101.com and it exactly what I want. But in Antlr4 it is not working. Instead the closest thing I get is:
['][a-zA-Z0-9 \[\]\^\$\.\|\?\*\+\(\)\\~`\!##%&\-_+={}""<>:;,\/°]*[']
But the Problem is that something like 'Ąłćórżnęł' won't be accepted in my language, but it should be.
Am I doing something wrong in Antlr4 or is that a limitation ? How could I manage to get it to work in Antlr4 ? String is a Lexer Rule.
\p{L} is not supported by ANTLR. You will have to write these ranges out by hand like this: [\u1234-\u5678] (change \u.... with your hexadecimal Unicode points), where \u1234 is the start of the range and \u5678 the end. Note that you can put more than 1 range in your character set: [\u1234-\u1238\u3456-\u5679].
Thanks, but how about a regular expression in Antlr4 where I allow everything inside of a String except a character like '. But I say that a string start with and end with '
That would look like this:
STRING : '\'' ~[']* '\'';
and with escaped quotes and not allowing line breaks, do this:
STRING : '\'' ( ~['\r\n] | '\\' ['\\] )* '\'';

ANTLR4 - Need an explanation on this String Literals

On my assignment, I have this description for the String Lexer:
"String literals consist zero or more characters enclosed by double
quotes ("). Use escape sequences (listed below) to represent special
characters within a string. It is a compile-time error for a new line
or EOF character to appear inside a string literal.
All the supported escape sequences are as follows:
\b backspace
\f formfeed
\r carriage return
\n newline
\t horizontal tab
\" double quote
\ backslash
The following are valid examples of string literals:
"This is a string containing tab \t"
"He asked me: \"Where is John?\""
A string literal has a type of string."
And this is my String lexer:
STRINGLIT: '"'(('\\'('b'|'t'|'n'|'f'|'r'|'\"'|'\\'))|~('\n'))*'"';
Can anybody check for my lexer if it meets the requirement or not? If it's not, please tell me your correction, I don't really understand the requirement and ANTLR4.
With ANTLR4, instead of writing \\ ('b' | 't' | 'n'), you can write \\ [btn]. Also, as J Earls mentioned in a comment, you'll want to include the quote in your negated set, as well as the \r and the literal \.
This ought to do the trick:
STRINGLIT
: '"' ( '\\' [btnfr"'\\] | ~[\r\n\\"] )* '"'
;
try this:
QUOTE: '"';
STRINGLIT: QUOTE ( '\\' [bfrnt"\\] | ~[\b\f\r\n\t"\\] )* QUOTE
{self.text = self.text[1:-1]};

Vim regex not matching spaces in a character class

I'm using vim to do a search and replace with this command:
%s/lambda\s*{\([\n\s\S]\)*//gc
I'm trying to match for all word, endline and whitespace characters after a {. For instance, the entirety of this line should match:
lambda {
FactoryGirl.create ...
Instead, it only matches up to the newline and no spaces before FactoryGirl. I've tried manually replacing all the spaces before, just in case there were tab characters instead, but no dice. Can anyone explain why this doesn't work?
The \s is an atom for whitespace; \n, though it looks similar, syntactically is an escape sequence for a newline character. Inside the collection atom [...], you cannot include other atoms, only characters (including some special ones like \n. From :help /[]:
The following translations are accepted when the 'l' flag is not
included in 'cpoptions' {not in Vi}:
\e <Esc>
\t <Tab>
\r <CR> (NOT end-of-line!)
\b <BS>
\n line break, see above |/[\n]|
\d123 decimal number of character
\o40 octal number of character up to 0377
\x20 hexadecimal number of character up to 0xff
\u20AC hex. number of multibyte character up to 0xffff
\U1234 hex. number of multibyte character up to 0xffffffff
NOTE: The other backslash codes mentioned above do not work inside
[]!
So, either specify the whitespace characters literally [ \t\n...], use the corresponding character class expression [[:space:]...], or combine the atom with the collection via logical or \%(\s\|[...]\).
Vim interprets characters inside of the [ ... ] character classes differently. It's not literally, since that regex wouldn't fully match lambda {sss or lambda {\\\. What \s and \S are interpreted as...I still can't explain.
However, I was able to achieve nearly what I wanted with:
%s/lambda\s*{\([\n a-zA-z]\)*//gc
That ignores punctuation, which I wanted. This works, but is dangerous:
%s/lambda\s*{\([\n a-zA-z]\|.\)*//gc
Because adding on a character after the last character like } causes vim to hang while globbing. So my solution was to add the punctuation I needed into the character class.