Antlr4 match Cyrillic, Latin, Polish and Greek Letters plus special characters? - regex

In Antlr4 grammar I need to match with the help of Regular Expressions Latin,Cyrillic,Polish and Greek Letters plus special characters. This is what I have:
STRING: ['][\p{L} 0-9\[\]\^\$\.\|\?\*\+\(\)\\~`\!##%&\-_+={}""<>:;,\/°]*['];
So I am saying that a String starts and ends with ''. Inside I can have any letter (\p{L}), number and special character except from '. I have tested this on regex101.com and it exactly what I want. But in Antlr4 it is not working. Instead the closest thing I get is:
['][a-zA-Z0-9 \[\]\^\$\.\|\?\*\+\(\)\\~`\!##%&\-_+={}""<>:;,\/°]*[']
But the Problem is that something like 'Ąłćórżnęł' won't be accepted in my language, but it should be.
Am I doing something wrong in Antlr4 or is that a limitation ? How could I manage to get it to work in Antlr4 ? String is a Lexer Rule.

\p{L} is not supported by ANTLR. You will have to write these ranges out by hand like this: [\u1234-\u5678] (change \u.... with your hexadecimal Unicode points), where \u1234 is the start of the range and \u5678 the end. Note that you can put more than 1 range in your character set: [\u1234-\u1238\u3456-\u5679].
Thanks, but how about a regular expression in Antlr4 where I allow everything inside of a String except a character like '. But I say that a string start with and end with '
That would look like this:
STRING : '\'' ~[']* '\'';
and with escaped quotes and not allowing line breaks, do this:
STRING : '\'' ( ~['\r\n] | '\\' ['\\] )* '\'';

Related

How to capture a literal in antlr4?

I am looking to make a rule for a regex character class that is of the form:
character_range
: '[' literal '-' literal ']'
;
For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?
Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.
In the parser:
character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;
And in the lexer:
OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];
The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.
You can also define a range of unicode characters. Try something like this in your lexer rules:
fragment LETTER: [a-zA-Z];
fragment LETTER_UNICODE: [\u0080-\uFFFF];
UTF8CHAR: ( LETTER | LETTER_UNICODE );

Why do I need to write \\d instead of \d in a C++ regex?

I'mm starting to learn about Regular Expressions and I have written code in c++
my task is : Implement a function that replaces each digit in the given string with a '#' character.
For my example, the inputstring = "12 points".
I know I need to use \d for matches a digit. I tried to use this : std::regex_replace(input,std::regex("\d"),"#");
but it is not working: the output is still "12 points";
Then I searched the internet and the result is:
std::regex_replace(input,std::regex("\\d"),"#");
with the output is "## points".
Can anyone help me to understand what is "\\d" ?
\d means decimal, however, in the regular expression, the \ is a special character, which needs to be escaped on its own as well, hence in \\d you escape the \ to mark it to be used as a regular character instead of its special meaning.
When you use "\d" in a C++ application, the \ is an escape character in C++. So it doesn't treat the following d as a d.
Regex then gets a string that doesn't have \d in it, but most likely an empty string (since \d doesn't evaluate to anything in C++ to my knowledge).
When you use "\d" you are escaping the . So C++ reads the string as "\d" as you intended.
An example of when you'd use an escape character, is when you want to output a quote. "\"" would output a single double quote.

allow only specific escape characters in regular expression

I have been looking through other regex questions, but have not been able to find an answer. I am working on a grammar in ANTLR4 and there is a regular expression that has been eluding me.
I am trying to match any character, except for \ followed by anything other than [btnrf"\].
I have tried ( ~([\\][.]) | [\\][btnrf"\] ) but the ~ only negates a single character as far as I can tell. I get the error:
error AC0050: extraneous input '[.]' expecting RPAREN while looking for lexer rule element
It seems like it shouldn't be too hard to exclude \* but allow the small list of acceptable escaped characters. I have been on http://www.regex101.com and I don't have any trouble matching the allowable characters, but for some reason I just can't figure out how to disallow escape characters besides the ones mentioned above, while also allowing all other characters.
Manually specifying every valid input character seems like overkill, but that may be what it comes down to. Something like:
[a-ZA-Z0-9_!##$%^&*()\-+=/.,<>;':\b\t\n\r\f\"\\]*
That may not be 100% valid, but the idea is just listing all valid possible characters, which by default would exclude any invalid escape characters. It seems like there should be a simpler way. Any tips or links to useful information would greatly appreciated.
The actual rule that I have so far, which allows anything enclosed in double quotes as a valid string:
STRING : '"' (~[\"] | '\\"')* '"';
I don't have ANTLR handy, but the following seems to do what you're after :
\([^\\].\)\|\(\\[btnrf\\"\\\\]\)
so effectively allow "EITHER anything other than a backslash followed by any character, OR a backslash followed by a specified character".
eg, putting that string in a file regexfile, and given a datafile containing
\a
\b
\\
xy
then performing grep -f regexfile datafile will exclude the \a, and return :
\b
\\
xy

Groovy : RegEx for matching Alphanumeric and underscore and dashes

I am working on Grails 1.3.6 application. I need to use Regular Expressions to find matching strings.
It needs to find whether a string has anything other than Alphanumeric characters or "-" or "_" or "*"
An example string looks like:
SDD884MMKG_JJGH1222
What i came up with so far is,
String regEx = "^[a-zA-Z0-9*-_]+\$"
The problem with above is it doesn't search for special characters at the end or beginning of the string.
I had to add a "\" before the "$", or else it will give an compilation error.
- Groovy:illegal string body character after dollar sign;
Can anyone suggest a better RegEx to use in Groovy/Grails?
Problem is unescaped hyphen in the middle of the character class. Fix it by using:
String regEx = "^[a-zA-Z0-9*_-]+\$";
Or even shorter:
String regEx = "^[\\w*-]+\$";
By placing an unescaped - in the middle of character class your regex is making it behave like a range between * (ASCII 42) and _ (ASCII 95), matching everything in this range.
In Groovy the $ char in a string is used to handle replacements (e.g. Hello ${name}). As these so called GStrings are only handled, if the string is written surrounding it with "-chars you have to do extra escaping.
Groovy also allows to write your strings without that feature by surrounding them with ' (single quote). Yet the easiest way to get a regexp is the syntax with /.
assert "SDD884MMKG_JJGH1222" ==~ /^[a-zA-Z0-9*-_]+$/
See Regular Expressions for further "shortcuts".
The other points from #anubhava remain valid!
It's easier to reverse it:
String regEx = "^[^a-zA-Z0-9\\*\\-\\_]+\$" /* This matches everything _but_ alnum and *-_ */

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!
I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.
The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.
Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.