C++ special characters in a regular expression

C++ special characters in a regular expression - c++

I have to parse a regular expression which can contain special symbols such as \s and \d. The problem is, I can't distinguish the \ when i am parsing the expression, I mean '\s' == 's', therefore I cannot distinguish between special character and basic character. How can I solve this?

Raw string literals since C++11 can help you to improve the readability:
"a\\sb" // matches: a[whitespace]b
"a\\\\sb" // matches: a\sb
becomes:
R"(a\sb)" // matches: a[whitespace]b
R"(a\\sb)" // matches: a\sb

You're confusing user input and character literals. You catch the user input \ by comparing all input characters with the character literal '\\'.

Related

How to capture a literal in antlr4?

I am looking to make a rule for a regex character class that is of the form:
character_range
: '[' literal '-' literal ']'
;
For example, with [1-5]+ I could match the string "1234543" but not "129". However, I'm having a hard time figuring out how I would define a "literal" in antlr4. Normally I would do [a-zA-Z], but then this is just ascii and won't include something such as é. So how would I do that?

Actually, you don't want to match an entire literal, because a literal can be more than one character. Instead you only need a single character for the match.
In the parser:
character_range: OPEN_BRACKET LETTER DASH LETTER CLOSE_BRACKET;
And in the lexer:
OPEN_BRACKET: '[';
CLOSE_BRACKET: ']';
LETTER: [\p{L}];
The character class used in the LETTER lexer rule is Unicode Letters as described in the Unicode description file of ANTLR. Other possible character classes are listed in the UAX #44 Annex of the Unicode Character DB. You may need others like Numbers, Punctuation or Separators for all possible regex character classes.

You can also define a range of unicode characters. Try something like this in your lexer rules:
fragment LETTER: [a-zA-Z];
fragment LETTER_UNICODE: [\u0080-\uFFFF];
UTF8CHAR: ( LETTER | LETTER_UNICODE );

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?

Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo

You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

How is python regex '\\\\' evaluated? [duplicate]

This question already has answers here:
Confused about backslashes in regular expressions [duplicate]
(3 answers)
Closed 4 years ago.
I'm reading python doc of re library and quite confused by the following paragraph:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
How is \\\\ evaluated?
\\\\ -> \\\ -> \\ cascadingly
or \\\\ -> \\ in pairs?
I know \ is a meta character just like |, I can do
>>> re.split('\|', 'a|b|c|d') # split by literal '|'
['a', 'b', 'c', 'd']
but
>>> re.split('\\', 'a\b\c\d') # split by literal '\'
Traceback (most recent call last):
gives me error, it seems that unlike \| the \\ evaluates more than once.
and I tried
>>> re.split('\\\\', 'a\b\c\d')
['a\x08', 'c', 'd']
which makes me even more confused...

There are two things going on here - how strings are evaluated, and how regexes are evaluated.
'a\b\c\d' in python <3.7 code represents the string a<backspace>\c\d
'\\\\' in python code represents the string \\.
the string \\ is a regex pattern that matches the character \
Your problem here is that the string you're searching is not what you expect.
\b is the backspace character, \x08. \c and \d are not real characters at all. In python 3.7, this will be an error.
I assume you meant to spell it r'a\b\c\d' or 'a\\b\\c\\d'

re.split('\\', 'a\b\c\d') # split by literal '\'
You forgot that '\' in the second one is escape character, it would work if the second one was changed:
re.split(r'\\', 'a\\b\\c\\d')
This r at the start means "raw" string - escape characters are not evaluated.

Think about the implications of evaluating backslashes cascadingly:
If you wanted the string \n (not the newline symbol, but literally \n), you couldn't find a sequence of characters to get said string.
\n would be the newline symbol, \\n would be evaluated to \n, which in turn would become the newline symbol again. This is why escape sequencens are evaluated in pairs.
So you need to write \\ within a string to get a single \, but you need to have to backslashes in your string so that the regex will match the literal \. Therefore you will need to write \\\\ to match a literal backslash.
You have a similar problem with your a\b\c\d string. The parser will try to evaluate the escape sequences, and \b is a valid sequence for 'backspace', represented as \x08. You will need to escape your backslashes here, too, like a\\b\\c\\d.

Lex/Flex :Regular expression for string literals in C/C++?

I look here ANSI C grammar .
This page includes a lot of regular expressions in Lex/Flex for ANSI C.
Having a problem in understanding regular expression for string literals.
They have mentioned regular expression as \"(\\.|[^\\"])*\"
As I can understand \" this is used for double quotes, \\ is for escape character, . is for any character except escape character and * is for zero or more times.
[^\\"] implies characters except \ , " .
So, in my opinion, regular expression should be \"(\\.)*\".
Can you give some strings where above regular expression will fail?
or
Why they have used [^\\"]?

The regex \"(\\.)*\" that you proposed matches strings that consist of \ symbols alternating with any characters like:
"\z\x\p\r"
This regular expression would therefore fail to match a string like:
"hello"
The string "hello" would be matched by the regex \".*\" but that would also match the string """" or "\" both of which are invalid.
To get rid of these invalid matches we can use \"[^\\"]*\", but this will now fail to match a string like "\a\a\a" which is a valid string.
As we saw \"(\\.)*\" does match this string, so all we need to do is combine these two to get \"(\\.|[^\\"])*\".

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!

I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.

The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.

Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js