allow only specific escape characters in regular expression

allow only specific escape characters in regular expression - regex

I have been looking through other regex questions, but have not been able to find an answer. I am working on a grammar in ANTLR4 and there is a regular expression that has been eluding me.
I am trying to match any character, except for \ followed by anything other than [btnrf"\].
I have tried ( ~([\\][.]) | [\\][btnrf"\] ) but the ~ only negates a single character as far as I can tell. I get the error:
error AC0050: extraneous input '[.]' expecting RPAREN while looking for lexer rule element
It seems like it shouldn't be too hard to exclude \* but allow the small list of acceptable escaped characters. I have been on http://www.regex101.com and I don't have any trouble matching the allowable characters, but for some reason I just can't figure out how to disallow escape characters besides the ones mentioned above, while also allowing all other characters.
Manually specifying every valid input character seems like overkill, but that may be what it comes down to. Something like:
[a-ZA-Z0-9_!##$%^&*()\-+=/.,<>;':\b\t\n\r\f\"\\]*
That may not be 100% valid, but the idea is just listing all valid possible characters, which by default would exclude any invalid escape characters. It seems like there should be a simpler way. Any tips or links to useful information would greatly appreciated.
The actual rule that I have so far, which allows anything enclosed in double quotes as a valid string:
STRING : '"' (~[\"] | '\\"')* '"';

I don't have ANTLR handy, but the following seems to do what you're after :
\([^\\].\)\|\(\\[btnrf\\"\\\\]\)
so effectively allow "EITHER anything other than a backslash followed by any character, OR a backslash followed by a specified character".
eg, putting that string in a file regexfile, and given a datafile containing
\a
\b
\\
xy
then performing grep -f regexfile datafile will exclude the \a, and return :
\b
\\
xy

Related

Flex regular expression String [duplicate]

This question already has answers here:
Regular expression for a string literal in flex/lex
(6 answers)
Closed 7 years ago.
I've got a regular expression that matches strings opening with " and closing with " and can contain \".
The regular expression is this \"".*[^\\]"\".
I don't understand what's the " that is followed after \" and after the [^\\].
Also this regular expression works when I have a \n inside a string but the . rule on flex doesn't match a \n.
I just tested for example the string "aaaaa\naaa\naaaa".
It matched it with no problem.
I made a regex for flex that matches what I need. It's this one \"(([^\\\"])|([\\\"]))*\". I understand how this works though.
Also I just tested my solutions against an "" an empty string. It doesn't work. Also the answers from all those that answered have been tested and don't work as well.

The pattern is a little naive and even indeed false. It doesn't handle correctly escaped quotes because it assumes that the closing quote is the first one that is not preceded by a backslash. This is a false assumption.
The closing quote can be preceded by a literal backslash (a backslash that is escaped with an other backslash, so the second backslash is no longer escaping the quote), example: "abcde\\" (so the content of this string is abcde\)
This is the pattern to deal with all cases:
\"[^"\\]*(?s:\\.[^"\\]*)*\"
or perhaps (I don't know exactly where you need to escape literal quotes in a flex pattern):
\"[^\"\\]*(?s:\\.[^\"\\]*)*\"
Note that the s modifier allows the dot to match newlines inside the non capturing group.

I just figured out everything :P
This \"".*[^\\]"\" works because in flex it means: I want to match something that starts with " and ends with ". Inside these quotes there will be another matching pattern(that's why there are the unexplained ", as I was pondering their existence in my question) that can be any set of any characters, but CANNOT end with \.
What confused me more was the use of ., cause in flex it means that it will match any character except a new line \n. So I was mistakenly thinking that it won't match a string such as "aaa\naaa".
But the reality is it will match it, because when flex reads it will read first \ and then n.
The TRUE newline would be, something like this:
"something
like
this"
But compilers in -ansi C for example(haven't tested it on other versions other than ansi) do not let you declare a string using in different lines.
I hope my answer is clear enough. Cheers.

Your pattern does not match "hello" but it matches ""hello"".
if you want to match anything that is in quotes and may contain \" try something like:
/(\"[\na-zA-Z\\"]*\")/gs

Replace quotes inside quoted string with escaped quotes in notepad++?

I am using Notepad++ to find (".*)"(.*) and replace it with \1\"\2 but it doesn't seem to work. I don't know why.
Example:
Someone said "My name is "sean""
I want it to be:
Someone said "My name is \"sean\""
Edit: In my case the closing quote is always on the end of line so will (".*)"(.*"$) work?
Edit2: Also the first quote is preceded with a comma so I will use (,".*)"(.*"$) though it may not work in some cases but I think it will work with my file.
Now there is the problem with the replace it doesn't add \" it just add some space.

It should work... you just need to do a little fixing...
The Find what regex should be ("[^"]*)("\w*)(")([^"]*")
The Replace with expression should be \1\\\2\\\3\4
Make sure you select the Search Mode to be "Regular expression"
Explanation...
This is quite tricky - I've assumed that the quoted text WITHIN quotes is just a single word. If you assume something else it becomes very hard to pin down.
You need to find a
" followed by
[^"]* - any number of characters that are NOT a " and then
("\w*)(") - a quoted word, and then finally
([^"]*") - any additional number of non-quote characters + a final quote
This is important because regular expression matching is greedy by default, and a .* would continue to match all characters, including " until the end of the string (see link )
In the replacement string you need to have \\ to represent a single \

Matching `\` in Flex

I am trying to create a simple state machine in flex which has to ensure that strings spanning multiple lines must have \ for line breaks. Concretely:
"this is \
ok"
"this is not
ok"
The first one is valid. The second one is not.
I have the following state machine:
expectstring BEGIN(expectstr);
<expectstr>[^\n] {num_lines++;}
<expectstr>\ {flag = true;}
<expectstr>\n {printf("%s\n", flag ? "True" : False);}
But when I try to compile this state machine, flex tells me that the rule with \ can not be matched. Why is that?
I have looked at this but cannot figure it out.

In flex, the following pattern matches anything other than a newline:
.
You can also write that as
[^\n]
but . is more normal.
In order to match a backslash you can write
\\
"\\"
[\\]
Again, the first would be the usual way.
It's important to understand that [...] is an way of representing a set of characters, and that most regular expression operators are just ordinary characters inside the brackets. Similarly, "..." is a way of representing a sequence of characters and most regular expression operators are just ordinary characters inside the quotes.
Thus,
[a|b] matches one character if it is an a, a |, or a b
"a|b" matches the three-character sequence a | b
and|but matches either of the three-character sequences and or but.
Since flex lets you match regular expressions, you really don't need to manually build a state machine. Just use an appropriate regular expression. For example, the following will match strings which start and end with ", in which \ may be used to escape itself as well as newlines, and in which newlines (other than escaped ones) are illegal. I think that's your goal.
\"([^"\n\\]|\\(.|\n))*\"
You should make sure you understand how it works; there are lots of good explanations of regular expressions on the internet (and even more bad ones, so try to find one written by someone who knows what they are talking about). Here's the summary:
\" A literal double-quote
(...)* Any number of repetitions of:
[^"\n\\] Anything other than a double-quote, newline, or backslash
| Or
\\ A literal backslash, followed by
(...) Grouping
. Anything other than a newline
| Or
\n a newline

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!

I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.

The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.

Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

c++ regexp for not preceded by backslash and preceded by backslash

I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.

In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!

You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js