Flex regular expression String [duplicate] - regex

This question already has answers here:
Regular expression for a string literal in flex/lex
(6 answers)
Closed 7 years ago.
I've got a regular expression that matches strings opening with " and closing with " and can contain \".
The regular expression is this \"".*[^\\]"\".
I don't understand what's the " that is followed after \" and after the [^\\].
Also this regular expression works when I have a \n inside a string but the . rule on flex doesn't match a \n.
I just tested for example the string "aaaaa\naaa\naaaa".
It matched it with no problem.
I made a regex for flex that matches what I need. It's this one \"(([^\\\"])|([\\\"]))*\". I understand how this works though.
Also I just tested my solutions against an "" an empty string. It doesn't work. Also the answers from all those that answered have been tested and don't work as well.

The pattern is a little naive and even indeed false. It doesn't handle correctly escaped quotes because it assumes that the closing quote is the first one that is not preceded by a backslash. This is a false assumption.
The closing quote can be preceded by a literal backslash (a backslash that is escaped with an other backslash, so the second backslash is no longer escaping the quote), example: "abcde\\" (so the content of this string is abcde\)
This is the pattern to deal with all cases:
\"[^"\\]*(?s:\\.[^"\\]*)*\"
or perhaps (I don't know exactly where you need to escape literal quotes in a flex pattern):
\"[^\"\\]*(?s:\\.[^\"\\]*)*\"
Note that the s modifier allows the dot to match newlines inside the non capturing group.

I just figured out everything :P
This \"".*[^\\]"\" works because in flex it means: I want to match something that starts with " and ends with ". Inside these quotes there will be another matching pattern(that's why there are the unexplained ", as I was pondering their existence in my question) that can be any set of any characters, but CANNOT end with \.
What confused me more was the use of ., cause in flex it means that it will match any character except a new line \n. So I was mistakenly thinking that it won't match a string such as "aaa\naaa".
But the reality is it will match it, because when flex reads it will read first \ and then n.
The TRUE newline would be, something like this:
"something
like
this"
But compilers in -ansi C for example(haven't tested it on other versions other than ansi) do not let you declare a string using in different lines.
I hope my answer is clear enough. Cheers.

Your pattern does not match "hello" but it matches ""hello"".
if you want to match anything that is in quotes and may contain \" try something like:
/(\"[\na-zA-Z\\"]*\")/gs

Related

How to exclude part of string using regex and change add this part and the and of string?

I've got a little problem with regex.
I got few strings in one file looking like this:
TEST.SYSCOP01.D%%ODATE
TEST.SYSCOP02.D%%ODATE
TEST.SYSCOP03.D%%ODATE
...
What I need is to define correct regex and change those string name for:
TEST.D%%ODATE.SYSCOP.#01
TEST.D%%ODATE.SYSCOP.#02
TEST.D%%ODATE.SYSCOP.#03
Actually, I got my regex:
r".SYSCOP[0-9]{2}.D%%ODATE" - for finding this in file
But how should look like the changing regex? I need to have the numbers from a string at the and of new string name.
.D%%ODATE.SYSCOP.# - this is just string, no regex and It didn't work
Any idea?
Find: (SYSCOP)(\d+)\.(D%%ODATE)
Replace: $3.$1.#$2 or \3.\1.#\2 for Python
Demo
You may use capturing groups with backreferences in the replacement part:
s = re.sub(r'(\.SYSCOP)([0-9]{2})(\.D%%ODATE)', r'\3\1.#\2', s)
See the regex demo
Each \X in the replacement pattern refers to the Nth parentheses in the pattern, thus, you may rearrange the match value as per your needs.
Note that . must be escaped to match a literal dot.
Please mind the raw string literal, the r prefix before the string literals helps you avoid excessive backslashes. '\3\1.#\2' is not the same as r'\3\1.#\2', you may print the string literals and see for yourself. In short, inside raw string literals, string escape sequences like \a, \f, \n or \r are not recognized, and the backslash is treated as a literal backslash, just the one that is used to build regex escape sequences (note that r'\n' and '\n' both match a newline since the first one is a regex escape sequence matching a newline and the second is a literal LF symbol.)

Why do I need to write \\d instead of \d in a C++ regex?

I'mm starting to learn about Regular Expressions and I have written code in c++
my task is : Implement a function that replaces each digit in the given string with a '#' character.
For my example, the inputstring = "12 points".
I know I need to use \d for matches a digit. I tried to use this : std::regex_replace(input,std::regex("\d"),"#");
but it is not working: the output is still "12 points";
Then I searched the internet and the result is:
std::regex_replace(input,std::regex("\\d"),"#");
with the output is "## points".
Can anyone help me to understand what is "\\d" ?
\d means decimal, however, in the regular expression, the \ is a special character, which needs to be escaped on its own as well, hence in \\d you escape the \ to mark it to be used as a regular character instead of its special meaning.
When you use "\d" in a C++ application, the \ is an escape character in C++. So it doesn't treat the following d as a d.
Regex then gets a string that doesn't have \d in it, but most likely an empty string (since \d doesn't evaluate to anything in C++ to my knowledge).
When you use "\d" you are escaping the . So C++ reads the string as "\d" as you intended.
An example of when you'd use an escape character, is when you want to output a quote. "\"" would output a single double quote.

Replace quotes inside quoted string with escaped quotes in notepad++?

I am using Notepad++ to find (".*)"(.*) and replace it with \1\"\2 but it doesn't seem to work. I don't know why.
Example:
Someone said "My name is "sean""
I want it to be:
Someone said "My name is \"sean\""
Edit: In my case the closing quote is always on the end of line so will (".*)"(.*"$) work?
Edit2: Also the first quote is preceded with a comma so I will use (,".*)"(.*"$) though it may not work in some cases but I think it will work with my file.
Now there is the problem with the replace it doesn't add \" it just add some space.
It should work... you just need to do a little fixing...
The Find what regex should be ("[^"]*)("\w*)(")([^"]*")
The Replace with expression should be \1\\\2\\\3\4
Make sure you select the Search Mode to be "Regular expression"
Explanation...
This is quite tricky - I've assumed that the quoted text WITHIN quotes is just a single word. If you assume something else it becomes very hard to pin down.
You need to find a
" followed by
[^"]* - any number of characters that are NOT a " and then
("\w*)(") - a quoted word, and then finally
([^"]*") - any additional number of non-quote characters + a final quote
This is important because regular expression matching is greedy by default, and a .* would continue to match all characters, including " until the end of the string (see link )
In the replacement string you need to have \\ to represent a single \

What does it mean "you can’t hide the terminating delimiter of a pattern inside a regex construct" in the "Programming Perl"?

Sorry, but once again I need help to understand rather complicated snippet from the "Programming Perl" book. Here it is (what is obscure to me marked as bold):
patterns are parsed like double-quoted strings, all the normal double-quote conventions will work, including variable interpolation (unless you use single quotes
as the delimiter) and special characters indicated with backslash escapes. These are applied before the string is interpreted as a regular expression (This is one of the
few places in the Perl language where a string undergoes more than one pass of
processing). ...
Another consequence of this two-pass parsing is that the ordinary Perl tokener
finds the end of the regular expression first, just as if it were looking for the
terminating delimiter of an ordinary string. Only after it has found the end of the
string (and done any variable interpolation) is the pattern treated as a regular
expression. Among other things, this means you can’t “hide” the terminating
delimiter of a pattern inside a regex construct (such as a bracketed character class
or a regex comment, which we haven’t covered yet). Perl will see the delimiter
wherever it is and terminate the pattern at that point.
First, why it is said that Only after it has found the end of the string not the end of the regular expression which it was looking, as stated before?
Second, what does it mean you can’t “hide” the terminating delimiter of a pattern inside a regex construct? Why I can't hide the terminating delimiter /, whereas I can place it wherever I want either in the regexp directly /A\/C/ or in a interpolated variable (even without \):
my $s = 'A/';
my $p = 'A/C';
say $p =~ /$s/;
outputs 1.
While I was writing and re-reading my question I thought that this snippet tells about using a single-quote as a regexp delimiter, then it all seems quite cohesive. Is my assumption correct?
My appreciation.
It says "end of the string" instead of "end of the regular expression" because at that point it's treating the regex as if it were just a string.
It's trying to say that this does not work:
/foo[-/_]/
Even though normal regex metacharacters are not special inside [], Perl will see the regex as /foo[-/ and complain about an unterminated class.
It's trying to say that Perl does not parse the regex as it reads it. First it finds the end of the regex in your source code as if it were a quoted string, so the only special character is \. Then it interpolates any variables. Then it parses the result as a regular expression.
You can hide the terminating delimiter with \ because that works in ordinary strings. You can hide the delimiter inside an interpolated variable, because interpolation happens after the delimiter is found. If you use a bracketing delimiter (e.g. { } or [ ]), you can nest matching pairs of delimiters inside the regex, because q{} works like that too.
But you can't hide it inside any other regex construct.
Say you want to match a *. You would use
m/\*/
But what if you were using you used * as your delimiter? The following doesn't work:
m*\**
because it's interpreted as
m/*/
as seen in the following:
$ perl -e'm*\**'
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE / at -e line 1.
Take the string literal
"a\"b"
It produces the string
a"b
Similarly, the match operator
m*a\*b*
produces the regex pattern
a*b
If you want to match a literal *, you have to use other means. In other words.
m*a\*b* === m/a*b/ matches pattern a*b
m*a\x{2A}b* === m/a\*b/ matches pattern a\*b

Regex for matching literal strings

I'm trying to write a regular expression which will match a string. For simplicity, I'm only concerned with double quote (") strings for the moment.
So far I have this: "\"[^\"]*\""
This works for most strings but fails when there is an escaped double quote such as this:
"a string \" with an escaped quote"
In this case, it only matches up to the escaped quote.
I've tried several things to allow an escaped quote but so far I've been unsuccessful, can anyone give me a hand?
I've managed to solve it myself:
"\"(\\.|[^\"\\])*\""
Try this:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"
If you want a multi-line escaped string you can use:
"[^"\\]*(?:\\.[^"\\]*)*"
You need a negative lookbehind. Check if this works?
"\"[^\"]*(?<!\\)"
(?<!\\)" is supposed to match " that's not followed by \.
Try:
"((\\")|[^"(\\")])+"
From Regular Expression Library.
Usually you want to accept escaped anything.
" [^"\\]* (?: \\. [^"\\]* )* " would be the fastest.
"[^"\\]*(?:\\.[^"\\]*)*" compressed.
POSIX does not, AFAIK, support lookaround - without it, there is really no way to do this with just regular expressions. However, according to a POSIX emulator I have (no access to a native environment or library), This might get you close, in certain cases:
"[^\"]*"|"[^\]*\\|\\[^\"]*[\"]
it will capture the part before and the part after the escaped quote... with this source string (ignore the line breaks, an imagine it's all in one string):
I want to match "this text" and "This text, where there is an escaped
slash (\\), and an \"escaped quote\" (\")", but I also want to handle\\ escaped
back-slashes, as in "this text, with a \\ backslash: \\" -- with a little
text behind it!
it will capture these groups:
"this text" -- simple, quoted string
"This text, where there is an escaped slash (\ -- part 1 of quoted string
\), and an \ -- part 2
"escaped quote\ -- part 3
" (\ -- part 4
")" -- part 5, and ends with a quote
\\ -- not part of a quoted string
"this text, with a \ -- part 1 of quoted string
\ backslash: \ -- part 2
\" -- part 3, and ends with a quote
With further analysis you can combine them, as appropriate:
If the group starts and ends with a ", then it is fine on its own
If the group starts with a ", and ends with a \, then it needs to be IMMEDIATELY followed by another match group that either ends with a quote character itself, or recursively continues to be IMMEDIATELY followed by another match group
If the group does not immediately follow another match, it is not part of a quoted string
I think that's all the analysis that you need - but make sure to test it!!!
Let me know if this idea helps!
EDIT:
Additional note: just to be clear, for this to work all quotes in the entire source string must be escaped if they are not to be used as delimiters, and backslashes must be escaped everywhere as well