Regex for matching literal strings - regex

I'm trying to write a regular expression which will match a string. For simplicity, I'm only concerned with double quote (") strings for the moment.
So far I have this: "\"[^\"]*\""
This works for most strings but fails when there is an escaped double quote such as this:
"a string \" with an escaped quote"
In this case, it only matches up to the escaped quote.
I've tried several things to allow an escaped quote but so far I've been unsuccessful, can anyone give me a hand?

I've managed to solve it myself:
"\"(\\.|[^\"\\])*\""

Try this:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"
If you want a multi-line escaped string you can use:
"[^"\\]*(?:\\.[^"\\]*)*"

You need a negative lookbehind. Check if this works?
"\"[^\"]*(?<!\\)"
(?<!\\)" is supposed to match " that's not followed by \.

Try:
"((\\")|[^"(\\")])+"
From Regular Expression Library.

Usually you want to accept escaped anything.
" [^"\\]* (?: \\. [^"\\]* )* " would be the fastest.
"[^"\\]*(?:\\.[^"\\]*)*" compressed.

POSIX does not, AFAIK, support lookaround - without it, there is really no way to do this with just regular expressions. However, according to a POSIX emulator I have (no access to a native environment or library), This might get you close, in certain cases:
"[^\"]*"|"[^\]*\\|\\[^\"]*[\"]
it will capture the part before and the part after the escaped quote... with this source string (ignore the line breaks, an imagine it's all in one string):
I want to match "this text" and "This text, where there is an escaped
slash (\\), and an \"escaped quote\" (\")", but I also want to handle\\ escaped
back-slashes, as in "this text, with a \\ backslash: \\" -- with a little
text behind it!
it will capture these groups:
"this text" -- simple, quoted string
"This text, where there is an escaped slash (\ -- part 1 of quoted string
\), and an \ -- part 2
"escaped quote\ -- part 3
" (\ -- part 4
")" -- part 5, and ends with a quote
\\ -- not part of a quoted string
"this text, with a \ -- part 1 of quoted string
\ backslash: \ -- part 2
\" -- part 3, and ends with a quote
With further analysis you can combine them, as appropriate:
If the group starts and ends with a ", then it is fine on its own
If the group starts with a ", and ends with a \, then it needs to be IMMEDIATELY followed by another match group that either ends with a quote character itself, or recursively continues to be IMMEDIATELY followed by another match group
If the group does not immediately follow another match, it is not part of a quoted string
I think that's all the analysis that you need - but make sure to test it!!!
Let me know if this idea helps!
EDIT:
Additional note: just to be clear, for this to work all quotes in the entire source string must be escaped if they are not to be used as delimiters, and backslashes must be escaped everywhere as well

Related

replace single-quote with double-quote, if and only if quote is after specific string

I'm working in notepad++, and using its find-replace dialog box.
NP++ documentation states: Notepad++ regular expressions use the Boost regular expression library v1.70, which is based on PCRE (Perl Compatible Regular Expression) syntax. ref: https://npp-user-manual.org/docs/searching
What I'm trying to do should be simple, but I'm a regex novice, and after 2-3 hrs of web searches and playing with online regex testers, I give up.
I want to replace all single quotes ' with double quote " , but if and only if the ' is to the RIGHT of one or more #, ie inside a python comment.
For example,
list1 = ['apple','banana','pear'] # All 'single quotes' to LEFT of # remained unchanged.
list2 = ['tomato','carrot'] # All 'single quotes' to RIGHT of one or more # are replaced
# # with "double quotes", like this.
The np++ file is over 800 lines, manual replacement would be tedious & error prone. Advice appreciated.
This regex should do what you want:
(^[^#]*#|(?<!^)\G)[^'\n]*\K'
It looks for a ' which is preceded by either
^[^#]*# : start of line and some number of non-# characters followed by a #; or
(?<!^)\G : the start of line or the end of the previous match (\G), with a negative lookbehind for start of line (?<!^), meaning that it only matches at the end of the previous match
and then some number of non ' or newline (to prevent the match wrapping around the end of the previous line) characters [^'\n]*.
We then use \K to reset the match, so that everything before that is discarded from the match, and the regex only matches the '.
That can then be replaced with ".
Demo on regex101
Update
You can avoid matching apostrophes within words by only matching ones that are either preceded or followed by a non-word character:
(^[^#]*#|(?<!^)\G)[^'\n]*\K('(?=\W)|(?<=\W)')
Demo on regex101
Update 2
You can also deal with the case where there are # characters in strings by qualifying the first part of the regex with the requirement for there to be matched pairs of quotes beforehand:
(?:^[^'#]*(?:'[^']*'[^#']*)*[^'#]*#|(?<!^)\G)[^'\n]*\K(?:'(?=\W)|(?<=\W)')
Demo on regex101

Regex Lookahead/behind to find character unless followed by the same

I'm really not good with Regex and have been messing about to achieve the following all morning:
I want to find unicode characters ie "\00026" in an SQL string before saving to the database and escape the "\", by replacing it with "\" unless it already has two "\" characters.
\\(?=[0])(?<![\\])
Is what I have written, which as I understand it does:
find the "\" character, positive look ahead for a "0", and look behind to check it isn't preceded by a "\"
But it's not working, so clearly I have misunderstood!
I can shorten it to \\(?=[0])
But then I get the "\" before the 0, even if it is preceded by another "\"
So how do I do:
Replace("\00026", "regex", "\\") to get "\\00026"
AND ensure that
Replace("\\00026", "regex", "\\") also gives "\\00026"
All help much appreciated!
EDIT:
This must parse an entire string and replace all occurrences, not just the first as well - just to be clear. Also I am using VB.net if it makes much difference.
Let me explain why your regex does not work.
\\ - Matches \
(?=[0]) - Checks (not matches) if the next character is 0
(?<![\\]) - Checks (but not matches) if the preceding character (that is \) is not \.
The last condition will always fail the match, as \ is \. So, not much sense, right?
If you want to match / in /000xx whole strings (e.g. separated with spaces), where x is any digit, you can use
\B(?<!/)/(?!/)(?=000\d{2})
See demo (go to Context tab)
To match the string even in context like w/00023, you can remove \B:
(?<!/)/(?!/)(?=000\d{2})
If you do not care about 0s, but just any digits:
(?<!/)/(?!/)(?=\d)
And in case you have \ (not /), just replace / with \\ in the above regular expressions.
You can use the following regex:
(?<!/)/(?=0)
And replace with //
See DEMO

Flex regular expression String [duplicate]

This question already has answers here:
Regular expression for a string literal in flex/lex
(6 answers)
Closed 7 years ago.
I've got a regular expression that matches strings opening with " and closing with " and can contain \".
The regular expression is this \"".*[^\\]"\".
I don't understand what's the " that is followed after \" and after the [^\\].
Also this regular expression works when I have a \n inside a string but the . rule on flex doesn't match a \n.
I just tested for example the string "aaaaa\naaa\naaaa".
It matched it with no problem.
I made a regex for flex that matches what I need. It's this one \"(([^\\\"])|([\\\"]))*\". I understand how this works though.
Also I just tested my solutions against an "" an empty string. It doesn't work. Also the answers from all those that answered have been tested and don't work as well.
The pattern is a little naive and even indeed false. It doesn't handle correctly escaped quotes because it assumes that the closing quote is the first one that is not preceded by a backslash. This is a false assumption.
The closing quote can be preceded by a literal backslash (a backslash that is escaped with an other backslash, so the second backslash is no longer escaping the quote), example: "abcde\\" (so the content of this string is abcde\)
This is the pattern to deal with all cases:
\"[^"\\]*(?s:\\.[^"\\]*)*\"
or perhaps (I don't know exactly where you need to escape literal quotes in a flex pattern):
\"[^\"\\]*(?s:\\.[^\"\\]*)*\"
Note that the s modifier allows the dot to match newlines inside the non capturing group.
I just figured out everything :P
This \"".*[^\\]"\" works because in flex it means: I want to match something that starts with " and ends with ". Inside these quotes there will be another matching pattern(that's why there are the unexplained ", as I was pondering their existence in my question) that can be any set of any characters, but CANNOT end with \.
What confused me more was the use of ., cause in flex it means that it will match any character except a new line \n. So I was mistakenly thinking that it won't match a string such as "aaa\naaa".
But the reality is it will match it, because when flex reads it will read first \ and then n.
The TRUE newline would be, something like this:
"something
like
this"
But compilers in -ansi C for example(haven't tested it on other versions other than ansi) do not let you declare a string using in different lines.
I hope my answer is clear enough. Cheers.
Your pattern does not match "hello" but it matches ""hello"".
if you want to match anything that is in quotes and may contain \" try something like:
/(\"[\na-zA-Z\\"]*\")/gs

What regex expression will match all characters except ", except when it is \"?

I'm trying to parse an apache log, and I'm having problems with the right syntax for the referer because it is a string inside " (double-quotes), that can also have \" inside it.
"([^"]*)" doesn't work when there is a \" in the string.
How do I start at the 1st double-quote, then take all characters that are not double-quotes, unless it's \", in which case I include it, and keep going?
You could use this:
"((?:[^"]|\\")*)"
It will match zero or more of any character other than a double-quote or a slash-double-quote pair, all surrounded by double-quotes.
Could there be other escapes in the string, for example "hello \\"? In that case, you need a more general approach:
"((?:\\.|[^"\\])*)"
How about this? A negative-lookbehind to exclude a \ before the closing "
"(.+?)(?<!\\)"
This will match two quotes with any number of escaped quotes in-between:
"\([^"]\|\\"\)*"
First it looks for a quote. Next it searches for zero to infinity of the following:
a non-quote character
a quote character preceded by a backslash

How can I match double-quoted strings with escaped double-quote characters?

I need a Perl regular expression to match a string. I'm assuming only double-quoted strings, that a \" is a literal quote character and NOT the end of the string, and that a \ is a literal backslash character and should not escape a quote character. If it's not clear, some examples:
"\"" # string is 1 character long, contains dobule quote
"\\" # string is 1 character long, contains backslash
"\\\"" # string is 2 characters long, contains backslash and double quote
"\\\\" # string is 2 characters long, contains two backslashes
I need a regular expression that can recognize all 4 of these possibilities, and all other simple variations on those possibilities, as valid strings. What I have now is:
/".*[^\\]"/
But that's not right - it won't match any of those except the first one. Can anyone give me a push in the right direction on how to handle this?
/"(?:[^\\"]|\\.)*"/
This is almost the same as Cal's answer, but has the advantage of matching strings containing escape codes such as \n.
The ?: characters are there to prevent the contained expression being saved as a backreference, but they can be removed.
NOTE: as pointed out by Louis Semprini, this is limited to 32kb texts due a recursion limit built into Perl's regex engine (that unfortunately silently returns a failure when hit, instead of crashing loudly).
How about this?
/"([^\\"]|\\\\|\\")*"/
matches zero or more characters that aren't slashes or quotes OR two slashes OR a slash then a quote
A generic solution(matching all backslashed characters):
/ \A " # Start of string and opening quote
(?: # Start group
[^\\"] # Anything but a backslash or a quote
| # or
\\. # Backslash and anything
)* # End of group
" \z # Closing quote and end of string
/xms
See Text::Balanced. It's better than reinvent wheel. Use gen_delimited_pat to see result pattern and learn form it.
RegExp::Common is another useful tool to be aware of. It contains regexps for many common cases, included quoted strings:
use Regexp::Common;
my $str = '" this is a \" quoted string"';
if ($str =~ $RE{quoted}) {
# do something
}
Here's a very simple way:
/"(?:\\?.)*?"/
Just remember if you're embedding such a regex in a string to double the backslashes.
Try this piece of code : (\".+")