Replace double quoted strings by single quoted except for GStrings - regex

My OCD has gotten the better of me and I'm going through my groovy codebase replacing simple strings with double quotes around them into single quoted strings.
However, I want to avoid GStrings that actually contain dollar symbols and variables.
I'm using IntelliJ to do the substitution, and the following almost works:
From: "([^$\"\n\r]+)"
To: '$1'
It captures strings without any dollars in, but only partially skips any strings that contain them.
For example it matches the quotes between two double quoted strings in this case:
foo("${var}": "bar")
^^^^
Is it possible to create a regex that would skip a whole string that contained dollars, so in the above case it skips "${var}" and selects "bar", instead of erroneously selecting ": "?
EDIT: Here's a section of code to try against
table.columns.elements.each{ columnName, column ->
def columnText = "${columnName} : ${column.dataType}"
cols += "${columnText}\n"
if (columnText.length() > width) {
width = columnText.length()
}
height++
}
builder."node"("id": table.elementName) {
builder."data"("key": "d0") {
builder."y:ShapeNode"()
}
}
def foo() {
def string = """
a multiline quote using triple quotes with ${var} gstring vars in.
"""
}

Do single and triple quote replacements separately.
Single quotes:
Use a look ahead for an even number of quotes after your hit. A negative look behind stops it matching the inner quotes of triple quoted strings.
Find: (?<!")"([^"$]*)"(?=(?:(?:[^"\r\n]*"){2})*[^"]*$)
Replace: '$1'
See live demo.
Triple quotes:
Use a simpler match for triple quoted strings, since they are on their own lines.
Find: """([^"$]*?)"""
Replace: '''$1'''
See live demo, which includes a triple-quoted string that contains a variable.

You need to make sure the first quote comes after even number of quotes:
^[^\n\r"]*(?:(?:"[^"\n\r]*){2})*"([^$\"\n\r]+)"
Here you can play with it.
Explanation:
^[^"\n\r]* - some non-quotes at the beginning
"[^"\n\r]* - a quote, then some more non-quotes
(?:"[^"\n\r]*){2} - let's have two of this
(?:(?:...)) - actually, let's have 0, 2, 4, 6, ... whatever amount of this
Then your regex comes to match the right string: "([^$\"\n\r]+)"
If intellij supports that, then you can make it faster by replacing the non-capturing groups (?:...) with atomic groups (?>...).
This regex finds the last string in the line so you'll have to run the replace several times.
Update
Updated the negated character classes with the newline characters. Now it works well for multi-line texts too. Still, you'll have to run it several times because it finds only one string per line.

Related

Regex to find pattern NOT enclosed in quotes [duplicate]

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

Repetead string, replaceallin scala [duplicate]

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

regex to remove comma between double quotes notepad++

I am trying to remove commas inside double quotes from a csv file in notepad++, this is what I have:
1070,17,2,GN3-670,"COLLAR B, M STAY","2,606.45"
and I need this:
1070,17,2,GN3-670,"COLLAR B M STAY","2606.45"
I ma trying to use notepad find/replace option with a reg exp. pattern.
I tried all kind of combination but didn't manage to do :( The file contains 1 million rows.
After whole today I am not anymore sure if a simple regex can do? Maybe I should go with a script...python?
mrki, this will do what you want (tested in N++):
Search: ("[^",]+),([^"]+")
Replace: $1$2 or \1\2
How does this work? The first parentheses capture the beginning of the string up to (but not including) the comma into Group 1. The second parentheses capture the end of the string after the comma into Group 2. The replacement substitutes the string with a concatenation of Group 1 and Group 2.
In more detail: in the first parentheses, we match the opening double quotes then any number of characters that are not a comma. That is the meaning of [^,]+. In the second parentheses, we match any number of characters that are not a double quote with [^"]+, then the closing double quotes .
Try the following
import re
print re.sub(',(?=[^"]*"[^"]*(?:"[^"]*"[^"]*)*$)',"",string)
This will remove comma between quotes
Just an update to #zx81's brilliant solution.
Lets say you have 2commas between quotes
Then the regex search has to be modified as follows:
("[^",]+),([^",]+),([^"]+")
Replace needs to be modified as
$1$2$3
So on modify it depending on the # of commas.
I tried exploring to see if recursive regex was possible but the does not seem to be possible as of now
For a line with multiple instances of "comma within double quotes", I can think of the following perl script - you need to have a header line without this kind of instance so that you know how many comma-separated fields there should be.
#! /usr/bin/perl -w
use strict;
my $n_fields = "";
while (<>) {
s/\s+$//;
if (/^\#/) { # header line
my #t = split(/,/);
$n_fields = scalar(#t); # total number of fields
} else { # actual data
my $n_commas = $_ =~s/,/,/g; # total number of commas
foreach my $i (0 .. $n_commas - $n_fields) { # iterate ($n_commas - $n_fields + 1) times
s/(\"[^",]+),([^"]+\")/$1\\x2c$2/g; # single replacement per previous answers
}
s/\"//g; # removal of double quotes (if you want)
}
print "$_\n";
}

Regex to pick characters outside of pair of quotes

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"