replace word using Regex, but not in Quotes in C# [duplicate] - regex

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
For Example:
An input string of: +bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return: #bar#baz"not+or\"+or+\"this+"foo#bar#

Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at
\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
I admit it is a little cryptic. =)

Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.
There happens to be a simple, general solution that wasn't mentioned.
Compared with alternatives, the regex for this solution is amazingly simple:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
Online demo
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
Hope this gives you a different idea of a very general way to do this. :)
What about Empty Strings?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:
"[^"]*"|(\+)
See demo.
What about Escaped Quotes?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.
Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
The resulting expression has three branches:
\\" to match and ignore
"(?:\\"|[^"])*" to match and ignore
(\+) to match, capture and handle
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
The full regex becomes:
\\"|"(?:\\"|[^"])*"|(\+)
See regex demo and full script.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

You can do it in three steps.
Use a regex global replace to extract all string body contents into a side-table.
Do your comma translation
Use a regex global replace to swap the string bodies back
Code below
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\\]|\\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.

Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';
and
var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.

Related

How can I strip double quotes and braces from my strings before insert in Rails4?

I am parsing values from xml and saving them to variables. I was able to strip all but the braces and double quotes from the string. The value displays like this on the page: ["MPEG Video"].
Here is an exampled of the parse saving it to a variable:
#video_format = REXML::XPath.each(media_parse_doc, "//track[#type='Video']/Format/text()") { |element| element }
I tried using .ts like this:
#video_format = (REXML::XPath.each(media_parse_doc, "//track[#type='Video']/Format/text()") { |element| element } ).ts('[]"','')
but it did not work. I saw some examples telling to you gsub and I looked at the api dock for gsub but I am not understanding the thought logic in the examples to be able to apply it correctly to my own case. Here is one of the examples:
"foobar".gsub(/^./, "") # => "oobar"
I understand it is removing te first character but I don't know how to set it up to remove " and [.
Why the /^? Is that ascii for something? Can someone please show me the correct syntax to remove the double quotes and braces from my varialbes and explain the logic process so I can better understand to use on my own in the future?
Thank you for the help!
If you want to understand regular expressions, check out http://rubular.com/.
"foobar".gsub(/^./, "") # => "oobar" that particular example will substitue the first letter of the string with "" (ie, nothing). The reason is that the ^ says "pin the match to the beginning of the string", and the . says "match any character" - so, it'll match any character at the beginning of the string. The encosing / characters are just the standard delimiters for a regular expression - so it's only the ^. that you need to figure out.
To replace double quotes: 'fo"o"bar'.gsub(/"/, "") # => "foobar"
To replace left square bracket: 'fo[o[bar'.gsub(/\[/, "") # => "foobar" (because square brackets are a special character in regex, you have to prefix them with a \ when you want to use them as a 'normal' character.
to replace all quotes and square brackers in one: 'fo[o"[b]"ar'.gsub(/("|\[|\])/, "") # => "foobar"
(the parenthesis indicate a group, and the pipes | indicate 'or'. So, ("|\[|\]) means "match any of the things in this group: a quote, or a left square bracket, or a right square bracket".
But really what you should do is do a good intro tutorial to regular expressions and start from the basics. Once you understand that, it shouldn't be too hard to start composing simple regular expressions of your own.
If you're on a mac, this app is very useful for writing your own regex's: http://krillapps.com/patterns/

Parsing variables within a string using a Regular Expression

I've got a bit of a problem with regular expressions with ColdFusion.
I have a string:
Hi my name is {firstname}. and i live in {towncity} my email address is {email}
What I would like to know is how would I go about finding all strings, within my string, that are encased within a set of {} brackets? I would like to split all the matching strings into an array so I can use the results of query data.
Also is this a commonly used pattern for processing strings within matching strings for merging variable data ?
Any help greatly appreciated.
Simple Answer
To find all the brace-encased strings, you can use rematch and the simple expression \{[^{}]+\}
Explanation
The backslashes \ before each brace are to escape them, and have them act as literal braces (they carry special meaning otherwise).
The [^...] is a negative character class, saying match any single char that is NOT one of those contained within, and the greedy + quantifier tells it to match as many as possible, but at least one, from the preceding item.
Thus using [^{}]+ between the braces means it will not match nested or unmatched braces. (Whilst using \{.*?\} could match two opening braces. Note: the *? is a lazy quantifier, it matches nothing (if possible), but as many as required.)
Extended Answer
However, since you say that the results come from a query, a way to only match the values you're dealing with is to use the query's ColumnList to form an expression:
`\{(#ListChangeDelims(QueryName.ColumnList,'|')#)\}`
This changes ColumnList into a pipe-delimited list - a set of alternatives, grouped by the parentheses - i.e. the generated pattern will be like:
\{(first_name|towncity|email)\}
(with the contents of that group going into capture group 1).
To actually populate the text (rather than just matching) you could do something similar, except there is no need for a regex here, just a straight replace whilst looping through columns:
<cfloop index="CurColumn" list=#QueryName.ColumnList#>
<cfset text = replace( text , '{#CurColumn#}' , QueryName[CurColumn][CurrentRow] , 'all' ) />
</cfloop>
(Since this is a standard replace, there's no need to escape the braces with backslashes; they have no special meaning here.)
Use the reMatch(reg_expression, string_to_search) function.
The details on Regular Expressions in Coldfusion 10 are here. (I believe the regexp in CF8 would be roughly the same.)
Use the following code.
<cfset str = "Hi my name is {firstname}. And I live in {towncity} my email address is {email}.">
<cfoutput>Search string: <b>#str#</b><br />Search result:<br /></cfoutput>
<cfset ret = reMatch("\{[\w\s\(\)\+\.#-]+\}", str)>
<cfdump var ="#ret#">
This returns an array with the following entries.
{firstname}
{towncity}
{email}
The [] brackets in CF regular expressions define a character set to match a single character. You put + after the brackets to match one or more characters from the character set defined inside the []. For example, to match one or more upper case letters you could write [A-Z]+.
As detailed in the link above, CF defines shortcuts to match various characters. The ones I used in the code are: \w to match an alpha-numeric character or an underscore, \s to match a whitespace character (including space, tab, newline, etc.).
To match the following special characters +*?.[^$({|\ you escape them by writing backslash \ before them.
An exception to this is the dash - character, which cannot be escaped with a backslash. So, to use it as a literal simply place it at the very end of the character set, like I did above.
Using the above regular expression you can extract characters from the following string, for example.
<cfset str = "Hi my name is { John Galt}. And I live in {St. Peters-burg } my email address is {john#exam_ple.com}.">
The result would be an array with the following entries.
{ John Galt}
{St. Peters-burg }
{john#exam_ple.com}
There may be much better ways to do this, but using something like rematch( '{.*?}', yourstring ) would give you an array of all the matches.
For future reference, I did this with the excellent RegExr, a really nice online regex checker. Full disclosure, it's not specifically for ColdFusion, but it's a great way to test things out.

how to group in regex matching correctly?

consider following scenario
input string = "WIPR.NS"
i have to replace this with "WIPR2.NS"
i am using following logic.
match pattern = "(.*)\.NS$" \\ any string that ends with .NS
replace pattern = "$12.NS"
In above case, since there is no group with index 12, i get result $12.NS
But what i want is "WIPR2.NS".
If i don't have digit 2 to replace, it works in all other cases but not working for 2.
How to resolve this case?
Thanks in advance,
Alok
Usually depends entirely on your regex engine (I'm not familiar with those that use $1 to represent a capture group, I'm more used to \1 but you'd have the same problem with that).
Some will provide a delimiter that you can use, like:
replace pattern = "${1}2.NS"
which clearly indicates that you want capture group 1 followed by the literal 2.NS.
In fact, by looking at this page, it appears that's exactly the way to do it (assuming .NET):
To replace with the first backreference immediately followed by the digit 9, use ${1}9. If you type $19, and there are less than 19 backreferences, the $19 will be interpreted as literal text, and appear in the result string as such.
Also keep in mind that Jay provides an excellent answer for this specific use case that doesn't require capture groups at all (by just replacing .NS with 2.NS).
You may want to look into that as a possibility - I'll leave this answer here since:
it's the accepted answer; and
it probably better for the more complex cases, like changing X([A-Z])4([A-Z]) with X${1}5${2}, where you have variable text on either side of the bit you wish to modify.
You don't need to do anything with what precedes the .NS, since only what is being matched is subject to replacement.
match pattern = "\.NS$" (any string that ends with .NS -- don't forget to escape the .)
replace pattern = "2.NS"
You can further refine this with lookaround zero-width assertions, but that depends on your regex engine, and you have not specified the environment/programming language in which you are working.

How to match any word from a word group

I'm trying to create a pattern that would identify a money in a string. My expression so far is:
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}[zl|zł|zlotych|złotych|pln|PLN]{0,1}
and my main problem is with the last part: [zl|zł|zlotych|złotych|pln|PLN], which should find one of the national notations for money value (sth like $ or usd or dollars) but I'm doing it wrong, since it also matches something like '108.1 z'. Is it possible to change the last part, so that it would match only expressions that contain the whole expressions like 'zl', 'pln' and so on, and not single letters?
Yes, don't use [], which defines a character class, but instead use () to group your words.
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}(zl|zł|zlotych|złotych|pln|PLN)?
As you had it written, [zl|zł|zlotych|złotych|pln|PLN], means "match any of the characters contained in the []", or the equivalent of: [zl|łotychpnPLN] (duplicates removed)
If you don't want the money symbol captured, then start the group with ?:, i.e.:
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}(?:zl|zł|zlotych|złotych|pln|PLN)?
Use parentheses (which delimit groups) rather than square brackets (which delimit character classes) around that last group.
As a matter of style, use ? instead of {0,1}.
(\d{1,3}[\.,\s]{0,2})*\d{3}[\.,\s]{0,2}\d{0,2}[\s]{0,2}(zl|zł|zlotych|złotych|pln|PLN)?
You have a few problems here. First off, inside [] characters are taken as literals, so the first two [] blocks should be [.,\s].
Next (as the other answers say), the last [] block needs to be a group, not a character class, so replace the [] with ().
Finally, at the end you can replace {0, 1} with ?. It won't make a difference, but it's neater.
The regex should look like this:
(\d{1,3}[.,\s]{0,2})*\d{3}[.,\s]{0,2}\d{0,2}[\s]{0,2}(zl|zł|zlotych|złotych|pln|PLN)?
For the future, for regex questions it's really helpful if you post a typical input string and desired match along with your question!

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.