Regular Expressions: Dealing with + (Plus) Sign - regex

I am a little confused with regular expressions. My intent in below example is to replace all 'NE_NS+' with 'NE_OS+_OE_NS'.When I am giving below code, I don't see any issue with replace results
tempString1 = tempString1.replace (/\NE__NS_+/g,'NE__OS_+_OE__NS_');
When I am giving below code, I see that there are issues. My intent here is to replace all < mn+> with < mn> [No space between < and m]
tempString2 = tempString2.replace (/\<mn>+/g,'<mn>');
and right code for above replace seems to be
tempString3 = tempString3.replace (/\<mn>\+/g,'<mn>');
Why is '+' not relevant in replace example of tempString1 while it is relevant in tempString2 example and wont work until I change it as per code in tempString3?
I have tough time understanding regex. Any books/articles that can help me understand them. I am a novice at regular expression.

The problem
Let's take a closer look at your various regexes. You'll understand what's going on:
tempString1 \NE__NS_+
Clearly, one or more _ are expected at the end.
tempString2 \<mn>+
The \ is simply ignored because it is an escape character. Moreover, it used before < that's don't need to be escaped. Again, > are expected one or more times.
tempString3 \<mn>\+
Here the + is escaped, indicating that it is not a meta-character but the plus sign that has to be matched from your temporary string.
The solution
To sum it up, if you want to match NE_NS+, the plus sign must be escaped.
So your regex will be:
NE_NS\+
If you want to match < mn+>, you'll use \s for matching a blank character (space, tabulation, carriage return etc). Again, you must escape + since it's a meta character.
So, you end up with:
<\smn\+>
There is more...
Use the powerful Debuggex to visualize your regex.
Secondly, use Regexr to quickly live test your regex against a given input text.

Related

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

Regular Expression using vbscript for two numbers: not valid:"$123456789012" and valid: "12345678912"

I wanted to create regular Expression to find the number with 10 or more digits and that number should not have $ symbol in front of it.
Eg:
not valid:$123456789012 and valid: 12345678912.
Apart from this I have more validations for example finding number pattern: 3digits - 4digits, 3digits - 4digits - 5digits.
But for now I am able to create pattern for all those but unable to do for $<number>, could you please help.
Sorry for not mentioning in the beginning - this is using vbscript.
Code:
RegularExpressionObject.IgnoreCase = True
RegularExpressionObject.Global = True
RegularExpressionObject.Pattern =
"(([0-9]{3}-[0-9]{4})|([0-9]{3}-[0-9]{4}-[0-9]{5})|([0-9]{3}-[0-9]{4}-[0-9]{5}-[‌​0-9]{6})|([0-9]{10}))"
'pattern:CCID or 3(digits)-4(digits), 3(digits)-4(digits)-5(digits), 3(digits)-4(digits)-5(digits)-6(digits), 10digts and above number
Set Matches = RegularExpressionObject.Execute(Rescomts)
If (Matches.Count <> 0) Then
In the above code, regular expression pattern 10 digits are allowed, but I wanted to ignore 10 digits starting from $ symbol
As #Sam said, you would probably use a negative look-behind:
(?<!\$)\b[0-9]{10,}\b
Example: http://regex101.com/r/vR3iF6
Depending on the language selected, look-around may not be available. For example, Javascript limits the use of (?<!\$) so you may need to write it as
[^$]\b[0-9]{10,}\b
Example: http://regex101.com/r/oH5uE4
Some languages support \d, which may make your regex look cleaner. Others don't, and you'll need to write [0-9].
EDIT
Per #AlanMoore's suggestion, extraneous characters may also interfere. You might be able to get around this by using \b[^$]\b instead of just a single \b:
\b[^$]\b[0-9]{10,}\b
Example: http://regex101.com/r/lL2bN9
This would get rid of preceding spaces, non-digit characters, etc.
This regular expression seems to work:
^[^\$][0-9]{9,}$

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}

eregi_replace to preg_replace conversion stuff

Regular expressions are not strong point.
I can do simple stuff, but this one has just got my goat !!
So could someone give me a hand with this one.
Here's the comment in the code :
// If utf8 detection didnt work before, strip those weird characters for an underscore, as a last resort.
eregi_replace("[^a-z0-9 \-\.\(\)\/\\]","_",$str);
to (here's what I tried)
preg_replace("{[^a-z0-9 \-\.\(\)\/\\]}i","_",$str);
Any regex pros out there who give me a hand?
You need to specify regexp identifier such as # or /
preg_replace("#[^a-z0-9 \-\.\(\)\/\\]#i","_",$str);
So you should enclose your regular expression in those identifier characters.
First, I believe the { and } are fine as delimiters for the expression from the flags, but I know there are some regex flavors that don't support it, so it might be a good idea to just use something like ! or #
Second, I am not sure how the expression before worked, because AFAIK escaping with a \ character does not work with ERE expressions. You have to represent special characters like ^, -, and ] by their position within the class (^ cannot be the first character, ] must be the first character, and - must be either the first or the last character). The - character in the first expression would be interpreted as a range specifier (in this case a character in the range between \ and \). Additionally, the \ characters are treated literally, so you've got a confusing looking and largely redundant regex.
The replacement expression, however, needs to be in preg notation/flavor, so there are rule changes:
Very few things need to be escaped in a character class, even with the new rules
The \ character needs to be escaped twice - once for the string, and then one more time for the regex - otherwise, it will escape the closing bracket ]
Assuming you want to match a dash (or rather match something OTHER than a dash, it needs to be moved to the end of the class
So, here is some code (link) that I believe does what you need it to do:
$source = 'hello! ##$%^&* wazzup-dawg?.()/\\[]{}<>:"';
$blah = preg_replace('![^a-z0-9 .()/\\\\-]!i','_',$source);
print($blah);
preg_replace("{[^a-z0-9]-.()/\/}i","_",$str)
works just fine.
I tried it with all # and / and { and they all worked.

replace word using Regex, but not in Quotes in C# [duplicate]

From this q/a, I deduced that matching all instances of a given regex not inside quotes, is impossible. That is, it can't match escaped quotes (ex: "this whole \"match\" should be taken"). If there is a way to do it that I don't know about, that would solve my problem.
If not, however, I'd like to know if there is any efficient alternative that could be used in JavaScript. I've thought about it a bit, but can't come with any elegant solutions that would work in most, if not all, cases.
Specifically, I just need the alternative to work with .split() and .replace() methods, but if it could be more generalized, that would be the best.
For Example:
An input string of: +bar+baz"not+or\"+or+\"this+"foo+bar+
replacing + with #, not inside quotes, would return: #bar#baz"not+or\"+or+\"this+"foo#bar#
Actually, you can match all instances of a regex not inside quotes for any string, where each opening quote is closed again. Say, as in you example above, you want to match \+.
The key observation here is, that a word is outside quotes if there are an even number of quotes following it. This can be modeled as a look-ahead assertion:
\+(?=([^"]*"[^"]*")*[^"]*$)
Now, you'd like to not count escaped quotes. This gets a little more complicated. Instead of [^"]* , which advanced to the next quote, you need to consider backslashes as well and use [^"\\]*. After you arrive at either a backslash or a quote, you need to ignore the next character if you encounter a backslash, or else advance to the next unescaped quote. That looks like (\\.|"([^"\\]*\\.)*[^"\\]*"). Combined, you arrive at
\+(?=([^"\\]*(\\.|"([^"\\]*\\.)*[^"\\]*"))*[^"]*$)
I admit it is a little cryptic. =)
Azmisov, resurrecting this question because you said you were looking for any efficient alternative that could be used in JavaScript and any elegant solutions that would work in most, if not all, cases.
There happens to be a simple, general solution that wasn't mentioned.
Compared with alternatives, the regex for this solution is amazingly simple:
"[^"]+"|(\+)
The idea is that we match but ignore anything within quotes to neutralize that content (on the left side of the alternation). On the right side, we capture all the + that were not neutralized into Group 1, and the replace function examines Group 1. Here is full working code:
<script>
var subject = '+bar+baz"not+these+"foo+bar+';
var regex = /"[^"]+"|(\+)/g;
replaced = subject.replace(regex, function(m, group1) {
if (!group1) return m;
else return "#";
});
document.write(replaced);
Online demo
You can use the same principle to match or split. See the question and article in the reference, which will also point you code samples.
Hope this gives you a different idea of a very general way to do this. :)
What about Empty Strings?
The above is a general answer to showcase the technique. It can be tweaked depending on your exact needs. If you worry that your text might contain empty strings, just change the quantifier inside the string-capture expression from + to *:
"[^"]*"|(\+)
See demo.
What about Escaped Quotes?
Again, the above is a general answer to showcase the technique. Not only can the "ignore this match" regex can be refined to your needs, you can add multiple expressions to ignore. For instance, if you want to make sure escaped quotes are adequately ignored, you can start by adding an alternation \\"| in front of the other two in order to match (and ignore) straggling escaped double quotes.
Next, within the section "[^"]*" that captures the content of double-quoted strings, you can add an alternation to ensure escaped double quotes are matched before their " has a chance to turn into a closing sentinel, turning it into "(?:\\"|[^"])*"
The resulting expression has three branches:
\\" to match and ignore
"(?:\\"|[^"])*" to match and ignore
(\+) to match, capture and handle
Note that in other regex flavors, we could do this job more easily with lookbehind, but JS doesn't support it.
The full regex becomes:
\\"|"(?:\\"|[^"])*"|(\+)
See regex demo and full script.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
You can do it in three steps.
Use a regex global replace to extract all string body contents into a side-table.
Do your comma translation
Use a regex global replace to swap the string bodies back
Code below
// Step 1
var sideTable = [];
myString = myString.replace(
/"(?:[^"\\]|\\.)*"/g,
function (_) {
var index = sideTable.length;
sideTable[index] = _;
return '"' + index + '"';
});
// Step 2, replace commas with newlines
myString = myString.replace(/,/g, "\n");
// Step 3, swap the string bodies back
myString = myString.replace(/"(\d+)"/g,
function (_, index) {
return sideTable[index];
});
If you run that after setting
myString = '{:a "ab,cd, efg", :b "ab,def, egf,", :c "Conjecture"}';
you should get
{:a "ab,cd, efg"
:b "ab,def, egf,"
:c "Conjecture"}
It works, because after step 1,
myString = '{:a "0", :b "1", :c "2"}'
sideTable = ["ab,cd, efg", "ab,def, egf,", "Conjecture"];
so the only commas in myString are outside strings. Step 2, then turns commas into newlines:
myString = '{:a "0"\n :b "1"\n :c "2"}'
Finally we replace the strings that only contain numbers with their original content.
Although the answer by zx81 seems to be the best performing and clean one, it needes these fixes to correctly catch the escaped quotes:
var subject = '+bar+baz"not+or\\"+or+\\"this+"foo+bar+';
and
var regex = /"(?:[^"\\]|\\.)*"|(\+)/g;
Also the already mentioned "group1 === undefined" or "!group1".
Especially 2. seems important to actually take everything asked in the original question into account.
It should be mentioned though that this method implicitly requires the string to not have escaped quotes outside of unescaped quote pairs.