parentheses only used for capture group in regular expression matching? - regex

I have a confusion when reading regular expression written by others, my question is parentheses () could only be used for capture group? Or we could use it to make regular expression more elegant to read by group logical set of operations together, especially for long regular expression?
I ask this since I sometimes see after people write long regular expression, they write additional parentheses even if they do not need to know capture group match information -- I mean even if they just need to know if the whole regular expression match or not, they still add some parentheses.

Parentheses are used for grouping and that would make regex engine capture the sub pattern inside the parentheses.
If you don't want to capture the text inside using non-capturing group by using this syntax:
(?:...)
This will also save some memory while processing the long and complex regex.
See more on Non capturing groups
A basic & simple example on how ad where () should be placed carefully if regex is this
^\w+|\d+$
Where we are selecting word at start or digits at end in the input:
foo -a123 bar baz 123
Thus we are matching foo and 123.
If you place brackets () like this:
^(\w+|\d+)$
then your match will fail as we are asserting presence of both anchors ^ and $ around a word OR a number in input.
Correct regex with brackets would be:
(?:^\w+|\d+$)

Related

use case for ?: in tcl regexp

I read the documentation of ?: in tcl regexp. Which says that it matches an expression without capturing it.
I tried and it worked fine.
My query is, what is the proper use case for this option, as it we do not want to use capture sequence, we won't puts brackets there.
Is it just an alternate way, or have some special condition, where we should use this? Kindly clarify.
Easy: You need to group several elements in your Regex, but you don't need them as a capturing group for reference.
a+ (b+|c+) OR (a+ b+)|c+
I need braces for grouping. But if I run it like this the engine will capture all those matches. This may need a lot of memory and cost a lot of performance. If I don't need the capturing groups later for reference, I can use ?: to get grouping without the performance impact:
a+ (?:b+|c+) OR (?:a+ b+)|c+
First, have a look at the Tcl regex reference:
(expression)
Parentheses surrounding an expression specify a nested expression. The substring matching expression is captured and can be referred to via the back reference mechanism, and also captured into any corresponding match variable specified as an argument to the command.
(?:expression)
matches expression without capturing it.
While the first part describing capturing group ability to capture subtext to be referred to with backreferences is universal, the second part dwelling on initializing variables based on the capturing group is specific to Tcl.
Bearing that in mind, Tcl regex usage can be greatly simplified with non-capturing groups in case you have a pattern with a number of capturing groups, and you want to modify it by adding another group in-between existing groups.
Say, you want to match strings like abc 1234 (comment) and use {(\w+)\s+(\d+)\s+\(([^()]+)\)}:
regexp {(\w+)\s+(\d+)\s+\(([^()]+)\)} $a - body num comment
However, you were asked to also match strings with any number of word+space+digits in-between 1234 and comment. If you write
set a1 "abc 1234 more 5678 text 890 here 678 (comment)"
regexp {(\w+)\s+(\d+)(\s+\w+\s+\d+)*\s+\(([^()]+)\)} $a - body1 num1 comment1
^^^^^^^^^^^^^^^
the $comment will hold a value you would not expect.
Turning it into a non-capturing group fixes the issue.
See IDEONE demo
For other common uses of a non-capturing group, please refer to Are optional non-capturing groups redundant post.
You can use () parentheses in regex when matching multiple word options which you then do not want to capture.
(?:one|two|three)

Negative lookahead alternative

For a URL pattern such as this one:
/detail.php?a=BYGhs5w8e9o&b=234844617545&h=9827a
I would like Google Analytics to match only the URL's with the a and b parameters in it:
/orderdetail.php?a=BYGhs5w8e9o&b=234844617545
And thus strip out:
&h=9827a
The main goal is to be able to setup a goal in Google Analytics which covers only the a and b parameters and ignores the h parameter.
Is there an easy way to accomplish this without a negative lookahead?
Standard regular expressions do not need negative lookahead for this. Just do a match and replace. Searching for:
(/detail.php\?a=\w+&b=\w+)&h=\w+
and replacing with \1 works with the regular expressions in Notepad++ version 6.5.5. Google's regular expressions may be subtly different.
The above works by surrounding the wanted text with capturing braces and leaving the unwanted part outside. The ? needs escaping as un-escaped it means the previous item (ie the p) is optional. The \w sequence mean any "word" character so \w+ means a word.

Regex for AND operator

I need a regex that needs to match
start from origin to id= and ;to cases.
I applied "OR" condition but it satifies only one condition. Any suggestions?
origin=eBook;id=**N27F-00000-00**;type=cases
Regex:
(^(.*id=)|(;type=cases.*))
You are mistaking some fundamentals of regular expressions, which I'll explain in a minute. But for now, try this:
id=(.*?);type=cases
Regular expressions try to match as much as a string as possible. This means it can match part of a string, and you don't need to use .* on either side of the string (unless you want to capture that information).
Since we aren't matching the .* in the beginning, you won't need to start from the beginning of the string (^).
There is no such thing as an AND operator, since an entire regular expression must match by default.
Link
Update
This will still match the whole chunk of regex. Since I used parenthesis around the important part (N27F-00000-00), it will be placed in a "match group". If you don't want to deal with match groups, you can use "lookarounds":
(?<=id=).*?(?=;type=cases)
Link

Sublime Text 2 - Regex Search - Non-Capture Group Syntax

I'm trying to use ST2's regex capability in search & replace, but can't figure out how to probably make a non-capturing group. For this example, I want to find instances of "DEAN" which are not followed by "UMBER", i.e. to distinguish "DEANCARE" from "DEANUMBER"
From what I've read and used in the past, the syntax with a non-capture should be:
DEAN(?:UMBER)
Which should match "DEANCARE" but not "DEANUMBER". Yet instead, Sublime Text only finds "DEANUMBER" as if I had typed:
DEAN(UMBER)
Using square brackets on the first (or each) of the unwanted letters does work:
DEAN[^U]
But I'd still prefer to use the group non-match as opposed for other purposes and to avoid having to explicitly not-match each individual character. Do I have a syntax mistake, or maybe a conceptual error in how ST2's regex works?
A non capturing group is the same as a group except it does not capture the matching portion of the regex in a back-reference.
If you were to use the regex DEAN(?:UMBER) on the string DEANUMBER then you would have a match, but referencing \1 in, e.g. a search and replace would give you nothing, because the group is non-capturing.
Using DEAN(UMBER) on the other hand you could do a search and replace with made of L\1 which would produce made of LUMBER because the match of the first (capturing) group is being back-referenced by \1. This of course is a very pointless example, if you want to learn more about groups and back-referencing I'd suggest you read this or some other documentation/turoial on the matter.
As suggested in the comments, what you want is a negative lookahead.

What does (?: do in a regular expression

I have come across a regular expression that I don't fully understand - can somebody help me in deciphering it:
^home(?:\/|\/index\.asp)?(?:\?.+)?$
It is used in url matching and the above example matches the following urls:
home
home/
home/?a
home/?a=1
home/index.asp
home/index.asp?a
home/index.asp?a=1
It seems to me that the question marks within the brackets (?: don't do anything. Can somebody enlighten me.
The version of regex being used is the one supplied with Classic ASP and is being run on the server if that helps at all.
(?:) creates a non-capturing group. It groups things together without creating a backreference.
A backreference is a part you can refer to in the expression or a possible replacement (usually by saying \1 or $1 etc - depending on flavor). You can also usually extract them from a match afterwards when using regex in a programming language. The only reason for using (?:) is to avoid creating a new backreference, which avoids incrementing the group number, and saves (a usually negligible amount of) memory
It's a non-capture group, which essentially is the same as using (...), but the content isn't retained (not available as a back reference).
If you're doing something like this: (abc)(?:123)(def) You'll get abc in $1 and def in $2, but 123 will only be matched.
From documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
its really easy
every parentheses will create a variable in the memory so you can use the parentheses value afterward so to not save it in memory just put :? in the parentheses like this (?:) and then fill the rest as you need.
that's it and nothing else