What does (?: do in a regular expression - regex

I have come across a regular expression that I don't fully understand - can somebody help me in deciphering it:
^home(?:\/|\/index\.asp)?(?:\?.+)?$
It is used in url matching and the above example matches the following urls:
home
home/
home/?a
home/?a=1
home/index.asp
home/index.asp?a
home/index.asp?a=1
It seems to me that the question marks within the brackets (?: don't do anything. Can somebody enlighten me.
The version of regex being used is the one supplied with Classic ASP and is being run on the server if that helps at all.

(?:) creates a non-capturing group. It groups things together without creating a backreference.
A backreference is a part you can refer to in the expression or a possible replacement (usually by saying \1 or $1 etc - depending on flavor). You can also usually extract them from a match afterwards when using regex in a programming language. The only reason for using (?:) is to avoid creating a new backreference, which avoids incrementing the group number, and saves (a usually negligible amount of) memory

It's a non-capture group, which essentially is the same as using (...), but the content isn't retained (not available as a back reference).
If you're doing something like this: (abc)(?:123)(def) You'll get abc in $1 and def in $2, but 123 will only be matched.

From documentation:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

its really easy
every parentheses will create a variable in the memory so you can use the parentheses value afterward so to not save it in memory just put :? in the parentheses like this (?:) and then fill the rest as you need.
that's it and nothing else

Related

Vim Regex Negative Look Arounds and Capture Groups

Say you have the following text
foobar
bar
And you want the following as your desired output
foobar
foobar
You could use the following regex
s/\v(foo)#<!(bar)/foo\2/g
What I made the mistake of before was thinking that the back-reference for bar was \1 and not \2; I didn't think that the regex lookaround was considered a capture group. Now whats intriguing me is if you were to use \1. The output you would get is the following
foobar
foo
Using the logic stated above, if \1 refers to the first capture group, (foo), then I expect that the output would be
foobar
foofoo
After having thought about it for a little bit, what I am suspecting is to be the answer of this question is that since its a negative lookbehind that's being used, it captures only when the specified text foo is not present. As such, this means that the stored capture group is nothing. Simply a null character. This would result in foo being the output if \1 is the specified back reference. Am I correct in my deduction?
What causes me to be rather certain about this is if I were to change the regex around to use a positive lookbehind instead with a reference to the first capture group, as follows
s/\v(foo)#<=(bar)/foo\1/g
The output would then become
foofoo
bar
Meaning that since its a positive lookbehind, the capture group (foo) matches when foo is present, thus the stored capture group would have to be foo.
The source of this confusion is the fact that Perl regex works in the fashion that regex lookarounds are not included as a capture group. If I am correct in what I have stated above, I'm curious as to why there is this difference between vim regex and Perl regex.
I'm curious as to why there is this difference between vim regex and Perl regex.
Because they're two different regex engines. If they worked in the exact same way, there wouldn't be a Vim regex engine and a Perl regex engine, they'd both be the Perl regex engine.
At some point™, Vim made a regex engine and decided on certain things. One of those, evidently, is to include lookaheads as capturing groups. If you wanna talk further divergence from Perl, #<= allows non-fixed-width patterns in Vim, but not in Perl (and several other engines). It's just how it was designed. The "why" is something only the people who made it can answer definitively, so I won't answer that.
If you absolutely wanna exclude the group from the group counting, you can prefix a %, as per :h /\%(\) to make it a non-capturing group (i.e. s/\v%(foo)#<!(bar)/foo\1/g). Note that non-capturing groups still act like normal, but you cannot refer to them when substituting.
While I'm already writing an answer though, let me introduce you to \zs and \ze, by far one of the best additions to the Vim regex engine (in my biased opinion):
\zs defines where the actual match starts. It won't affect groups, but it has several side-useful side-effects. In your case specifically, it lets you completely drop the positive lookbehind. It won't let you drop the negative lookbehind (because regex), but it'll let you simplify your regex a little. Equivalently, \ze determines where the match ends.
Your second example can be simplified to:
s/\vfoo\zs(bar)/\1
\zs tells the engine to start the match just before (bar). If it helps, you can think of every regex as being prefixed with \zs and postfixed with \ze - explicitly defining it just changes those bounds. This doesn't affect number grouping and \<n>-saving.
What this means is that only the space selected by bar is considered a match, and that bit is replaced - the other bits are left intact.
Your first regex with a negative lookbehind doesn't simplify as well (because regex overall feels intended for forward operations, so anything operating backwards tends to be messy), but for longer regexes, it can still shorten the regex dramatically. Here's what that substitution looks like:
s/\v(foo)#<!\zebar/foo
Expanded:
s/\v
| (foo)#<!
| | \ze
| | | bar
| | | | /foo
^ Very magic | |
^ not prefixed with foo. Can be made non-capturing, but it has no actual relevance for this regex specifically
^ End the match
^ bar
^ substitute the "area" selected by "not prefixed with foo" with foo
('scuse the terrible diagram, I've never made one of these before and I don't remember how they're generally made)
This one uses \ze because your goal indirectly to replace the space allocated by the negative lookahead with itself. Unfortunately, Vim only stores actual matched values, meaning \1 can't be used to insert foo, because it's not there yet. This is probably something all engines do, because you can't guess the content of (?<=ab.d) for an instance.
That being said, if you just want to avoid confusion with group numbering, non-capturing groups is the way to go for now. \zs and \ze, while fantastic, are mildly confusing at first and might not be the best idea to throw on top of learning everything else in Vim for the time being.
And finally, an unexpected plugin recommendation: haya14busa/incsearch.vim(no affiliation, just a user), which previews your substitutions and searches so you can tell what's going to happen before you go ahead with a substitution or a search. Might not help with your confusion around group numbering, but you'll at least be able to see when you're using the wrong group number before you substitute.

VSCode Regex Find/Replace In Files: can't get a numbered capturing group followed by numbers to work out

I have a need to replace this:
fixed variable 123
with this:
fixed variable 234
In VSCode this matches fine:
fixed(.*)123
I can't find any way to make it put the capture in the output if a number follows:
fixed$1234
fixed${1}234
But the find replace window just looks like this:
I read that VSCode uses rust flavoured rexes.. Here indicates ${1}234 should work, but VSCode just puts it in the output..
Tried named capture in a style according to here
fixed(?P<n>.*)123 //"invalid regular expression" error
VSCode doesn't seem to understand ${1}:
ps; I appreciate I could hack it in the contrived example with
FIND: fixed (.*) 123
REPL: fixed $1 234
And this does work in vscode:
but not all my data consistently has the same character before the number
After a lot of investigation by myself and #Wiktor we discovered a workaround for this apparent bug in vscode's search (aka find across files) and replace functionality in the specific case where the replace would have a single capture group followed by digits, like
$1234 where the intent is to replace with capture group 1 $1 followed by 234 or any digits. But $1234 is the actual undesired replaced output.
[This works fine in the find/replace widget for the current file but not in the find/search across files.]
There are (at least) two workarounds. Using two consecutive groups, like $1$2234 works properly as does $1$`234 (or precede with the $backtick).
So you could create a sham capture group as in (.*?)()(\d{3}) where capture group 2 has nothing in it just to get 2 consecutive capture groups in the replace or
use your intial search regex (.*?)(\d{3}) and then use $` just before or after your "real" capture group $1.
OP has filed an issue https://github.com/microsoft/vscode/issues/102221
Oddly, I just discovered that replacing with a single digit like $11 works fine but as soon as you add two or more it fails, so $112 fails.
I'd like to share some more insights and my reasoning when I searched for a workaround.
Main workaround idea is using two consecutive backreferences in the replacement.
I tried all backreference syntax described at Replacement Strings Reference: Matched Text and Backreferences. It appeared that none of \g<1>, \g{1}, ${1}, $<1>, $+{1}, etc. work. However, there are some other backreferences, like $' (inserts the portion of the string that follows the matched substring) or $` (inserts the portion of the string that precedes the matched substring). However, these two backreferences do not work in VS Code file search and replace feature, they do not insert any text when used in the replacement pattern.
So, we may use $` or $' as empty placeholders in the replacement pattern.
Find What:      fix(.*?)123
Replace With:
fix$'$1234
fix$`$1234
Or, as in my preliminary test, already provided in Mark's answer, a "technical" capturing group matching an empty string, (), can be introduced into the pattern so that a backreference to that group can be used as a "guard" before the subsequent "meaningful" backreference:
Find What: fixed()(.*)123 (see () in the pattern that can be referred to using $1)
Replace With: fixed$1$2234
Here, $1 is a "guard" placeholder allowing correct parsing of $2 backreference.
Side note about named capturing groups
Named capturing groups are supported, but you should use .NET/PCRE/Java named capturing group syntax, (?<name>...). Unfortunately, the none of the known named backreferences work replacement pattern. I tried $+{name} Boost/Perl syntax, $<name>, ${name}, none work.
Conclusion
So, there are several issues here that need to be addressed:
We need an unambiguous numbered backerence syntax (\g<1>, ${1}, or $<1>)
We need to make sure $' or $` work as expected or are parsed as literal text (same as $_ (used to include the entire input string in the replacement string) or $+ (used to insert the text matched by the highest-numbered capturing group that actually participated in the match) backreferences that are not recognized by Visual Studio Code file search and replace feature), current behavior when they do not insert any text is rather undefined
We need to introduce named backreference syntax (like \g<name> or ${name}).

Complicated regex to match anything NOT within quotes

I have this regex which scans a text for the word very: (?i)(?:^|\W)(very)[\W$] which works. My goal is to upgrade it and avoid doing a match if very is within quotes, standalone or as part of a longer block.
Now, I have this other regex which is matching anything NOT inside curly quotes: (?<![\S"])([^"]+)(?![\S"]) which also works.
My problem is that I cannot seem to combine them. For example the string:
Fred Smith very loudly said yesterday at a press conference that fresh peas will "very, very defintely not" be served at the upcoming county fair. In this bit we have 3 instances of very but I'm only interested in matching the first one and ignore the whole Smith quotation.
What you describe is kind of tricky to handle with a regular expression. It's difficult to determine whether you are inside a quote. Your second regex is not effective as it only ignores the first very that is directly to the right of the quote and still matches the second one.
Drawing inspiration from this answer, that in turn references another answer that describes how to regex match a pattern unless ... I can capture the matches you want.
The basic idea is to use alternation | and match all the things you don't want and then finally match (and capture) what you do want in the final clause. Something like this:
"[^"]*"|(very)
We match quoted strings in the first clause but we don't capture them in a group and then we match (and capture) the word very in the second clause. You can find this match in the captured group. How you reference a captured group depends on your regex environment.
See this regex101 fiddle for a test case.
This regex
(?i)(?<!(((?<DELIMITER>[ \t\r\n\v\f]+)(")(?<FILLER>((?!").)*))))\bvery\b(?!(((?<FILLER2>((?!").)*)(")(?<DELIMITER2>[ \t\r\n\v\f]+))))
could work under two conditions:
your regex engine allows unlimited lookbehind
quotes are delimited by spaces
Try it on http://regexstorm.net/tester

use case for ?: in tcl regexp

I read the documentation of ?: in tcl regexp. Which says that it matches an expression without capturing it.
I tried and it worked fine.
My query is, what is the proper use case for this option, as it we do not want to use capture sequence, we won't puts brackets there.
Is it just an alternate way, or have some special condition, where we should use this? Kindly clarify.
Easy: You need to group several elements in your Regex, but you don't need them as a capturing group for reference.
a+ (b+|c+) OR (a+ b+)|c+
I need braces for grouping. But if I run it like this the engine will capture all those matches. This may need a lot of memory and cost a lot of performance. If I don't need the capturing groups later for reference, I can use ?: to get grouping without the performance impact:
a+ (?:b+|c+) OR (?:a+ b+)|c+
First, have a look at the Tcl regex reference:
(expression)
Parentheses surrounding an expression specify a nested expression. The substring matching expression is captured and can be referred to via the back reference mechanism, and also captured into any corresponding match variable specified as an argument to the command.
(?:expression)
matches expression without capturing it.
While the first part describing capturing group ability to capture subtext to be referred to with backreferences is universal, the second part dwelling on initializing variables based on the capturing group is specific to Tcl.
Bearing that in mind, Tcl regex usage can be greatly simplified with non-capturing groups in case you have a pattern with a number of capturing groups, and you want to modify it by adding another group in-between existing groups.
Say, you want to match strings like abc 1234 (comment) and use {(\w+)\s+(\d+)\s+\(([^()]+)\)}:
regexp {(\w+)\s+(\d+)\s+\(([^()]+)\)} $a - body num comment
However, you were asked to also match strings with any number of word+space+digits in-between 1234 and comment. If you write
set a1 "abc 1234 more 5678 text 890 here 678 (comment)"
regexp {(\w+)\s+(\d+)(\s+\w+\s+\d+)*\s+\(([^()]+)\)} $a - body1 num1 comment1
^^^^^^^^^^^^^^^
the $comment will hold a value you would not expect.
Turning it into a non-capturing group fixes the issue.
See IDEONE demo
For other common uses of a non-capturing group, please refer to Are optional non-capturing groups redundant post.
You can use () parentheses in regex when matching multiple word options which you then do not want to capture.
(?:one|two|three)

Sublime Text 2 - Regex Search - Non-Capture Group Syntax

I'm trying to use ST2's regex capability in search & replace, but can't figure out how to probably make a non-capturing group. For this example, I want to find instances of "DEAN" which are not followed by "UMBER", i.e. to distinguish "DEANCARE" from "DEANUMBER"
From what I've read and used in the past, the syntax with a non-capture should be:
DEAN(?:UMBER)
Which should match "DEANCARE" but not "DEANUMBER". Yet instead, Sublime Text only finds "DEANUMBER" as if I had typed:
DEAN(UMBER)
Using square brackets on the first (or each) of the unwanted letters does work:
DEAN[^U]
But I'd still prefer to use the group non-match as opposed for other purposes and to avoid having to explicitly not-match each individual character. Do I have a syntax mistake, or maybe a conceptual error in how ST2's regex works?
A non capturing group is the same as a group except it does not capture the matching portion of the regex in a back-reference.
If you were to use the regex DEAN(?:UMBER) on the string DEANUMBER then you would have a match, but referencing \1 in, e.g. a search and replace would give you nothing, because the group is non-capturing.
Using DEAN(UMBER) on the other hand you could do a search and replace with made of L\1 which would produce made of LUMBER because the match of the first (capturing) group is being back-referenced by \1. This of course is a very pointless example, if you want to learn more about groups and back-referencing I'd suggest you read this or some other documentation/turoial on the matter.
As suggested in the comments, what you want is a negative lookahead.