Is it possible to assess the contents of a capture group again? - regex

More out of curiosity than anything else (given the time I've spent on this), I'm trying to see if I can use the replace function in Sublime Text 3 with regex to convert
style="bla: bla; bla:bla;"
into
bla:bla;
bla:bla;
I'm able to create a capture for just bla: bla; bla:bla;, without style or the quotation marks:
(?<=\sstyle=")(.*)(?=") https://regex101.com/r/tctJti/1
After that though, I'm stuck. I've also tried capturing every bla:bla seperately, but that doesn't even seem to help, since the capture group then only holds the last thing it captured:
\s*style="((.*?;)*)" https://regex101.com/r/tctJti/2
What I would need is to be able to tell sublime to ignore everything outside that capture group from my first example, and then inside that single capture group look for semicolons, and transform those into newlines. Is something like that even possible, or is that by definition a two-step conversion?

You may use this PCRE regex to match:
(?:\bstyle="|(?!^)\G)([^;"]+;?)\s*(?:"$)?
And replace it by:
$1\n
RegEx Demo
\G asserts position at the end of the previous match or the start of the string for the first match. By placing (?!^) we ensure that \G is not matched at start of the line.

Related

VSCode Regex Find/Replace In Files: can't get a numbered capturing group followed by numbers to work out

I have a need to replace this:
fixed variable 123
with this:
fixed variable 234
In VSCode this matches fine:
fixed(.*)123
I can't find any way to make it put the capture in the output if a number follows:
fixed$1234
fixed${1}234
But the find replace window just looks like this:
I read that VSCode uses rust flavoured rexes.. Here indicates ${1}234 should work, but VSCode just puts it in the output..
Tried named capture in a style according to here
fixed(?P<n>.*)123 //"invalid regular expression" error
VSCode doesn't seem to understand ${1}:
ps; I appreciate I could hack it in the contrived example with
FIND: fixed (.*) 123
REPL: fixed $1 234
And this does work in vscode:
but not all my data consistently has the same character before the number
After a lot of investigation by myself and #Wiktor we discovered a workaround for this apparent bug in vscode's search (aka find across files) and replace functionality in the specific case where the replace would have a single capture group followed by digits, like
$1234 where the intent is to replace with capture group 1 $1 followed by 234 or any digits. But $1234 is the actual undesired replaced output.
[This works fine in the find/replace widget for the current file but not in the find/search across files.]
There are (at least) two workarounds. Using two consecutive groups, like $1$2234 works properly as does $1$`234 (or precede with the $backtick).
So you could create a sham capture group as in (.*?)()(\d{3}) where capture group 2 has nothing in it just to get 2 consecutive capture groups in the replace or
use your intial search regex (.*?)(\d{3}) and then use $` just before or after your "real" capture group $1.
OP has filed an issue https://github.com/microsoft/vscode/issues/102221
Oddly, I just discovered that replacing with a single digit like $11 works fine but as soon as you add two or more it fails, so $112 fails.
I'd like to share some more insights and my reasoning when I searched for a workaround.
Main workaround idea is using two consecutive backreferences in the replacement.
I tried all backreference syntax described at Replacement Strings Reference: Matched Text and Backreferences. It appeared that none of \g<1>, \g{1}, ${1}, $<1>, $+{1}, etc. work. However, there are some other backreferences, like $' (inserts the portion of the string that follows the matched substring) or $` (inserts the portion of the string that precedes the matched substring). However, these two backreferences do not work in VS Code file search and replace feature, they do not insert any text when used in the replacement pattern.
So, we may use $` or $' as empty placeholders in the replacement pattern.
Find What:      fix(.*?)123
Replace With:
fix$'$1234
fix$`$1234
Or, as in my preliminary test, already provided in Mark's answer, a "technical" capturing group matching an empty string, (), can be introduced into the pattern so that a backreference to that group can be used as a "guard" before the subsequent "meaningful" backreference:
Find What: fixed()(.*)123 (see () in the pattern that can be referred to using $1)
Replace With: fixed$1$2234
Here, $1 is a "guard" placeholder allowing correct parsing of $2 backreference.
Side note about named capturing groups
Named capturing groups are supported, but you should use .NET/PCRE/Java named capturing group syntax, (?<name>...). Unfortunately, the none of the known named backreferences work replacement pattern. I tried $+{name} Boost/Perl syntax, $<name>, ${name}, none work.
Conclusion
So, there are several issues here that need to be addressed:
We need an unambiguous numbered backerence syntax (\g<1>, ${1}, or $<1>)
We need to make sure $' or $` work as expected or are parsed as literal text (same as $_ (used to include the entire input string in the replacement string) or $+ (used to insert the text matched by the highest-numbered capturing group that actually participated in the match) backreferences that are not recognized by Visual Studio Code file search and replace feature), current behavior when they do not insert any text is rather undefined
We need to introduce named backreference syntax (like \g<name> or ${name}).

Complicated regex to match anything NOT within quotes

I have this regex which scans a text for the word very: (?i)(?:^|\W)(very)[\W$] which works. My goal is to upgrade it and avoid doing a match if very is within quotes, standalone or as part of a longer block.
Now, I have this other regex which is matching anything NOT inside curly quotes: (?<![\S"])([^"]+)(?![\S"]) which also works.
My problem is that I cannot seem to combine them. For example the string:
Fred Smith very loudly said yesterday at a press conference that fresh peas will "very, very defintely not" be served at the upcoming county fair. In this bit we have 3 instances of very but I'm only interested in matching the first one and ignore the whole Smith quotation.
What you describe is kind of tricky to handle with a regular expression. It's difficult to determine whether you are inside a quote. Your second regex is not effective as it only ignores the first very that is directly to the right of the quote and still matches the second one.
Drawing inspiration from this answer, that in turn references another answer that describes how to regex match a pattern unless ... I can capture the matches you want.
The basic idea is to use alternation | and match all the things you don't want and then finally match (and capture) what you do want in the final clause. Something like this:
"[^"]*"|(very)
We match quoted strings in the first clause but we don't capture them in a group and then we match (and capture) the word very in the second clause. You can find this match in the captured group. How you reference a captured group depends on your regex environment.
See this regex101 fiddle for a test case.
This regex
(?i)(?<!(((?<DELIMITER>[ \t\r\n\v\f]+)(")(?<FILLER>((?!").)*))))\bvery\b(?!(((?<FILLER2>((?!").)*)(")(?<DELIMITER2>[ \t\r\n\v\f]+))))
could work under two conditions:
your regex engine allows unlimited lookbehind
quotes are delimited by spaces
Try it on http://regexstorm.net/tester

Notepad++ Regex - Issue with ^ anchor and repeating patterns

When one tries to remove some characters from the start of a line and the anchored pattern can be found again after the first replace, it will be removed again.
For a very simple example given the input 012345, search pattern ^. and empty replacement, Notepad++ will remove the whole line when using replace all. This is most likely due to the case, that the cursor is still at the start of the line after the first replace and thus matches the ^ anchor again.
How can one ensure that only the actual first character is removed (in my case the expected output would be 12345)?
You can see my workaround in my answer, but maybe there is another nice trick to achieve it.
One can match the rest of the line, capture the match into a group and then use this group as replacement. The pattern in the question could be adjusted to ^.(.*) and be replaced by $1.
This will force the cursor to move forward in the string, so the ^ anchor can't match again.
Another workaround could be finding:
^.(.)?
and replacing it with:
\1
I'm sure this is a subject of a bug report but couldn't find it as of now. In N++:
Anchors are buggy
By Replace All functionality, replacements are supposed to not be a subject to re-matching. But they are, when replacement strings are invisible / zero-length characters.
Take care of them.

Is it possible to say in Regex "if the next word does not match this expression"?

I'm trying to detect occurrences of words italicized with *asterisks* around it. However I want to ensure it's not within a link. So it should find "text" in here is some *text* but not within http://google.com/hereissome*text*intheurl.
My first instinct was to use look aheads, but it doesn't seem to work if I use a URL regex such as John Gruber's:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
And put it in a look ahead at the beginning of the pattern, followed by the rest of the pattern.
(?=URLPATTERN)\*[a-zA-Z\s]\*
So how would I do this?
You can use this alternation technique to match everything first on LHS that you want to discard. Then on RHS use captured group to match desired text.
https?:\/\/\S*|(\*\S+\*)
You can then use captured group #1 for your emphasized text.
RegEx Demo
The following regexp:
^(?!http://google.com/hereissome.*text.*intheurl).*
Matches everything but http://google.com/hereissome*text*intheurl. This is called negative lookahead. Some regexp libraries may not support it, python's does.
Here is a link to Mastering Lookahead and Lookbehind.

Sublime Text 2 - Regex Search - Non-Capture Group Syntax

I'm trying to use ST2's regex capability in search & replace, but can't figure out how to probably make a non-capturing group. For this example, I want to find instances of "DEAN" which are not followed by "UMBER", i.e. to distinguish "DEANCARE" from "DEANUMBER"
From what I've read and used in the past, the syntax with a non-capture should be:
DEAN(?:UMBER)
Which should match "DEANCARE" but not "DEANUMBER". Yet instead, Sublime Text only finds "DEANUMBER" as if I had typed:
DEAN(UMBER)
Using square brackets on the first (or each) of the unwanted letters does work:
DEAN[^U]
But I'd still prefer to use the group non-match as opposed for other purposes and to avoid having to explicitly not-match each individual character. Do I have a syntax mistake, or maybe a conceptual error in how ST2's regex works?
A non capturing group is the same as a group except it does not capture the matching portion of the regex in a back-reference.
If you were to use the regex DEAN(?:UMBER) on the string DEANUMBER then you would have a match, but referencing \1 in, e.g. a search and replace would give you nothing, because the group is non-capturing.
Using DEAN(UMBER) on the other hand you could do a search and replace with made of L\1 which would produce made of LUMBER because the match of the first (capturing) group is being back-referenced by \1. This of course is a very pointless example, if you want to learn more about groups and back-referencing I'd suggest you read this or some other documentation/turoial on the matter.
As suggested in the comments, what you want is a negative lookahead.