RegExp: If-Clause for capturing group possible? - regex

tl;dr:
I am searching for a way to match the closing character sequence based upon the style of the opening sequence syntax in PHP with PCRE-style regular expressions.
The task
I am writing a module to capture all translatable strings from written PHP code. One responsibility of this module will be to also capture any translation context stated within the code. This context is provided as part of an options array.
In PHP (afair starting with version 5.4), there are two different styles possible to define an array:
a) array(...)
b) [...]
I now want to write a regular expression that is able to recognize both styles. The pattern should be able to correctly match the ending character sequence depending on the style chosen to start the array.
Unfortunately, I was not able to find any documentation on how to apply the IF-statement to a given capturing group.
In theory it should look something like this:
/ ... (array\(|\[) ... (?(?=\1==\[)\]|\)) ... /
(Note: "..." in the line above should indicate that the regex pattern is longer than stated here. This should only serve as an example for what I am trying to achieve)
The (?(?=\1==\[)\]|\))part translated to "normal language": If the contents of the first capturing group is an opening square bracket, then the pattern should capture a closing square bracket, otherwise a closing round bracket is required.
Is it possible to achieve something like this? Any help is greatly appreciated!
Thanks in advance
Chris

The regex answer is
(?:array(\()|\[).*?(?(1)\)|])
See the regex demo
Details
(?:array(\()|\[) - a non-capturing group matching either array( while capturing ( into Group 1, or [ char
.*? - any 0 or more chars other than line break chars as few as possible
(?(1)\)|]) - a conditional construct: if Group 1 is matched (the ( char is in the group memory buffer) the ) must match at the current position, else ].

If you want to capture the values using the same capturing group, you could also use a branch reset group (?| to refer to group 1 for the value.
To get the values between the opening and closing parenthesis or square brackets, you could use a negated character class [^ to match any char except the listed in the character class.
(?|array(\([^()]*\))|(\[[^][]*]))
Explanation
(?| Branch reset group
array match literally
( Capture group 1
\([^()]*\) Match (...)
) Close group 1
| Or
( Capture group 2
\[[^][]*] Match [...]
) Close group 2
) close branch reset group
Regex demo

Related

Google Data Studio Regexp Replace formula - delete all characters after ? and #

I have a dasbhoard in Google Data Studio
I'm trying to create a custom field and replace all the characters that are going after # and ? sing (of course them too). But this formula - i dont know why - does not work
I was trying this one
REGEXP_REPLACE(Landing Page,'(#|\?)(.*)','')
Could you please help?
The pattern you tried (#|\?)(.*) caputures either # or ? using a capturing group with an alternation | followed by capturing 0+ times any char in another capturing group.
But in the replacement there is an empty string specified, removing all that is matched.
You could make use of a character class ([#?]) in a capturing group to capture one of the listed.
To only do the replacement where there is something after the match, you could match 1+ times any character except a newline using .+
To remove what comes after the matched character, you could refer to the capturing group using \\1 so that you keep the # or ? and remove what is matched afterwards.
The pattern could look like:
([#?]).+

Find and replace with variable text

Trying to batch a bunch of conversions with a regex find and replace and I'm not actually sure it's possible, let alone how to achieve this.
Using Sublime text as an editor, open to other tools to accomplish this if possible.
Two sample lines :
Session::flash( 'error', 'Only users with permission may view the directory user.' );
Session::flash( 'error', 'System user ID does not exist.' );
** Desired outcome: **
flash('Only users with permission may view the directory user.')->error();
flash('System user ID does not exist.')->error();
Current Regex that matches:
Session::flash(\s*'error',.* )
Is it possible, that the text lines can be saved and reused in the replace lines? Hoping for a solution along the lines of $variable so that I may replace the strings with something like
** Wishful line: **
flash('$variable')->error();
Thanks folks!
You could use 2 capturing groups and in the replacement referer to those capturing groups.
\bSession::flash\(\s*'([^']+)',\s*('[^']+')\s*\);
In the replacement use:
flash($2)->$1;
Explanation
\bSession::flash\(\s* Match a wordboundary to prevent Session being part of a longer word, then match Session::flash( followed by 0+ times a whitespace char
'([^']+)' Match ', then capture in group 1 matching not a ' using a negated character class, then match ' again
,\s* Match a comma followed by 0+ times a whitespace char
('[^']+') Capture in group 2 matching ', then not ' and again '
\s*\); Match 0+ times a whitespace char followed by );
Regex demo
Result:
flash('Only users with permission may view the directory user.')->error;
flash('System user ID does not exist.')->error;
What you're looking for here is a capture group and a backreference.
In a regular expression anything wrapped in ( and ) is captured for later use by whatever performed the regular expression match, which in this case is Sublime Text. The number of capture groups that are supported varies depending on the regular expression library in use, but you generally get at least 10.
In use, every incidence of () creates a capture, with the first capture being numbered as 1, the second one 2 and so on (generally also the entire match is capture 0). Using the sequence \1 or $1 means "use the contents of the first capture group".
As an example, consider the regular expression ^([a-z]).\1. Breaking it down:
^ - match starting at the start of a line
( - start a capture
[a-z] - match a single lower case letter
) - end a capture
. - match any character
\1 - match whatever the contents of the first capture was
Given this input:
abc
aba
bab
This regular expression matches aba and bab because the first character in both cases is captured as \1 and needs to match later. However abc doesn't match because in that case \1 is a, but the third character is c.
The result of the capture can also be used in the replacement text as well the same way. If you modify your regular expression, you can capture the text you want to keep and use it in the replacement.
As a note, your regex as outlined in your question above doesn't match in Sublime because ( starts a capture group, and thus does not match the ( that's actually in the text. If you're using Sublime and you turn on the Highlight Matches option in the Find/Replace panel, you'll see that your regex is not considered a match.
Find:
Session::flash\(\s*'error'\s*,(.*)\);
Replace:
flash(\1)->error();
Result:
flash( 'Only users with permission may view the directory user.' )->error();
flash( 'System user ID does not exist.' )->error();
This is more or less the regex outlined in your question, except:
The ( and ) in your regex have been replaced with \( and \), which is to tell the regex that this should match a literal ( and not be considered to start a capture.
The .* is changed to (.*) which means "whatever text appears here, capture it for later use.
The replacement text refers to the text captured as \1 and puts it back in the replacement.

Is there a way to match Regex based on previous capture group, not captured previously?

Okay, so the task is that there is a string that can either look like post, or post put or even get put post. All of these must be matched. Preferably deviances like [space]post, or get[space] should not be matched.
Currently I came up with this
^(post|put|delete|get)(( )(post|put|delete|get))*$
However I'm not satisfied with it, because I had to specify (post|put|delete|get) twice. It also matches duplications like post post.
I'd like to somehow use a backreference(?) to the first group so that I don't have to specify the same condition twice.
However, backreference \1 would help me only match post post, for example, and that's the opposite of what I want. I'd like to match a word in the first capture group that was NOT previously found in the string.
Is this even possible? I've been looking through SO questions, but my Google-fu is eluding me.
If you are using a PCRE-based regex engine, you may use subroutine calls like (?n) to recurse the subpatterns.
^(post|put|delete|get)( (?!\1)(?1))*$
^^^^
See the regex demo
Expression details:
^ - start of string
(post|put|delete|get) - Group 1 matching one of the alternatives as literal substrings
( (?!\1)(?1))* - zero or more sequences of:
- a space
(?!\1) - a negative lookahead that fails the match if the text after the current location is identical to the one captured into Group 1 due to backreference \1
(?1) - a subroutine call to the first capture group (i.e. it uses the same pattern used in Group 1)
$ - end of string
UPDATE
In order to avoid matching strings like get post post, you need to also add a negative lookahead into Group 1 so that the subroutine call was aware that we do not want to match the same value that was captured into Group 1.
^((post|put|delete|get)(?!.*\2))( (?1))*$
See the regex demo
The difference is that we capture the alternations into Group 2 and add the negative lookahead (?!.*\2) to disallow any occurrences of the word we captured further in the string. The ( (?1))* remains intact: now, the subroutine recurses the whole Capture Group 1 subpattern with the lookahead.

Capture filename parts: Why doesn't this regexp work?

I'm faily new to regexp and I miss something from capturing groups.
Let's suppose I have a filepath like that
test.orange.john.edn
I want to capture two groups:
test.orange.john (which is the body)
edn (which is the extension)
I used this (and variants of it, taking the $ outside, etc.)
^([a-z]*.)*.([a-z]*$)
But it captures xm only
What did I miss? I do not understand why l is not captured and the body too...
I found answers on the web to capture the extension but I do not understand the problem there.
Thanks
The ^([a-z]*.)*.([a-z]*$) regex is very inefficient as there are lots of unnecessary backtracking steps here.
The start of string is matched, and then [a-z]*. is matched 0+ times. That means, the engine matches as many [a-z] as possible (i.e. it matches test up to the first dot), and then . matches the dot (but only because . matches any character!). So, this ([a-z]*.)* matches test.orange.john.edn only capturing edn since repeating capturing groups only keep the last captured value.
You already have edn in Group 1 at this step. Now, .([a-z]*$) should allocate a substring for the . (any character) pattern. Backtracking goes back and finds n - now, Group 1 only contains ed.
For your task, you should escape the last . to match a literal dot and perhaps, the best expression is
^(.*)\.(.*)$
See demo
It will match all the string up to the end with the first (.*), and then will backtrack to find the last . symbol (so, Group 1 will have all text from the beginning till the last .), and then capturing the rest of the string into Group 2.
If a dot does not have to be present (i.e. if a file name has no extension), add an optional group:
^(.*)(?:\.(.*))?$
See another demo
You can try with:
^([a-z.]+)\.([a-z]+)$
online example

regular expression match word or absence of word

I am trying to match the word group or match the absence of the word group
http://rubular.com/r/TKJPFvnzZ0
I can match a space but I would like it to actually match nothing. I am struggling with finding the correct syntax.
Match group 3 should contain either group or empty string.
Thanks!
Not sure if I understood you correctly, but would this solve your problem:
^I post a "(.*?)" to the "(.*?)"(?: (group))? which the entire world can see$
?
Basically says that group is optional.
The ?: inside the parenthesis marks that group as a "non-capturing group", which means that we're only enclosing that part of the expression in parenthesis to group it, but we don't want to capture the content to use after. group is simply enclosed in parenthesis because we want to capture that match as a group.