match character not enclosed by braces recursively - regex

I'm trying to split a string on pipes, when they are not enclosed by braces.
i've got a regex that works, unless there are recursive braces:
~\([^)]*\)(*SKIP)(*F)|\|~
test(test(test|tester)|test)|test
^ and ^ are matched, only last one should match
regex101 link to play around

You may use the following regex based on a subroutine:
(\((?:[^()]++|(?1))*\))(*SKIP)(*F)|\|
See the regex demo
Details
(\((?:[^()]++|(?1))*\)) - Group 1 that matches
\( - a (
(?:[^()]++|(?1))* - 0 or more occurrences of:
[^()]++ - any 1+ chars other than ( and )
| - or
(?1) - the whole Group 1 pattern is recursed (note that (?R) would not work here since it would recurse the whole regex pattern)
\) - a ) char
(*SKIP)(*F) - PCRE verb sequence that omits the currently matched text and makes the regex engine search for the next match beginning from the end of the current match
| - or
\| - a literal |

Related

regex getting words between '|'

I am trying to get the full words between two '|' characters
example string: {{person label|Jens Addle|border=red}}
here I would like to get the string: Jens Addle
I have attempted with the following:
(([A-Z]\w+))
However, this separates the result into two words and I would like to get it as a single entity.
This should put the value into $1.
Key is escaping the pipes, capturing what is in between and being non-greedy about it.
\|(.+?)\|
This should work in your case: /\|(.*?)\|/gm, or without the flags \|(.*?)\|.
This regex matches all character between two | characters. (\| - the | character, (.*?) - match everything and capture)
Here is the regex101 page.
You can use
\|\K[^|]*(?=\|)
(?<=\|)[^|]*(?=\|)
See the regex #1 demo and regex #2 demo.
Details:
(?<=\|) - a location that is immediately preceded with a | char
\|\K - matches a | char and then "forgets" it
[^|]* - zero or more chars other than a | char
(?=\|) - a location that is immediately followed with a | char.
Matching 1 ore more words between the pipe chars can be done using a capture group.
Note that [A-Z]\w+ matches at least 2 characters.
\|([A-Z]\w+(?: \w+)*)(?=\|)
\| Match |
( Capture group 1
[A-Z]\w+ Match an uppercase char A-Z and 1+ word characters
(?: \w+)* Optionally repeat matching a space and 1+ word characters
) Close group 1
(?=\|) Positive lookahead, assert | to the right
See a regex demo.
To take the format of the example string into account, you might also make the pattern a bit more specific:
{{[^|]*\|([A-Z]\w+(?: \w+)*)\|[^|]*}}
See another regex demo.

Regex to capture everything after optional token

I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*) which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*) which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.
About the patterns that you tried:
- (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
(?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
(?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated
If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^ Start of string
(?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
\K Clear the match buffer (Forget what is currently matched)
\S+ Match 1+ non whitespace characters
(?: Non capture group
\h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
\S+ Match 1+ non whitespace chars
)* Close non capture group and repeat 1+ times to match more "words" separated by spaces
Regex demo
Edit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
Regex demo
You may use
^(?:.*? - )?\K.*?(?= - | *$)
^(?:.*?\h-\h)?\K.*?(?=\h-\h|\h*$)
See the regex demo
Details
^ - start of string
-(?:.*? - )? - an optional non-capturing group matching any 0+ chars other than line break chars as few as possible up to the first space-space
\K - match reset operator
.*? - any 0+ chars other than line break chars as few as possible
(?= - | *$) - space-space or 0+ spaces till the end of string should follow immediately on the right.
Note that \h matches any horizontal whitespace chars.
^(?:[A-Z]+ - \K)?.*\S
demo
Since "Something Here" can be anything, there's no reason to specially describe the eventual last letter in the pattern. You don't need something more complicated.
With this pattern I assume that you are not interested by the trailing spaces, that's why I ended it with \S. If you want to keep them, remove the \S and change the previous quantifier to +.

Search / and replace it with ; in xml tag with sublime text 3

I am working on an .xml file with this tag
<Categories><![CDATA[Test/Test1-Test2-Test3|Test4/Test5-Test6|Test7/Test8]]></Categories>
and I am trying to replace / with ; by using regular expressions in Sublime Text 3.
The output should be
<Categories><![CDATA[Test;Test1-Test2-Test3|Test4;Test5-Test6|Test7;Test8]]></Categories>
When I use this (<Categories>\S+)\/(.+</Categories>) it matches all the line and of course if I use this \/ it matches all / everywhere inside the .xml file.
Could you please help?
For you example string, you could make use of a \G to assert the position at the end of the previous match and use \K to forget what has already been matched and then match a forward slash.
In the replacement use a ;
Use a positive lookahead to assert what is on the right is ]]></Categories>
(?:<Categories><!\[CDATA\[|\G(?!^))[^/]*\K/(?=[^][]*]]></Categories>)
Explanation
(?: Non capturing group
<Categories><!\[CDATA\[ Match <Categories><![CDATA[
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close non capturing group
[^/]* Match 0+ times not / using a negated character class
\K/ Forget what was matched, then match /
(?= Positive lookahead, assert what is on the right is
[^][]*]]></Categories> Match 0+ times not [ or ], then match ]]></Categories>
) Close positive lookahead
Regex demo

How to do a find replace around some function call

I have a lot of calls in lots of different files to os.getenv('some_var'). I would like to replace all of these with os.environ['some_var'].
I know how to replace all instances of os.getenv with os.environ but not how to replace the (.*) with [.*] without loosing the text inside.
Try this regex:
(os\.)[^()]*\(([^()]*)\)
Replace each match with \1environ[\2]
Click for Demo
Explanation:
(os\.) - matches os. and capture in group 1
[^()]*\( - matches 0+ occurrences of any character that is neither a ( nor ) follwed by (
([^()]*) - matches 0+ occurrences of any character that is neither a ( nor ). This substring is captured in Group 2
\) - matches )
You can match the text and capture the text inside parenthesis using this regex,
os.getenv\('([^']+)'\)
And replace it with os.environ['\1']
This regex basically has three parts,
os.getenv\(' - This literally matches os.getenv('
([^']+) - This captures whatever text is there in parenthesis and captures it in group1
'\) - This literally matches ')
Demo

Perl Regex match balanced parentheses

Following strings - match:
"MNO(A=(B=C) D=(E=F)) PQR(X=(G=H) I=(J=(K=L)))" - "MNO"
"MNO(A=(B=C) D=(E=F))" - "MNO"
"MNO" - "MNO"
"RAX.MNO(A=(B=C) D=(E=F)) PQR(X=(G=H) I=(J=(K=L)))" - "RAX.MNO"
"RAX.MNO(A=(B=C) D=(E=F))" - "RAX.MNO"
"RAX.MNO" - "RAX.MNO"
Inside every brace, there can be unlimited groups of them, but they have to be closed properly.
Any ideas? Don't know how to test properly for closure.
I have to use a Perl-Regular-Expression.
In Perl or PHP, for example, you could use a regex like
/\((?:[^()]++|(?R))*\)/
to match balanced parentheses and their contents.
See it on regex101.
To remove all those matches from a string $subject in Perl, you could use
$subject =~ s/\((?:[^()]++|(?R))*\)//g;
Explanation:
\( # Match a (
(?: # Start of non-capturing group:
[^()]++ # Either match one or more characters except (), don't backtrack
| # or
(?R) # Match the entire regex again, recursively
)* # Any number of times
\) # Match a )