Explode string with comma when comma is not inside any brackets - regex

I have string "xyz(text1,(text2,text3)),asd" I want to explode it with , but only condition is that explode should happen only on , which are not inside any brackets (here it is ()).
I saw many such solutions on stackoverflow but it didn't work with my pattern. (example1) (example2)
What is correct regex for my pattern?
In my case xyz(text1,(text2,text3)),asd
result should be
xyz(text1,(text2,text3)) and asd.

You may use a matching approach using a regex with a subroutine:
preg_match_all('~\w+(\((?:[^()]++|(?1))*\))?~', $s, $m)
See the regex demo
Details
\w+ - 1+ word chars
(\((?:[^()]++|(?1))*\))? - an optional capturing group matching
\( - a (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1+ chars other than ( and )
| - or
(?1) - the whole Group 1 pattern
\) - a ).
PHP demo:
$rx = '/\w+(\((?:[^()]++|(?1))*\))?/';
$s = 'xyz(text1,(text2,text3)),asd';
if (preg_match_all($rx, $s, $m)) {
print_r($m[0]);
}
Output:
Array
(
[0] => xyz(text1,(text2,text3))
[1] => asd
)

If the requirement is to split at , but only outside nested parenthesis another idea would be to use preg_split and skip the parenthesized stuff also by use of a recursive pattern.
$res = preg_split('/(\((?>[^)(]*(?1)?)*\))(*SKIP)(*F)|,/', $str);
See this pattern demo at regex101 or a PHP demo at eval.in
The left side of the pipe character is used to match and skip what is inside the parenthesis.
On the right side it will match remaining commas that are left outside of the parenthesis.
The pattern used is a variant of different common patterns to match nested parentehsis.

Related

How to delete two groups of characters with regex?

I have this type of string:
First part: [[archive 726|The Archive]] is a great start
And I want to print:
First part: The Archive is a great start
Here is what I've come to far:
input.gsub!(/\[\[(.*?)\|/,"")
print input
> "First part: The Archive]] is a great start"
How can I also match the ]]?
You may use
input.gsub!(/\[\[[^\]\[]*\|(.*?)\]\]/, '\1')
See the Rubular demo and a Ruby demo.
Details
\[\[ - a [[ substring
[^\]\[]* - any 0 or more chars other than [ and ], as many as possible (if there are multiple | chars inside [[...]], replace * with *? to match as few as possible)
\| - a | char
(.*?) - Group 1 (the group value is referred to with \1 from the replacement pattern, mind the single quotes around \1): any 0 or more chars other than line break chars, as few as possible
\]\] - a ]] substring.

Regex for matching a specific pattern only if it doesn't match other pattern

I need to create a matching regex to find genetic sequences and I got stuck behind one specific problem - after first, start codon ATG, follows other codons from three nucleotides as well and the regex ends with three possible codons TAA, TAG and TGA. What if the stop(end) codon goes after the start(ATG) codon? My current regex works when there are intermediate codons between start and stop codon, but if there are none, the regex matches ALL of the sequence after start codon. I know why it does that, but I have no idea how to change it to work the way I want it to.
My regex should look for AGGAGG (exactly this pattern), then A, C, G or T (from 4 to 12 times) then ATG (exactly this pattern), then A, C, G or T (in triples (for example, ACG, TGC and etc.), doesn't matter how long) UNTIL it matches TAA, TAG or TGA. The search should end after that and start again after that.
Example of a good match:
XXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXX
AGGAGGTATGATGCGTACGGGCTAGTAGAGGAGGTATGATGTAGTAGCATGCT
There are two matches in the sequence - from 0 to 25 and from 28 to 44.
My current regex(don't mind the first two brackets):
$seq =~ /(AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3,3}){0,}(TAA|TAG|TGA)/ig
Problem here comes from the default usage of greedy quantifiers.
When using (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*(TAA|TAG|TGA), 4th group ([ACTG]{3})* will match as many as possible, then only 5th group is considered (backtracking if needed).
In your sequence you get TAGTAG. Greedy quantifier will lead to first TAG being captured in group 4, and second one captured as ending group.
You may use lazy quantifier instead: (AGGAGG)([ACGT]{4,12})(ATG)([ACTG]{3})*?(TAA|TAG|TGA) (note the added question mark, making the quantifier lazy).
That way, first TAG encountered will be treated as the ending group.
Demo.
According to the pattern you gave, you could have overlapping matches. The following will find all matches, including overlapping matches:
local our #matches;
$seq =~ /
(
( AGGAGG )
( [ACGT]{4,12} )
( ATG )
( (?: (?! TAA|TAG|TGA ) [ACTG]{3} )* )
( TAA|TAG|TGA )
)
(?{ push #matches, [ $-[1], $1, $2, $3, $4, $5, $6 ] })
(?!)
/xg;
Perl essential regex feature, as opposed to plain regex like grep, is the lazy quantifier: ? following the * or + quantifier. it matches zero (one) or more occurrence of the character preceding * (+) token as the shortest glob match as possible
$seq =~ /((AGGAGG)([ACGT]{4,12})(ATG)([ACGT]{3})*?(TAA|TAG|TGA))/igx

Replace commas enclosed in curly braces (but not quite)

This seems to be a duplicate question of an already asked one but not really. What I'm looking for is one or more regular expressions without the help of any programming language to change the following text String.Concat( new string[] { "some", "random", "text", string1, string2, "end" }) into "some" + "random" + "text" + string1 + string2 + "end".
I was thinking of using two regular expressions, replacing the commas with pluses, and then removing the String.Concat( new string[] { ... }). The second part is quite easy, but I am struggling with the first regular expression. I used a positive look-behind expression, but it matches only the first comma: (?<=String\.Concat\(new string\[\] \{[^,}]*),
I'm not an expert but I think that this is a limitation of the regular expression engine. Once the first comma is matched, the regular expression engine moves the starting matching index after the comma and it doesn't match anymore the look-behind group before it.
Is there a regular expression to make this substitution, pluses instead of commas, without the help on any programming language?
Just like you said: first replace all comma by plus signs:
Regex 1: /,/g
Replacement 1: " +"
Then remove all unnecessary stuff, capture what you need and use a backreference to the captured group as replacement:
Regex 2: /String\.Concat\(\s*new\s*string\s*\[\]\s*{\s*(.*?)\s*}\)/g
Replacement 2: "$1"(or however you can specify backreferences).
I'm assuming you're using a text editor:
,
substitution:
+
See the demo
Second:
.*?{(.*?)}.*
Replacement
$1
See the demo
You should do it in two steps.
First substitution
s/^\s*String\s*\.Concat\s*\(\s*new\s+string\[\]\s*\{\s*("[^\}]*")\s*\}\)\s*$/$1/i
gets you
"some", "random", "text", string1, string2, "end"
Second one
s/\s*("[^"]*"|\b\w+\b)\s*,\s*/$1 + /g
returns your desired output
"some" + "random" + "text" + string1 + string2 + "end"
Check this: https://ideone.com/QuRKmt (sample code in Perl).
Update Notepad++ single pass solution
(Note you can also do this in RegexFormat8 using the Boost extended replacement option)
Find (?:String\.Concat\(\s*new\s*string\s*\[\]\s*{\s*([^,})]*?)\s*(?=,|}\))|\G(?!^)\s*,\s*([^,})]*?)\s*(?=,|}\)))(?:}\)|(?=(?:\s*,\s*[^,})]*)+}\)))
Replace (?1$1: + $2)
(without conditional replace its + $1$2 https://regex101.com/r/Vqe44r/1)
Formatted-expand mode
(?:
String \. Concat\( \s* new \s* string \s* \[\] \s* { \s*
( [^,})]*? ) # (1)
\s*
(?= , | }\) )
|
\G
(?! ^ )
\s* , \s*
( [^,})]*? ) # (2)
\s*
(?= , | }\) )
)
(?:
}\)
|
(?=
(?:
\s* , \s*
[^,})]*
)+
}\)
)
)
Output
"some" + "random" + "text" + string1 + string2 + "end"

PowerShell -replace to get string between two different characters

I am current using split to get what I need, but I am hoping I can use a better way in powershell.
Here is the string:
server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000
I want to get the server and database with out the database= or the server=
here is the method I am currently using and this is what I am currently doing:
$databaseserver = (($details.value).split(';')[0]).split('=')[1]
$database = (($details.value).split(';')[1]).split('=')[1]
This outputs to:
ss8.server.com
CSSDatabase
I would like it to be as simple as possible.
Thank you in advance
Replacing approach
You may use the following regex replace:
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$dbserver = $s -replace '^server=([^;]+).*', '$1'
$db = $s -replace '^[^;]*;database=([^;]+).*', '$1'
The technique is to match and capture (with (...)) what we need and just match what we need to remove.
Pattern details:
^ - start of the line
server= - a literal substring
([^;]+) - Group 1 (what $1 refers to) matching 1+ chars other than ;
.* - any 0+ chars other than a newline, as many as possible
Pattern 2 is almost the same, the capturing group is shifted a bit to capture another detail, and some more literal values are added to match the right context.
Note: if the values you need to extract may appear anywhere in the string, replace ^ in the first one and ^[^;]*; pattern in the second one with .*?\b (any 0+ chars other than a newline, as few as possible followed with a word boundary).
Matching approach
With a -match, you may do it the following way:
$s -match '^server=(.+?);database=([^;]+)'
The $Matches[1] will contain the server details and $Matches[2] will hold the DB info:
Name Value
---- -----
2 CSSDatabase
1 ss8.server.com
0 server=ss8.server.com;database=CSSDatabase
Pattern details
^ - start of string
server= - literal substring
(.+?) - Group 1: any 1+ non-linebreak chars as few as possible
;database= - literal substring
([^;]+) - 1+ chars other than ;
Another solution with a RegEx and named capture groups, similar to Wiktor's Matching Approach.
$s = 'server=ss8.server.com;database=CSSDatabase;uid=WS_CSSDatabase;pwd=abc123-1cda23-123-A7A0-CC54;Max Pool Size=5000'
$RegEx = '^server=(?<databaseserver>[^;]+);database=(?<database>[^;]+)'
if ($s -match $RegEx){
$Matches.databaseserver
$Matches.database
}

extract a part of string using regex

I have a text file with pattern as below.
"s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}"
I am trying to extract the values cap1, cap2, CAP3.
I am trying with regex "([a-z]|[|])cap1(\{(.*?)\})([a-z]|[|]|[0-9])" but with no luck any help is appreciated.
As I understand you want to extract the value of cap1, cap2, CAP3 one by one. There are 3 regex then
For cap1
cap1\{([^\}]*)\}
Explanation
cap1\{ match text cap1{,
([^\}]*) capture any characters except } to group $1,
\} match text }.
For cap2
cap2\[([^\]]*)\]
Explanation
cap2\[ match text cap2[,
([^\]]*) capture any characters except ] to group $1,
\] match text ].
For CAP3
CAP3\{\[([^\]]*)\]\|\[([^\]]*)\]\}
Explanation
CAP3\{ match text CAP3{,
\[([^\]]*)\]\|\[([^\]]*)\] capture any characters except ] to groups $1, $2 respectively,
\} match text }.
Additional: Thank you for a comment from #Borodin, to do this task you don't need to use lookaround but in case that you want to do search and replace, the lookaround may be necessary.
For cap1: (?<=cap1\{)([^\}]*)(?=\})
For cap2: (?<=cap2\[)([^\]]*)(?=\])
For CAP3: (?<=CAP3\{)\[([^\]]*)\]\|\[([^\]]*)\](?=\})
Using a pattern such as this should work:
[{\[]+([^}{\]\[]+)[\]}]+
Code:
$searchText =~ m/[{\[]+([^}{\]\[]+)[\]}]+/
Example:
https://regex101.com/r/qI3fI6/1
Update
I apologise -- I initially mistook your question for something more trivial
Essentially you want to perform a split on pipe | characters, excluding those found inside pairs of brackets or braces [ ... ] or { ... }. As long as you don't need to take account of nesting inside brackets of the same type (i.e. braces will only ever contain brackets, and brackets will only ever contain braces) then it is simply done like this
my #matches = $s =~ m{ \w+ ( \{ [^{}]* \} | \[ [^\[\]]* \] ) }gx;
print "$_\n" for #matches;
output
{capture|these|values}
[capture|these|values]
{[capture|these|values]|[capture|these|values]}
The data you show has no instances of braces containing braces, or brackets containing brackets, but I suspect that there is no theoretical limit to the nesting of the your data in which case some recursion is necessary
The regex pattern in the program below defines the text that can appear inside a pair of matching brackets as a pipe-delimited sequence of
another pair of matching brackets and their content [ ... ]
another pair of matching braces and their content { ... }
a sequence of word characters like capture and values
A pattern matching that is inside the second pair of capturing parentheses. It is a recursive pattern that calls itself using relative numbering (?-1). That could also be absolute numbering (?2) but it would have to be changed if the number of preceding captures was changed
The complete pattern looks for and captures a series of word characters immediately before the recursive pattern to account for the cap1, cap2 etc. This allows the result of a glolbal search to be assigned directly to a hash with the result show below
use strict;
use warnings;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
my %captures = $s =~ m{
( (?> \w+ ) )
(
\{ (?-1) (?> \| (?-1) )* \} |
\[ (?-1) (?> \| (?-1) )* \] |
\w+
)
}gx;
use Data::Dump;
dd \%captures;
output
{
cap1 => "{capture|these|values}",
cap2 => "[capture|these|values]",
CAP3 => "{[capture|these|values]|[capture|these|values]}",
}
Original answer
It looks like you want all identifiers that are preceded by a pipe | character and followed by either a square or curly opening bracket [ or {
This program will do that for you
use strict;
use warnings;
use v5.10;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
for ( $s ) {
my #captures = /\|(\w+)[\[\{]/g;
say for #captures;
}
output
cap1
cap2
CAP3