How do I use perl regex to extract the contents within the outermost parentheses?
text = (-(A + (B - C)))
output = -(A + (B - C))
Thanks
It can be done with this (\(((?:[^()]++|(?1))*)\)) and there are several
ways to do it.
Formatted and tested:
( # (1 start), Recursion code group
\( # Opening (
( # (2 start), Capture, inner core
(?: # Cluster group
[^()]++ # Possesive, not parenth's
| # or,
(?1) # Recurse to group 1
)* # End cluster, do 0 to many times
) # (2 end)
\) # Closing )
) # (1 end)
Output
** Grp 0 - ( pos 4 , len 16 )
(-(A + (B - C)))
** Grp 1 - ( pos 4 , len 16 )
(-(A + (B - C)))
** Grp 2 - ( pos 5 , len 14 )
-(A + (B - C))
I don't see that anything more than this is required
use strict;
use warnings 'all';
my $text = "(-(A + (B - C)))";
my ($result) = $text =~ / \( (.*) \) /x;
print $result, "\n";
output
-(A + (B - C))
The pattern captures everything from after the first opening parenthesis to before the last closing parenthesis. From your question, I don't think there's a need to check that the string is balanced
Related
If the regular expression is e.g. ^(?<object>[\-\w]+)/([\-\w]+)$, will one invoke the second capturing group as $2 or as $1? In other words, are anonymous capturing groups absolutely or relatively numbered?
Use $2 to refer to the second numbered capturing group. Note I would not call it anonymous, maybe, "unnamed" would suit better here.
See a sample regex demo.
See PCRE docs:
PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still acquire numbers.
In PCRE, Capture groups are numbered sequentially in the order found.
Here is an example where the groups are annotated, indented and numbered (mixed with some conditionals).
# ==============================
# Variations of the same thing
# ==============================
1 ( a )?
2 ( b )?
3 ( c )?
c (?(1)
|
c (?(2)
|
c (?(3) | (*FAIL) )
)
)
# ==============================
4 (
5 ( a )?
6 ( b )?
7 ( c )?
4 )
c (?(2)
|
c (?(3)
|
c (?(4) | (*FAIL) )
)
)
# ==============================
8 (?<A> a )?
9 (?<B> b )?
10 (?<C> c )?
c (?(<A>)
|
c (?(<B>)
|
c (?(<C>) | (*FAIL) )
)
)
# ==============================
11 (?<M>
12 (?<A> a )?
13 (?<B> b )?
14 (?<C> c )?
c (?(<A>)
|
c (?(<B>)
|
c (?(<C>) | (*FAIL) )
)
)
11 )
# ==============================
The Branch Reset treats conditionals a little differently.
At the next group number where the BR starts, it numbers sequentially
at the start of each branch.
Going past the BR, the numbering starts 1+ after the largest count assigned
from a single branch.
Example:
# Super Branch with Conditional's
1 ( a ) # (1)
(?|
x
br 2 ( y ) # (2)
z
(?|
br 3 ( u ) # (3)
4 ( u ) # (4)
c (?(1)
5 ( R ) # (5)
| (?|
br 6 ( x ) # (6)
|
br 6 ( x ) # (6)
c (?(2)
a
|
7 ( b ) # (7)
)
8 ( c ) # (8)
)
)
9 ( u ) # (9)
10 ( u ) # (10)
|
br 3 ( e ) # (3)
4 ( e ) # (4)
5 ( e ) # (5)
|
br 3 ( c ) # (3)
)
11 ( K ) # (11)
|
br 2 ( # (2 start)
p
3 ( # (3 start)
q
(?|
br 4 ( M ) # (4)
5 ( M ) # (5)
6 ( M ) # (6)
7 ( M ) # (7)
(?|
br 8 ( T ) # (8)
9 ( T ) # (9)
10 ( T ) # (10)
|
br 8 ( D ) # (8)
9 ( D ) # (9)
)
12 ( R ) # (12)
13 ( R ) # (13)
|
br 4 ( B ) # (4)
5 ( B ) # (5)
6 ( B ) # (6)
|
br 4 ( v ) # (4)
)
3 ) # (3 end)
r
2 ) # (2 end)
14 ( o ) # (14)
15 ( i ) # (15)
|
br 2 ( t ) # (2)
s
3 ( w ) # (3)
)
16 ( Z ) # (16)
Addendum for Dot-Net counting
There are 2 options for counting Dot-Net captures.
Count named capture groups
Named groups last
Obviously, without 1 you don't get 2.
Example: Don't count named groups
1 ( # (1 start)
(?'overall'
^
(?= [^&] )
(?:
(?<scheme> [^:/?#]+ )
:
)?
(?:
//
2 ( ) # (2)
(?<authority> [^/?#]* )
)?
(?<path> [^?#]* )
(?:
\?
(?<query> [^#]* )
)?
3 ( ) # (3)
(?:
\#
(?<fragment> .* )
)?
)
1 ) # (1 end)
Example: Count named groups
1 ( # (1 start)
2 (?'overall' # (2 start)
^
(?= [^&] )
(?:
3 (?<scheme> [^:/?#]+ ) # (3)
:
)?
(?:
//
4 ( ) # (4)
5 (?<authority> [^/?#]* ) # (5)
)?
6 (?<path> [^?#]* ) # (6)
(?:
\?
7 (?<query> [^#]* ) # (7)
)?
8 ( ) # (8)
(?:
\#
9 (?<fragment> .* ) # (9)
)?
2 ) # (2 end)
1 ) # (1 end)
Example: Count named groups, and Named groups last
1 ( # (1 start)
4 (?'overall' #_(4 start)
^
(?= [^&] )
(?:
5 (?<scheme> [^:/?#]+ ) #_(5)
:
)?
(?:
//
2 ( ) # (2)
6 (?<authority> [^/?#]* ) #_(6)
)?
7 (?<path> [^?#]* ) #_(7)
(?:
\?
8 (?<query> [^#]* ) #_(8)
)?
3 ( ) # (3)
(?:
\#
9 (?<fragment> .* ) #_(9)
)?
4 ) #_(4 end)
1 ) # (1 end)
I have a text file:
Text.txt
2015-08-31 05:55:54,881 INFO (ClientThread.java:173) - Login successful for user = Test, client = 123.456.789.100:12345
2015-08-31 05:56:51,354 INFO (ClientThread.java:325) - Closing connection 123.456.789.100:12345
I would like output to be:
2015-08-31 05:55:54 Login Test 123.456.789.100
2015-08-31 05:56:51 Closing connection 123.456.789.100
Code:
$files = Get-Content "Text.txt"
$grep = $files | Select-String "serviceClient:" , "Unregistered" |
Where {$_ -match '^(\S+)+\s+([^,]+).*?-\s+(\w+).*?(\S+)$' } |
Foreach {"$($matches[1..4])"} | Write-Host
How can I do it with the current code?
Add another group and make it optional.
^(\S+)+\s+([^,]+).*?-\s+(\w+)(?:.*?=\s+(\w+))?.*?(\S+?)(?::\d+)?$
^
( \S+ )+ # (1)
\s+
( [^,]+ ) # (2)
.*? - \s+
( \w+ ) # (3)
(?:
.*? = \s+
( \w+ ) # (4)
)?
.*?
( \S+? ) # (5)
(?: : \d+ )?
$
Output:
** Grp 0 - ( pos 0 , len 121 )
2015-08-31 05:55:54,881 INFO (ClientThread.java:173) - Login successful for user = Test, client = 123.456.789.100:12345
** Grp 1 - ( pos 0 , len 10 )
2015-08-31
** Grp 2 - ( pos 11 , len 8 )
05:55:54
** Grp 3 - ( pos 57 , len 5 )
Login
** Grp 4 - ( pos 85 , len 4 )
Test
** Grp 5 - ( pos 100 , len 15 )
123.456.789.100
----------------
** Grp 0 - ( pos 123 , len 97 )
2015-08-31 05:56:51,354 INFO (ClientThread.java:325) - Closing connection 123.456.789.100:12345
** Grp 1 - ( pos 123 , len 10 )
2015-08-31
** Grp 2 - ( pos 134 , len 8 )
05:56:51
** Grp 3 - ( pos 180 , len 7 )
Closing
** Grp 4 - NULL
** Grp 5 - ( pos 199 , len 15 )
123.456.789.100
I'm having trouble applying a regex to keep only one of two specific consecutive characters in a column. I have the following file in which C-O appears for number 1 and number 2, as indicated. I would like to write a new file in which only C-O in number 1 is present. This functionality needs to be repeated throughout the file, for example between number 2 and 3 (keep number 2), and number 3 and 4 (keep number 3) etc .
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581
This is what I have so far, hope my logic is semi-clear. I'm still learning and any commentary is greatly appreciated!
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'data.txt';
open my $fh, '<', $file or die "Can't read $file: $!";
while (my $line = <fh>) {
chomp $line;
my #column = split(/\t/,$line);
if ($column[1] =~ s/COCO/\s+/g) {
print "#columns\n";
}
}
You could maybe do it all at once. Read the whole file into a string.
Then put it through this regex.
# s/(?m)(^\h+(\d+)\h+C.*\s+^\h+\2\h+O.*\n)\s*^\h+(?!\2)(\d+)\h+C.*\s+^\h+\3\h+O.*\n(?!\s*\z)/$1/g
(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
Perl code:
use strict;
use warnings;
$/ = "";
my $input = <DATA>;
print "Input:\n$input\n";
$input =~
s/(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
/$1/g;
print "Output:\n$input\n";
__DATA__
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Code output:
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581
Having this string:
"example( other(1), 123, [25]).othermethod(456)"
How i can capture only the arguments of the main functions:
"other(1), 123, [25]" and "456"
I am trying this:
http://regex101.com/r/cR0uS9/2
In html example. Having this:
<div>
<div>
<div>12</div>
<div>34</div>
</div>
</div>
<div>56</div>
I want to get:
<div>
<div>12</div>
<div>34</div>
</div>
and 56 as second match.
Here's a pattern that doesn't use recursion:
\w+\s*\((?P<parameters>(?:(?:(?:[^()]*\([^()]*\))+|[^()]*)(?:,(?!\s*\))|(?=\))))*)\)
Caveats:
Does not support more than 2 levels of nested braces. e.g.
a(b(c()))
Strings containing ( or ) will trip it up. e.g.
a(")")
You'll find the parameters in the group called "parameters".
Demo.
Explanation:
\w+ # function name
\s* # white space
\(
(?P<parameters> # parameters:
(?:
# two possibilities: 1: a simple parameter, like "12", "'hello'", or "3*1+2"
# 2: the parameter contains braces.
# we'll try to consume pairs of braces. If that fails, we'll simply match a parameter.
(?:
(?: # match a pair of braces ()
[^()]*
\(
[^()]*
\)
)+ # consume as many pairs of braces as possible. Make sure there's at least one, though, because we can't go matching nothing.
|
[^()]* # since there are no more (pairs of) braces, simply consume the function's parameters.
)
# next, either consume a "," or assert there's a ")"
(?:
,
(?! # make sure there is another parameter after the comma
\s*
\)
)
|
(?=
\)
)
)
)*
)
\)
P.S.: I haven't managed to come up with an acceptable pattern for the HTML example yet.
This does some recursion. Use it in a global find function.
# '~(?is)(?:([a-z]\w*)\s*\(((?&core)|)\))(?(DEFINE)(?<core>(?>(?&content)|(?:[a-z]\w*\s*\(|\()(?:(?=.)(?&core)|)\))+)(?<content>(?>(?![a-z]\w*\s*\(|[()]).)+))~'
(?xis-)
(?:
( [a-z] \w* ) # (1), Start-Delimiter, Function
\s* \(
( # (2), CORE
(?&core)
|
)
\) # End-Delimiter, close paren
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
(?: # Start-Delimiter
[a-z] \w* \s* \( # Function
| \( # Or, a open paren
)
(?:
(?= . )
(?&core) # Recurse core
|
)
\) # End-Delimiter, close paren
)+
)
# content
(?<content>
(?>
(?!
[a-z] \w* \s* \(
| [()]
)
.
)+
)
)
Output:
** Grp 0 - ( pos 0 , len 29 )
example( other(1), 123, [25])
** Grp 1 - ( pos 0 , len 7 )
example
** Grp 2 - ( pos 8 , len 20 )
other(1), 123, [25]
** Grp 3 - NULL
** Grp 4 - NULL
-----------------------
** Grp 0 - ( pos 30 , len 16 )
othermethod(456)
** Grp 1 - ( pos 30 , len 11 )
othermethod
** Grp 2 - ( pos 42 , len 3 )
456
** Grp 3 - NULL
** Grp 4 - NULL
For the html div -
# '~(?s)(?:<div>((?&core)|)</div>)(?(DEFINE)(?<core>(?>(?&content)|<div>(?:(?=.)(?&core)|)</div>)+)(?<content>(?>(?!</?div>).)+))~'
(?xs-)
(?:
<div> # Start-Delimiter <div>
( # (1), CORE
(?&core)
|
)
</div> # End-Delimiter </div>
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
<div> # Start-Delimiter <div>
(?:
(?= . )
(?&core) # Recurse core
|
)
</div> # End-Delimiter </div>
)+
)
# content
(?<content>
(?>
(?! </?div> )
.
)+
)
)
Output:
** Grp 0 - ( pos 0 , len 82 )
<div>
<div>
<div>12</div>
<div>34</div>
</div>
</div>
** Grp 1 - ( pos 5 , len 71 )
<div>
<div>12</div>
<div>34</div>
</div>
** Grp 2 - NULL
** Grp 3 - NULL
---------------------------
** Grp 0 - ( pos 84 , len 13 )
<div>56</div>
** Grp 1 - ( pos 89 , len 2 )
56
** Grp 2 - NULL
** Grp 3 - NULL
Perhaps regex is not the best way to parse this, tell me if I it is not. Anyway, here are some examples of what the syntax tree looks like:
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
Anyway, what I am trying to do is pull the connective out (and, then, once, etc) and its corresponding head (CC,IN,CC), which I already know for each syntax tree so it can act as an anchor, and I also need to retrieve its parent (in the first it is S, second SBARTMP, and third it is S), and its siblings, if there are any (in the first none, in the second left hand side sibling, and third left-hand-side and right-hand-side sibling). Anything higher than the parent is not included
my $pos = "(\\\w|-)*";
my $sibling = qr{\s*(\\((?:(?>[^()]+)|(?1))*\\))\s*};
my $connective = "once";
my $re = qr{(\(\w*\s*$sibling*\s*\\(IN\s$connective\\)\s*$sibling*\s*\))};
This code works for things like:
my $test1 = "(X (SBAR-TMP (IN once) (S sdf) (S sdf)))";
my $test2 = "(X (SBAR-TMP (IN once))";
my $test3 = "(X (SBAR-TMP (IN once) (X as))";
my $test4 = "(X (SBAR-TMP (X adsf) (IN once))";
It will throw away the X on top and keep everything else, however, once the siblings have stuff embedded in them then it does not match because the regex does not go deeper.
my $test = "(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))";
I am not sure how to account for this. I am kind of new to the extended patterns for Perl, just started learning it. To clarify a bit about what the regex is doing: it looks for the connective within two parentheses and the capital-letter/- combo, looks for a complete parent of the same format closing with two parentheses and then should look for any number of siblings that have all their parentheses paired off.
To only get the nearest 'parent' to your anchor connective you can
do it as a recursive parent with a FAIL or do it directly.
(for some reason I can't edit my other posts, must be cookies being deleted).
use strict;
use warnings;
my $connective = qr/ \((?:IN|CC)\s(?:once|and|then)\)/x;
my $sibling = qr/
\s*
(
(?! $connective )
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex1 = qr/
\( ( [\w-]+ \s* $sibling* \s* $connective \s* $sibling* ) \) #1
/x;
my $regex2 = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my $sample = qq/
(X (SBAR-TMP (IN once) (S sdf) (S sdf)))
(X (SBAR-TMP (IN once))
(X (SBAR-TMP (IN once) (X as))
(X (SBAR-TMP (X adsf) (IN once))
(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
/;
while ($sample =~ /$regex1/xg) {
print "Found: $1\n";
}
print '-' x 20, "\n";
while ($sample =~ /$regex2/xg) {
print "Found: $2\n";
}
__END__
Why did you give up on this, you almost had it. Try this:
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
\s*
(
(?!$connect)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s*
/x;
my $regex = qr/
( #1
\(
\s* [\w-]+ \s*
(?> $sibling* \s* $connective \s* $sibling*
| (?1)
)
\s*
\)
)
/x;
my #tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
);
for my $sample (#tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $1\n";
}
}
my $another =<<EOS;
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S
(S
(NP blah
(VP blah)
)
(CC then)
(NP blah
(VP blah
(PP blah)
)
)
)
)
EOS
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$1\n";
}
END
This should work as well
use strict;
use warnings;
my $connective = qr/(?: \((?:IN|CC)\s(?:once|and|then)\) )/x;
my $sibling = qr/
(?: \s*
(
(?!$connective)
\(
(?:
(?> (?: [^()]+ ) )
| (?-1)
)*
\)
)
\s* )
/x;
my $regex = qr/
( #1
\( \s*
( #2
[\w-]+ \s*
(?> $sibling* \s* $connective (?(R)(*FAIL)) \s* $sibling*
| (?1)
)
)
\s*
\)
)
/x;
my #tests = (
'(X (SBAR-TMP (IN once) (S sdf) (S sdf)))',
'(X (SBAR-TMP (IN once))',
'(X (SBAR-TMP (IN once) (X as))',
'(X (SBAR-TMP (X adsf) (IN once))',
'(X (SBAR-TMP (IN once) (MORE stuff (MORE stuff))))',
);
for my $sample (#tests)
{
while ($sample =~ /$regex/xg) {
print "Found: $2\n";
}
}
my $another = "
(S (CC and))
(SBARTMP (IN once) (NP otherstuff))
(S (S (NP blah (VP blah)) (CC then) (NP blah (VP blah (PP blah))) ))
";
print "\n---------\n";
while ($another =~ /$regex/xg) {
print "\nFound:\n$2\n";
}
__END__