Perl: Keep only one of two consecutive characters - regex

I'm having trouble applying a regex to keep only one of two specific consecutive characters in a column. I have the following file in which C-O appears for number 1 and number 2, as indicated. I would like to write a new file in which only C-O in number 1 is present. This functionality needs to be repeated throughout the file, for example between number 2 and 3 (keep number 2), and number 3 and 4 (keep number 3) etc .
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581
This is what I have so far, hope my logic is semi-clear. I'm still learning and any commentary is greatly appreciated!
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'data.txt';
open my $fh, '<', $file or die "Can't read $file: $!";
while (my $line = <fh>) {
chomp $line;
my #column = split(/\t/,$line);
if ($column[1] =~ s/COCO/\s+/g) {
print "#columns\n";
}
}

You could maybe do it all at once. Read the whole file into a string.
Then put it through this regex.
# s/(?m)(^\h+(\d+)\h+C.*\s+^\h+\2\h+O.*\n)\s*^\h+(?!\2)(\d+)\h+C.*\s+^\h+\3\h+O.*\n(?!\s*\z)/$1/g
(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
Perl code:
use strict;
use warnings;
$/ = "";
my $input = <DATA>;
print "Input:\n$input\n";
$input =~
s/(?xm-)
# C-O in the bottom of a segment
( # (1 start), Keep this
^ \h+ # new line
( \d+ ) # (2), col 1 number
\h+ C .* \s+ # C
^ \h+ # next line
\2 \h+ O .* \n # \2 .. O
) # (1 end)
# Throw this away
# C-O in the top of next segment
\s*
^ \h+ # new line
(?! \2 ) # Not \2
( \d+ ) # (3), col 1 num
\h+ C .* \s+ # C
^ \h+ # next line
\3 \h+ O .* \n # \3 .. O
(?! \s* \z ) # Not the last in file
/$1/g;
print "Output:\n$input\n";
__DATA__
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Code output:
Input:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 C 36.6954
2 O -118.5597
2 N 133.6704
2 H 28.3581
Output:
1 H 27.5310
1 H 27.0882
1 C 36.8857
1 O -118.2564
2 N 133.6704
2 H 28.3581

Related

need return value for captured group from last captured string in perl

I have XML files from which i want to capture init value( tag) for each parameter.I am copying some part of xml for reference.
I have port name and parameter name( tag(MNO) available with me.
eg . port name is XYZ & parameter name is MNO
port name is PQR & parameter name is ABC and GHI
There can be multiple tag under one container.
<R-PORT-PROTOTYPE UUID="Oac11eff016c6bb667f357a89xOac11f0ad174240e817fa858f00">
<SHORT-NAME>XYZ</SHORT-NAME>
<REQUIRED-COM-SPECS>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Init_Val</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>0.071</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/MNO</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
</REQUIRED-COM-SPECS>
</R-PORT-PROTOTYPE>
<R-PORT-PROTOTYPE UUID="Oac11eff016c6bb667f357a89xOac11f0ad174240e817f8f55900">
<SHORT-NAME>PQR</SHORT-NAME>
<REQUIRED-COM-SPECS>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Init_0</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>80</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/ABC</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Int_ghi</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>-80</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/GHI</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
</REQUIRED-COM-SPECS>
</R-PORT-PROTOTYPE>
regex :
if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>Port_name<\/short-name>.*?<V>(.*?)<\/.*?<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">.*?Parameter_name<\/PARAMETER-REF>/gis) {
print $2;
}
I need output 80 if parameter is ABC and -80 if parameter is GHI
I suggest using XML::LibXML.
Here I've combined two Xpath queries to find V nodes:
SHORT-NAME is XYZ and PARAMETER-REF (with DEST == PARAMETER-DATA-PROTOTYPE) contains MNO.
SHORT-NAME is PQR and PARAMETER-REF (with DEST == PARAMETER-DATA-PROTOTYPE) contains ABC or GHI.
Example:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(location => 'doc.xml');
my $query = q{
//R-PORT-PROTOTYPE/SHORT-NAME[text()="XYZ"]/..
//PARAMETER-REF[#DEST="PARAMETER-DATA-PROTOTYPE"][
contains(text(),'MNO')
]/..//V
|
//R-PORT-PROTOTYPE/SHORT-NAME[text()="PQR"]/..
//PARAMETER-REF[#DEST="PARAMETER-DATA-PROTOTYPE"][
contains(text(),'ABC') or contains(text(),'GHI')
]/..//V
};
foreach my $vnode ($dom->findnodes($query)) {
print $vnode->to_literal() . "\n";
}
Output:
0.071
80
-80
The two ways to get either or both is
1 - Linear https://regex101.com/r/NYbvI8/1
# https://regex101.com/r/NYbvI8/1
# if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>PQR<\/short-name>(?:.*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO1<\/PARAMETER-REF>)?(?:.*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO2<\/PARAMETER-REF>)?(?(1)|(?(2)|(?!)))/gis)
<R-PORT-PROTOTYPE .*? <short-name>PQR</short-name>
(?:
.*?
<V>
( .*? ) # (1)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO1</PARAMETER-REF>
)?
(?:
.*?
<V>
( .*? ) # (2)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO2</PARAMETER-REF>
)?
(?(1)
| (?(2)
| (?!)
)
)
2 - Out of order https://regex101.com/r/gQJ3cO/1
# https://regex101.com/r/t4M9UB/1
# if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>PQR<\/short-name>(?:(?:(?(1)(?!)).*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO1<\/PARAMETER-REF>)|(?:(?(2)(?!)).*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO2<\/PARAMETER-REF>)){1,2}/gis)
<R-PORT-PROTOTYPE .*? <short-name>PQR</short-name>
(?:
(?:
(?(1) (?!) )
.*?
<V>
( .*? ) # (1)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO1</PARAMETER-REF>
)
|
(?:
(?(2) (?!) )
.*?
<V>
( .*? ) # (2)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO2</PARAMETER-REF>
)
){1,2}

Perl nested parentheses expression

How do I use perl regex to extract the contents within the outermost parentheses?
text = (-(A + (B - C)))
output = -(A + (B - C))
Thanks
It can be done with this (\(((?:[^()]++|(?1))*)\)) and there are several
ways to do it.
Formatted and tested:
( # (1 start), Recursion code group
\( # Opening (
( # (2 start), Capture, inner core
(?: # Cluster group
[^()]++ # Possesive, not parenth's
| # or,
(?1) # Recurse to group 1
)* # End cluster, do 0 to many times
) # (2 end)
\) # Closing )
) # (1 end)
Output
** Grp 0 - ( pos 4 , len 16 )
(-(A + (B - C)))
** Grp 1 - ( pos 4 , len 16 )
(-(A + (B - C)))
** Grp 2 - ( pos 5 , len 14 )
-(A + (B - C))
I don't see that anything more than this is required
use strict;
use warnings 'all';
my $text = "(-(A + (B - C)))";
my ($result) = $text =~ / \( (.*) \) /x;
print $result, "\n";
output
-(A + (B - C))
The pattern captures everything from after the first opening parenthesis to before the last closing parenthesis. From your question, I don't think there's a need to check that the string is balanced

How to match the main subgroup in a regular expression?

Having this string:
"example( other(1), 123, [25]).othermethod(456)"
How i can capture only the arguments of the main functions:
"other(1), 123, [25]" and "456"
I am trying this:
http://regex101.com/r/cR0uS9/2
In html example. Having this:
<div>
<div>
<div>12</div>
<div>34</div>
</div>
</div>
<div>56</div>
I want to get:
<div>
<div>12</div>
<div>34</div>
</div>
and 56 as second match.
Here's a pattern that doesn't use recursion:
\w+\s*\((?P<parameters>(?:(?:(?:[^()]*\([^()]*\))+|[^()]*)(?:,(?!\s*\))|(?=\))))*)\)
Caveats:
Does not support more than 2 levels of nested braces. e.g.
a(b(c()))
Strings containing ( or ) will trip it up. e.g.
a(")")
You'll find the parameters in the group called "parameters".
Demo.
Explanation:
\w+ # function name
\s* # white space
\(
(?P<parameters> # parameters:
(?:
# two possibilities: 1: a simple parameter, like "12", "'hello'", or "3*1+2"
# 2: the parameter contains braces.
# we'll try to consume pairs of braces. If that fails, we'll simply match a parameter.
(?:
(?: # match a pair of braces ()
[^()]*
\(
[^()]*
\)
)+ # consume as many pairs of braces as possible. Make sure there's at least one, though, because we can't go matching nothing.
|
[^()]* # since there are no more (pairs of) braces, simply consume the function's parameters.
)
# next, either consume a "," or assert there's a ")"
(?:
,
(?! # make sure there is another parameter after the comma
\s*
\)
)
|
(?=
\)
)
)
)*
)
\)
P.S.: I haven't managed to come up with an acceptable pattern for the HTML example yet.
This does some recursion. Use it in a global find function.
# '~(?is)(?:([a-z]\w*)\s*\(((?&core)|)\))(?(DEFINE)(?<core>(?>(?&content)|(?:[a-z]\w*\s*\(|\()(?:(?=.)(?&core)|)\))+)(?<content>(?>(?![a-z]\w*\s*\(|[()]).)+))~'
(?xis-)
(?:
( [a-z] \w* ) # (1), Start-Delimiter, Function
\s* \(
( # (2), CORE
(?&core)
|
)
\) # End-Delimiter, close paren
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
(?: # Start-Delimiter
[a-z] \w* \s* \( # Function
| \( # Or, a open paren
)
(?:
(?= . )
(?&core) # Recurse core
|
)
\) # End-Delimiter, close paren
)+
)
# content
(?<content>
(?>
(?!
[a-z] \w* \s* \(
| [()]
)
.
)+
)
)
Output:
** Grp 0 - ( pos 0 , len 29 )
example( other(1), 123, [25])
** Grp 1 - ( pos 0 , len 7 )
example
** Grp 2 - ( pos 8 , len 20 )
other(1), 123, [25]
** Grp 3 - NULL
** Grp 4 - NULL
-----------------------
** Grp 0 - ( pos 30 , len 16 )
othermethod(456)
** Grp 1 - ( pos 30 , len 11 )
othermethod
** Grp 2 - ( pos 42 , len 3 )
456
** Grp 3 - NULL
** Grp 4 - NULL
For the html div -
# '~(?s)(?:<div>((?&core)|)</div>)(?(DEFINE)(?<core>(?>(?&content)|<div>(?:(?=.)(?&core)|)</div>)+)(?<content>(?>(?!</?div>).)+))~'
(?xs-)
(?:
<div> # Start-Delimiter <div>
( # (1), CORE
(?&core)
|
)
</div> # End-Delimiter </div>
)
# ///////////////////////
# // Subroutines
# // ---------------
(?(DEFINE)
# core
(?<core>
(?>
(?&content)
|
<div> # Start-Delimiter <div>
(?:
(?= . )
(?&core) # Recurse core
|
)
</div> # End-Delimiter </div>
)+
)
# content
(?<content>
(?>
(?! </?div> )
.
)+
)
)
Output:
** Grp 0 - ( pos 0 , len 82 )
<div>
<div>
<div>12</div>
<div>34</div>
</div>
</div>
** Grp 1 - ( pos 5 , len 71 )
<div>
<div>12</div>
<div>34</div>
</div>
** Grp 2 - NULL
** Grp 3 - NULL
---------------------------
** Grp 0 - ( pos 84 , len 13 )
<div>56</div>
** Grp 1 - ( pos 89 , len 2 )
56
** Grp 2 - NULL
** Grp 3 - NULL

Perl regular expression {} quantifier multiple matches

Im trying to parse a file wherein each line has 3 floats(1, +1.0 -1.0 being valid values) and while the regular expression in the snippet matches a float value, I'm not sure how I should be using the Perl quantifier {n} to match multiple floats within a single line.
#!/usr/bin/perl
use strict;
use warnings;
open(my $fh, "<", "floatNumbers.txt") or die "Cannot open < floatNumbers.txt";
while(<$fh>)
{
if ($_=~m/([-+]?\d*[\.[0-9]*]?\s*)/)
{
print $1."\n";
}
}
Code snippet, I tried to match 3 floats within a line. Could readers help me with the correct usage of the {} quantifier?
if ($_=~m/([-+]?\d*[\.[0-9]*]?\s*){3}/)
You're trying to do extraction and validation at the same time. I'd go with:
sub is_float {
return $_[0] =~ /
^
[-+]?
(?: \d+(?:\.[0-9]*)? # 9, 9., 9.9
| \.[0-9]+ # .9
)
\z
/x;
}
while (<$fh>) {
my #fields = split;
if (#fields != 3 || grep { !is_float($_) } #fields) {
warn("Syntax error at line $.\n");
next;
}
print("#fields\n");
}
Note that your validation consdered ., [ and ...0...0... to be numbers. I fixed that.
Quntifiers just allow you to specify how many times you want to match something in a regex.
For example /(ba){3}/ would match ba in a string exactly 3 times :
bababanfnfd = bababa but not
baba = no match.
You can also use (taken from: http://perldoc.perl.org/perlrequick.html):
a? = match 'a' 1 or 0 times
a* = match 'a' 0 or more times, i.e., any number of times
a+ = match 'a' 1 or more times, i.e., at least once
a{n,m} = match at least n times, but not more than m times.
a{n,} = match at least n or more times
a{n} = match exactly n times
This is a generalized pattern that I think does what you are talking about:
# ^\s*(?:[-+]?(?=[^\s\d]*\d)\d*\.?\d*(?:\s+|$)){3}$
^ # BOL
\s* # optional whitespaces
(?: # Grouping start
[-+]? # optional -+
(?= [^\s\d]* \d ) # lookahead for \d
\d* \.? \d* # match this form (everything optional but guaranteed a \d)
(?: \s+ | $ ) # whitespaces or EOL
){3} # Grouping end, do 3 times
$ # EOL

How to match words separated with single space vs words separated with multiple spaces

I need to separate the key and values from the text that looks like below
Student ID: 0
Department ID = 18432
Name XYZ
Subjects:
Computer Architecture
Advanced Network Security 2
In the above example Student ID, Department ID and Name are the keys and 0,18432, XYZ are values. The keys are separated from the values either by :,= or multiple spaces. I tried reg ex such as
$line =~ /(([\w\(\)]*\s)*)([=:\s?]?)\s*(\S.*)?$/;
$key = $2;
$colon=$3;
$value = $4;
The problem I am facing is identifying when a word is separated with single space and when it is separated by more than one.
The output I get is
line is Student ID: 0
key is Student , value is ID: 0
while I want key is Student ID and value is 0. For lines like Subjects: and Computer Architecture, the key should have Subjects and Computer Architecture. I have logic later when there is no value or colon, I append the strings to the previous key so it will look like Subjects=Computer Architecture;Advanced Network Security 2
Update: Thanks Ikegami for indicating that I use look behind operator. But I still seem to have problem solving it.
$line=~/^(?: ( [^:=]+ ) (?<!\s\s)\s* [:=]\s*|\s*)(.*)$/x;
So When I say (?<!\s\s)\s* [:=]\s*|\s* I mean when there more than two spaces, consume all the spaces and when there are no two consecutive spaces look for : or = and consume spaces. So if you pass below line to the expression, Shouldnt I be getting $1=Name and $2=ABC XYZ?
Name ABC XYZ
What I seem to be getting is key is empty and value is Name ABC XYZ.
If
Name Eric Brine
Computer Architecture x86
means
key: Name Eric value: Brine
key: Computer Architecture value: x86
then you want
# Requires 5.10
if (/
^
(?: (?<key> [^:=]+ (?<!\s) ) \s* [:=] \s* (?<val> .* )
| (?<key> .+ (?<!\s) ) \s+ (?<val> \S+ )
)
\s* $
/x) {
my $key = $+{key};
my $val = $+{val};
...
}
or
if (/
^
(?: ( [^:=]+ (?<!\s) ) \s* [:=] \s* ( .* )
| ( .+ (?<!\s) ) \s+ ( \S+ )
)
\s*
( .* )
/x) {
my ($key,$val) = defined($1) ? ($1,$2) : ($3,$4);
...
}
If
Name Eric Brine
Computer Architecture x86
means
key: Name value: Eric Brine
key: Computer value: Architecture x86
then you want
# Requires 5.10
if (/
^
(?: (?<key> [^:=]+ (?<!\s) ) \s* [:=]
| (?<key> \S+ ) \s
)
\s*
(?<val> .* )
/x) {
my $key = $+{key};
my $val = $+{val};
...
}
or
if (/
^
(?: ( [^:=]+ (?<!\s) ) \s* [:=]
| ( \S+ ) \s
)
\s*
( .* )
/x) {
my $key = defined($1) ? $1 : $2;
my $val = $3;
...
}
Note that you can remove all the space and line breaks. For example, the last snippet can be written as:
if (/^(?:([^:=]+(?<!\s))\s*[:=]|(\S+)\s)\s*(.*)/) {
my $key = defined($1) ? $1 : $2;
my $val = $3;
...
}
Try specifying the key part as two bits of text with an optional space in between;
$line =~ /([\w\(\)]*\s?[\w\(\)]*)\s*([=:]?)\s*(\S.*)?$/;
That should capture both one-word and two-word keys.