Regex value comparison - regex

I want to compare two numbers isolated from this sample data:
'gi|112807938|emb|CU075707.1|_Xenopus_tropicalis_finished_cDNA,_clone_TNeu129d01 C1:TCONS_00039972(XLOC_025068),_12.9045:32.0354,_Change:1.3118,_p:0.00025,_q:0.50752 C2:TCONS_00045925(XLOC_029835),_10.3694:43.8379,_Change:2.07985,_p:0.0004,_q:0.333824',
'gi|115528274|gb|BC124894.1|_Xenopus_laevis_islet-1,_mRNA_(cDNA_clone_MGC:154537_IMAGE:8320777),_complete_cds C1:TCONS_00080221(XLOC_049570),_17.9027:40.8136,_Change:1.18887,_p:0.00535,_q:0.998852 C2:TCONS_00092192(XLOC_059015),_17.8995:35.5534,_Change:0.990066,_p:0.0355,_q:0.998513',
'gi|118404233|ref|NM_001078963.1|_Xenopus_(Silurana)_tropicalis_pancreatic_lipase-related_protein_2_(pnliprp2),_mRNA C1:TCONS_00031955(XLOC_019851),_0.944706:5.88717,_Change:2.63964,_p:0.01915,_q:0.998852 C2:TCONS_00036655(XLOC_023660),_2.31819:11.556,_Change:2.31757,_p:0.0358,_q:0.998513',
using the following regex:
#!/usr/bin/perl -w
use strict;
use File::Slurp;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my (#log_change, #largest_change);
foreach (#intersect) {
chomp;
my #condition1_match = ($_ =~ /C1:.*?Change:(-?\d+\.\d+)|C1:.*?Change:(-?inf)/); # Sometimes the value is 'inf' or '-inf'. This allows either a numerical or inf value to be captured.
my #condition2_match = ($_ =~ /C2:.*?Change:(-?\d+\.\d+)|C2:.*?Change:(-?inf)/);
push #log_change, "#condition1_match\t#condition2_match";
}
print Dumper (\#log_change);
Which gives this output:
'1.3118 2.07985 ',
'1.18887 0.990066 ',
'2.63964 2.31757 ',
Ideally, within the same loop I now want to make a comparison between the values held in #condition1_match and #condition2_match such that the larger value is pushed onto a new array, unless comparing against a non numerical 'inf' in which case push the numerical value.
Something like this:
my (#log_change, #largest_change);
foreach (#intersect) {
chomp;
my #condition1_match = ($_ =~ /C1:.*?Change:(-?\d+\.\d+)|C1:.*?Change:(-?inf)/);
my #condition2_match = ($_ =~ /C2:.*?Change:(-?\d+\.\d+)|C2:.*?Change:(-?inf)/);
push #log_change, "#condition1_match\t#condition2_match";
unless ($_ =~ /Change:-?inf/) {
if (#condition1_match > #condition2_match) {
push #largest_change, #condition1_match;
}
else {
push #largest_change, #condition2_match;
}
}
}
print Dumper (\#largest_change);
Which gives:
'2.07985',
undef,
'0.990066',
undef,
'2.31757',
undef,
as well as a lot of this error message:
Use of uninitialized value $condition2_match[1] in join or string at intersect.11.8.pl line 114.
I'm unsure as to what exactly the error message means, as well as why I'm getting undef values in my #largest_change

As you've written your code, #condition_match1 and #condition_match2 will be created with 2 elements -- corresponding to the 2 capture groups in your regular expression -- each time there is a match. But one of these elements will always necessarily be undef, leading to the uninitialized ... warnings.
In this case, you can repair this program by putting the | inside the capture group:
my ($condition1_match) = ($_ =~ /C1:.*?Change:(-?\d+\.\d+|-?inf)/);
my ($condition2_match) = ($_ =~ /C2:.*?Change:(-?\d+\.\d+|-?inf)/);
so that there is a single capture group and the matching operation produces a list with a single, defined element.
In addition, the comparison
if (#condition1_match > #condition2_match) {
is probably not doing what you think it is doing. In Perl, a numerical comparison between two arrays is a comparison of array lengths. What you apparently mean to do is to compare the defined value in each of those arrays, so you would need to do something more cumbersome like:
my $condition1_match = $condition1_match[0] // $condition1_match[1];
my $condition2_match = $condition2_match[0] // $condition2_match[1];
if ($condition1_match > $condition2_match) {
push #largest_change, $condition1_match;
} else {
push #largest_change, $condition2_match;
}

Related

Save result from flip-flop in variable?

I have about 1kB of text from STDIN
my $f = join("", <STDIN>);
and I would like to get the content between open1 and close1, so /open1/../close1/ comes to mind.
I have only seen it been used in one liners and in scripts in while-loops and $_.
Question
How can I get the result from /open1/../close1/ in my script when everything is in $f?
Capturing all matches with a single regular expression
If you want to capture all the lines between open1 and start1 markers (excluding the markers), it is easily done with a single regular expression:
my $f = join("", <STDIN>);
my #matches = ( $f =~ m/\bopen1\b(.*?)\bclose1\b/gs );
for my $m (#matches) {
print "$m";
}
where
s modifier treats the string as a single line;
g modifier captures all the matches;
(.*?) matches a group of any characters using the lazy quantifier
Using the range operator
The range operator (so-called flip-flop) is not very convenient for this task if you want to avoid capturing the markers, because an expression like /open1/ .. /close1/ returns true for the lines matching the patterns.
The expression /^open1$/ .. /^close1$/ returns false until /^open1$/ is true. The left regular expression stops being evaluated once it matches the line, and keeps returning true until /^close1$/ becomes true. When the right expression matches, the cycle is restarted. Thus, the open1 and close1 markers are included into $matches.
It is even less convenient, if the input is stored in a variable, because you will need to read the contents of the variable line by line, e.g.:
my $matches = "";
my #lines = split /\n/, $f;
foreach my $line (#lines) {
if ($line =~ m/^open1$/ .. $line =~ m/^close1$/) {
$matches .= "$line\n";
}
}
Note, it is possible to use arbitrary Perl expressions as operands of the range operator. I wouldn't recommend this code, as it is not very efficient, and not very readable. At the same time it is easy to adapt the first example to the case where the open1 and close1 markers are included into the set of matches, e.g.:
my #matches = ( $f =~ m/\bopen1\b(.*?)\bclose1\b/gs );
for my $m (#matches) {
print "open1${m}close1\n";
}
You can rewrite how $f is generated so that it takes advantage of the flip-flop inside a while loop:
my ( $f, $matched );
while ( <> ) {
$f .= $_;
$matched .= $_ if /open1/ .. /close1/;
}
Another way is to create a new inputs stream out of the contents of $f.
open my $fh, '<', \$f;
while (<$fh>) {
if (/open1/ .. /close1/) {
...
}
}
You can also employ split. To get what is between the first pair of open1 and close1
my $open_to_close = (split /open1|close1/, $f)[1];
The delimiter can be either open1 or close1, so returned is a list of three elements: before open1, between them, and after close1. We take the second element.
If there are more open1/close1 pairs take all odd-indexed elements.
Either get the array as well
my #parts = split /open1|close1/, $f;
my #all_open_to_close = #parts[ grep { $_ & 1 } 0..$#parts ];
or get it directly from the list
my #all_open_to_close =
grep { CORE::state $i; ++$i % 2 == 0 } split /open1|close1/, $f;
The state is a feature
from v5.10. If you already use that you don't need CORE:: prefix.

In array listitem how to store values with comma separation and end with dot using perl

use strict;
use warnings;
my $tmp = join "\n", <DATA>;
my #biblables = ();
List items will be fetched and storing into #biblables in a while loop
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
push(#biblables, "\\cite{$2}, ");
}
print #biblables;
While printing this we are getting the output like as:
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006},
However we need the output like this
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006}.
Hence we can use post regex to insert dot at the end of the listitem in array
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
my $post = $';
if($post!~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/)
{ push(#biblables, "\\cite{$2}."); }
else { push(#biblables, "\\cite{$2}, "); }
}
print #biblables;
Could you please advise me if there is short way to get this output
#
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
You can add the comma and period after the fact:
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g){
push(#biblables, "\\cite{$2}");
}
print join ', ', #biblables;
print ".\n";
If you read from a filehandle you can use eof to determine that you are on the last line, at which point you replace the comma by the dot in the last element. This allows you to build the array completely in the loop, as required.
use warnings;
use strict;
open my $fh, '<', 'bibitems.txt';
my #biblabels;
while (<$fh>) {
push #biblabels, "\\cite{$2}," if /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/;
$biblabels[-1] =~ tr/,/./ if eof;
}
print "$_ " for #biblabels;
print "\n";
This prints your desired output.
The oef returns true if the next read will return end-of-file. This means that you've just read the last line, which got put on the array if it matched. This function is rarely needed but here it seems to find a fitting purpose. Note that eof and eof() behave a little differently. Please see the eof page.
If the other capture in the regex is meant to be used change the above to if (...) { ... }. Note that what is in {} is in Latex called citation keys, while the (optional) labels are things inside []. I'd go with the array name of #citkeys for clarity.
If you're determine to add the comma's and dots to the elements when
matching in the regex while loop, it can be done like this.
Since you don't know the total matches yet, just keep a reference to
the most recently pushed element.
Then append the , or . as needed.
Code
use strict;
use warnings;
$/ = undef;
my $tmp = <DATA>;
my #biblables = ();
my $ref = undef;
while( $tmp =~ /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g )
{
$$ref .= ", " if defined $ref;
$ref = \$biblables[ push(#biblables,"\\cite{$2}") ];
}
$$ref .= "." if defined $ref;
print #biblables;
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
Output
\cite{BuI2001}, \cite{BuI2002}, \cite{BuI2003}, \cite{BuI2004}, \cite{BuI2005}, \cite{BuI2006}.

Why does my program crash after one line of input?

I am writing a simple program which capitalizes each word in a sentence. It gets a multi-line input. I then loop through the input lines, split each word in the line, capitalize it and then join the line again. This works fine if the input is one sentence, but as soon as I input two lines my program crashes (and if I wait too long my computer freezes.)
Here is my code
#input = <STDIN>;
foreach(#input)
{
#reset #words
#words= ();
#readability
$lines =$_;
#split sentence
#words = split( / /, $lines );
#capitalize each word
foreach(#words){
$words[$k] = ucfirst;
$k++;
}
#join sentences again
$lines = join(' ', #words);
#create output line
$output[$i]=$lines;
$i++;
}
#print the result
print "\nResult:\n";
foreach(#output){
print $output[$j],"\n";
$j++;
}
Could someone please tell me why it crashes?
use strict (and be told about not properly handled variables like your indices)
use for var (array) to get a usable item without an index (Perl isn't Javascript)
What isn't there can't be wrong (e.g. push instead of index)
In code:
use strict; # always!
my #input = <STDIN>; # the loop need in- and output
my #output = ();
for my $line (#input) # for makes readability *and* terseness easy
{
chomp $line; # get rid of eol
#split sentence
my #words = split( / /, $line );
#capitalize each word
for my $word (#words){ # no danger of mishandling indices
$word = ucfirst($word);
}
#join sentences again
$line = join(' ', #words);
#create output line
push #output, $line;
}
#print the result
print "\nResult:\n";
for my $line (#output){
print $line, "\n";
}
The problem is that you are using global variables throughout, so they are keeping their values across iterations of the loop. You have reset #words to an empty list even though you didn't need to - it is overwritten when you assign the result of split to it - but $k is increasing endlessly.
$k is initially set to undef which evaluates as zero, so for the first sentence everything is fine. But you leave $k set to the number of elements in #words so it starts from there instead of from zero for the next sentence. Your loop over #words becomes endless because you are assigning to (and so creating) $words[$k] so the array is getting longer as fast as you are looping through it.
The same problem applies to $i and $j, but execution never gets as far as reusing those.
Alshtough this was the only way of working in Perl 4, over twenty years ago, Perl 5 has made programming very much nicer to write and debug. You can now declare variables with my, and you can use strict which (among other things) insists that every variable you use must be declared, otherwise your program won't compile. There is also use warnings which is just as invaluable. In this case it would have warned you that you were using an undefined variable $k etc. to index the arrays.
If I apply use strict and use warnings, declare all of your variables and initialise the counters to zero then I get a working program. It's still not very elegant, and there are much better ways of doing it, but the error has gone away.
use strict;
use warnings;
my #input = <STDIN>;
my #output;
my $i = 0;
foreach (#input) {
# readability
my $lines = $_;
# split sentence
my #words = split ' ', $lines;
# capitalize each word
my $k = 0;
foreach (#words) {
$words[$k] = ucfirst;
$k++;
}
# join sentences again
$lines = join ' ', #words;
#create output line
$output[$i] = $lines;
$i++;
}
print "\nResult:\n";
my $j = 0;
foreach (#output) {
print $output[$j], "\n";
$j++;
}

When there is a similar pattern in an expression, how to extract the occurence of the last instance in perl?

The value of $s is dynamic. I need to extract the values that occur after the last | in between each [].
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = split(/\]/, $s);
foreach my $part (#parts)
{
# Need to extract the values that occur after the last '|'
# (for example: !, .1iit, 10:48AM, Calculator, Coffee)
# and store each of the values separately in a hash
}
Could someone help me out in this?
Thanks,
Best to transform the string into a more useful data structure, then take the needed elements. Why is this best? Because right now you need the last element, but perhaps next time you will need some other part. Since its not harder to do it right, why not?
#!/usr/bin/perl
use strict;
use warnings;
# Only needed for Dumper
use Data::Dumper;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
# Extract each group between []
# Then transform each group into an array reference by splitting on |
my #groups = map { [ split /\|/ ] } ($s =~ /\[([^\]]*)\]/g);
# Inspect the data structure
print Dumper \#groups;
# Print only the last element of each sub-array
print "$_\n" for map {$_->[-1]} #groups;
If needed the third elements of the sub-arrays could be transformed into hashrefs quite easily too. ,however since that wasn't needed, I leave that as an exercise for the reader (I always love saying that when I get the chance!).
Edit: since I found it interesting I ended up creating these hashrefs, here is the code that would replace the my #groups line:
my #groups = map { [ map { /\{([^\}]*)\}/ ? { split /(?:=|,)/, $1 } : $_ } (split /\|/) ] } ($s =~ /\[([^\]]*)\]/g);
or more properly commented (map commands are read from the back, so the comments start at the bottom and follow by number, comments like #/N pair with those like #N)
my #groups = map { #/1
[ #/2
map { #/3
/\{([^\}]*)\}/ #4 ... and if any element (separated by pipes in #3)
# is surrounded by curly braces
? { #5 ... then return a hash ref
split /(?:=|,)/, $1 #6 ... whose elements are given
# pairwise between '=' or ',' signs
} #/5
: $_ #7 ... otherwise (from 'if' in #4 ) return the element as is
} (split /\|/) #3 ... where each element is separated by pipes (i.e. |)
] #2 ... return an array ref
} ($s =~ /\[([^\]]*)\]/g); #1 For each element between sqr braces (i.e. [])
The generic way:
#subparts = split /\|/, $part;
$tail = $subparts[$#subparts];
If you only ever need the last part separately:
$part =~ /([^\|]*)$/ and $tail = $1;
my ($value) = $part =~ m/[^|]\|(.+)$/;
print "$part => $value\n";
and another way:
my $s =
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = $s =~ m/\|([^|]+)]/g;
print join( "\n", #parts );
Since you insist on a regex:
#matches = $s =~ /\|([^|]+?)]/g
Using /g will dump all matches into the array #matches
You really don't need a regex... just use split(). The results are stored in %results
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
foreach my $part (split(/\]/, $s))
{
#pieces = split(/\|/, $part);
$results{$pieces[-1]} = $pieces[-1];
}
With regexes, when you think “I want the last of,” you should immediately think of the pattern .* because regex greed does just what you want.
For example, matching /^(.*)a(.*)$/ chops up "abababab" into
ababab in $1
a matched by the literal in the pattern
b in $2
Let's think through the process of the match. Imagine .* as Augustus Gloop.
Augustus: Ausgezeichnet! The ^ anchor means I get to start at the beginning. From there, I shall eat all the candies!
Willie Wonka: But, my dear Augustus, you must share with the other children.
Augustus: Fine, I get "abababa" and they get "b". Happy?
Willie Wonka: But the next child in line doesn't like b candies.
Augustus: Then I shall keep "ababab" for myself and leave "ab" for the others.
At this point, Augustus has his big pile, humble little Charlie Bucket gets his single a, and Veruca Salt—although scowling about the meager quantity—gets at least something now.
In other words, $2 contains everything after the last a. To be persnickety, the ^ and $ anchors are redundant, but I like keeping them for added emphasis.
Putting this into action, you could write
#! /usr/bin/env perl
use strict;
use warnings;
sub last_fields {
local($_) = #_;
my #last;
push #last, $1 =~ /^.*\|(.+)$/ ? $1 : undef
while /\[(.*?)\]/g;
wantarray ? #last : \#last;
}
The outer while breaks up the string into [...] chunks and assumes that right square-bracket cannot occur inside a chunk. Within each chunk, we use /^.*\|(.+)$/ to capture in $1 everything after the last pipe.
Testing it with your example looks like
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!]" .
"[0|0|{A=167,B=2,C=67,D=17}|.1iit]" .
"[196|0|{A=244,B=6,C=67,D=12}|10:48AM]" .
"[204|0|{A=9,B=201,C=61,D=11}|Calculator]" .
"[66|0|{A=145,B=450,C=49,D=14}|Coffee]";
use Test::More tests => 6;
my #lasts = last_fields $s;
# yes, is_deeply could do this in a single call,
# but it's laid out explicitly here for expository benefit
is $lasts[0], "!";
is $lasts[1], ".1iit";
is $lasts[2], "10:48AM";
is $lasts[3], "Calculator";
is $lasts[4], "Coffee";
is scalar #lasts, 5;
All the tests pass:
$ ./match-last-of
1..6
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
The output of prove is nicer. Run it yourself to see the color coding.
$ prove ./match-last-of
./match-last-of .. ok
All tests successful.
Files=1, Tests=6, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.05 CPU)
Result: PASS

Using Perl, how can I build a dynamic regexp by passing in an argument to a subroutine?

I would like to create subroutine with a dynamically created regxp. Here is what I have so far:
#!/usr/bin/perl
use strict;
my $var = 1234567890;
foreach (1 .. 9){
&theSub($_);
}
sub theSub {
my $int = #_;
my $var2 = $var =~ m/(??{$int})/;
print "$var2\n";
}
It looks like it will work, but it seems that once the $int in the regex gets evaluated for the first time, it's there forever.
Is there anyway to do something similar to this, but have the regex pick up the new argument each time the sub is called?
The easiest way to fix your code is to add parentheses around my, and remove ??{. Here is the fixed program:
#!/usr/bin/perl
use strict;
my $var = 1234567890;
foreach (1 .. 9){
theSub($_);
}
sub theSub {
my($int) = #_;
my($var2) = $var =~ m/($int)/;
print "$var2\n";
}
One of the problematic lines in your code was my $int = #_, which was equivalent to my $int = 1, because it evaluated #_ in scalar context, yielding the number of elements in #_. To get the first argument of your sub, use my($int) = #_;, which evaluates #_ in list context, or fetch the first element using my $int = $_[0];, or fetch+remove the first element using my $int = shift;
There was a similar problem in the my $var2 = line, you need the parentheses there as well to evaluate the regexp match in list context, yielding the list of ($1, $2, ...), and assigning $var2 = $1.
The construct (??{...}) you were trying to use had the opposite effect to what you wanted: (among doing other things) it compiled your regexp the first time it was used for matching. For regexps containing $ or #, but not containing ??{...}, Perl recompiles the regexp automatically for each match, unless you specify the o flag (e.g. m/$int/o).
The construct (??{...}) means: use Perl code ... to generate a regexp, and insert that regexp here. To get more information, search for ??{ on http://perldoc.perl.org/perlre.html . The reason why it didn't work in your example is that you would have needed an extra layer of parentheses to capture $1, but even with my ($var2) = $var =~ m/((??{$int}))/ it wouldn't have worked, because ??{ has an undocumented property: it forces the compilation of its argument the first time the regexp is used for matching, so my ($var2) = $var =~ m/((??{$int + 5}))/ would have always matched 6.
my $int = #_;
This will give you the count of parameters, always '1' in your case.
I think you want
my $int = shift;
To dynamically pass a regexp to a function, rather than dynamically build it in the function, use qr//.
#!/usr/bin/perl
use strict;
my $var = 1234567890;
foreach (1 .. 9){
&theSub(qr/$int/);
}
sub theSub {
my($regexp) = #_;
my($var2) = ($var =~ $regexp);
print "$var2\n";
}
qr// accepts the same trailing arguments that m// does: i, m, s, and x
my $int is the scalar context, he has ($int) for the list context and that puts $_[0] into $int. In the following only 10 is put into $int and the rest 11 to 99 are lost.
my ($int)=(10..99);
print $int;
10