Perl regular expression unintendedly seems to be modifying a source String - regex

Am trying to match some data patterns from a file in a Perl program. Since the match could be over multiple lines, I have made the line seperator as undefined.
$/ = undef ;
Now, since the match can be across multiple lines and more than one, I am using smgi modifiers.
if ( $msgText =~ /$msgTypeExpr/smgi )
Now, the problem I am having is that the variable $msgText above gets modified though I am not replacing it.
Here is the relevant code:
open (HANDLE1,"$file") || die "cannot open file \n";
$/ = undef ;
while ( my $msgText = <HANDLE1> )
{
my $msgTypeExpr = "<city\\W+";
print "Attempt 1:\n";
if ( $msgText =~ /$msgTypeExpr/smgi )
{
print "matched\n";
}
else
{
print " not matched \n";
}
print "Attempt 2:\n";
if ( $msgText =~ /$msgTypeExpr/smgi )
{
print "matched\n";
}
else
{
print " not matched \n";
}
}
The test input file looks like this:
<city
name="abc">
</city>
One would expect the pattern to match twice but it is matching the first time only and not the second time around.
I have temporarily fixed this issue with assigning to a temp variable for now before matching and using that temp variable to match.
my $tmpMsgText = $msgText ;
This is the first time I am posting a question on this forum, so please pardon any etiquettes mistakes I may have made and also please be kind enough to point them out so that I don't repeat in the future.

if (//g) makes no sense. "If it matches and keeps on matching until there's no match"? Get rid of the g.
I don't know why you're using s or m either.
The s is useless since the pattern doesn't contain ..
The m is useless since the pattern doesn't contain ^ or $.
In reality, //g in scalar context acts as an iterator.
$ perl -E'$_ = "abc"; /(.)/g && say $1; /(.)/g && say $1;'
a
b

first of all, I'm not sure about reading a file like that. Modifying those Perl-special variables, like $/ should be done with local, like this:
local $/ = undef;
this way the variable is only modified in the current scope (thus eliminating possible action-in-the-distance bugs). By setting $/ to undef you will read the entire file in one go, so there is no point putting a while loop there. I'd read the whole file like this:
open my $fh, "<", "somefile" or die;
my $content = do { local $/ = undef; <$fh> };
the do block restricts the modified $/ value only to that one statement (it creates a new scope).
About regex matching: remove the /g modifier after the regex. If I remember correctly, it will remember the last regex search position and continue from there. Also for detecting if a string was altered or not, print the variable before and after those matches. You will see, that they are not modified.
Instead of:
if ( $msgText =~ /$msgTypeExpr/smgi )
put:
if ( $msgText =~ /$msgTypeExpr/smi )

Related

In array listitem how to store values with comma separation and end with dot using perl

use strict;
use warnings;
my $tmp = join "\n", <DATA>;
my #biblables = ();
List items will be fetched and storing into #biblables in a while loop
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
push(#biblables, "\\cite{$2}, ");
}
print #biblables;
While printing this we are getting the output like as:
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006},
However we need the output like this
\cite{BuI2001},\cite{BuI2002},\cite{BuI2003},\cite{BuI2004},\cite{BuI2005},\cite{BuI2006}.
Hence we can use post regex to insert dot at the end of the listitem in array
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g)
{
my $post = $';
if($post!~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/)
{ push(#biblables, "\\cite{$2}."); }
else { push(#biblables, "\\cite{$2}, "); }
}
print #biblables;
Could you please advise me if there is short way to get this output
#
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
You can add the comma and period after the fact:
while($tmp=~m/\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g){
push(#biblables, "\\cite{$2}");
}
print join ', ', #biblables;
print ".\n";
If you read from a filehandle you can use eof to determine that you are on the last line, at which point you replace the comma by the dot in the last element. This allows you to build the array completely in the loop, as required.
use warnings;
use strict;
open my $fh, '<', 'bibitems.txt';
my #biblabels;
while (<$fh>) {
push #biblabels, "\\cite{$2}," if /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/;
$biblabels[-1] =~ tr/,/./ if eof;
}
print "$_ " for #biblabels;
print "\n";
This prints your desired output.
The oef returns true if the next read will return end-of-file. This means that you've just read the last line, which got put on the array if it matched. This function is rarely needed but here it seems to find a fitting purpose. Note that eof and eof() behave a little differently. Please see the eof page.
If the other capture in the regex is meant to be used change the above to if (...) { ... }. Note that what is in {} is in Latex called citation keys, while the (optional) labels are things inside []. I'd go with the array name of #citkeys for clarity.
If you're determine to add the comma's and dots to the elements when
matching in the regex while loop, it can be done like this.
Since you don't know the total matches yet, just keep a reference to
the most recently pushed element.
Then append the , or . as needed.
Code
use strict;
use warnings;
$/ = undef;
my $tmp = <DATA>;
my #biblables = ();
my $ref = undef;
while( $tmp =~ /\\bibitem\[([^\[\]]*)\]{([^\{\}]*)}/g )
{
$$ref .= ", " if defined $ref;
$ref = \$biblables[ push(#biblables,"\\cite{$2}") ];
}
$$ref .= "." if defined $ref;
print #biblables;
__DATA__
\bibitem[{BuI (2001)}]{BuI2001}
\bibitem[{BuII (2002)}]{BuI2002}
\bibitem[{BuIII (2003)}]{BuI2003}
\bibitem[{BuIV (2004)}]{BuI2004}
\bibitem[{BuV (2005)}]{BuI2005}
\bibitem[{BuVI (2006)}]{BuI2006}
Output
\cite{BuI2001}, \cite{BuI2002}, \cite{BuI2003}, \cite{BuI2004}, \cite{BuI2005}, \cite{BuI2006}.

Issue with regex matching in Perl

I am new to Perl and any help will be appreciated. I have 2 variables: $release and $env_type. I want to check if a string contains $release_$env_type, then do something. For example,
$release="beta";
$env_type="testing";
so string is beta_testing
Code snippet:
if ( $_ =~ /${release}_${env_type}/ ) {
#do Something
}
This if condition doesn't get resolved. Kindly let me know what is the correct syntax to make this check? I searched on Google but didn't get any good post..
Kindly help!
I have a file with contents:
admin_vh_c9_simv2_edg=/console,/consolehelp
idminternal_vh_c9_simv2_edg=/oim,/soa-infra
sso_vh_c9_simv2_edg=/oim,/soa-infra,/odsm
my $env_type = "edg";
my $release = "c9_simv2";
#Input file containing contexts
my $idmInternal = "./IdmContexts.conf";
if ( !-e $idmInternal ) {
die "Unable to find the file $idmInternal!\n";
}
open( MYFILE, $idmInternal );
while (<MYFILE>) {
chomp;
if ( $_ =~ /${release}_${env_type}/ ) {
push( #filtered, $_ );
}
}
Your code is fine. The problem is elsewhere. The following prints match.
my $release="beta";
my $env_type="testing";
$_ = "so string is beta_testing";
if ( $_ =~ /${release}_${env_type}/ ) {
print "match\n";
}
Note: /\Q${release}_${env_type}/ would be better. It'll make sure that special characters in the interpolated variables match themselves.
Most likely problem: You read the value of $release and/or $env_type from a file, and forgot to chomp the trailing newline.
If you are using $_ then this will work.
if (m/${release}_${env_type}/)
{
# Do something
}
The m// match operator binds automatically to $_. There is no need to bind it explicitly.
To really tell what is going on, you can inject code before the test. For example compile a regex first and then print it.
# compile the regex first
my $regex = qr/${release}_${env_type}/;
say qq{\$regex="$regex"};
# then print your scanned text
say qq{\$_="$_"};
if ( m/$regex/ ) {
# do something
}
If you're going to explicitly bind to a regex, then use variables:
my $string = $_;
if ( $string =~ m/$regex/ ) {
}
Otherwise, simply match the "context variable" ($_).
if ( m/$regex/ ) {
}
Also, USUW would help spot a few problems, proactively:
# Before everything else
use strict;
use warnings;

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Values from IF statement regex match (Perl)

I'm currently extracting values from a table within a file via REGEX line matches against the table rows.
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
# $val's used here
}
}
I purposely assign vals for clarity in the code. Some of my table rows contain 10+ vals (aka columns) - is there a more efficient method of assigning the vals instead of doing ... = ($1, $2, ..., $n)?
A match in list context yields a list of the capture groups. If it fails, it returns an empty list, which is false. You can therefore
if( my ( $val1, $val2, $val3 ) = $line =~ m/^(\S+)\s*(\S+)\s*(\S+)$/ ) {
...
}
However, a number of red flags are apparent in this code. That regexp capture looks very similar to a split:
if( my ( $val2, $val2, $val3 ) = split ' ', $line ) {
...
}
Secondly, why split $file by linefeeds; if you are reading the contents of a file, far nicer is to actually read a single line at once:
while( my $line = <$fh> ) {
...
}
I assume that this is not your actual code, because if so, it will not work:
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
}
# all the $valX variables are now out of scope
}
You should also be aware that \s* will also match the empty string, and may cause subtle errors. For example:
"a bug" =~ /^(\S+)\s*(\S+)\s*(\S+)$/;
# the captures are now: $1 = "a"; $2 = "bu"; $3 = "g"
Even despite the fact that \S+ is greedy, the anchors ^ ... $ will force the regex to fit, hence allowing the empty strings to split the words.
If your intention is to capture all the words that are separated by whitespace, using split is your best option, as others have already mentioned.
open my $fh, "<", "file.txt" or die $!;
my #stored;
while (<$fh>) {
my #vals = split;
push(#stored, \#vals) if #vals; # ignore empty values
}
This will store any captured values into a two-dimensional array. Using the file handle directly and reading line-by-line is the preferred method, unless for some reason you actually need to have the entire file in memory.
Looks like you are just using a table with a space delimiter.You can use the split function:
#valuearray = split(" ", $line)
And then address the elements as:
#valuearray[0] ,#valuearray[1] etc..

Perl regexp use of uninitialized pattern patch

My script loads some stuff from some files in some arrays, you enter a text from the keyboard, the script searches the relevant part of the text in those arrays, if it finds it, it does something, if not, well, another thing, at least in theory.
I get the following errors:
Use of uninitialized value in pattern match (m//) at emo_full_dynamic.pl line 120, <STDIN> chunk 2.
Modification of a read-only value attempted at emo_full_dynamic.pl line 121, <STDIN> chunk 2.
line 120 = $plm3 =~ /arr_(\w+.txt)/;
My problem, I think, is at $plm3 =~ /arr_(\w+.txt)/;. I used it so that I can store the name of an array in $1.
Here's my code:
#!/usr/bin/perl
use warnings;
$idx = 0;
$oldsep = $/;
opendir(DIR, 'c:/downloads/text_files/arrs/');
#files = readdir(DIR);
while ($idx <= $#files )
{
$value = $files[$idx];
if ( $value !~ m/^arr/i)
{
splice #files, $idx, 1;
}
else
{
$idx++;
}
}
foreach $plm (#files)
{
if($plm =~ m/txt$/)
{
open(ARR, "C:/downloads/text_files/arrs/$plm") or die $!;
while(<ARR>)
{ {
chomp($_);
$plm =~ m/arr_(\w+).txt/;
push(#{$1}, $_);
}
close ARR;
}
}
$plm = 0;
$idx = 0;
$stare = <STDIN>;
chomp($stare);
while($stare)
{
foreach $plm2 (#files)
{
if($plm2 =~ m/txt$/)
{
$plm2 =~ m/arr_(\w+).txt/;
if(grep $stare =~ m/$_/i, #{$1})
{
$flag = 1;
}
else
{
$flag = 0;
}
}
}
if($flag == 1)
{
$/ = "%\n";
$plm3 =~ /arr_(\w+.txt)/;
open SUPARARE, "C:/downloads/text_files/replies/$1" or die $!;
etc etc....
First of all, it's always a good idea to use strict pragma -- unless you have a valid reason to avoid it --.
Second, I don't see $plm3 initialized anywhere in your code. You have probably forgot to initialize it.
I think you are assigning something to variable $1 on line 121
Apparently there are some copy/paste issues which negates my initial answer.
Other mistakes, great and small:
You don't use strict. (fatal flaw)
Your opendir is used once, then never closed.
You use global filehandles, instead of lexical (e.g. open my $fh, ...)
Using a complext loop + splice instead of grep (#files=grep /^arr/i, #files)
Using chomp($_) when chomp per default chomps the $_ variable
I don't even know what this line means:
if(grep $stare =~ m/$_/i, #{$1}) {
You seem to be using a pattern match, where $_ is the pattern (which in this case is.. what? Nothing? Anything?), whose return value is used as a grep pattern for an array reference, that may or may not be initialized. A very horrible statement. If it indeed works as intended, the readability is very low.
Redeclaring $/ seems like a frivolous thing to do in this context, but I can't really tell, as the script ends there.