Find multiline blocks - regex

I'm trying to find occurrences of BLOB_SMUGHO, from the file test.out from the bottom of the file. If found, return a chunk of data which I'm interested in between the string "2014.10"
I'm getting Use of uninitialized value $cc in pattern match (m//) at
Whats is wrong with this script?
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
use File::ReadBackwards;
my $find = "BLOB_SMUGHO";
my $chnkdelim = "\n[" . strftime "%Y.%m", localtime;
my $fh = File::ReadBackwards->new('test.out', $chnkdelim, 0) or die "err-file: $!\n";
while ( defined(my $line = $fh->readline) ) {
if(my $cc =~ /$find/){
print $cc;
}
}
close($fh);
In case if this helps, here is a sample content of test.out
2014.10.31 lots and
lots of
gibbrish
2014.10.31 which I'm not
interested
in. It
also
2014.10.31 spans
across thousands of
lines and somewhere in the middle there will be
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31 certain other
2014.10.31 words
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31
this precious word BLOB_SMUGHO and
which
I
will
be
interested
in.
And I'm expecting to capture all the multiple occurrences of the chunk of the text from bottom of the file.
2014.10.31
this precious word BLOB_SMUGHO and

First, you have written your match incorrectly due to misunderstanding the =~ operator:
if(my $cc =~ /$find/){ # incorrect, like saying if(undef matches something)
If you want to match what is in $line against the pattern between /.../ then do:
if($line =~ /$find/) {
The match operator expects a value on left side as well as right side. you were using it like an assignment operator.
If you need to capture the match(es) into a variable or list, then add it to the left of an equal sign:
if(my ($cc) = $line =~ /$find/) { <-- wrap $cc in () for list context
By the way, I think you are better off writing:
if($line =~ /$find/) {
print $line;
or if you want to print what you matched only
print $0;
Since you aren't capturing a substring, it doesnt really matter here.
Now, as to how to match everything between two patterns, the task is easier if you don't match line by line, but match across newlines using the /s modifier.
In Perl, you can set the record separator to undef and use slurp mode.
local $/ = undef;
my $s = <>; # read all lines into $s
Now to scan $s for patterns
while($s =~ /(START.*?STOP)/gsm) { print "$1\n"; } # print the pattern inclusive of START and STOP
Or to capture between START and STOP
while($s =~ /START(.*?)STOP/gsm) { print "$1\n"; } # print the pattern between of START and STOP
So in your case the start pattern is 2014.10.31 and stop is BLOB_SMUGHO
while($s =~ /(2014\.10\.31.*?BLOB_SMUGHO)/gsm) {
print "$1\n";
}
NOTE: Regex modifiers in Perl come after the last / so if you see I use /gsm for multiline, match newline, and global matching (get multiple matches in a loop by remembering the last location).

Related

How to find all the words that begin with a|b and end with a|b. (Ex: “adverb” and “balalaika”)

The following perl program has a regex written to serve my purpose. But, this captures results present within a string too. How can I only get strings separated by spaces/newlines/tabs?
The test data I used is present below:
http://sainikhil.me/stackoverflow/dictionaryWords.txt
use strict;
use warnings;
sub print_a_b {
my $file = shift;
$pattern = qr/(a|b|A|B)\S*(a|b|A|B)/;
open my $fp, $file;
my $cnt = 0;
while(my $line = <$fp>) {
if($line =~ $pattern) {
print $line;
$cnt = $cnt+1;
}
}
print $cnt;
}
print_a_b #ARGV;
You could consider using an anchor like \b: word boundary
That would help apply the regexp only after and before a word.
\b(a|b|A|B)\S*(a|b|A|B)\b
Simpler, as Avinash Raj adds in the comments:
(?i)\b[ab]\S*[ab]\b
(using the case insensitive flag or modifier)
If you have multiple words in the same line then you can use word boundaries in a regex like this:
(?i)\b[ab][a-z]*[ab]\b
The pattern code is:
$pattern = /\b[ab][a-z]*[ab]\b/i;
However, if you want to check for lines with only has a word, then you can use:
(?i)$[ab][a-z]*[ab]$
Update: for your comment * lines that begin and end with the same character*, you can use this regex:
(?i)\b([a-z])[a-z]*\1\b
But if you want any character and not letters only like above you can use:
(?i)\b(.)[a-z]*\1\b

Perl grep a multi line output for a pattern

I have the below code where I am trying to grep for a pattern in a variable. The variable has a multiline text in it.
Multiline text in $output looks like this
_skv_version=1
COMPONENTSEQUENCE=C1-
BEGIN_C1
COMPONENT=SecurityJNI
TOOLSEQUENCE=T1-
END_C1
CMD_ID=null
CMD_USES_ASSET_ENV=null_jdk1.7.0_80
CMD_USES_ASSET_ENV=null_ivy,null_jdk1.7.3_80
BEGIN_C1_T1
CMD_ID=msdotnet_VS2013_x64
CMD_ID=ant_1.7.1
CMD_FILE=path/to/abcI.vc12.sln
BEGIN_CMD_OPTIONS_RELEASE
-useideenv
The code I am using to grep for the pattern
use strict;
use warnings;
my $cmd_pattern = "CMD_ID=|CMD_USES_ASSET_ENV=";
my #matching_lines;
my $output = `cmd to get output` ;
print "output is : $output\n";
if ($output =~ /^$cmd_pattern(?:null_)?(\w+([\.]?\w+)*)/s ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
I am getting the multiline output as expected from $output but the regex pattern match which I am using on $output is not giving me any results.
Desired output
jdk1.7.0_80
ivy
jdk1.7.3_80
msdotnet_VS2013_x64
ant_1.7.1
Regarding your regular expression:
You need a while, not an if (otherwise you'll only be matching once); when you make this change you'll also need the /gc modifiers
You don't really need the /s modifier, as that one makes . match \n, which you're not making use of (see note at the end)
You want to use the /m modifier so that ^ matches the beginning of every new line, and not just the beginning of the string
You want to add \s* to your regular expression right after ^, because in at least one of your lines you have a leading space
You need parenthesis around $cmd_pattern; otherwise, you're getting two options, the first one being ^CMD_ID= and the second one being CMD_USES_ASSET_ENV= followed by the rest of your expression
You can also simplify the (\w+([\.]?\w+)*) bit down to (.+).
The result would be:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
print "1 is : $1\n";
push (#matching_lines, $1);
}
That being said, your regular expression still won't split ivy and jdk1.7.3_80 on its own; I would suggest adding a split and removing _null with something like:
while ($output =~ /^\s*(?:$cmd_pattern)(?:null_)?(.+)/gcm ) {
my $text = $1;
my #text;
if ($text =~ /,/) {
#text = split /,(?:null_)?/, $text;
}
else {
#text = $text;
}
for (#text) {
print "1 is : $_\n";
push (#matching_lines, $_);
}
}
The only problem you're left with is the lone line CMD_ID=null. I'm gonna leave that to you :-)
(I recently wrote a blog post on best practices for regular expressions - http://blog.codacy.com/2016/03/30/best-practices-for-regular-expressions/ - you'll find there a note to always require the /s in Perl; the reason I mention here that you don't need it is that you're not using the ones you actually need, and that might mean you weren't certain of the meaning of /s)

Printing first instance of match in each line of file (Perl)

I have the following in an executable .pl file:
#!/usr/bin/env perl
$file = 'TfbG_peaks.txt';
open(INFO, $file) or die("Could not open file.");
foreach $line (<INFO>) {
if ($line =~ m/[^_]*(?=_)/){
#print $line; #this prints lines, which means there are matches
print $1; #but this prints nothing
}
}
Based on my reading at http://goo.gl/YlEN7 and http://goo.gl/VlwKe, print $1; should print the first match in each line, but it doesn't. Help!
No, $1 should print the string saved by so-called capture groups (created by the bracketing construct - ( ... )). For example:
if ($line =~ m/([^_]*)(?=_)/){
print $1;
# now this will print something,
# unless string begins from an underscore
# (which still matches the pattern, as * is read as 'zero or more instances')
# are you sure you don't need `+` here?
}
The pattern in your original code didn't have any capture groups, that's why $1 was empty (undef, to be precise) there. And (?=...) didn't count, as these were used to add a look-ahead subexpression.
$1 prints what the first capture ((...)) in the pattern captured.
Maybe you were thinking of
print $& if $line =~ /[^_]*(?=_)/; # BAD
or
print ${^MATCH} if $line =~ /[^_]*(?=_)/p; # 5.10+
But the following would be simpler (and work before 5.10):
print $1 if $line =~ /([^_]*)_/;
Note: You'll get a performance boost when the pattern doesn't match if you add a leading ^ or (?:^|_) (whichever is appropriate).
print $1 if $line =~ /^([^_]*)_/;

Perl search and replace enters endless loop

I am trying to match and replace in multiple files some string using
local $/;
open(FILE, "<error.c");
$document=<FILE>;
close(FILE);
$found=0;
while($document=~s/([a-z_]+)\.h/$1_new\.h/gs){
$found=$found+1;
};
open(FILE, ">error.c");
print FILE "$document";
close(FILE);'
It enters an endless loop, because the result of the substitution is matched again by the regular expression searched for. But shouldn't this be avoided by the s///g construct?
EDIT:
I found that also a foreach loop will not do exactly what I want (it will replace all occurrences, but print only one of them). The reason seems to be that the perl substitution and and search behave quite differently in the foreach() and while() constructs. To have a solution to replace in multiple files which outputs also all individual replacements, I came up with the following body:
# mandatory user inputs
my #files;
my $subs;
my $regex;
# additional user inputs
my $fileregex = '.*';
my $retval = 0;
my $opt_testonly=0;
foreach my $file (#files){
print "FILE: $file\n";
if(not($file =~ /$fileregex/)){
print "filename does not match regular expression for filenames\n";
next;
}
# read file
local $/;
if(not(open(FILE, "<$file"))){
print STDERR "ERROR: could not open file\n";
$retval = 1;
next;
};
my $string=<FILE>;
close(FILE);
my #locations_orig;
my #matches_orig;
my #results_orig;
# find matches
while ($string =~ /$regex/g) {
push #locations_orig, [ $-[0], $+[0] ];
push #matches_orig, $&;
my $result = eval("\"$subs\"");
push #results_orig, $result;
print "MATCH: ".$&." --> ".$result." #[".$-[0].",".$+[0]."]\n";
}
# reverse order
my #locations = reverse(#locations_orig);
my #matches = reverse(#matches_orig);
my #results = reverse(#results_orig);
# number of matches
my $length=$#matches+1;
my $count;
# replace matches
for($count=0;$count<$length;$count=$count+1){
substr($string, $locations[$count][0], $locations[$count][1]-$locations[$count][0]) = $results[$count];
}
# write file
if(not($opt_testonly) and $length>0){
open(FILE, ">$file"); print FILE $string; close(FILE);
}
}
exit $retval;
It first reads the file creates lists of the matches, their positions and the replacement text in each file (printing each match). Second it will replace all occurrences starting from the end of the string (in order not to change the position of previous messages). Finally, if matches were found, it writes the string back to the file. Can surely be more elegant, but it does what I want.
$1_new will still match ([a-z_]+). It enters an endless loop because you use while there. With the s///g construct, ONE iteration will replace EVERY occurence in the string.
To count the replacements use:
$replacements = () = $document =~ s/([a-z_]+)\.h/$1_new\.h/gs;
$replacements will contain the number of replaced matches.
If you essentially just want the matches, not the replacements:
#matches = $document =~ /([a-z_]+)\.h/gs;
You can then take $replacement = scalar #matches to obtain their count.
I'd say you're over-engineering this. I did this in the past with:
perl -i -p -e 's/([a-z_]+)\.h/$1_new\.h/g' error.c
This works correctly when the substituted string contains the matching pattern.
the /g option is like a loop in itself. I think you want this:
while($document=~s/([a-z_]+)(?!_new)\.h/$1_new\.h/s){
$found=$found+1;
};
Because you are replacing the match with itself and more, you need a negative lookahead assertion.

Regular expression in index function

I am looking for occurrence of "CCGTCAATTC(A|C)TTT(A|G)AGT" in a text file.
$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line=~/$text/){
chomp($line);
$pos=index($line,$text);
}
Searching is working, but I am not able to get the position of "text" in line.
It seems index does not accepts a regular expression as substring.
How can I make this work.
Thanks
The #- array holds the offsets of the starting positions of the last successful match. The first element is the offset of the whole matching pattern, and subsequent elements are offsets of parenthesized subpatterns. So, if you know there was a match, you can get its offset as $-[0].
You don't need to use index at all, just a regex. The portion of $line that comes before your regex match will be stored in $` (or $PREMATCH if you've chosen to use English;). You can get the index of the match by checking the length of $`, and you can get the match itself from the $& (or $MATCH) variable:
$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line =~ /$text/) {
$pos = length($PREMATCH);
}
Assuming you want to get $pos to continue matching on the remaining part of $line, you can use the $' (or $POSTMATCH) variable to get the portion of $line that comes after the match.
See http://perldoc.perl.org/perlvar.html for detailed information on these special variables.
Based on your comments, it seems like what you are after is matching the 50 characters directly following the match. So, a simple solution would be:
my ($match) = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/;
As you see, [AG] is equivalent to A|G. If you wish to match multiple times, you can use an array #matches, and the /g global option on the regex. E.g.
my #matches = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/g;
You can do this to keep the matching pattern:
my ($pattern, $match) = $line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;
Or in a loop:
while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
my ($pattern, $match) = ($1, $2);
}
while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
I like it, but no ; in while.
I had hard times to search for the reason of errors. T_T.