Regular expression in index function - regex

I am looking for occurrence of "CCGTCAATTC(A|C)TTT(A|G)AGT" in a text file.
$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line=~/$text/){
chomp($line);
$pos=index($line,$text);
}
Searching is working, but I am not able to get the position of "text" in line.
It seems index does not accepts a regular expression as substring.
How can I make this work.
Thanks

The #- array holds the offsets of the starting positions of the last successful match. The first element is the offset of the whole matching pattern, and subsequent elements are offsets of parenthesized subpatterns. So, if you know there was a match, you can get its offset as $-[0].

You don't need to use index at all, just a regex. The portion of $line that comes before your regex match will be stored in $` (or $PREMATCH if you've chosen to use English;). You can get the index of the match by checking the length of $`, and you can get the match itself from the $& (or $MATCH) variable:
$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line =~ /$text/) {
$pos = length($PREMATCH);
}
Assuming you want to get $pos to continue matching on the remaining part of $line, you can use the $' (or $POSTMATCH) variable to get the portion of $line that comes after the match.
See http://perldoc.perl.org/perlvar.html for detailed information on these special variables.

Based on your comments, it seems like what you are after is matching the 50 characters directly following the match. So, a simple solution would be:
my ($match) = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/;
As you see, [AG] is equivalent to A|G. If you wish to match multiple times, you can use an array #matches, and the /g global option on the regex. E.g.
my #matches = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/g;
You can do this to keep the matching pattern:
my ($pattern, $match) = $line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;
Or in a loop:
while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
my ($pattern, $match) = ($1, $2);
}

while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
I like it, but no ; in while.
I had hard times to search for the reason of errors. T_T.

Related

Replace only the second occurance of string in a line in perl regex

I have a string like "ven|ven|vett|vejj|ven|ven". Treat each "|" delimiter for each column.
By splitting the string with "|" saving all the columns in array and reading each column into $str
So, I'm trying to do this as
$string =~ s/$str/venky/g if $str =~ /ven/i; # it will do globally.
Which not met the requirement.
On-demand basis, I need to replace string at the particular number of occurrence of the string.
For example, I've a request to change 2nd occurrence of "ven" to venky.
Then how can I met this requirement simply? Is it some-thing like
$string =~ s/ven/venky/2;
As of my knowledge we have 'o' for replace once and 'g' for globally. I'm struggling for the solution to get the replacement at particular occurrence. And I should not use pos() to get the position, because string keeps on change. It becomes difficult to trace it every-time. That's my intention.
Please help me on this regard.
There is no flag that you can add to the regex that will do this.
The easiest way would be to split and loop. However, if you insist to use one regex, it is doable:
/^(?:[^v]|v[^e]|ve[^n])*ven(?:[^v]|v[^e]|ve[^n])*\Kven/
If you want to replace the Nth occurrence instead of the second, you can do:
/^(?:(?:[^v]|v[^e]|ve[^n])*ven){N-1}(?:[^v]|v[^e]|ve[^n])*\Kven/
The general idea:
(?:[^v]|v[^e]|ve[^n])* - matches any string that isn't part of ven
\K is a cool matcher that drops everything matched so far, so you can sort of use it as a lookbehind with variable length
Currently you're replacing every instance of'ven' with 'venky' if your string contains a match for ven, which of course it does.
What I assume you're trying to do is to substitute 'ven' for 'venky' within your string if it's the second element:
my $string = 'ven|ven|vett|vejj|ven|ven';
my #elements = split(/\|/, $string);
my $count;
foreach (#elements){
$count++;
s/$_/venky/g if /ven/i and $count == 2;
}
print join('|', #elements);
print "\n";
Your approach was already pretty good. What you described makes sense, but I think you are having trouble implementing it.
I created a function to do the work. It takes 4 arguments:
$string is the string we want to work on
$n is the nth occurance you want to replace
$needle is the thing you want to replace – thing needle in a haystack
Note that right now we allow to pass stuff that might contain regular expressions. So you would have to use quotemeta on it or match with /\Q$needle\E/
$replacement is the replacement for the $needle
The idea is to split up the string, then check each element if it matches the pattern ($needle) and keep track of how many have matched. If the nth one is reached, replace it and stop processing. Then put the string back together.
use strict;
use warnings;
use feature 'say';
say replace_nth_occurance("ven|ven|vett|vejj|ven|ven", 2, 'ven', 'venky');
sub replace_nth_occurance {
my ($string, $n, $needle, $replacement) = #_;
# take the string appart
my #elements = split /\|/, $string;
my $count = 0; # keep track of ...
foreach my $e (#elements) {
$count++ if $e =~ m/$needle/; # ... how many matches we've found
if ($count == $n) {
$e =~ s/$needle/$replacement/; # replace
last; # and stop processing
}
}
# put it back into the pipe-separated format
return join '|', #elements;
}
Output:
ven|venky|vett|vejj|ven|ven
To replace the n'th occurrence of "ven" to "venky":
my $n = 3;
my $test = "seven given ravens";
$test =~ s/ven/--$n == 0 ? "venky" : $&/eg;
This uses the ability with the /e flag to specify the substitution part as an expression.

Perl how do you assign a varanble to a regex match result

How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.

Find multiline blocks

I'm trying to find occurrences of BLOB_SMUGHO, from the file test.out from the bottom of the file. If found, return a chunk of data which I'm interested in between the string "2014.10"
I'm getting Use of uninitialized value $cc in pattern match (m//) at
Whats is wrong with this script?
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw(strftime);
use File::ReadBackwards;
my $find = "BLOB_SMUGHO";
my $chnkdelim = "\n[" . strftime "%Y.%m", localtime;
my $fh = File::ReadBackwards->new('test.out', $chnkdelim, 0) or die "err-file: $!\n";
while ( defined(my $line = $fh->readline) ) {
if(my $cc =~ /$find/){
print $cc;
}
}
close($fh);
In case if this helps, here is a sample content of test.out
2014.10.31 lots and
lots of
gibbrish
2014.10.31 which I'm not
interested
in. It
also
2014.10.31 spans
across thousands of
lines and somewhere in the middle there will be
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31 certain other
2014.10.31 words
2014.10.31
this precious word BLOB_SMUGHO and
2014.10.31
this precious word BLOB_SMUGHO and
which
I
will
be
interested
in.
And I'm expecting to capture all the multiple occurrences of the chunk of the text from bottom of the file.
2014.10.31
this precious word BLOB_SMUGHO and
First, you have written your match incorrectly due to misunderstanding the =~ operator:
if(my $cc =~ /$find/){ # incorrect, like saying if(undef matches something)
If you want to match what is in $line against the pattern between /.../ then do:
if($line =~ /$find/) {
The match operator expects a value on left side as well as right side. you were using it like an assignment operator.
If you need to capture the match(es) into a variable or list, then add it to the left of an equal sign:
if(my ($cc) = $line =~ /$find/) { <-- wrap $cc in () for list context
By the way, I think you are better off writing:
if($line =~ /$find/) {
print $line;
or if you want to print what you matched only
print $0;
Since you aren't capturing a substring, it doesnt really matter here.
Now, as to how to match everything between two patterns, the task is easier if you don't match line by line, but match across newlines using the /s modifier.
In Perl, you can set the record separator to undef and use slurp mode.
local $/ = undef;
my $s = <>; # read all lines into $s
Now to scan $s for patterns
while($s =~ /(START.*?STOP)/gsm) { print "$1\n"; } # print the pattern inclusive of START and STOP
Or to capture between START and STOP
while($s =~ /START(.*?)STOP/gsm) { print "$1\n"; } # print the pattern between of START and STOP
So in your case the start pattern is 2014.10.31 and stop is BLOB_SMUGHO
while($s =~ /(2014\.10\.31.*?BLOB_SMUGHO)/gsm) {
print "$1\n";
}
NOTE: Regex modifiers in Perl come after the last / so if you see I use /gsm for multiline, match newline, and global matching (get multiple matches in a loop by remembering the last location).

How can I extract a substring up to the first digit?

How can I find the first substring until I find the first digit?
Example:
my $string = 'AAAA_BBBB_12_13_14' ;
Result expected: 'AAAA_BBBB_'
Judging from the tags you want to use a regular expression. So let's build this up.
We want to match from the beginning of the string so we anchor with a ^ metacharacter at the beginning
We want to match anything but digits so we look at the character classes and find out this is \D
We want 1 or more of these so we use the + quantifier which means 1 or more of the previous part of the pattern.
This gives us the following regular expression:
^\D+
Which we can use in code like so:
my $string = 'AAAA_BBBB_12_13_14';
$string =~ /^\D+/;
my $result = $&;
Most people got half of the answer right, but they missed several key points.
You can only trust the match variables after a successful match. Don't use them unless you know you had a successful match.
The $&, $``, and$'` have well known performance penalties across all regexes in your program.
You need to anchor the match to the beginning of the string. Since Perl now has user-settable default match flags, you want to stay away from the ^ beginning of line anchor. The \A beginning of string anchor won't change what it does even with default flags.
This would work:
my $substring = $string =~ m/\A(\D+)/ ? $1 : undef;
If you really wanted to use something like $&, use Perl 5.10's per-match version instead. The /p switch provides non-global-perfomance-sucking versions:
my $substring = $string =~ m/\A\D+/p ? ${^MATCH} : undef;
If you're worried about what might be in \D, you can specify the character class yourself instead of using the shortcut:
my $substring = $string =~ m/\A[^0-9]+/p ? ${^MATCH} : undef;
I don't particularly like the conditional operator here, so I would probably use the match in list context:
my( $substring ) = $string =~ m/\A([^0-9]+)/;
If there must be a number in the string (so, you don't match an entire string that has no digits, you can throw in a lookahead, which won't be part of the capture:
my( $substring ) = $string =~ m/\A([^0-9]+)(?=[0-9])/;
$str =~ /(\d)/; print $`;
This code print string, which stand before matching
perl -le '$string=q(AAAA_BBBB_12_13_14);$string=~m{(\D+)} and print $1'
AAAA_BBBB_

How to user Perl to find all instances of strings that match a regexp?

How to user Perl to find and print all strings that match a regexp?
The following only finds the first match.
$text="?Adsfsadfgaasdf.
?Bafadfdsaadsfadsf.
xcxvfdgfdg";
if($text =~ m/\\?([^\.]+\.)/) {
print "$1\n";
}
EDIT1: /g doesn't work
#!/usr/bin/env perl
$text="?Adsfsadfgaasdf.
?Bafadfdsaadsfadsf.
xcxvfdgfdg";
if($text =~ m/\\?([^\.]+\.)/g) {
print "$1\n";
}
$ ./test.pl
?Adsfsadfgaasdf.
The problem is that the /g modifier does not use capture groups for multiple matches. You need to either iterate over the matches in scalar context, or catch the returned list in list context. For example:
use v5.10; # required for say()
$text="?Adsfsadfgaasdf.
?Bafadfdsaadsfadsf.
xcxvfdgfdg";
while ($text =~ /\?([^.]+\.)/g) { # scalar context
say $1;
}
for ($text =~ /\?[^.]+\./g) { # list context
say; # match is held in $_
}
Note in the second case, I skipped the parens, because in list context the whole match is returned if there are no parens. You may add parens to select part of the string.
Your version, using if, uses scalar context, which saves the position of the most recent match, but does not continue. A way to see what happens is:
if($text =~ m/\?([^\.]+\.)/g) {
print "$1\n";
}
say "Rest of string: ", substr $text, pos;
pos gives the position of the most recent match.
In previous answer #TLP correctly wrote that matching should be in list context.
use Data::Dumper;
$text="?Adsfsadfgaasdf.
?Bafadfdsaa.
dsfadsf.
xcxvfdgfdg";
#arr = ($text =~ /\?([^\.]+\.)/g);
print Dumper(#arr);
Expected result:
$VAR1 = 'Adsfsadfgaasdf.';
$VAR2 = 'Bafadfdsaa.';
You seem to be missing the /g flag, which tells perl to repeat the match as many times as possible.