Unable to match string between | in Perl - regex

I have a text read from a fasta file and I am trying to read the accession number in Perl. But I am not getting an output. Here's the code:
use strict;
use warnings;
sub main {
my $file = "PXXXXX.fasta";
if(!open(FASTA, $file)) {
die "Could not find $file\n";
}
my $myLine = <FASTA>;
my $pat = "|";
my #Num = $myLine =~ /$pat(.*?)$pat/;
print($Num[0]);
close(FASTA);
}
main();
The content of the FASTA filehandle is:
sp|P27455|MOMP_CHLPN Major outer membrane porin OS=Chlamydia pneumoniae OX=83558 GN=ompA PE=2 SV=1
MKKLLKSALLSAAFAGSVGSLQALPVGNPSDPSLLIDGTIWEGAAGDPCDPCATWCDAIS
LRAGFYGDYVFDRILKVDAPKTFSMGAKPTGSAAANYTTAVDRPNPAYNKHLHDAEWFTN
AGFIALNIWDRFDVFCTLGASNGYIRGNSTAFNLVGLFGVKGTTVNANELPNVSLSNGVV
ELYTDTSFSWSVGARGALWECGCATLGAEFQYAQSKPKVEELNVICNVSQFSVNKPKGYK
GVAFPLPTDAGVATATGTKSATINYHEWQVGASLSYRLNSLVPYIGVQWSRATFDADNIR
IAQPKLPTAVLNLTAWNPSLLGNATALSTTDSFSDFMQIVSCQINKFKSRKACGVTVGAT
LVDADKWSLTAEARLINERAAHVSGQFRF
Any clue how to fix the code to return: P27455 ?

The pipe | holds a special meaning in regular expressions. You need to escape it. The easiest way to do that is by using \Q and \E.
$myLine =~ /\Q$pat\E(.*?)\Q$pat\E/;
Or you could use the quotemeta built-in.
my $pat = quotemeta "|";
my #Num = $myLine =~ /$pat(.*?)$pat/; # or use [^$pat]+
You can also just not use a regular expression search and simply split the line. If you always want the second column, this will do just as well.
my (undef, $num) = split /\|/, $line;

Looks like you are trying to split the line on the | character, so use the split function.
my #Num = split /\|/, $myLine;
This splits $myLine on |. Note that you may have to change the index on #Num to get the correct item out of it.

Related

How to find all the words that begin with a|b and end with a|b. (Ex: “adverb” and “balalaika”)

The following perl program has a regex written to serve my purpose. But, this captures results present within a string too. How can I only get strings separated by spaces/newlines/tabs?
The test data I used is present below:
http://sainikhil.me/stackoverflow/dictionaryWords.txt
use strict;
use warnings;
sub print_a_b {
my $file = shift;
$pattern = qr/(a|b|A|B)\S*(a|b|A|B)/;
open my $fp, $file;
my $cnt = 0;
while(my $line = <$fp>) {
if($line =~ $pattern) {
print $line;
$cnt = $cnt+1;
}
}
print $cnt;
}
print_a_b #ARGV;
You could consider using an anchor like \b: word boundary
That would help apply the regexp only after and before a word.
\b(a|b|A|B)\S*(a|b|A|B)\b
Simpler, as Avinash Raj adds in the comments:
(?i)\b[ab]\S*[ab]\b
(using the case insensitive flag or modifier)
If you have multiple words in the same line then you can use word boundaries in a regex like this:
(?i)\b[ab][a-z]*[ab]\b
The pattern code is:
$pattern = /\b[ab][a-z]*[ab]\b/i;
However, if you want to check for lines with only has a word, then you can use:
(?i)$[ab][a-z]*[ab]$
Update: for your comment * lines that begin and end with the same character*, you can use this regex:
(?i)\b([a-z])[a-z]*\1\b
But if you want any character and not letters only like above you can use:
(?i)\b(.)[a-z]*\1\b

Load regex from file and match groups with it in Perl

I have a file containing regular expressions, e.g.:
City of (.*)
(.*) State
Now I want to read these (line by line), match them against a string, and print out the extraction (matched group). For example: The string City of Berlin should match with the first expression City of (.*) from the file, after that Berlin should be extracted.
This is what I've got so far:
use warnings;
use strict;
my #pattern;
open(FILE, "<pattern.txt"); # open the file described above
while (my $line = <FILE>) {
push #pattern, $line; # store it inside the #pattern variable
}
close(FILE);
my $exampleString = "City of Berlin"; # line that should be matched and
# Berlin should be extracted
foreach my $p (#pattern) { # try each pattern
if (my ($match) = $exampleString =~ /$p/) {
print "$match";
}
}
I want Berlin to be printed.
What happens with the regex inside the foreach loop?
Is it not compiled? Why?
Is there even a better way to do this?
Your patterns contain a newline character which you need to chomp:
while (my $line = <FILE>) {
chomp $line;
push #pattern, $line;
}
First off - chomp is the root of your problem.
However secondly - your code is also very inefficient. Rather than checking patterns in a foreach loop, consider instead compiling a regex in advance:
#!/usr/bin/env perl
use strict;
use warnings;
# open ( my $pattern_fh, '<', "pattern.txt" ) or die $!;
my #patterns = <DATA>;
chomp(#patterns);
my $regex = join( '|', #patterns );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}
Why is this better? Well, because it scales more efficiently. With your original algorithm - if I have 100 lines in the patterns file, and 100 lines to check as example str, it means making 10,000 comparisons.
With a single regex, you're making one comparison on each line.
Note - normally you'd use quotemeta when reading in regular expressions, which will escape 'meta' characters. We don't want to do this in this case.
If you're looking for even more concise, you can use map to avoid needing an intermediate array:
my $regex = join( '|', map { chomp; $_ } <$pattern_fh> );
$regex = qr/(?:$regex)/;
print "Using regex of: $regex\n";
my $example_str = 'City of Berlin';
if ( my ($match) = $example_str =~ m/$regex/ ) {
print "Matched: $match\n";
}

Print line of pattern match in Perl regex

I am looking for a keyword in a multiline input using a regex like this,
if($input =~ /line/mi)
{
# further processing
}
The data in the input variable could be like this,
this is
multi line text
to be matched
using perl
The code works and matches the keyword line correctly. However, I would also like to obtain the line where the pattern was matched - "multi line text" - and store it into a variable for further processing. How do I go about this?
Thanks for the help.
You can grep out the lines into an array, which will then also serve as your conditional:
my #match = grep /line/mi, split /\n/, $input;
if (#match) {
# ... processing
}
TLP's answer is better but you can do:
if ($input =~ /([^\n]+line[^\n]+)/i) {
$line = $1;
}
I'd look if the match is in the multiline-String and in case it is, split it into lines and then look for the correct index number (starting with 0!):
#!/usr/bin/perl
use strict;
use warnings;
my $data=<<END;
this is line
multi line text
to be matched
using perl
END
if ($data =~ /line/mi){
my #lines = split(/\r?\n/,$data);
for (0..$#lines){
if ($lines[$_] =~ /line/){
print "LineNr of Match: " . $_ . "\n";
}
}
}
Did you try his?
This works for me. $1 represents the capture of regex inside ( and )
Provided there is only one match in one of the lines.If there are matches in multiple lines, then only the first one will be captured.
if($var=~/(.*line.*)/)
{
print $1
}
If you want to capture all the lines which has the string line then use below:
my #a;
push #a,$var=~m/(.*line.*)/g;
print "#a";

perl regular expression finding pattern only in the front of the text

Suppose there is a text like this:
|-SAMPLE-D2
|---SAMPLE-D1
|---SAMPLE3
I want to count the number of "-" after |.
I tried to parse that by using the following regular expression in perl
$count=()= /-/g;
but this is problematic because the first two has "-" somewhere else in the text as well as in the front. How should I form my regex or use other function in perl to get the number of "-" right after "|"?
Regex to match the dashes after the starting |:
/^\|([\-]*)/
To count dashes that are not preceded by a letter, use a negative look-behind assertion.
$count = () = /(?<!\w)-/g
If the vertical line only ever comes at the start you can get the string of repeating minuses with:
my ($match) = $txt =~ /^\|(-*)/;
The brackets around $match cause the captured portion of the regex to be put into it
then get the number of minuses using
my $minus_count = length($match || '');
The
|| '')
bit
Initialises $match if the regex above found no matches at all, to stop length moaning about uninitialised variables (if you have warnings on)
Not sure if you can count in Regex directly but you can extract capture groups and do a simple arithmetic with their string lengths:
#!/usr/bin/perl
use warnings;
my $inFile = $ARGV[0];
open(FILEHANDLE, "<", $inFile) || die("Could not open file ".$inFile);
my #fileLines = <FILEHANDLE>;
my $lineNo = 0;
my $rslt;
foreach my $line(#fileLines) {
chomp($line);
$line =~ s/^\s+//;
$line =~ s/\s+$//;
$lineNo++;
print "\n".$lineNo." = <".$line.">";
if($line =~ m/^\|-+(.+)/) {
my $text = $1;
print "\n\ttext = <".$text.">";
my $minCnt = length($line) - length($text) - 1;
print "\n\tminus count = <".$minCnt.">";
}
}
close(FILEHANDLE);

How do I do optional matching in a regular expression using Perl?

I want to extract the size value from a string. The string can be be formatted in one of two ways:
Data-Size: (2000 bytes)
or
file Data-Size: (2082 bytes)
If the string is present in a file, it will appear only once.
So far I have:
#!/usr/bin/perl
use strict;
use warnings;
open FILE, "</tmp/test";
my $input = do { local $/; <FILE> };
my ($length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/m;
$length or die "could not get data length\n";
print "length: $length\n";
The problem seems to be with making the word file optional. I thought I could do this with:
(file)?
But this seems to be stopping matches when the word file is not present. Also when the word file is there it sets $length to the string "file". I think this is because the parenthesis around file also mean extraction.
So how do I match either of the two strings and extract the size value?
You want the second capture in $length. To do that, you could use
my (undef, $length) = $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/;
or
my $length = ( $input =~ /(file)?\s*Data-Size: \((\d+) bytes\)/ )[1];
But a much better approach would be to avoid capturing something you're not interested in capturing.
my ($length) = $input =~ /(?:file)?\s*Data-Size: \((\d+) bytes\)/;
Of course, you'd get the same result from
my ($length) = $input =~ /Data-Size: \((\d+) bytes\)/;
By the way, I removed the needless /m. /m changes the meaning of ^ and $, yet neither are present in the pattern.
Just my 2 cents, you can make optional matching other way:
/(file|)\s*Data-Size: ((\d+) bytes)/