Having trouble capturing storing strings in Perl regular expression? - regex

So I've been working around with this problem for a while now.
I have a file with one hundred FASTA sequences arranged like this:
>gi|192567|gb|AAA37417.1| cystic fibrosis transmembrane conductance regulator [Mus musculus]
MQKSPLEKASFISKLFFSWTTPILRKGYRHHLELSDIYQAPSADSADHLSEKLEREWDREQASKKNPQLIHALRRCFFWRFLFYGILLYLGEVTKAVQPVLLGRIIASYDPENKVERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHRIGMQMRTAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFIWIAPLQVTLLMGLLWDLLQFSAFCGLGLLIILVIFQAILGKMMVKYRDQRAAKINERLVITSEIIDNIYSVKAYCWESAMEKMIENLREVELKMTRKAAYMRFFTSSAFFFSGFFVVFLSVLPYTVINGIVLRKIFTTISFCIVLRMSVTRQFPTAVQIWYDSFGMIRKIQDFLQKQEYKVLEYNLMTTGIIMENVTAFWEEGFGELLQKAQQSNGDRKHSSDENNVSFSHLCLVGNPVLKNINLNIEKGEMLAITGSTGLGKTSLLMLILGELEASEGIIKHSGRVSFCSQFSWIMPGTIKENIIFGVSYDEYRYKSVVKACQLQQDITKFAEQDNTVLGEGGVTLSGGQRARISLARAVYKDADLYLLDSPFGYLDVFTEEQVFESCVCKLMANKTRILVTSKMEHLRKADKILILHQGTSYFYGTFSELQSLRPSFSSKLMGYDTFDQFTEERRSSILTETLRRFSVDDSSAPWSKPKQSFRQTGEVGEKRKNSILNSFSSVRKISIVQKTPLCIDGESDDLQEKRLSLVPDSEQGEAALPRSNMIATGPTFPGRRRQSVLDLMTFTPNSGSSNLQRTRTSIRKISLVPQISLNEVDVYSRRLSQDSTLNITEEINEEDLKECFLDDVIKIPPVTTWNTYLRYFTLHKGLLLVLIWCVLVFLVEVAASLFVLWLLKNNPVNSGNNGTKISNSSYVVIITSTSFYYIFYIYVGVADTLLALSLFRGLPLVHTLITASKILHRKMLHSILHAPMSTISKLKAGGILNRFSKDIAILDDFLPLTIFDFIQLVFIVIGAIIVVSALQPYIFLATVPGLVVFILLRAYFLHTAQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFRRQTYFETLFHKALNLHTANWFMYLATLRWFQMRIDMIFVLFFIVVTFISILTTGEGEGTAGIILTLAMNIMSTLQWAVNSSIDTDSLMRSVSRVFKFIDIQTEESMYTQIIKELPREGSSDVLVIKNEHVKKSDIWPSGGEMVVKDLTVKYMDDGNAVLENISFSISPGQRVGLLGRTGSGKSTLLSAFLRMLNIKGDIEIDGVSWNSVTLQEWRKAFGVITQKVFIFSGTFRQNLDPNGKWKDEEIWKVADEVGLKSVIEQFPGQLNFTLVDGGYVLSHGHKQLMCLARSVLSKAKIILLDEPSAHLDPITYQVIRRVLKQAFAGCTVILCEHRIEAMLDCQRFLVIEESNVWQYDSLQALLSEKSIFQQAISSSEKMRFFQGRHSSKHKPRTQITALKEETEEEVQETRL
I've written a subroutine that opens the file, and reads each sequence one at a time. For each sequence I would like to add both the gi number at the beginning and the long sequence in capital letters as strings to a growing array. However, I'm having trouble writing a regular expression to store these values. Here is my current subroutine, which I tweaked to see if I was actually storing the gi number:
sub getFASTA {
my ($filename) = #_;
my #FASTA_arr;
$/ = "\n\n";
open (my $fh, '<', $filename) or
die ("Could not open file: $filename");
while (<$fh>) {
chomp $_;
$_ =~ /^>gi|(\d*?)|/s;
say "$1";
}
close $fh;
#say join(" ", #FASTA_arr);
}
However, trying to run this returns:
Use of uninitialized value $1 in string at sequenceAlignment.pl line 30, <$fh> chunk 1.
This is returned for each sequence, so 100 times in total.
So any idea of what is wrong? I'm almost certain that it is a problem with the regular expression, because when I changed it to "$_ =~ /(>gi|)/s;", it worked correctly, just with 100 ">gi|"s printing out.

| means OR in a regex. Escape it. (Seems like perl figured out what you "really" meant when it was at the end of the capture group and didn't have a 2nd operand)

Related

How to capture multiple words using regex on this particular text?

I'm trying to extract the best paying job titles from this sample text:
Data Scientist
#1 in Best Paying Jobs
5,100 Projected Jobs $250,000 Median Salary 0.5% Unemployment Rate
Programmer
#2 in Best Paying Jobs
4,000 Projected Jobs $240,000 Median Salary 1.0% Unemployment Rate
SAP Module Consultant
#3 in Best Paying Jobs
3,000 Projected Jobs $220,000 Median Salary 0.2% Unemployment Rate
by using the following regex and Perl code.
use File::Glob;
local $/ = undef;
my $file = #ARGV[0];
open INPUT, "<", $file
or die "Couldn't open file $!\n";
my $content = <INPUT>;
my $regex = "^\w+(\w+)*$\n\n#(\d+)";
my #arr_found = ($content =~ m/^\w+(\w+)*$\n\n#(\d+)/g);
close (INPUT);
Q1: The regex finds only the one-word titles*. How to make it find the multiple word titles and how to forward (i.e. how to properly capture) those found titles into the Perl array?
Q2: I defined the regex into a Perl variable and tried to use that variable for the regex operation like:
my #arr_found = ($content =~ m/"$regex"/g);
but it gave error. How to make it?
* When I apply the regex ^\w+(\w+)*$\n\n#(\d+) on Sublime Text 2, it finds only the one word titles.
Why not process line-by-line, simple and easy
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (#jobs, $prev_line);
while (my $line = <$fh>) {
chomp $line;
next if not $line =~ /\S/;
if ($line =~ /^\s*#[0-9]/) {
push #jobs, $prev_line;
}
$prev_line = $line;
}
say for #jobs;
This relies on the requirement that the #N line is the first non-empty line after the jobs title.
It prints
Data Scientist
Programmer
SAP Module Consultant
The question doesn't say whether rankings are wanted as well but there is a hint in the regex that they may be. Then, assuming that the ordering in the file is "correct" you can iterate over the array indices and print elements (titles) with their indices (rank).
Or, to be certain, capture them in the regex, /^\s*#([0-9]+)/. Then you can directly print both the title and its rank, or perhaps store them in a hash with key-value pairs rank => title.
As for the regex, there are a few needed corrections. To compose a regex ahead of matching, what is a great idea, you want the qr operator. To work with multi-line strings you need the /m modifier. (See perlretut.) The regex itself needs fixing. For example
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*[0-9]/m;
my #titles = $content =~ /$regex/g
what captures a line followed by at least one empty line and then #N on another line.
If the ranking of titles is needed as well then capture it, too, and store in a hash
my $regex = qr/^(.+)?(?:\n\s*)+\n\s*#\s*([0-9]+)/m;
my %jobs = reverse $content =~ /$regex/g;
or maybe better not push it with reverse-ing the list of matches but iterate through pairs instead
my %jobs;
while ($content =~ /$regex/g) {
$jobs{$2} = $1;
}
since with this we can check our "catch" at each iteration, do other processing, etc. Then you can sort the keys to print in order
say "#$_ $jobs{$_}" for sort { $a <=> $b } keys %jobs;
and just in general pick jobs by their rank as needed.
I think that it's fair to say that the regex here is much more complex than the first program.
Answers for your questions:
you are capturing the second word only and you do not allow for space in between them. That's why it won't match e.g. Data Scientist
use the qr// operator to compile regexes with dynamic content. The error stems from the $ in the middle of the regex which Perl regex compiler assumes you got wrong, because $ should only come at the end of a regex.
The following code should achieve what you want. Note the two-step approach:
Find matching text
beginning of a line (^)
one-or-more words separated by white space (\w+(?:\s+\w+)*, no need to capture match)
2 line ends (\n\n)
# followed by a number (\d+)
apply regex multiple times (/g) and treat strings as multiple lines (/m, i.e. ^ will match any beginning of a line in the input text)
Split match at line ends (\n) and extract the 1st and the 3rd field
as we know $match will contain three lines, this approach is much easier than writing another regex.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw(say);
use File::Slurper qw(read_text);
my $input = read_text($ARGV[0])
or die "slurp: $!\n";
my $regex = qr/^(\w+(?:\s+\w+)*\n\n#\d+)/m;
foreach my $match ($input =~ /$regex/g) {
#say $match;
my($title, undef, $rank) = split("\n", $match);
$rank =~ s/^#//;
say "MATCH '${title}' '${rank}'";
}
exit 0;
Test run over the example text you provided in your question.
$ perl dummy.pl dummy.txt
MATCH 'Data Scientist' '1'
MATCH 'Programmer' '2'
MATCH 'SAP Module Consultant' '3'
UNICODE UPDATE: as suggested by #Jan's answer the code can be improved like this:
my $regex = qr/^(\w+(?:\s+\w+)*\R\R#\d+)/m;
...
my($title, undef, $rank) = split(/\R/, $match);
That is probably the more generic approach, as UTF-8 is the default for File::Slurper::read_text() anyway...
You were not taking whitespaces (as in Data Scientist) into account:
^\w+.*$\R+#(\d+)
See a demo on regex101.com.
\R is equal to (?>\r\n|\n|\r|\f|\x0b|\x85) (matches Unicode newlines sequences).

perl regex: searching thru entire line of file

I'm a regex newbie, and I am trying to use a regex to return a list of dates from a text file. The dates are in mm/dd/yy format, so for years it would be '55' for '1955', for example. I am trying to return all entries from years'50' to '99'.
I believe the problem I am having is that once my regex finds a match on a line, it stops right there and jumps to the next line without checking the rest of the line. For example, I have the dates 12/12/12, 10/10/57, 10/09/66 all on one line in the text file, and it only returns 10/10/57.
Here is my code thus far. Any hints or tips? Thank you
open INPUT, "< dates.txt" or die "Can't open input file: $!";
while (my $line = <INPUT>){
if ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
A few points about your code
You must always use strict and use warnings 'all' at the top of all your Perl programs
You should prefer lexical file handles and the three-parameter form of open
If your regex pattern contains literal slashes then it is clearest to use a non-standard delimiter so that they don't need to be escaped
Although recent releases of Perl have fixed the issue, there used to be a significant performance hit when using $&, so it is best to avoid it, at least for now. Put capturing parentheses around the whole pattern and use $1 instead
This program will do as you ask
use strict;
use warnings 'all';
open my $fh, '<', 'dates.txt' or die "Can't open input file: $!";
while ( <$fh> ) {
print $1, "\n" while m{(\d\d/\d\d/[5-9][0-9])}g
}
output
10/10/57
10/09/66
You are printing $& which gets updated whenever any new match is encountered.
But in this case you need to store the all the previous matches and the updated one too, so you can use array for storing all the matches.
while(<$fh>) {
#dates = $_ =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g;
print "#dates\n" if(#dates);
}
You just need to change the 'if' to a 'while' and the regex will take up where it left off;
open INPUT, "< a.dat" or die "Can't open input file: $!";
while (my $line = <INPUT>){
while ($line =~ /(\d\d)\/(\d\d)\/([5-9][0-9])/g){
print "$&\n" ;
}
}
# Output given line above
# 10/10/57
# 10/09/66
You could also capture the whole of the date into one capture variable and use a different regex delimiter to save escaping the slashes:
while ($line =~ m|(\d\d/\d\d/[5-9]\d)|g) {
print "$1\n" ;
}
...but that's a matter of taste, perhaps.
You can use map also to get year range 50 to 99 and store in array
open INPUT, "< dates.txt" or die "Can't open input file: $!";
#as = map{$_ =~ m/\d\d\/\d\d\/[5-9][0-9]/g} <INPUT>;
$, = "\n";
print #as;
Another way around it is removing the dates you don't want.
$line =~ s/\d\d\/\d\d\/[0-4]\d//g;
print $line;

Perl, problems scrubbing a file with regex effeciently

First I will explain the problem my program attempts to solve. I have two input files, one contains lines of "good" numbers:
100000
100001
100002
100003
100004
The other file is a file of "raw" numbers that I want to check each line of and see if the line contains one of the "good" numbers above followed by 4 more numbers, the additional 4 numbers can be anything. so if the file containing the raw numbers is :
8881000001234
1000014321
999991000021234567
00234100001
1000041234
100002123
1000029876
after scrubbing with regex the matched numbers would be
1000001234
1000014321
1000021234
1000041234
1000029876
The way I have done this so far is to store the "good" numbers in an array then slurped the "raw" numbers into a scalar
my $FH
my#good_nums
open $FH, '<', 'good_numbers' or die $!;
while(<$FH>) { chomp; push #good_nums, $_; }
close $FH;
open $FH, '<', 'raw_numbers' or die $!;
my $raw_nums = do { local $/; <$FH> };
close $FH;
then with those I can do this :
my #matches;
foreach my $num (#good_nums) {
push #matches, $raw_nums =~ /$num\d{4}/g;
}
So #matches contains the correct matches and this has been working well.
But now I have developed the need to also capture the lines from "raw" numbers that did not match. I can capture the non matching lines by putting the "raw" numbers into an array (instead of slurping them) and join the #good_nums array into a regex :
my $QRnums = '(?:' . (join '|', #good_nums) . ')';
$QRnums = qr/$QRnums/;
my #raw_nums;
open my $FH, '<', 'raw_numbers' or die $!;
while(<$FH>) { chomp; push #raw_nums, $_; }
my #matches;
my #junk;
for (#raw_nums) {
if ($_ =~ /($QRnums\d{4})/g) {
push #matches, $1;
} else {
push #junk, $_;
}
}
This is working but when I increase the number of lines in each file to 150,000 or more, the latter solution takes 4 or 5 times longer than the former solution. I know there must be another Perl solution that can solve my problem efficiently but I am at a loss. I am not very good with intermediate Perl and beyond.. Is there a better way to do this or can my first solution be rewritten so that I can obtain the non matches in an array too? Apart from needing to solve the problem explained in the opening of my post, I am open to anything.
Only cache your regex one time, and then compare using $_ =~ $QRnums.
Additionally, there's no need to slurp your other file, just do line-by-line process instead.
my $QRnums = '(?:' . (join '|', #good_nums) . ')';
$QRnums = qr/($QRnums\d{4})/;
my #matches;
my #junk;
open my $FH, '<', 'raw_numbers' or die $!;
while (<$FH>) {
chomp;
if ($_ =~ $QRnums) {
push #matches, $1;
} else {
push #junk, $_;
}
}
Also, if your regex should be bounded to the start of the string ^, then I would suggest that you add that: $QRnums = qr/^($QRnums\d{4})/;
Addendum
From perlop - Regexp Quote-Like Operators
Since Perl may compile the pattern at the moment of execution of the qr() operator, using qr() may have speed advantages in some situations, notably if the result of qr() is used standalone:
and later:
Precompilation of the pattern into an internal representation at the moment of qr() avoids a need to recompile the pattern every time a match /$pat/ is attempted. (Perl has many other internal optimizations, but none would be triggered in the above example if we did not use qr() operator.)
Basically, because your #good_nums list was potentially very large, it made sense to cache that if we could so that it the regex test only needed to be compiled once.

Perl Regex Match Text String and Extract Following Number

I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.

Removing CRLF (0D 0A) from string in Perl

I've got a Perl script which consumes an XML file on Linux and occasionally there are CRLF (Hex 0D0A, Dos new lines) in some of the node values which.
The system which produces the XML file writes it all as a single line, and it looks as if it occasionally decides that this is too long and writes a CRLF into one of the data elements. Unfortunately there's nothing I can do about the providing system.
I just need to remove these from the string before I process it.
I've tried all sorts of regex replacement using the perl char classes, hex values, all sorts and nothing seems to work.
I've even run the input file through dos2unix before processing and I still can't get rid of the erroneous characters.
Does anyone have any ideas?
Many Thanks,
Typical, After battling for about 2 hours, I solved it within 5 minutes of asking the question..
$output =~ s/[\x0A\x0D]//g;
Finally got it.
$output =~ tr/\x{d}\x{a}//d;
These are both whitespace characters, so if the terminators are always at the end, you can right-trim with
$output =~ s/\s+\z//;
A few options:
1. Replace all occurrences of cr/lf with lf: $output =~ s/\r\n/\n/g; #instead of \r\n might want to use \012\015
2. Remove all trailing whitespace: output =~ s/\s+$//g;
3. Slurp and split:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
sub main{
createfile();
outputfile();
}
main();
sub createfile{
(my $file = $0)=~ s/\.pl/\.txt/;
open my $fh, ">", $file;
print $fh "1\n2\r\n3\n4\r\n5";
close $fh;
}
sub outputfile{
(my $filei = $0)=~ s/\.pl/\.txt/;
(my $fileo = $0)=~ s/\.pl/out\.txt/;
open my $fin, "<", $filei;
local $/; # slurp the file
my $text = <$fin>; # store the text
my #text = split(/(?:\r\n|\n)/, $text); # split on dos or unix newlines
close $fin;
local $" = ", "; # change array scalar separator
open my $fout, ">", $fileo;
print $fout "#text"; # should output numbers separated by comma space
close $fout;
}