Perl: Matching 3 pairs of numbers from 4 consecutive numbers - regex

I am writing some code and I need to do the following:
Given a 4 digit number like "1234" I need to get 3 pairs of numbers (the first 2, the 2 in the middle, and the last 2), in this example I need to get "12" "23" and "34".
I am new to perl and don't know anything about regex. In fact, I am writing a script for personal use and I've started reading about Perl some days ago because I figured it was going to be a better language for the task at hand (need to do some statistics with the numbers and find patterns)
I have the following code but when testing I processed 6 digit numbers, because I "forgot" that the numbers I would be processing are 4 digits, so it failed with the real data, of course
foreach $item (#totaldata)
{
my $match;
$match = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
if ($match)
{
($arr1[$i], $arr2[$i], $arr3[$i]) = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
$processednums++;
$i++;
}
}
Thank you.

You can move last matching position with pos()
pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset..
my $item = 1234;
my #arr;
while ($item =~ /(\d\d)/g) {
push #arr, $1;
pos($item)--;
}
print "#arr\n"; # 12 23 34

The simplest way would be to use a global regex pattern search
It is nearly always best to separate verificaton of the input data from processing, so the program below first rejects any values that are not four characters long or that contain a non-digit character
Then the regex pattern finds all points in the string that are followed by two digits, and captures them
use strict;
use warnings 'all';
for my $val ( qw/ 1234 6572 / ) {
next if length($val) != 4 or $val =~ /\D/;
my #pairs = $val =~ /(?=(\d\d))/g;
print "#pairs\n";
}
output
12 23 34
65 57 72

Here's a pretty loud example demonstrating how you can use substr() to fetch out the portions of the number, while ensuring that what you're dealing with is in fact exactly a four-digit number.
use warnings;
use strict;
my ($one, $two, $three);
while (my $item = <DATA>){
if ($item =~ /^\d{4}$/){
$one = substr $item, 0, 2;
$two = substr $item, 1, 2;
$three = substr $item, 2, 2;
print "one: $one, two: $two, three: $three\n";
}
}
__DATA__
1234
abcd
a1b2c3
4567
891011
Output:
one: 12, two: 23, three: 34
one: 45, two: 56, three: 67

foreach $item (#totaldata) {
if ( my #match = $item =~ m/(?=(\d\d))/ ) {
($heads[$i], $middles[$i], $tails[$i]) = #match;
$processednums++;
$i++;
}
}

Related

Loop through global substitution

$num = 6;
$str = "1 2 3 4";
while ($str =~ s/\d/$num/g)
{
print $str, "\n";
$num++;
}
Is it possible to do something like the above in perl? I would like the loop to run only 4 times and to finish with $str being 6 7 8 9.
You don't need the loop: the /g modifier causes the substitution to be repeated as many times as it matches. What you want is the /e modifier to compute the substitution. Assuming the effect you were after is to add 5 to each number, the example code is as follows.
$str = "1 2 3 4";
$str =~ s/(\d)/$1+5/eg;
print "$str\n";
If you really wanted to substitute numbers starting with 6, then this works.
$num = 6;
$str = "1 2 3 4";
$str =~ s/\d/$num++/eg;
print "$str\n";
You could with the match operator.
my $new_str = '';
while ($str =~ /\G(.*?)\d/gc) {
$new_str .= $1 . $num++;
}
$new_str .= substr($str, pos($str));
However, #TFBW's solution is probably what you want unless you're writing a tokenizer for a big parser. In a tokenizer, it would look more like:
# So we don't have to say «$str =~ all» over the place.
# Also, allows us to use redo.
for ($str) {
/\G \s+ /xgc; # Skip whitespace.
if (/\G (\d+) /xgc) {
# Do something with $1.
redo;
}
...
last if /\G \z /xgc;
die "Unrecognized token";
}

Is there a built-in perl function that determines length of numeric variable?

I am parsing a text file with the following format that needs to be distinguished.
cat dog [numeric, 00-23]
pickle fence [numeric, 0-5]
tack glue [numeric,1-53]
dent paint [numeric01-15]
If the minimum of the range is a single digit, then I need to process it a certain way. If the minimum of the range is double-digit (including 00,01,02,etc), I need to process it another way.
if($line =~ /\[numeric.*(\d+)\-(\d+)/i){
$rangemin=$1;
$rangemax=$2;
#find # of digits in $rangemin
#length() doesn't work
#I'm trying to find a function that finds number of digits so length of `00` or `01` or `02` or etc. returns `2`
}
How do I find the # of digits of $rangemin?
Your regular expression grabs the leading 0 because .* is very greedy.
use warnings;
use strict;
while (my $line = <DATA>) {
if ($line =~ /\[numeric[\D]*(\d+)\-(\d+)/i){
my $rangemin = $1;
my $len = length $rangemin;
print "rangemin=$rangemin len=$len\n";
}
}
__DATA__
cat dog [numeric, 00-23]
pickle fence [numeric, 0-5]
tack glue [numeric,1-53]
dent paint [numeric01-15]
Output:
rangemin=00 len=2
rangemin=0 len=1
rangemin=1 len=1
rangemin=01 len=2
You can use the length built-in, just like you would for a string; Perl implicitly converts arguments based on the operator/builtin that they are being passed to:
#!perl
use strict;
use warnings;
my $number = 5;
print length($number), "\n";
You can also use a base-10 logarithm to determine how long a number is:
#!perl
use strict;
use warnings;
my $number = 12345;
my $length = $number > 0
? int(log($number)/log(10)) + 1
: $number == 0
? 0
: int(log(abs($number))/log(10)) + 2;
print $length, "\n";
# 5
**EDIT: as #Keith Thompson points out, log is not defined for numbers <= 0. Use abs to prevent this condition.

Extract journal title from Genbank file using perl without using $1, $2 etc

This is a part of my input Genbank file:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
And I want to extract the journal title for example J. Gen. Virol. (not including the issue number and pages)
This is my code and it doesn't give any result so I am wondering what goes wrong. I did use parentheses for $1, $2 etc... And though it worked my tutor told me to try without using that method, use substr instead.
foreach my $line (#lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/\s\d/ ) {
print substr( $line, 12, $index - 13 );
print "\n";
}
$index++;
}
}
Another way to do this, is to take advantage of BioPerl, which can parse GenBank files:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "\n";
}
If you run this script with AC_000005.1 saved in a file called AC_000005.1.gb, you get:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003)
J. Virol. 68 (1), 379-389 (1994)
J. Virol. 67 (2), 682-693 (1993)
J. Virol. 63 (8), 3535-3540 (1989)
Nucleic Acids Res. 9 (23), 6571-6589 (1981)
Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
Rather than matching and using substr, it is much easier to use a single regex to capture the whole JOURNAL line and use brackets to capture the text representing the journal information:
foreach my $line (#lines) {
if ($line =~ /JOURNAL\s+(.+)/) {
print "Journal information: $1\n";
}
}
The regular expression looks for JOURNAL followed by one or more whitespace characters, and (.+) captures the rest of the characters in the line.
To get the text without using $1, I think you're trying to do something like this:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /\s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
If you already know how many characters there are in the left-hand column, you can just do substr($line, 12) (or whatever) to retrieve a substring of $line starting at character 12:
foreach my $line (#lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "\n";
}
}
You can combine the two techniques to eliminate the issue number and dates from the journal data:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /\d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_name\n";
}
I think it would have been easier just to get the data from the Pubmed API! ;)

Pattern matching in perl (Lookahead and Condition on word Index)

I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.
Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions
You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}
Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}
First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}

Storing Numerical Data in a Variable through matching in Perl

I am a beginner at Perl and want to store some data from a file format into a variable. Specifically, each line of the file has a format like the following:
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
I was able to use matching to get the line I wanted (with the code below).
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
}
However, I want to store the three coordinate values as variables or in a hash. Is it possible to use matching to store the coordinate values rather than having to use substring.
In this instance I would simply split each record into an array and verify the identifying fields. The coordinate values can simply be extracted from the array if the line has been found to be relevant.
Like this
use strict;
use warnings;
my $residue_name = 'ASP';
my $residue_number = 72;
while (<DATA>) {
my #fields = split;
next unless $fields[0] eq 'ATOM'
and $fields[2] eq 'CB'
and $fields[3] eq $residue_name
and $fields[5] == $residue_number;
my #coords = #fields[6, 7, 8];
print "#coords\n";
}
__DATA__
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
output
-2.80100 -7.45000 -2.09400
You can get the end of the line which is AFTER the match with $' (see http://perldoc.perl.org/perlvar.html), and split around spaces like in :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
(undef, $x, $y, $z) = split /\s+/;
...
}
(undef is necessary, because $_ will start with some spaces, thus the first variable will be empty)
You can also write something like :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
/\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)/;
($x, $y, $z) = ($1, $2, $3);
}
In fact, as always in Perl, there are lots of ways to do it...