Storing Numerical Data in a Variable through matching in Perl - regex

I am a beginner at Perl and want to store some data from a file format into a variable. Specifically, each line of the file has a format like the following:
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
I was able to use matching to get the line I wanted (with the code below).
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
}
However, I want to store the three coordinate values as variables or in a hash. Is it possible to use matching to store the coordinate values rather than having to use substring.

In this instance I would simply split each record into an array and verify the identifying fields. The coordinate values can simply be extracted from the array if the line has been found to be relevant.
Like this
use strict;
use warnings;
my $residue_name = 'ASP';
my $residue_number = 72;
while (<DATA>) {
my #fields = split;
next unless $fields[0] eq 'ATOM'
and $fields[2] eq 'CB'
and $fields[3] eq $residue_name
and $fields[5] == $residue_number;
my #coords = #fields[6, 7, 8];
print "#coords\n";
}
__DATA__
ATOM 575 CB ASP 2 72 -2.80100 -7.45000 -2.09400 C_3 4 0 -0.28000 0 0
output
-2.80100 -7.45000 -2.09400

You can get the end of the line which is AFTER the match with $' (see http://perldoc.perl.org/perlvar.html), and split around spaces like in :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
(undef, $x, $y, $z) = split /\s+/;
...
}
(undef is necessary, because $_ will start with some spaces, thus the first variable will be empty)
You can also write something like :
if ($line =~ /^ATOM\s+\d+\s+(CB+)\s+$residue_name+\s+\d+\s+$residue_number/)
{
$_ = $';
/\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)\s+(-?\d+\.?\d*)/;
($x, $y, $z) = ($1, $2, $3);
}
In fact, as always in Perl, there are lots of ways to do it...

Related

Perl: Matching 3 pairs of numbers from 4 consecutive numbers

I am writing some code and I need to do the following:
Given a 4 digit number like "1234" I need to get 3 pairs of numbers (the first 2, the 2 in the middle, and the last 2), in this example I need to get "12" "23" and "34".
I am new to perl and don't know anything about regex. In fact, I am writing a script for personal use and I've started reading about Perl some days ago because I figured it was going to be a better language for the task at hand (need to do some statistics with the numbers and find patterns)
I have the following code but when testing I processed 6 digit numbers, because I "forgot" that the numbers I would be processing are 4 digits, so it failed with the real data, of course
foreach $item (#totaldata)
{
my $match;
$match = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
if ($match)
{
($arr1[$i], $arr2[$i], $arr3[$i]) = ($item =~ m/(\d\d)(\d\d)(\d\d)/);
$processednums++;
$i++;
}
}
Thank you.
You can move last matching position with pos()
pos directly accesses the location used by the regexp engine to store the offset, so assigning to pos will change that offset..
my $item = 1234;
my #arr;
while ($item =~ /(\d\d)/g) {
push #arr, $1;
pos($item)--;
}
print "#arr\n"; # 12 23 34
The simplest way would be to use a global regex pattern search
It is nearly always best to separate verificaton of the input data from processing, so the program below first rejects any values that are not four characters long or that contain a non-digit character
Then the regex pattern finds all points in the string that are followed by two digits, and captures them
use strict;
use warnings 'all';
for my $val ( qw/ 1234 6572 / ) {
next if length($val) != 4 or $val =~ /\D/;
my #pairs = $val =~ /(?=(\d\d))/g;
print "#pairs\n";
}
output
12 23 34
65 57 72
Here's a pretty loud example demonstrating how you can use substr() to fetch out the portions of the number, while ensuring that what you're dealing with is in fact exactly a four-digit number.
use warnings;
use strict;
my ($one, $two, $three);
while (my $item = <DATA>){
if ($item =~ /^\d{4}$/){
$one = substr $item, 0, 2;
$two = substr $item, 1, 2;
$three = substr $item, 2, 2;
print "one: $one, two: $two, three: $three\n";
}
}
__DATA__
1234
abcd
a1b2c3
4567
891011
Output:
one: 12, two: 23, three: 34
one: 45, two: 56, three: 67
foreach $item (#totaldata) {
if ( my #match = $item =~ m/(?=(\d\d))/ ) {
($heads[$i], $middles[$i], $tails[$i]) = #match;
$processednums++;
$i++;
}
}

Perl regex to replace part of one string with a portion of another

I have a need in Perl to replace a section of one string with most of another. :-) This needs be done for multiple pairs of strings.
For example, I need to replace
"/root_vdm_2/fs_clsnymigration"
within
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
with
rfsn_clsnymigration
so that I end up with
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
(without the leading "/root_vdm_2" part) ... but I am sufficiently sleep-deprived to have lost sight of how to accomplish this.
Help ?
Try this regex:
^\/root_vdm_2\/fs_clsnymigration
Substitute with:
\/rfsn_clsnymigration
example:
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1";
$string=~s/^\/root_vdm_2\/fs_clsnymigration/\/rfsn_clsnymigration/;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
EDIT 1
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02";
$string=~s/^\/root_vdm_.\/fs_[^\/]*/\/rfsn_clsnymigration/gm;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/rfsn_clsnymigration/users/Marketing,rfsw_users
/rfsn_clsnymigration/sandi_users,rfsw_sandi
/rfsn_clsnymigration/Analytics,rfsw_pci
/rfsn_clsnymigration/camnt01/AV,rfsw_camnt01
/rfsn_clsnymigration/sfa,rfss_stcloud01
/rfsn_clsnymigration/data4,rfss_stcloud03
/rfsn_clsnymigration/depart1,rfss_stcloud02
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($lhs, $rhs) = split(/,/, $_, 2);
my #parts = split(/\//, $lhs);
splice(#parts, 1, 2, $rhs);
print join('/', #parts) . "\n";
}
__DATA__
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
My challenge was to replace part of $string1 with all of $string2, split on the commas.
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
The difficulty I saw initially was how to replace /root_vdm_2/fs_clsnymigration with rfsn_clsnymigration, and I allowed myself to think that a regexp was the best approach.
Although far less eloquent, this gets the job done:
foreach $line (#lines) {
chop $line;
($orig,$replica) = split /\,/, $line;
chop substr $orig, 0, 1;
#pathparts = split /\//, $orig;
$rootvdm = shift #pathparts;
#pathparts[0] = $replica;
$newpath = "/" . join ('/', #pathparts);
print " here's \$newpath:$newpath\n";
}
... which yields something like this:
here's $newpath:/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU
here's $newpath:/rfsw_users/users/Marketing
here's $newpath:/rfsw_sandi/sandi_users
here's $newpath:/rfsw_pci/Analytics
here's $newpath:/rfsw_camnt01/camnt01/AV
here's $newpath:/rfss_stcloud01/sfa
here's $newpath:/rfss_stcloud03/data4
here's $newpath:/rfss_stcloud02/depart1

Extract journal title from Genbank file using perl without using $1, $2 etc

This is a part of my input Genbank file:
LOCUS AC_000005 34125 bp DNA linear VRL 03-OCT-2005
DEFINITION Human adenovirus type 12, complete genome.
ACCESSION AC_000005 BK000405
VERSION AC_000005.1 GI:56160436
KEYWORDS .
SOURCE Human adenovirus type 12
ORGANISM Human adenovirus type 12
Viruses; dsDNA viruses, no RNA stage; Adenoviridae; Mastadenovirus.
REFERENCE 1 (bases 1 to 34125)
AUTHORS Davison,A.J., Benko,M. and Harrach,B.
TITLE Genetic content and evolution of adenoviruses
JOURNAL J. Gen. Virol. 84 (Pt 11), 2895-2908 (2003)
PUBMED 14573794
And I want to extract the journal title for example J. Gen. Virol. (not including the issue number and pages)
This is my code and it doesn't give any result so I am wondering what goes wrong. I did use parentheses for $1, $2 etc... And though it worked my tutor told me to try without using that method, use substr instead.
foreach my $line (#lines) {
if ( $line =~ m/JOURNAL/g ) {
$journal_line = $line;
$character = substr( $line, $index, 2 );
if ( $character =~ m/\s\d/ ) {
print substr( $line, 12, $index - 13 );
print "\n";
}
$index++;
}
}
Another way to do this, is to take advantage of BioPerl, which can parse GenBank files:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $io=Bio::SeqIO->new(-file=>'AC_000005.1.gb', -format=>'genbank');
my $seq=$io->next_seq;
foreach my $annotation ($seq->annotation->get_Annotations('reference')) {
print $annotation->location . "\n";
}
If you run this script with AC_000005.1 saved in a file called AC_000005.1.gb, you get:
J. Gen. Virol. 84 (PT 11), 2895-2908 (2003)
J. Virol. 68 (1), 379-389 (1994)
J. Virol. 67 (2), 682-693 (1993)
J. Virol. 63 (8), 3535-3540 (1989)
Nucleic Acids Res. 9 (23), 6571-6589 (1981)
Submitted (03-MAY-2002) MRC Virology Unit, Church Street, Glasgow G11 5JR, U.K.
Rather than matching and using substr, it is much easier to use a single regex to capture the whole JOURNAL line and use brackets to capture the text representing the journal information:
foreach my $line (#lines) {
if ($line =~ /JOURNAL\s+(.+)/) {
print "Journal information: $1\n";
}
}
The regular expression looks for JOURNAL followed by one or more whitespace characters, and (.+) captures the rest of the characters in the line.
To get the text without using $1, I think you're trying to do something like this:
if ($line =~ /JOURNAL/) {
my $ix = length('JOURNAL');
# variable containing the journal name
my $j_name;
# while the journal name is not defined...
while (! $j_name) {
# starting with $ix = the length of the word JOURNAL, get character $ix in the string
if (substr($line, $ix, 1) =~ /\s/) {
# if it is whitespace, increase $ix by one
$ix++;
}
else {
# if it isn't whitespace, we've found the text!!!!!
$j_name = substr($line, $ix);
}
}
If you already know how many characters there are in the left-hand column, you can just do substr($line, 12) (or whatever) to retrieve a substring of $line starting at character 12:
foreach my $line (#lines) {
if ($line =~ /JOURNAL/) {
print "Journal information: " . substr($line, 12) . "\n";
}
}
You can combine the two techniques to eliminate the issue number and dates from the journal data:
if ($line =~ /JOURNAL/) {
my $j_name;
my $digit;
my $indent = 12; # the width of the left-hand column
my $ix = $indent; # we'll use this to track the characters in our loop
while (! $digit) {
# starting with $ix = the length of the indent,
# get character $ix in the string
if (substr($line, $ix, 1) =~ /\d/) {
# if it is a digit, we've found the number of the journal
# we can stop looping now. Whew!
$digit = $ix;
# set j_name
# get a substring of $line starting at $indent going to $digit
# (i.e. of length $digit - $indent)
$j_name = substr($line, $indent, $digit-$indent);
}
$ix++;
}
print "Journal information: $j_name\n";
}
I think it would have been easier just to get the data from the Pubmed API! ;)

A Better Regex Solution in Perl?

Here's my problem:
I have text files with five columns. The last always has a single digit. Backslashes are illegal in the first three. Spaces may show up in the first column. I remove everything after the last # in the first column. The columns are separated by spaces. I can set the column width to pretty much any value I want, giving me some control as to the spacing between columns.
So, I might have something like this:
D Smith Application Database Read 2
I have code that transforms it into this:
grant read on database 'Application'.'Database' to 'D Smith';
Here is the Regex code I have created to delimit each field and avoid confusing any spaces in the first field from the delimiting spacing.
while (<>) {
s/^ //m;
if (/^([^\\]+?)( {80,})/) {
my $atindex = rindex($1,"#",);
my $username = substr($1,0,$atindex);
if ($atindex != -1) {
s/^([^\\]+?)( {80,})/$username $2/m;
s/ {2,}/ \\ \\ /g;
s/\\ \d$//gm;
s/ \\ $//gm;
}
}
What this does is make \\ \\ the delimiter between fields. Then I use this code for the transformation:
if (/([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+)\n/) {
if ($4 eq "any") {
my $execany = "execute any";
print "grant $execany on database '$2'.'$3' to user '$1';\n";
} else {
print "grant $4 on database '$2'.'$3' to user '$1';\n";
}
I'm doing this because I couldn't figure out a way to discern the spaces between the fields from the spaces that might occur in the first field. Is there a better way? This works sufficiently quickly, but it's not elegant.
Are the columns constant width? If so, skip the regular expression and simply use substr:
Data Format
D Smith Application Database Read 2
012345678901234567890123456789012345678901234567890
Program
use strict;
use warnings;
use feature qw(say);
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 )) =~ s/\s*$//;
( my $file = substr( $line, 12, 15 )) =~ s/\s*$//;
( my $db = substr( $line, 28, 12 )) =~ s/\s*$//;
( my $op = substr( $line, 41, 9 )) =~ s/\s*$//;
( my $num = substr ( $line, 50 )) =~ s/\s*$//;
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
The s/\s*$//; trims the right side of the string removing white space.
If you don't want to use all of those substrings, and only your first field might have a space in it, then you can use substr to split out that first field, and split on the rest of the fields:
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 ) ) =~ s/\s*$//;
my ( $file, $db, $op, $num ) = split /\s+/, substr( $line, 12 );
....
}
Another Solution
Are the columns constant width? ... Nice solution. unpack could also be used with constant widths. – Kenosis
Let's use unpack!
while ( my $line = <> ) {
chomp $line;
my ( $user, $file, $db, $op, $num ) = unpack ("A12A16A13A9A*", $line);
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
Yes, that's easy to understand. At least I don't have to right trim my strings like I did with substr. See the pack/unpack tutorial.
As I describe in the comments to your question, as long as you can ensure that two simple assumptions are valid, you have no need for a lot of complicated hairy regexing. Those assumptions are:
that, for every pair of columns, at least two spaces separate the end of the value in the first column, and the beginning of the value in the second;
that no column's value contains a string of two or more spaces.
(If you can't guarantee those assumptions for a separator consisting of two or more spaces, perhaps you can for three or more, or four or more, &c. You're better off delimiting your columns with something that you can be certain will never appear in any value, but absent that, rules like these are the best you can hope to do.)
Given those assumptions, you can just split() the string on substrings of two or more spaces, something like this:
while (<>) {
$_ =~ s#^\s+##;
my #fields = split(/\s{2,}/, $_);
# print your commands, interpolating values from #fields
}
Or, more simply and readably still, you can do something like this:
while (my $line = <STDIN>) {
# the same leading-space cleanup and split...
$line =~ s#^\s+##;
my #fields = split(/\s{2,}/, $line);
# ...and then we assign values to a hash with meaningful keys...
my %values = ('user' => $fields[0],
'application' => $fields[1],
'database' => $fields[2],
'permission' => (lc($fields[3]) eq 'any'
? 'execany'
: $fields[3]));
# ...so that our interpolation and printing becomes much more
# readable.
print "grant $values{'permission'}"
. " on database '$values{'application'}'.'$values{'database'}"
. " to user '$values{'user'}';"
. "\n";
};
You'd do well also to add some validity checking, i.e. make sure all the values you expect in a given row are present and correctly formatted and emit some useful notice, or just die() outright, if they're not.
To match lines like this:
D Smith Application Database Read 2
F J Perl Foobar Database2 Write 4
Something Whatever Database3 Any 1
into the relevant columns 1 to 5, where column 1 can contain spaces, anchor on end-of-line ($):
while (<>) {
next unless /^\s*(.+?)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)$/;
my $grant_type = $4;
$grant_type = 'execute any' if lc $grant_type eq 'any';
print "grant $grant_type on '$2'.'$3' to '$1'\n";
}
result:
grant Read on 'Application'.'Database' to 'D Smith'
grant Write on 'Foobar'.'Database2' to 'F J Perl'
grant execute any on 'Whatever'.'Database3' to 'Something'
Given you have two+ spaces between fields, perhaps the following will be helpful:
use strict;
use warnings;
while (<>) {
my ( $user, $app, $db, $perm ) = grep $_, split /\s{2,}/;
$perm = 'execute any' if lc $perm eq 'any';
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}
You can omit the initial-space substitution by grepping the result of split. $perm is changed only if it's any after the split.
As you say only the first column contains spaces we can use split to break up the columns,
and splice to remove the last four... Then just use string interpolation to re-constitute
the first column - no complex repular expressions required, no assumptions about fixed
column spacing and no assumptions about double spacing.. Probably want to add some more
validity checks (make sure values are valid)
use strict;
use Const::Fast qw(const);
const my $N => 4;
while(<>){
## Split the string on spaces...
chomp;
my #Q = split;
next if #Q <= $N;
## And remove the last four columns...
my ($app,$db,$perm,$flag) = splice #Q,-$N,$N;
## Sort out name and perm...
( my $user = "#Q" ) =~ s{#[^#]+}{}mxs;
$perm = 'execute any' if 'any' eq lc $perm;
## Print out statement... using named variables makes life easier!
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}

Perl Text Extraction

I just need to extract the numbers in each of these items and store them separately, whats the best way to do this ?
IF the data is something like
p °c 4'i
App data usage stats
E
iii
! 12:12PM
Received data
Sent data
Current usage
598KB
28KB
626KB :18%
Get Current Stat Browser App
J
Battery Level
I tried this, but I get only 18 as an output in this case.
foreach my $line (#lines) {
if ($line =~/ :[ ]*(\d+)[ ]*(KB|%)/) {
$value = $1;
print "the value is $value\n";
push (#array, $1);
}
}
Thanks,
Loop over every line, and using a regular expression
foreach my $line (#lines) {
if ($line =~ /(\d+)/) {
push (#array, $1);
}
}
And you'll have all the numbers in your #array array
Here's one way to do it. Note that it does not care about which kind of numbers it extracts, as per your request.
It splits the line on colons in max two fields, key and value. Then we extract numbers from the values and insert into the hash. This part will effectively skip all lines where values do not contain numbers. This is also where you would insert stricter checks, e.g. if ($value =~ /(\d+)\s*KB/i) would only capture numbers followed by KB (I opted to add case insensitivity).
use strict;
use warnings;
use Data::Dumper;
my %hash;
while (<DATA>) {
my ($key, $value) = split /\s*:\s*/, $_, 2;
if ($value =~ /(\d+)/) {
$hash{$key} = $1;
}
}
print Dumper \%hash;
__DATA__
Received data : 598 KB
Sent data : 28 KB
Current usage : 626 KB
Battery Level : 35 %
Output:
$VAR1 = {
'Sent data' => '28',
'Current usage' => '626',
'Battery Level' => '35',
'Received data' => '598'
};