Removing blank elements from an array - regex

In my Perl code, I am accessing an email. I need to fetch the table in it and parse it into an array.
I did it using:
my #plain = split(/\n/,$plaintext);
However, there are many blank elements in #plain. It has 572 elements and about half of them are empty.
Am I doing anything wrong here? What do I need to add/change in my code to get rid of the blank elements?

grep the output so you only get entries that contain non-whitepace characters.
my #plain = grep { /\S/ } split(/\n/,$plaintext);

The correct way to do it is here from #dave-cross
Quick and dirty if you're not up for fixing your split:
foreach(#plain){
if( ( defined $_) and !($_ =~ /^$/ )){
push(#new, $_);
}
}
edit: how it works
There are going to be more elegant and efficient ways of doing it than the above, but as with everything perl-y tmtowtdi! The way this works is:
Loop through the array #plain, making $_ set to current array element
foreach(#plain){
Check the current element to see if we're interested in it:
( defined $_) # has it had any value assigned to it
!($_ =~ /^$/ ) # ignore those which have been assigned a blank value eg. ''
If the current element passes those checks push it to #new
push(#new, $_);

One line addition is required in your code, and it works
#plain= grep { $_ ne '' } #plain;

Here is what i used, too late but this is good one , can be used in future
$t = "1.2,3.4,3.12,3.18,3.27";
my #to = split(',',$t);
foreach $t ( #to ){
push ( #valid , $t );
}
my $max = (sort { $b <=> $a } #valid)[0];
print $max

Related

Make hash from regex expression patten match -error

I am making a hash from regex expression. I run my program below and I have a check at the end to see if my hash made ok. But I keep getting an error for the value., I get this ARRAY(0x1a1c740), when it should be 437768. Keys can display ok. I didnt do split because i need the key to be the first part of a species name. This is what i am matching.
# "aaaaaaaaaa","aaaaaaaaaa","437768","Cryptophyta sp. CR-MAL06",0
Thanks very much for your help that you may give.
use strict;
use warnings;
open (my $in_fh,"$ARGV[0]") or die "Failed to open file: $!\n";
open (my $out_fh, ">genus.txt");
my %hash;
while ( my $line = <$in_fh> ) {
#
# "aaaaaaaaaa","aaaaaaaaaa","437768","Cryptophyta sp. CR-MAL06",0
#
if ($line =~ m/\"+\w+\"+\,+\"+\w+\"+\,+\"+(\d+)\"+\,+\"+(\w+)+.+/) {
my $v = $1;
my $k = $2;
$hash{$k} = [$v];
}
}
if (exists $hash{'Cryptophyta'}) {
print $out_fh $hash{'Cryptophyta'};
}
else {
print $out_fh "NO\n";
}
close $in_fh;
close $out_fh;
Change this line:
$hash{$k} = [$v];
to
$hash{$k} = $v;
[$v] is a reference to an array but you want to store a scalar.
[ ] creates an array, assigns the result of the enclosed expression to that array, and returns a reference to the array. It is that reference you are printing.
You were probably trying to support multiple matches. Two problems:
You continually create a new array with one element. Replace
$hash{$k} = [ $v ];
with
push #{ $hash{$k} }, $v;
You print the reference to the array rather than the contents of the array. Replace
print $out_fh $hash{'Cryptophyta'};
with
print $out_fh join(', ', #{ $hash{'Cryptophyta'} });

Why does my program crash after one line of input?

I am writing a simple program which capitalizes each word in a sentence. It gets a multi-line input. I then loop through the input lines, split each word in the line, capitalize it and then join the line again. This works fine if the input is one sentence, but as soon as I input two lines my program crashes (and if I wait too long my computer freezes.)
Here is my code
#input = <STDIN>;
foreach(#input)
{
#reset #words
#words= ();
#readability
$lines =$_;
#split sentence
#words = split( / /, $lines );
#capitalize each word
foreach(#words){
$words[$k] = ucfirst;
$k++;
}
#join sentences again
$lines = join(' ', #words);
#create output line
$output[$i]=$lines;
$i++;
}
#print the result
print "\nResult:\n";
foreach(#output){
print $output[$j],"\n";
$j++;
}
Could someone please tell me why it crashes?
use strict (and be told about not properly handled variables like your indices)
use for var (array) to get a usable item without an index (Perl isn't Javascript)
What isn't there can't be wrong (e.g. push instead of index)
In code:
use strict; # always!
my #input = <STDIN>; # the loop need in- and output
my #output = ();
for my $line (#input) # for makes readability *and* terseness easy
{
chomp $line; # get rid of eol
#split sentence
my #words = split( / /, $line );
#capitalize each word
for my $word (#words){ # no danger of mishandling indices
$word = ucfirst($word);
}
#join sentences again
$line = join(' ', #words);
#create output line
push #output, $line;
}
#print the result
print "\nResult:\n";
for my $line (#output){
print $line, "\n";
}
The problem is that you are using global variables throughout, so they are keeping their values across iterations of the loop. You have reset #words to an empty list even though you didn't need to - it is overwritten when you assign the result of split to it - but $k is increasing endlessly.
$k is initially set to undef which evaluates as zero, so for the first sentence everything is fine. But you leave $k set to the number of elements in #words so it starts from there instead of from zero for the next sentence. Your loop over #words becomes endless because you are assigning to (and so creating) $words[$k] so the array is getting longer as fast as you are looping through it.
The same problem applies to $i and $j, but execution never gets as far as reusing those.
Alshtough this was the only way of working in Perl 4, over twenty years ago, Perl 5 has made programming very much nicer to write and debug. You can now declare variables with my, and you can use strict which (among other things) insists that every variable you use must be declared, otherwise your program won't compile. There is also use warnings which is just as invaluable. In this case it would have warned you that you were using an undefined variable $k etc. to index the arrays.
If I apply use strict and use warnings, declare all of your variables and initialise the counters to zero then I get a working program. It's still not very elegant, and there are much better ways of doing it, but the error has gone away.
use strict;
use warnings;
my #input = <STDIN>;
my #output;
my $i = 0;
foreach (#input) {
# readability
my $lines = $_;
# split sentence
my #words = split ' ', $lines;
# capitalize each word
my $k = 0;
foreach (#words) {
$words[$k] = ucfirst;
$k++;
}
# join sentences again
$lines = join ' ', #words;
#create output line
$output[$i] = $lines;
$i++;
}
print "\nResult:\n";
my $j = 0;
foreach (#output) {
print $output[$j], "\n";
$j++;
}

Perl: Strip out part of a string in an array

I read all the questions that looked similar and am not gleaning an answer.
I saw a lot of "remove this or add that" but not a "move to another array..."
This question is below all of you but I am a Perl Newblet and could really use an elegant solution help.
I have an array with an unknown # of elements, each element containing a string similar to {img_names_will_change.jpg}some unknown text.
I need a subroutine that will strip the {yadayada.jpg} from each element and add the yadayada.jpg portion to a second array.
However, I still need each element in the original array to survive but without the {....}.
I looked into using substr or regex but got lost in the syntax.
I'll be RTFM on regex as well.
If i get you right, this could be a solution:
my #names = (
'{img_names_will_change.jpg}some unknown text',
'{img_names_will_change.jpg}some unknown text',
'{img_names_will_change.jpg}some unknown text'
);
my #extract;
foreach my $name ( #names ) {
if ( $name =~ m/{(\w+\.\w+)}/ ) {
push #extract, $1;
}
}
use Data::Dumper;
print Dumper #extract;
Output
$VAR1 = 'img_names_will_change.jpg';
$VAR2 = 'img_names_will_change.jpg';
$VAR3 = 'img_names_will_change.jpg';
Extracting the Imagename with {(\w+\.\w+)} and push it into another array.
I got it. Just added the rest of the string into $2 and applied it to $original. Thanks Paulchenkiller!
foreach my $orignal ( #original ) {
#Extracts the text from within "{}" and pushes it into #images
if ( $original =~ m/{(\w+\.\w+)}(.*)/ ) {
push #images, $1;
#Strips "{..}" out of #original
$original = $2;
}
}

A Better Regex Solution in Perl?

Here's my problem:
I have text files with five columns. The last always has a single digit. Backslashes are illegal in the first three. Spaces may show up in the first column. I remove everything after the last # in the first column. The columns are separated by spaces. I can set the column width to pretty much any value I want, giving me some control as to the spacing between columns.
So, I might have something like this:
D Smith Application Database Read 2
I have code that transforms it into this:
grant read on database 'Application'.'Database' to 'D Smith';
Here is the Regex code I have created to delimit each field and avoid confusing any spaces in the first field from the delimiting spacing.
while (<>) {
s/^ //m;
if (/^([^\\]+?)( {80,})/) {
my $atindex = rindex($1,"#",);
my $username = substr($1,0,$atindex);
if ($atindex != -1) {
s/^([^\\]+?)( {80,})/$username $2/m;
s/ {2,}/ \\ \\ /g;
s/\\ \d$//gm;
s/ \\ $//gm;
}
}
What this does is make \\ \\ the delimiter between fields. Then I use this code for the transformation:
if (/([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+)\n/) {
if ($4 eq "any") {
my $execany = "execute any";
print "grant $execany on database '$2'.'$3' to user '$1';\n";
} else {
print "grant $4 on database '$2'.'$3' to user '$1';\n";
}
I'm doing this because I couldn't figure out a way to discern the spaces between the fields from the spaces that might occur in the first field. Is there a better way? This works sufficiently quickly, but it's not elegant.
Are the columns constant width? If so, skip the regular expression and simply use substr:
Data Format
D Smith Application Database Read 2
012345678901234567890123456789012345678901234567890
Program
use strict;
use warnings;
use feature qw(say);
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 )) =~ s/\s*$//;
( my $file = substr( $line, 12, 15 )) =~ s/\s*$//;
( my $db = substr( $line, 28, 12 )) =~ s/\s*$//;
( my $op = substr( $line, 41, 9 )) =~ s/\s*$//;
( my $num = substr ( $line, 50 )) =~ s/\s*$//;
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
The s/\s*$//; trims the right side of the string removing white space.
If you don't want to use all of those substrings, and only your first field might have a space in it, then you can use substr to split out that first field, and split on the rest of the fields:
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 ) ) =~ s/\s*$//;
my ( $file, $db, $op, $num ) = split /\s+/, substr( $line, 12 );
....
}
Another Solution
Are the columns constant width? ... Nice solution. unpack could also be used with constant widths. – Kenosis
Let's use unpack!
while ( my $line = <> ) {
chomp $line;
my ( $user, $file, $db, $op, $num ) = unpack ("A12A16A13A9A*", $line);
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
Yes, that's easy to understand. At least I don't have to right trim my strings like I did with substr. See the pack/unpack tutorial.
As I describe in the comments to your question, as long as you can ensure that two simple assumptions are valid, you have no need for a lot of complicated hairy regexing. Those assumptions are:
that, for every pair of columns, at least two spaces separate the end of the value in the first column, and the beginning of the value in the second;
that no column's value contains a string of two or more spaces.
(If you can't guarantee those assumptions for a separator consisting of two or more spaces, perhaps you can for three or more, or four or more, &c. You're better off delimiting your columns with something that you can be certain will never appear in any value, but absent that, rules like these are the best you can hope to do.)
Given those assumptions, you can just split() the string on substrings of two or more spaces, something like this:
while (<>) {
$_ =~ s#^\s+##;
my #fields = split(/\s{2,}/, $_);
# print your commands, interpolating values from #fields
}
Or, more simply and readably still, you can do something like this:
while (my $line = <STDIN>) {
# the same leading-space cleanup and split...
$line =~ s#^\s+##;
my #fields = split(/\s{2,}/, $line);
# ...and then we assign values to a hash with meaningful keys...
my %values = ('user' => $fields[0],
'application' => $fields[1],
'database' => $fields[2],
'permission' => (lc($fields[3]) eq 'any'
? 'execany'
: $fields[3]));
# ...so that our interpolation and printing becomes much more
# readable.
print "grant $values{'permission'}"
. " on database '$values{'application'}'.'$values{'database'}"
. " to user '$values{'user'}';"
. "\n";
};
You'd do well also to add some validity checking, i.e. make sure all the values you expect in a given row are present and correctly formatted and emit some useful notice, or just die() outright, if they're not.
To match lines like this:
D Smith Application Database Read 2
F J Perl Foobar Database2 Write 4
Something Whatever Database3 Any 1
into the relevant columns 1 to 5, where column 1 can contain spaces, anchor on end-of-line ($):
while (<>) {
next unless /^\s*(.+?)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)$/;
my $grant_type = $4;
$grant_type = 'execute any' if lc $grant_type eq 'any';
print "grant $grant_type on '$2'.'$3' to '$1'\n";
}
result:
grant Read on 'Application'.'Database' to 'D Smith'
grant Write on 'Foobar'.'Database2' to 'F J Perl'
grant execute any on 'Whatever'.'Database3' to 'Something'
Given you have two+ spaces between fields, perhaps the following will be helpful:
use strict;
use warnings;
while (<>) {
my ( $user, $app, $db, $perm ) = grep $_, split /\s{2,}/;
$perm = 'execute any' if lc $perm eq 'any';
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}
You can omit the initial-space substitution by grepping the result of split. $perm is changed only if it's any after the split.
As you say only the first column contains spaces we can use split to break up the columns,
and splice to remove the last four... Then just use string interpolation to re-constitute
the first column - no complex repular expressions required, no assumptions about fixed
column spacing and no assumptions about double spacing.. Probably want to add some more
validity checks (make sure values are valid)
use strict;
use Const::Fast qw(const);
const my $N => 4;
while(<>){
## Split the string on spaces...
chomp;
my #Q = split;
next if #Q <= $N;
## And remove the last four columns...
my ($app,$db,$perm,$flag) = splice #Q,-$N,$N;
## Sort out name and perm...
( my $user = "#Q" ) =~ s{#[^#]+}{}mxs;
$perm = 'execute any' if 'any' eq lc $perm;
## Print out statement... using named variables makes life easier!
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}