A Better Regex Solution in Perl? - regex

Here's my problem:
I have text files with five columns. The last always has a single digit. Backslashes are illegal in the first three. Spaces may show up in the first column. I remove everything after the last # in the first column. The columns are separated by spaces. I can set the column width to pretty much any value I want, giving me some control as to the spacing between columns.
So, I might have something like this:
D Smith Application Database Read 2
I have code that transforms it into this:
grant read on database 'Application'.'Database' to 'D Smith';
Here is the Regex code I have created to delimit each field and avoid confusing any spaces in the first field from the delimiting spacing.
while (<>) {
s/^ //m;
if (/^([^\\]+?)( {80,})/) {
my $atindex = rindex($1,"#",);
my $username = substr($1,0,$atindex);
if ($atindex != -1) {
s/^([^\\]+?)( {80,})/$username $2/m;
s/ {2,}/ \\ \\ /g;
s/\\ \d$//gm;
s/ \\ $//gm;
}
}
What this does is make \\ \\ the delimiter between fields. Then I use this code for the transformation:
if (/([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+)\n/) {
if ($4 eq "any") {
my $execany = "execute any";
print "grant $execany on database '$2'.'$3' to user '$1';\n";
} else {
print "grant $4 on database '$2'.'$3' to user '$1';\n";
}
I'm doing this because I couldn't figure out a way to discern the spaces between the fields from the spaces that might occur in the first field. Is there a better way? This works sufficiently quickly, but it's not elegant.

Are the columns constant width? If so, skip the regular expression and simply use substr:
Data Format
D Smith Application Database Read 2
012345678901234567890123456789012345678901234567890
Program
use strict;
use warnings;
use feature qw(say);
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 )) =~ s/\s*$//;
( my $file = substr( $line, 12, 15 )) =~ s/\s*$//;
( my $db = substr( $line, 28, 12 )) =~ s/\s*$//;
( my $op = substr( $line, 41, 9 )) =~ s/\s*$//;
( my $num = substr ( $line, 50 )) =~ s/\s*$//;
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
The s/\s*$//; trims the right side of the string removing white space.
If you don't want to use all of those substrings, and only your first field might have a space in it, then you can use substr to split out that first field, and split on the rest of the fields:
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 ) ) =~ s/\s*$//;
my ( $file, $db, $op, $num ) = split /\s+/, substr( $line, 12 );
....
}
Another Solution
Are the columns constant width? ... Nice solution. unpack could also be used with constant widths. – Kenosis
Let's use unpack!
while ( my $line = <> ) {
chomp $line;
my ( $user, $file, $db, $op, $num ) = unpack ("A12A16A13A9A*", $line);
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
Yes, that's easy to understand. At least I don't have to right trim my strings like I did with substr. See the pack/unpack tutorial.

As I describe in the comments to your question, as long as you can ensure that two simple assumptions are valid, you have no need for a lot of complicated hairy regexing. Those assumptions are:
that, for every pair of columns, at least two spaces separate the end of the value in the first column, and the beginning of the value in the second;
that no column's value contains a string of two or more spaces.
(If you can't guarantee those assumptions for a separator consisting of two or more spaces, perhaps you can for three or more, or four or more, &c. You're better off delimiting your columns with something that you can be certain will never appear in any value, but absent that, rules like these are the best you can hope to do.)
Given those assumptions, you can just split() the string on substrings of two or more spaces, something like this:
while (<>) {
$_ =~ s#^\s+##;
my #fields = split(/\s{2,}/, $_);
# print your commands, interpolating values from #fields
}
Or, more simply and readably still, you can do something like this:
while (my $line = <STDIN>) {
# the same leading-space cleanup and split...
$line =~ s#^\s+##;
my #fields = split(/\s{2,}/, $line);
# ...and then we assign values to a hash with meaningful keys...
my %values = ('user' => $fields[0],
'application' => $fields[1],
'database' => $fields[2],
'permission' => (lc($fields[3]) eq 'any'
? 'execany'
: $fields[3]));
# ...so that our interpolation and printing becomes much more
# readable.
print "grant $values{'permission'}"
. " on database '$values{'application'}'.'$values{'database'}"
. " to user '$values{'user'}';"
. "\n";
};
You'd do well also to add some validity checking, i.e. make sure all the values you expect in a given row are present and correctly formatted and emit some useful notice, or just die() outright, if they're not.

To match lines like this:
D Smith Application Database Read 2
F J Perl Foobar Database2 Write 4
Something Whatever Database3 Any 1
into the relevant columns 1 to 5, where column 1 can contain spaces, anchor on end-of-line ($):
while (<>) {
next unless /^\s*(.+?)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)$/;
my $grant_type = $4;
$grant_type = 'execute any' if lc $grant_type eq 'any';
print "grant $grant_type on '$2'.'$3' to '$1'\n";
}
result:
grant Read on 'Application'.'Database' to 'D Smith'
grant Write on 'Foobar'.'Database2' to 'F J Perl'
grant execute any on 'Whatever'.'Database3' to 'Something'

Given you have two+ spaces between fields, perhaps the following will be helpful:
use strict;
use warnings;
while (<>) {
my ( $user, $app, $db, $perm ) = grep $_, split /\s{2,}/;
$perm = 'execute any' if lc $perm eq 'any';
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}
You can omit the initial-space substitution by grepping the result of split. $perm is changed only if it's any after the split.

As you say only the first column contains spaces we can use split to break up the columns,
and splice to remove the last four... Then just use string interpolation to re-constitute
the first column - no complex repular expressions required, no assumptions about fixed
column spacing and no assumptions about double spacing.. Probably want to add some more
validity checks (make sure values are valid)
use strict;
use Const::Fast qw(const);
const my $N => 4;
while(<>){
## Split the string on spaces...
chomp;
my #Q = split;
next if #Q <= $N;
## And remove the last four columns...
my ($app,$db,$perm,$flag) = splice #Q,-$N,$N;
## Sort out name and perm...
( my $user = "#Q" ) =~ s{#[^#]+}{}mxs;
$perm = 'execute any' if 'any' eq lc $perm;
## Print out statement... using named variables makes life easier!
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}

Related

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

Perl regex to replace part of one string with a portion of another

I have a need in Perl to replace a section of one string with most of another. :-) This needs be done for multiple pairs of strings.
For example, I need to replace
"/root_vdm_2/fs_clsnymigration"
within
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
with
rfsn_clsnymigration
so that I end up with
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
(without the leading "/root_vdm_2" part) ... but I am sufficiently sleep-deprived to have lost sight of how to accomplish this.
Help ?
Try this regex:
^\/root_vdm_2\/fs_clsnymigration
Substitute with:
\/rfsn_clsnymigration
example:
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1";
$string=~s/^\/root_vdm_2\/fs_clsnymigration/\/rfsn_clsnymigration/;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
EDIT 1
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02";
$string=~s/^\/root_vdm_.\/fs_[^\/]*/\/rfsn_clsnymigration/gm;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/rfsn_clsnymigration/users/Marketing,rfsw_users
/rfsn_clsnymigration/sandi_users,rfsw_sandi
/rfsn_clsnymigration/Analytics,rfsw_pci
/rfsn_clsnymigration/camnt01/AV,rfsw_camnt01
/rfsn_clsnymigration/sfa,rfss_stcloud01
/rfsn_clsnymigration/data4,rfss_stcloud03
/rfsn_clsnymigration/depart1,rfss_stcloud02
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($lhs, $rhs) = split(/,/, $_, 2);
my #parts = split(/\//, $lhs);
splice(#parts, 1, 2, $rhs);
print join('/', #parts) . "\n";
}
__DATA__
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
My challenge was to replace part of $string1 with all of $string2, split on the commas.
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
The difficulty I saw initially was how to replace /root_vdm_2/fs_clsnymigration with rfsn_clsnymigration, and I allowed myself to think that a regexp was the best approach.
Although far less eloquent, this gets the job done:
foreach $line (#lines) {
chop $line;
($orig,$replica) = split /\,/, $line;
chop substr $orig, 0, 1;
#pathparts = split /\//, $orig;
$rootvdm = shift #pathparts;
#pathparts[0] = $replica;
$newpath = "/" . join ('/', #pathparts);
print " here's \$newpath:$newpath\n";
}
... which yields something like this:
here's $newpath:/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU
here's $newpath:/rfsw_users/users/Marketing
here's $newpath:/rfsw_sandi/sandi_users
here's $newpath:/rfsw_pci/Analytics
here's $newpath:/rfsw_camnt01/camnt01/AV
here's $newpath:/rfss_stcloud01/sfa
here's $newpath:/rfss_stcloud03/data4
here's $newpath:/rfss_stcloud02/depart1

How to remove strings which do not start or end with specific substring?

Unfortunately, I'm not a regex expert, so I need a little help.
I'm looking for the solution how to grep an array of strings to get two lists of strings which do not start (1) or end (2) with the specific substring.
Let's assume we have an array with strings matching to the following rule:
[speakerId]-[phrase]-[id].txt
i.e.
10-phraseone-10.txt 11-phraseone-3.txt 1-phraseone-2.txt
2-phraseone-1.txt 3-phraseone-1.txt 4-phraseone-1.txt
5-phraseone-3.txt 6-phraseone-2.txt 7-phraseone-2.txt
8-phraseone-10.txt 9-phraseone-2.txt 10-phrasetwo-1.txt
11-phrasetwo-1.txt 1-phrasetwo-1.txt 2-phrasetwo-1.txt
3-phrasetwo-1.txt 4-phrasetwo-1.txt 5-phrasetwo-1.txt
6-phrasetwo-3.txt 7-phrasetwo-10.txt 8-phrasetwo-1.txt
9-phrasetwo-1.txt 10-phrasethree-10.txt 11-phrasethree-3.txt
1-phrasethree-1.txt 2-phrasethree-11.txt 3-phrasethree-1.txt
4-phrasethree-3.txt 5-phrasethree-1.txt 6-phrasethree-3.txt
7-phrasethree-1.txt 8-phrasethree-1.txt 9-phrasethree-1.txt
Let's introduce variables:
$speakerId
$phrase
$id1, $id2
I would like to grep a list and obtain an array:
with elements which contain specific $phrase but we exclude those strigns which simultaneously start with specific $speakerId AND end with one of specified id's (for instance $id1 or $id2)
with elements which have specific $speakerId and $phrase but do NOT contain one of specific ids at the end (warning: remember to not exclude the 10 or 11 for $id=1 , etc.)
Maybe someone coulde use the following code to write the solution:
#AllEntries = readdir(INPUTDIR);
#Result1 = grep(/blablablahere/, #AllEntries);
#Result2 = grep(/anotherblablabla/, #AllEntries);
closedir(INPUTDIR);
Assuming a basic pattern to match your example:
(?:^|\b)(\d+)-(\w+)-(?!1|2)(\d+)\.txt(?:\b|$)
Which breaks down as:
(?:^|\b) # starts with a new line or a word delimeter
(\d+)- # speakerid and a hyphen
(\w+)- # phrase and a hyphen
(\d+) # id
\.txt # file extension
(?:\b|$) # end of line or word delimeter
You can assert exclusions using negative look-ahead. For instance, to include all matches that do not have the phrase phrasetwo you can modify the above expression to use a negative look-ahead:
(?:^|\b)(\d+)-(?!phrasetwo)(\w+)-(\d+)\.txt(?:\b|$)
Note how I include (?!phrasetwo). Alternatively, you find all phrasethree entries that end in an even number by using a look-behind instead of a look-ahead:
(?:^|\b)(\d+)-phrasethree-(\d+)(?<![13579])\.txt(?:\b|$)
(?<![13579]) just makes sure the last number of the ID falls on an even number.
It sounds a bit like you're describing a query function.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my ( $set_a, $set_b ) = query( 2, 'phrasethree', [ 1, 3 ] );
print Dumper( { a => $set_a, b => $set_b } );
# a) fetch elements which
# 1. match $phrase
# 2. exclude $speakerId
# 3. match #ids
# b) fetch elements which
# 1. match $phrase
# 2. match $speakerId
# 3. exclude #ids
sub query {
my ( $speakerId, $passPhrase, $id_ra ) = #_;
my %has_id = map { ( $_ => 0 ) } #{$id_ra};
my ( #a, #b );
while ( my $filename = glob '*.txt' ) {
if ( $filename =~ m{\A ( \d+ )-( .+? )-( \d+ ) [.] txt \z}xms ) {
my ( $_speakerId, $_passPhrase, $_id ) = ( $1, $2, $3 );
if ( $_passPhrase eq $passPhrase ) {
if ( $_speakerId ne $speakerId
&& exists $has_id{$_id} )
{
push #a, $filename;
}
if ( $_speakerId eq $speakerId
&& !exists $has_id{$_id} )
{
push #b, $filename;
}
}
}
}
return ( \#a, \#b );
}
I like the approach with pure regular expressions using negative lookaheads and -behinds. However, it's a little bit hard to read. Maybe code like this could be more self-explanatory. It uses standard perl idioms that are readable like english in some cases:
my #all_entries = readdir(...);
my #matching_entries = ();
foreach my $entry (#all_entries) {
# split file name
next unless /^(\d+)-(.*?)-(\d+).txt$/;
my ($sid, $phrase, $id) = ($1, $2, $3);
# filter
next unless $sid eq "foo";
next unless $id == 42 or $phrase eq "bar";
# more readable filter rules
# match
push #matching_entries, $entry;
}
# do something with #matching_entries
If you really want to express something that complex in a grep list transformation, you could write code like this:
my #matching_entries = grep {
/^(\d)-(.*?)-(\d+).txt$/
and $1 eq "foo"
and ($3 == 42 or $phrase eq "bar")
# and so on
} readdir(...)

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Values from IF statement regex match (Perl)

I'm currently extracting values from a table within a file via REGEX line matches against the table rows.
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
# $val's used here
}
}
I purposely assign vals for clarity in the code. Some of my table rows contain 10+ vals (aka columns) - is there a more efficient method of assigning the vals instead of doing ... = ($1, $2, ..., $n)?
A match in list context yields a list of the capture groups. If it fails, it returns an empty list, which is false. You can therefore
if( my ( $val1, $val2, $val3 ) = $line =~ m/^(\S+)\s*(\S+)\s*(\S+)$/ ) {
...
}
However, a number of red flags are apparent in this code. That regexp capture looks very similar to a split:
if( my ( $val2, $val2, $val3 ) = split ' ', $line ) {
...
}
Secondly, why split $file by linefeeds; if you are reading the contents of a file, far nicer is to actually read a single line at once:
while( my $line = <$fh> ) {
...
}
I assume that this is not your actual code, because if so, it will not work:
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
}
# all the $valX variables are now out of scope
}
You should also be aware that \s* will also match the empty string, and may cause subtle errors. For example:
"a bug" =~ /^(\S+)\s*(\S+)\s*(\S+)$/;
# the captures are now: $1 = "a"; $2 = "bu"; $3 = "g"
Even despite the fact that \S+ is greedy, the anchors ^ ... $ will force the regex to fit, hence allowing the empty strings to split the words.
If your intention is to capture all the words that are separated by whitespace, using split is your best option, as others have already mentioned.
open my $fh, "<", "file.txt" or die $!;
my #stored;
while (<$fh>) {
my #vals = split;
push(#stored, \#vals) if #vals; # ignore empty values
}
This will store any captured values into a two-dimensional array. Using the file handle directly and reading line-by-line is the preferred method, unless for some reason you actually need to have the entire file in memory.
Looks like you are just using a table with a space delimiter.You can use the split function:
#valuearray = split(" ", $line)
And then address the elements as:
#valuearray[0] ,#valuearray[1] etc..