Values from IF statement regex match (Perl) - regex

I'm currently extracting values from a table within a file via REGEX line matches against the table rows.
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
# $val's used here
}
}
I purposely assign vals for clarity in the code. Some of my table rows contain 10+ vals (aka columns) - is there a more efficient method of assigning the vals instead of doing ... = ($1, $2, ..., $n)?

A match in list context yields a list of the capture groups. If it fails, it returns an empty list, which is false. You can therefore
if( my ( $val1, $val2, $val3 ) = $line =~ m/^(\S+)\s*(\S+)\s*(\S+)$/ ) {
...
}
However, a number of red flags are apparent in this code. That regexp capture looks very similar to a split:
if( my ( $val2, $val2, $val3 ) = split ' ', $line ) {
...
}
Secondly, why split $file by linefeeds; if you are reading the contents of a file, far nicer is to actually read a single line at once:
while( my $line = <$fh> ) {
...
}

I assume that this is not your actual code, because if so, it will not work:
foreach my $line (split("\n", $file)) {
if ($line =~ /^(\S+)\s*(\S+)\s*(\S+)$/) {
my ($val1, $val2, $val3) = ($1, $2, $3);
}
# all the $valX variables are now out of scope
}
You should also be aware that \s* will also match the empty string, and may cause subtle errors. For example:
"a bug" =~ /^(\S+)\s*(\S+)\s*(\S+)$/;
# the captures are now: $1 = "a"; $2 = "bu"; $3 = "g"
Even despite the fact that \S+ is greedy, the anchors ^ ... $ will force the regex to fit, hence allowing the empty strings to split the words.
If your intention is to capture all the words that are separated by whitespace, using split is your best option, as others have already mentioned.
open my $fh, "<", "file.txt" or die $!;
my #stored;
while (<$fh>) {
my #vals = split;
push(#stored, \#vals) if #vals; # ignore empty values
}
This will store any captured values into a two-dimensional array. Using the file handle directly and reading line-by-line is the preferred method, unless for some reason you actually need to have the entire file in memory.

Looks like you are just using a table with a space delimiter.You can use the split function:
#valuearray = split(" ", $line)
And then address the elements as:
#valuearray[0] ,#valuearray[1] etc..

Related

The perfect regex to extract a particular string in perl

I have a text file abc.txt that looks like this:
dQdC(sA1B2C3,sC5) = A lot of stuff
a = b = c
Baseball
dQdC(sC2V3X1,sD5) = A lot of stuff again
Now I want create two arrays in perl, one of which will contain A1B2C3 and C2V3X1, the other array will contain C5 and D5. I don't care about the other intermediate lines. To achieve this goal, I am trying this perl script:
for (my $in=0;$in<=$#lines;$in++){
if ($lines[$in]=~/dQdC\(s([A-Z0-9]+?),s([A-Z0-9]+?)\)/) {
print "1111"; #this line is just to check if it is at all going inside the loop
#A = $1;
#B = $2;
}
However, it is not even going inside the loop. So I guess I did something wrong with the regex. Will someone please tell me what I am doing wrong here?
my (#a, #b);
while ($file =~ /^dQdC\(s(\w+),s(\w+)\)/mg) {
push #a, $1;
push #b, $2;
}
or
my (#a, #b);
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #a, $1;
push #b, $2;
}
}
Working with parallel arrays isn't nice.
Alternative 1: Hash
my %hash = $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg;
or
my %hash;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
$hash{$1} = $2;
}
}
Alternative 2: AoA
use List::Util qw( pairs ); # 1.29+
my #pairs = pairs( $file =~ /^dQdC\(s(\w+),s(\w+)\)/mg );
or
my #pairs;
while (<$fh>) {
if (/^dQdC\(s(\w+),s(\w+)\)/) {
push #pairs, [ $1, $2 ];
}
}
If the format of your target lines is always as shown
use warnings;
use strict;
my $file = ...
my (#ary_1, #ary_2);
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
my ($v1, $v2) = /dQdC\(s([^,]+),s([^\)]+)/ or next;
push #ary_1, $v1;
push #ary_2, $v2;
}
which captures between ( and a , and then between a , and ). The first pattern might as well be s(.*?), as there is no benefit of the negated character class since the following , still need be matched (but I left it with [^...] for consistency with the other one).
Comments
In general better process a file line-by-line, unless there are specific reasons to read it first
C-style loop is rarely needed. To iterate over array index use for my $i (0..$#ary)
Please use warnings; and use strict; always
Try this:
(?<=\(s)([A-Z0-9]+)(?=,)
It matches substrings that come between (s and , using lookbehind and lookahead.
Similarily, use (?<=,s)([A-Z0-9]+)(?=\)) to capture the substrings between ,s and ).
Putting them together, you can create two capturing groups, each containing the different kind of substrings: (A1B2C3, C2V3X1), (C5, D5)

Save result from flip-flop in variable?

I have about 1kB of text from STDIN
my $f = join("", <STDIN>);
and I would like to get the content between open1 and close1, so /open1/../close1/ comes to mind.
I have only seen it been used in one liners and in scripts in while-loops and $_.
Question
How can I get the result from /open1/../close1/ in my script when everything is in $f?
Capturing all matches with a single regular expression
If you want to capture all the lines between open1 and start1 markers (excluding the markers), it is easily done with a single regular expression:
my $f = join("", <STDIN>);
my #matches = ( $f =~ m/\bopen1\b(.*?)\bclose1\b/gs );
for my $m (#matches) {
print "$m";
}
where
s modifier treats the string as a single line;
g modifier captures all the matches;
(.*?) matches a group of any characters using the lazy quantifier
Using the range operator
The range operator (so-called flip-flop) is not very convenient for this task if you want to avoid capturing the markers, because an expression like /open1/ .. /close1/ returns true for the lines matching the patterns.
The expression /^open1$/ .. /^close1$/ returns false until /^open1$/ is true. The left regular expression stops being evaluated once it matches the line, and keeps returning true until /^close1$/ becomes true. When the right expression matches, the cycle is restarted. Thus, the open1 and close1 markers are included into $matches.
It is even less convenient, if the input is stored in a variable, because you will need to read the contents of the variable line by line, e.g.:
my $matches = "";
my #lines = split /\n/, $f;
foreach my $line (#lines) {
if ($line =~ m/^open1$/ .. $line =~ m/^close1$/) {
$matches .= "$line\n";
}
}
Note, it is possible to use arbitrary Perl expressions as operands of the range operator. I wouldn't recommend this code, as it is not very efficient, and not very readable. At the same time it is easy to adapt the first example to the case where the open1 and close1 markers are included into the set of matches, e.g.:
my #matches = ( $f =~ m/\bopen1\b(.*?)\bclose1\b/gs );
for my $m (#matches) {
print "open1${m}close1\n";
}
You can rewrite how $f is generated so that it takes advantage of the flip-flop inside a while loop:
my ( $f, $matched );
while ( <> ) {
$f .= $_;
$matched .= $_ if /open1/ .. /close1/;
}
Another way is to create a new inputs stream out of the contents of $f.
open my $fh, '<', \$f;
while (<$fh>) {
if (/open1/ .. /close1/) {
...
}
}
You can also employ split. To get what is between the first pair of open1 and close1
my $open_to_close = (split /open1|close1/, $f)[1];
The delimiter can be either open1 or close1, so returned is a list of three elements: before open1, between them, and after close1. We take the second element.
If there are more open1/close1 pairs take all odd-indexed elements.
Either get the array as well
my #parts = split /open1|close1/, $f;
my #all_open_to_close = #parts[ grep { $_ & 1 } 0..$#parts ];
or get it directly from the list
my #all_open_to_close =
grep { CORE::state $i; ++$i % 2 == 0 } split /open1|close1/, $f;
The state is a feature
from v5.10. If you already use that you don't need CORE:: prefix.

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

A Better Regex Solution in Perl?

Here's my problem:
I have text files with five columns. The last always has a single digit. Backslashes are illegal in the first three. Spaces may show up in the first column. I remove everything after the last # in the first column. The columns are separated by spaces. I can set the column width to pretty much any value I want, giving me some control as to the spacing between columns.
So, I might have something like this:
D Smith Application Database Read 2
I have code that transforms it into this:
grant read on database 'Application'.'Database' to 'D Smith';
Here is the Regex code I have created to delimit each field and avoid confusing any spaces in the first field from the delimiting spacing.
while (<>) {
s/^ //m;
if (/^([^\\]+?)( {80,})/) {
my $atindex = rindex($1,"#",);
my $username = substr($1,0,$atindex);
if ($atindex != -1) {
s/^([^\\]+?)( {80,})/$username $2/m;
s/ {2,}/ \\ \\ /g;
s/\\ \d$//gm;
s/ \\ $//gm;
}
}
What this does is make \\ \\ the delimiter between fields. Then I use this code for the transformation:
if (/([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+) \\ \\ ([^\\]+)\n/) {
if ($4 eq "any") {
my $execany = "execute any";
print "grant $execany on database '$2'.'$3' to user '$1';\n";
} else {
print "grant $4 on database '$2'.'$3' to user '$1';\n";
}
I'm doing this because I couldn't figure out a way to discern the spaces between the fields from the spaces that might occur in the first field. Is there a better way? This works sufficiently quickly, but it's not elegant.
Are the columns constant width? If so, skip the regular expression and simply use substr:
Data Format
D Smith Application Database Read 2
012345678901234567890123456789012345678901234567890
Program
use strict;
use warnings;
use feature qw(say);
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 )) =~ s/\s*$//;
( my $file = substr( $line, 12, 15 )) =~ s/\s*$//;
( my $db = substr( $line, 28, 12 )) =~ s/\s*$//;
( my $op = substr( $line, 41, 9 )) =~ s/\s*$//;
( my $num = substr ( $line, 50 )) =~ s/\s*$//;
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
The s/\s*$//; trims the right side of the string removing white space.
If you don't want to use all of those substrings, and only your first field might have a space in it, then you can use substr to split out that first field, and split on the rest of the fields:
while ( my $line = <> ) {
chomp $line;
( my $user = substr( $line, 0, 10 ) ) =~ s/\s*$//;
my ( $file, $db, $op, $num ) = split /\s+/, substr( $line, 12 );
....
}
Another Solution
Are the columns constant width? ... Nice solution. unpack could also be used with constant widths. – Kenosis
Let's use unpack!
while ( my $line = <> ) {
chomp $line;
my ( $user, $file, $db, $op, $num ) = unpack ("A12A16A13A9A*", $line);
say qq(User = "$user", File = "$file", DB = "$db", OP = "$op", NUM = "$num");
}
Yes, that's easy to understand. At least I don't have to right trim my strings like I did with substr. See the pack/unpack tutorial.
As I describe in the comments to your question, as long as you can ensure that two simple assumptions are valid, you have no need for a lot of complicated hairy regexing. Those assumptions are:
that, for every pair of columns, at least two spaces separate the end of the value in the first column, and the beginning of the value in the second;
that no column's value contains a string of two or more spaces.
(If you can't guarantee those assumptions for a separator consisting of two or more spaces, perhaps you can for three or more, or four or more, &c. You're better off delimiting your columns with something that you can be certain will never appear in any value, but absent that, rules like these are the best you can hope to do.)
Given those assumptions, you can just split() the string on substrings of two or more spaces, something like this:
while (<>) {
$_ =~ s#^\s+##;
my #fields = split(/\s{2,}/, $_);
# print your commands, interpolating values from #fields
}
Or, more simply and readably still, you can do something like this:
while (my $line = <STDIN>) {
# the same leading-space cleanup and split...
$line =~ s#^\s+##;
my #fields = split(/\s{2,}/, $line);
# ...and then we assign values to a hash with meaningful keys...
my %values = ('user' => $fields[0],
'application' => $fields[1],
'database' => $fields[2],
'permission' => (lc($fields[3]) eq 'any'
? 'execany'
: $fields[3]));
# ...so that our interpolation and printing becomes much more
# readable.
print "grant $values{'permission'}"
. " on database '$values{'application'}'.'$values{'database'}"
. " to user '$values{'user'}';"
. "\n";
};
You'd do well also to add some validity checking, i.e. make sure all the values you expect in a given row are present and correctly formatted and emit some useful notice, or just die() outright, if they're not.
To match lines like this:
D Smith Application Database Read 2
F J Perl Foobar Database2 Write 4
Something Whatever Database3 Any 1
into the relevant columns 1 to 5, where column 1 can contain spaces, anchor on end-of-line ($):
while (<>) {
next unless /^\s*(.+?)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)$/;
my $grant_type = $4;
$grant_type = 'execute any' if lc $grant_type eq 'any';
print "grant $grant_type on '$2'.'$3' to '$1'\n";
}
result:
grant Read on 'Application'.'Database' to 'D Smith'
grant Write on 'Foobar'.'Database2' to 'F J Perl'
grant execute any on 'Whatever'.'Database3' to 'Something'
Given you have two+ spaces between fields, perhaps the following will be helpful:
use strict;
use warnings;
while (<>) {
my ( $user, $app, $db, $perm ) = grep $_, split /\s{2,}/;
$perm = 'execute any' if lc $perm eq 'any';
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}
You can omit the initial-space substitution by grepping the result of split. $perm is changed only if it's any after the split.
As you say only the first column contains spaces we can use split to break up the columns,
and splice to remove the last four... Then just use string interpolation to re-constitute
the first column - no complex repular expressions required, no assumptions about fixed
column spacing and no assumptions about double spacing.. Probably want to add some more
validity checks (make sure values are valid)
use strict;
use Const::Fast qw(const);
const my $N => 4;
while(<>){
## Split the string on spaces...
chomp;
my #Q = split;
next if #Q <= $N;
## And remove the last four columns...
my ($app,$db,$perm,$flag) = splice #Q,-$N,$N;
## Sort out name and perm...
( my $user = "#Q" ) =~ s{#[^#]+}{}mxs;
$perm = 'execute any' if 'any' eq lc $perm;
## Print out statement... using named variables makes life easier!
print "grant $perm on database '$app'.'$db' to user '$user';\n";
}

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}