printing all lines when multiple matching values in table perl - regex

I have two tables: $conversion and $table. In my script I'm checking if there is a match between cols[5] from $conversion and cols[2] from $table, if this is the case I print out the value from another column in $conversion, namely the corresponding value in cols[1].
This is all working fine.
However some values in cols[5] from $conversion are the same. If this is the case I want to print off course everything from $conversion that matches. Now he prints only the corresponding value for the last match that he finds while going through the file. So when cols[5] from $conversion contains 4 times the same value, in the output only the corresponding value of the 4th match is printed. Any hint on how to solve this?
This is my script:
my %hash = ();
while (<$conversion>) {
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[5];
my $keyfield2 = $cols[1];
$hash{$keyfield} = $keyfield2;
}
seek $table,0,0; #cursor resetting
while (<$table>) {
my #cols = split(/\t/);
my $keyfield = $cols[2];
if (exists($hash{$keyfield})) {
print $output "$cols[0]", "\t", "$hash{$keyfield}", "\t", "$cols[1]\n";
}
}

Don't store a single $col[1], store the whole array of them:
push #{ $hash{$keyfield} }, $keyfield2;
You'll need to dereference the array reference when printing:
print $output "$cols[0]","\t","#{ $hash{$keyfield} }","\t","$cols[1]\n";
If you want unique values, you can use a hash instead of an array.

my %hash = ();
while(<$conversion>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[5];
my $keyfield2 = $cols[1];
push #$hash{$keyfield}, $keyfield2;
# $hash{$keyfield} = $keyfield2;
}
seek $table,0,0; #cursor resetting
while(<$table>){
my #cols = split(/\t/);
my $keyfield = $cols[2];
if (exists($hash{$keyfield})){
foreach(#$hash{$keyfield})
print $output "$cols[0]","\t","$_","\t","$cols[1]\n";
}
}

Related

How to remove and ID from a string

I have a string that looks like this, they are ids in a table:
1,2,3,4,5,6,7,8,9
If someone deletes something from the database, I will need to update the string. I know that doing this it will remove the value, but not the commas. Any idea how can I check if the id has a comma before and after so my string doesn't break?
$new_values = $original_values[0];
$new_values =~ s/$car_id//;
Result: 1,2,,4,5,6,7,8,9 using the above sample (bad). It should be 1,2,4,5,6,7,8,9.
To remove the $car_id from the string:
my $car_id = 3;
my $new_values = q{1,2,3,4,5,6,7,8,9};
$new_values = join q{,}, grep { $_ != $car_id }
split /,/, $new_values;
say $new_values;
# Prints:
# 1,2,4,5,6,7,8,9
If you already removed the id(s), and you need to remove the extra commas, reformat the string like so:
my $new_values = q{,,1,2,,4,5,6,7,8,9,,,};
$new_values = join q{,}, grep { /\d/ } split /,/, $new_values;
say $new_values;
# Prints:
# 1,2,4,5,6,7,8,9
You can use
s/^$car_id,|,$car_id\b//
Details
^ - start of string
$car_id - variable value
, - comma
| - or
, - comma
$car_id - variable value
\b - word boundary.
s/^\Q$car_id\E,|,\Q$car_id\E\b//
Another approach is to store an extra leading and trailing comma (,1,2,3,4,5,6,7,8,9,)
The main benefit is that it makes it easier to search for the id using SQL (since you can search for ,$car_id,). Same goes for editing it.
On the Perl side, you'd use
s/,\K\Q$car_id\E,// # To remove
substr($_, 1, -1) # To get actual string
Ugly way: use regex to remove the value, then simplify
$new_values = $oringa_value[0];
$new_values =~ s/$car_id//;
$new_values =~ s/,+/,/;
Nice way: split and merge
$new_values = $oringa_value[0];
my #values = split(/,/, $new_values);
my $index = 0;
$index++ until $values[$index] eq $car_id;
splice(#values, $index, 1);
$new_values = join(',', #values);

Remove lines containing same string

If a line IN(..) and a line OUT(..) have the same string in their parentheses, then remove the line OUT(..).
My input file is like :
IN(ABC);
IN(DEF);
IN(FGH);
OUT(QWE);
OUT(ABC);
OUT(DEF);
My desired output is:
IN(ABC);
IN(DEF);
IN(FGH);
OUT(QWE);
On the assumption that all IN(...) lines are before the OUT(...) lines (i.e. sorted), the following should work:
my %in;
while (<DATA>) {
if (/^IN\((.*?)\)/) {
$in{$1} = 1;
} elsif (/^OUT\((.*?)\)/) {
if ($in{$1}) {
next;
}
}
print $_;
}
__DATA__
IN(ABC);
IN(DEF);
IN(FGH);
OUT(QWE);
OUT(ABC);
OUT(DEF);
The idea is to use a hash to track which IN values have been used. Go through the data line by line, if it's an IN line, store the value and print the line. If it's an OUT line and it is not in the list of recognized IN values, print it as well, otherwise, skip it.

Matching Values in Hashes

I have two arrays of hashes. I want to narrow down the second one according to variables in the first.
The first array contains hashes with keys seqname, source, feature, start, end, score, strand, frame, geneID and transcriptID.
The second array contains hashes with keys
organism, geneID, number, motifnumber, position, strand and sequence.
What I want to do, is remove from the first array of hashes, all the hashes which have a variable geneID which is not found in any of the hashes of the second array. - Note both types of hash have the geneID key. Simply put, I want to keep those hashes in the first array, which have geneID values which are found in the hashes of the second array.
My attempt at this so far was with two loops:
my #subset # define a new array for the wanted hashes to go into.
for my $i (0 .. $#first_hash_array){ # Begin loop to go through the hashes of the first array.
for my $j (0 .. $#second_hash_array){ # Begin loop through the hashes of the 2nd array.
if ($second_hash_array[$j]{geneID} =~ m/$first_hash_array[$i]{geneID}/)
{
push #subset, $second_hash_array[$j];
}
}
}
However I'm not sure that this is the right way to go about this.
For starters, $a =~ /$b/ doesn't check for equality. You'd need
$second_hash_array[$j]{geneID} =~ m/^\Q$first_hash_array[$i]{geneID}\E\z/
or simply
$second_hash_array[$j]{geneID} eq $first_hash_array[$i]{geneID}
for that.
Secondly,
for my $i (0 .. $#first_hash_array) {
... $first_hash_array[$i] ...
}
can be written more succinctly as
for my $first (#first_hash_array) {
... $first ...
}
Next on the list is that
for my $second (#second_hash_array) {
if (...) {
push #subset, $second;
}
}
can add $second to #subset more than once. You either need to add a last
# Perform the push if the condition is true for any element.
for my $second (#second_hash_array) {
if (...) {
push #subset, $second;
last;
}
}
or move the push out of the loop
# Perform the push if the condition is true for all elements.
my $flag = 1;
for my $second (#second_hash_array) {
if (!...) {
$flag = 0;
last;
}
}
if ($flag) {
push #subset, $second;
}
depending on what you want to do.
To remove from an array, one would use splice. But removing from an array messes up all the indexes, so it's better to iterate the array backwards (from last to first index).
Not only is it complicated, it's also expensive. Every time you splice, all subsequent elements in the array need to moved.
A better approach is to filter the elements and assign the resulting element to the array.
my #new_first_hash_array;
for my $first (#first_hash_array) {
my $found = 0;
for my $second (#second_hash_array) {
if ($first->{geneID} eq $second->{geneID}) {
$found = 1;
last;
}
}
if ($found) {
push #new_first_hash_array, $first;
}
}
#first_hash_array = #new_first_hash_array;
Iterating through #second_hash_array repeatedly is needlessly expensive.
my %geneIDs_to_keep;
for (#second_hash_array) {
++$geneIDs_to_keep{ $_->{geneID} };
}
my #new_first_hash_array;
for (#first_hash_array) {
if ($geneIDs_to_keep{ $_->{geneID} }) {
push #new_first_hash_array, $_;
}
}
#first_hash_array = #new_first_hash_array;
Finally, we can replace that for with a grep to give the following simple and efficient answer:
my %geneIDs_to_keep;
++$geneIDs_to_keep{ $_->{geneID} } for #second_hash_array;
#first_hash_array = grep $geneIDs_to_keep{ $_->{geneID} }, #first_hash_array;
This is how I would do it.
Create an array req_geneID for geneIDs required and put all geneIds of the second hash in it.
Traverse the first hash and check if the geneId is contained in the req_geneID array.(its easy in ruby using "include?" but you may try this in perl)
and,
Finally delete the hash that doesnot match any geneID in req_geneID using this in perl
for (keys %hash)
{
delete $hash{$_};
}
Hope this helps.. :)

regular expression help: catch this: |TrxId=475665|

For example I have a string:
MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|
and I want to catch this: |TrxId=475665|
after TrxId= it could be any numbers and any amount of them, so regex should catch as well:
|TrxId=111333| and |TrxId=0000011112222| and |TrxId=123|
TrxId=(\d+)
That would give a group (1) with the TrxId.
PS: Use global modifier.
The regex should look somewhat like this:
TrxId=[0-9]+
It will match TrxId= followed by at least one digit.
An example solution in Python:
In [107]: data = 'MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|'
In [108]: m = re.search(r'\|TrxId=(\d+)\|', data)
In [109]: m.group(0)
Out[109]: '|TrxId=475665|'
In [110]: m.group(1)
Out[110]: '475665'
/MsgNam\=.*?\|(TrxId\=\d+)\|.*/
for example in perl:
$a = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100111|";
$a =~ /MsgNam\=.*?\|(TrxId\=\d+)\|.*/;
print $1;
will print TrxId=475665
You know what your delimiters look like, so you don't need a regex, you need to split. Here's an implementation in Perl.
use strict;
use warnings;
my $input = "MsgNam=WMS.WEATXT|VersionsNr=0|TrxId=475665|MndNr=0257|Werk=0000|WeaNr=0171581054|WepNr=|WeaTxtTyp=110|SpraNam=ru|WeaTxtNr=2|WeaTxtTxt=100 111|";
my #first_array = split(/\|/,$input); #splitting $input on "|"
#Now, since the last character of $input is "|", the last element
#of this array is undef (ie the Perl equivalent of null)
#So, filter that out.
#first_array = grep{defined}#first_array;
#Also filter out elements that do not have an equals sign appearing.
#first_array = grep{/=/}#first_array;
#Now, put these elements into an associative array:
my %assoc_array;
foreach(#first_array)
{
if(/^([^=]+)=(.+)$/)
{
$assoc_array{$1} = $2;
}
else
{
#Something weird may be happening...
#we may have an element starting with "=" for example.
#Do what you want: throw a warning, die, silently move on, etc.
}
}
if(exists $assoc_array{TrxId})
{
print "|TrxId=" . $assoc_array{TrxId} . "|\n";
}
else
{
print "Sorry, TrxId not found!\n";
}
The code above yields the expected output:
|TrxId=475665|
Now, obviously this is more complex than some of the other answers, but it's also a bit more robust in that it allows you to search for more keys as well.
This approach does have a potential issue if your keys appear more than once. In that case, it's easy enough to modify the code above to collect an array reference of values for each key.

In Perl, how many groups are in the matched regex?

I would like to tell the difference between a number 1 and string '1'.
The reason that I want to do this is because I want to determine the number of capturing parentheses in a regular expression after a successful match. According the perlop doc, a list (1) is returned when there are no capturing groups in the pattern. So if I get a successful match and a list (1) then I cannot tell if the pattern has no parens or it has one paren and it matched a '1'. I can resolve that ambiguity if there is a difference between number 1 and string '1'.
You can tell how many capturing groups are in the last successful match by using the special #+ array. $#+ is the number of capturing groups. If that's 0, then there were no capturing parentheses.
For example, bitwise operators behave differently for strings and integers:
~1 = 18446744073709551614
~'1' = Î ('1' = 0x31, ~'1' = ~0x31 = 0xce = 'Î')
#!/usr/bin/perl
($b) = ('1' =~ /(1)/);
print isstring($b) ? "string\n" : "int\n";
($b) = ('1' =~ /1/);
print isstring($b) ? "string\n" : "int\n";
sub isstring() {
return ($_[0] & ~$_[0]);
}
isstring returns either 0 (as a result of numeric bitwise op) which is false, or "\0" (as a result of bitwise string ops, set perldoc perlop) which is true as it is a non-empty string.
If you want to know the number of capture groups a regex matched, just count them. Don't look at the values they return, which appears to be your problem:
You can get the count by looking at the result of the list assignment, which returns the number of items on the right hand side of the list assignment:
my $count = my #array = $string =~ m/.../g;
If you don't need to keep the capture buffers, assign to an empty list:
my $count = () = $string =~ m/.../g;
Or do it in two steps:
my #array = $string =~ m/.../g;
my $count = #array;
You can also use the #+ or #- variables, using some of the tricks I show in the first pages of Mastering Perl. These arrays have the starting and ending positions of each of the capture buffers. The values in index 0 apply to the entire pattern, the values in index 1 are for $1, and so on. The last index, then, is the total number of capture buffers. See perlvar.
Perl converts between strings and numbers automatically as needed. Internally, it tracks the values separately. You can use Devel::Peek to see this in action:
use Devel::Peek;
$x = 1;
$y = '1';
Dump($x);
Dump($y);
The output is:
SV = IV(0x3073f40) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
SV = PV(0x30698cc) at 0x3073484
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x3079bb4 "1"\0
CUR = 1
LEN = 4
Note that the dump of $x has a value for the IV slot, while the dump of $y doesn't but does have a value in the PV slot. Also note that simply using the values in a different context can trigger stringification or nummification and populate the other slots. e.g. if you did $x . '' or $y + 0 before peeking at the value, you'd get this:
SV = PVIV(0x2b30b74) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x3079c5c "1"\0
CUR = 1
LEN = 4
At which point 1 and '1' are no longer distinguishable at all.
Check for the definedness of $1 after a successful match. The logic goes like this:
If the list is empty then the pattern match failed
Else if $1 is defined then the list contains all the catpured substrings
Else the match was successful, but there were no captures
Your question doesn't make a lot of sense, but it appears you want to know the difference between:
$a = "foo";
#f = $a =~ /foo/;
and
$a = "foo1";
#f = $a =~ /foo(1)?/;
Since they both return the same thing regardless if a capture was made.
The answer is: Don't try and use the returned array. Check to see if $1 is not equal to ""