Error in regex hyphen usage inside brackets utilizing perl - regex

I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].
I am getting the following error:
Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.
I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.
#cyt = <CYT>;
#nuc = <NUC>;
$cyt = join ('',#cyt);
$cyt =~ /\[([^\]]+)\]/g;
#shared = '';
foreach $nuc (#nuc) {
if ($cyt =~ $nuc) {
push #shared, $nuc;
}
}
print #shared;
What I am trying to achieve with this code is compare two different lists loaded into the arrays #cyt and #nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into #shared. Hope that clarifies it a bit.

Your question describes a set intersection, which is covered in the Perl FAQ.
How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in
either A or in B but not in both. Think of it as an xor operation.
Applying it to your problem gives the code below.
Factor out the common code to find the names in the data files. This sub assumes that
every [name] will be entirely contained within a given line rather than crossing a newline boundary
each line of input will contain at most one [name]
If these assumptions are invalid, please provide more representative samples of your inputs.
Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.
sub extract_names {
my($fh) = #_;
my %name;
while (<$fh>) {
++$name{$1} if /\[ ([^\]]+) \]/x;
}
%name;
}
Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.
open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";
my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;
With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.
my %shared;
for (keys %cyt, keys %nuc) {
++$shared{$_};
}
At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.
The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.
for (sort keys %shared) {
if ($shared{$_} > 1) {
print $_, "\n";
}
else {
delete $shared{$_};
}
}

Related

How do i read lines from a file into a hash in Perl

I'm working with a file format that has lines of information that looks like this:
ATOM 1 N LYS A 56 20.508 14.774 -7.432 1.00 50.83 N
All i want is the first number, and the three numbers following '56' in the example above; so im using regular expressions to get that information. How do i then put that info into a hash?
So far i have:
my $pdb_file = $ARGV[0];
open (PDBFILE, "<$pdb_file") or die ("$pdb_file not found");
while (<PDBFILE>) {
if ($_=~ /^ATOM\s+(\d+)\s+\w+\s+\w+\s+\w+\s+\d+\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)/) {
my $atom = $1;
my $xcor = $2;
my $ycor = $3;
my $zcor = $4;
print "AtomNumber: $atom\t xyz: $xcor $ycor $zcor\n";
}
}
Instead of using a regex, I would instead recommend using split to split it into fields on whitespace. This will be faster and more robust, it doesn't depend on a detailed knowledge of the format of each field (which could change, like if a number has a minus sign which you forgot to take into account). And it's a lot easier to understand.
my #fields = split /\s+/, $line;
Then you can pick out the fields (for example, the first number is field 2, so $fields[1]) and put them into your hash.
my %coordinate = (
atom => $fields[1],
x => $fields[6],
y => $fields[7],
z => $fields[8]
);
You're reading a bunch of lines, so you're going to make a bunch of hashes which have to go somewhere. I'd recommend putting them all in another hash with some sort of unique field as the key. Possibly the atom field.
$atoms{$coordinate{atom}} = \%coordinate;

Perl String Regular Expression

i need some help in Perl regular expressions.
I have this string:
{
"ITEM":[
{
"-itemID": "1000000" ,
"-itemName": "DisneyJuniorLA" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000001" ,
"-itemName": "31 minutos" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000002" ,
"-itemName": "Plaza Sésamo" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
]
}
The string is in a variable: $jsonString
I have another variable: $itemName
I want to only keep in $jsonString the itemId value that is above itemName (where itemName equals $itemName)
I would really appreciate your help. I am really amateur in regular expressions.
Thank you!
Notwithstanding that your JSON string is very slightly malformed (there's an extra comma after the last element in the array that should be fixed by whoever's generating the "JSON"), attempting to use regexps to handle this just means you now have two problems instead of one.
More specifically, objects within JSON are explicitly unordered sets of key/value pairs. It's perfectly possible that whatever's changing the JSON could be rewritten such that the JSON is semantically identical but serialised differently, making anything that relies on the current structure brittle and error prone.
Instead, use a proper JSON decoder, and then traverse the resulting object hierarchy directly to find the desired element:
use JSON;
use utf8;
# decode the JSON
my $obj = decode_json($jsonString);
# get the ITEM array
my $itemRef = $obj->{ITEM};
# find all elements matching the item name
my #match = grep { $_->{'-itemName'} eq $itemName } #{$itemRef};
# extract the item ID
if (#match) {
my $itemID = $match[0]->{'-itemID'};
print $itemID;
}
Don't use a regular expression to parse JSON. Use JSON.
Basically :
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json_string;
{
open( my $json_in, "<", 'test.json' );
local $/;
$json_string = <$json_in>;
}
my $json = decode_json ( $json_string );
print Dumper \$json;
foreach my $item ( #{ $json -> {'ITEM'} } ) {
print $item -> {'-itemID'},"\n";
print $item -> {'-itemName'},"\n";
}
But you have to fix your json first. (There's a trailing comma that shouldn't be there. )
JSON is a defined data transfer structure. Whilst you can technically treat it as 'plain text' and extract things from the text, that's definitively the wrong way to do things.
It might work fine for a good long time, but if your source program changes a little - and change their output, whilst still sticking to the JSON standard - your code will break unexpectedly, and you may not realise. That can set off a domino effect of breakages, making a whole system or site just crash and burn. And worse yet - the source of this crash and burn will be hidden away in some script that hasn't been touched in years, so will be very difficult to fix.
This is one of my pet peeves as a professional sysadmin. Please don't even go there.

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!

Config::Simple and creating arrays

I am attempting to write a script that collects user input using getopts. I need to be able to limit restrict the values the user can enter. I see how to set a default value but, I have been unable to find any way to set a list of allowed values... so,
I am attempting to use Config::Simple to create an array from values stored in a text file to use to validate against.
contents of values.txt
ChangeCategories resolution, storm
contents of main.pl
#---create array from values.txt ChangeCategories
my #chg_cats = $cfg->param("ChangeCategories");
unlink $_ for #chg_cats;
#----grab user input via getopts
my $change_categories = $opt_c || die "Please enter a valid change category; #chg_cats";
The issue occurs when I attempt to do the pattern match, it is matching only the first value listed on the ChangeCategories line in the values.txt file.
#---pattern mathching code
my $valid_category;
chomp(#chg_cats);
foreach (#chg_cats) {
##foreach my $line (#chg_cats) {
if(($_ =~ $change_categories) )
#if(($_ =~ m/$change_categories/) )
#if(($_ eq $change_categories) )
As you can see, I have tried numerous constructs to correct this and verify that I get correct matching results every time. I am not sure if this is somehow related to 'chomping' but, I have tried every pattern I can think of. I am a beginner to Perl and would very much appreciate any and all help.... and if anyone can tell me an easier/cleaner way to achieve this result, I would be very grateful

Comparison between 2 files and printing the difference between them in a sequential manner

I have 2 files namely
**a.txt**
cars
bikes
bus
vehicle
atv
**b.txt**
hawk
hero
atv
bus
***result.txt***
cars
bikes
vehicle
hawk
hero
I want to print the difference between the 2 files. Now I have tried a code but it gives me the difference in a random manner; I want it to display it in a proper sequential manner. Can anyone help me out in this.
use strict;
my %results = ();
open FILE1, "<a.txt"
or die "Could not open file: $! \n";
while (my $line = <FILE1>) {
$results{$line}=1;
}
close FILE1;
open FILE2, "<b.txt"
or die "Could not open file: $! \n";
while (my $line = <FILE2>) {
$results{$line}++;
}
close FILE2;
open OUTFILE, ">>result.txt"
or die "Cannot open $outfile for writing \n";
foreach my $line (keys %results) {
print OUTFILE $line if $results{$line} == 1;
}
close OUTFILE;
close OUTFILE1;
The output that I am getting with this code is
***result.txt***
cars
hawk
bikes
hero
vehicle
Hashes store their elements in a random order, therefore we have to carry positional information with us. In the following solution, I use the hash value to carry an unique ID.
Code
#!/usr/bin/perl
use strict; use warnings; use Data::Dumper;
my #a = qw( cars bikes bus vehicle atv );
my #b = qw( hawk hero atv bus );
my $i = 0;
my %ahash = map {$_ => ++$i} #a;
my %bhash = map {$_ => ++$i} #b;
my %different = map {
($ahash{$_} and $bhash{$_})
? ()
: ($_ => ($ahash{$_} or $bhash{$_}))
} (keys %ahash, keys %bhash);
my #sorted = sort {$different{$a} <=> $different{$b}} keys %different;
print Dumper(\#sorted);
Discussion
For the purpose of demonstration, I didn't use files, but stored the records in arrays #a and #b.
I build hashes %ahash and %bhash from these arrays, using the array element as key and using an unique number as value. In the case of the %ahash, these IDs are the same as the line number. You can think of the IDs in %bhash as line numbers with an offset.
I then gather all key-value pairs that differ between the hashes. To do that, I take the list of all keys in %ahash and %bhash. For each of these keys, I test for the existence of this key in both hashes (I don't use exists, but rather test against the value, as I can guarantee that no ID with a false value is assigned—all numbers except 0 evaluate true). If the key exists in both hashes, I return the empty list (). If the key exists in only one of the hashes, I return a two-element list containing the key and the value, which is either in %ahash or %bhash.
In the next step, I sort all different keys in ascending order by their associated value, and store the keys in this order in #sorted. You would print them to a file, I simply dumped them, and got this output:
$VAR1 = [
'cars',
'bikes',
'vehicle',
'hawk',
'hero'
];
I believe this meets your criteria.
Basically, you had a quite elegant algorithm to find differences, but you didn't incorporate the sorting data, as I did with my line-number like IDs.
Style notes
In most cases, using lexical filehandles with my is preferred over using global filehandles. Also, the three-argument form of open is more flexible and worth getting used to. For example, I'd write your third line as
my $filename1 = "a.txt";
open my $file1, '<', $filename1 or die qq{Could not open "$filename1": $!\n};
but then again, TIM TOWTDI.
perl -lne '$X{$_}++;if(eof){$a++;}if($a==2){foreach (keys %X){if($X{$_}==1){print $_}}}' file1 file2
tested below:
> cat temp
cars
bikes
bus
vehicle
atv
> cat temp2
hawk
hero
atv
bus
> perl -lne '$X{$_}++;if(eof){$a++;}if($a==2){foreach (keys %X){if($X{$_}==1){print $_}}}' temp temp2
cars
hawk
hero
vehicle
bikes
>
Since the data is stored as a hash reference the order is not preserved. See How can I print a Perl hash in specific order?
$ diff a.txt b.txt
$ is not part of the code, it is there to denote a bash command prompt.
comm may also be of use to you.