I have a list of instances from a database, and I am accounting for some number of instances. I am writing a proc which will give me the number of instances which have not yet been accounted for; doing so by 'subtracting' the lists from each other. I am seeing some instances show up in the full db list, the accounted for list and still showing up in the missed list, which implies to me that by subtraction portion of the code is wrong.
Moreover the database has around 31000 elements and almost all are accounted for except 300, there are only around 8 which are double counted which seems odd. Here is my subtraction proc which reads from 2 files and writes to another.
Here is the code:
proc checkDif {} {
set fp [open "accounted_db.tcl" r]
set file_data_accounted [read $fp]
close $fp
set fp [open "innovus_db.tcl" r]
set file_data_db [read $fp]
close $fp
foreach elem $file_data_db {dict set y $elem 1}
foreach elem $file_data_accounted {dict unset y $elem}
set res [dict keys $y]
set fileName "missed_db.tcl"
set fileId [open $fileName "w"]
foreach inst $res {
puts $fileId $inst
}
close $fileId
return [llength $res]
}
If your data is one key per line (as it is in your output) and could contain spaces but not newlines (in the keys) then you should change your input code to do:
set fp [open "accounted_db.tcl" r]
set file_data_accounted [split [read $fp] "\n"]
close $fp
set fp [open "innovus_db.tcl" r]
set file_data_db [split [read $fp] "\n"]
close $fp
In fact, it's probably best to factor out that little code sequence:
proc readLines {filename} {
set fp [open $filename]
set lines [split [read $fp] "\n"]
close $fp
return $lines
}
# Might as well factor out an efficient reverse operation at the same time
proc writeLines {filename lines} {
set fp [open $filename "w"]
puts $fp [join $lines "\n"]
close $fp
}
proc checkDif {} {
set file_data_accounted [readLines "accounted_db.tcl"]
set file_data_db [readLines "innovus_db.tcl"]
set y {}; # In case file_data_db is empty; unlikely, but paranoia is good…
foreach elem $file_data_db {
dict set y $elem 1
}
foreach elem $file_data_accounted {
dict unset y $elem
}
set res [dict keys $y]
writeLines "missed_db.tcl" $res
return [llength $res]
}
Without doing the split, you're relying on the data being a true Tcl list from the outset. It might be… but only if you are generating it as such and just hoping everything works is not a recommended approach (in large scale code). By doing explicit splitting (actually very fast) you avoid problems from things like spaces, backslashes and braces in your data; it's my experience that they don't usually crop up in test data, but do more commonly occur in production data…
Related
I am trying to set-up multiple incr's for each entry in a list. I thought that I could assign an integer to each list entry...
set list {
{/run 00}
{/run/shm 00}
{/boot 00}
}
and use the following code as part of a foreach loop to increment the value...
lset list 1 [expr {[lindex $list 1] + 1}]
What I am finding is that the value increments correctly but when the code executes a second and third time the value has reset to 00, so it never increases past 1 on each pass.
If I set up a basic increment for a standard variable as part of the code..
set counter 00
incr counter
it quite happily increments on each run of the code and the counter increases by 1 until I break the code.
Any advise or help in getting this working would be much appreciated. I am definitely not a tcl expert so if I am trying to accomplish this the wrong way please let me know. :)
Thanks in advance for your help.
If you change your data structure slightly to flatten it out instead of using a list of pairs, it becomes usable as a dict. And there's a dict incr command:
This adds the given increment value (an integer that defaults to 1 if not specified) to the value that the given key maps to in the dictionary value contained in the given variable, writing the resulting dictionary value back to that variable. Non-existent keys are treated as if they map to 0. It is an error to increment a value for an existing key if that value is not an integer. The updated dictionary value is returned.
Example usage:
% set list {/run 0 /run/shm 0 /boot 0}
/run 0 /run/shm 0 /boot 0
% dict incr list /boot
/run 0 /run/shm 0 /boot 1
% puts $list
/run 0 /run/shm 0 /boot 1
If you want to do this in a command, you have to pass by name and use upvar so the changes are made in the right stack frame:
% proc demo {fstab_} {
upvar 1 $fstab_ fstab
dict incr fstab /run
}
% demo list
/run 1 /run/shm 0 /boot 1
% puts $list
/run 1 /run/shm 0 /boot 1
And to update every value:
% foreach dir [dict keys $list] { dict incr list $dir }
% puts $list
/run 2 /run/shm 1 /boot 2
I would expect that doing this:
for {set idx 0} {$idx < [llength $list]} {incr idx} {
lset list $idx 1 [expr {[lindex $list $idx 1] + 1}]
}
would increment every numeric value in that list, which is what I believe you want to do. However, doing this:
foreach pair $list {
lset pair 1 [expr {[lindex $pair 1] + 1}]
}
will not work. Tcl conceptually copies the sublist-items out of the main list in foreach so that the changes to the pairs aren't reflected back. Also, conceptually Tcl also copies the value to hand to foreach in the first place. Of course, these copies are not actually real, as that would be very expensive! Instead Tcl uses shared references with copy-on-write-to-shared semantics, a system that works very well given that we can check the sharing status very cheaply (which is enabled by Tcl's threading model; values are never shared between threads, so sharing-state decisions can be lock-free and local).
A consequence of this is that Tcl explicitly rejects weird at-a-distance state changes of the kinds that cause weird bugs sometimes in languages with different semantics. If you're changing something, it'd better be a variable (as those are the main mutable things) and you'll have it right there in front of you when you do the change.
I have the following code:
proc list_backslash {} {
array unset options
array set options {
-inputs {vdd}
-outputs {vss}
}
set inputs { vdd2 vdd dvdd }
set outputs { vss2 vss dvss }
set updateOptions [ list \
-inputs $inputs \
-outputs $outputs ]
array set options $updateOptions
foreach {k v} [array get options] {
puts "$k => $v"
}
}
Since I have a lot of key-value pairs in updateOptions, there is a severe backslashitis! Is there a better way to do code updateOptions? I tried subst + braces {} and realized it does not preserve the list structure thus dooming it.
Generally speaking, if you need to continue a line you have to use a quoting mechanism of some kind with Tcl. Otherwise, the command call ends when the line ends. The [brackets] can include multiple statements too; it's legal, but really not recommended.
But that does mean that sometimes you've got awkward alternatives. Perhaps you'll be best off with doing this:
set updateOptions {
-inputs $inputs
-outputs $outputs
}
foreach {key value} $updateOptions {
set options($key) [subst $value]
}
The array set command isn't especially efficient until you get to huge numbers of options (many thousands) when the code is inside a procedure.
Or if you've got Tcl 8.6, dict map is perhaps better:
array set options [dict map {key value} $updateOptions {subst $value}]
Be aware that subst is not a particularly efficient command in Tcl 8.6 except when used with a literal argument. That's because with variable arguments, it compiles them to bytecode at runtime.
option a) Put it all on one line.
option b) Structure the code as:
set options(-inputs) $inputs
set options(-outputs) $outputs
option c) Learn to like backslashes.
I am trying to crate a new array/list from an existing list of items. I am wanting to check if the item exist first, if it does not, create it then add a value to it. If it already exist just append a value. I also need a way to get the length of the total.
set Area {23401 23402 23403}
foreach Item $Area {
set ElExist [info exist ${Item}lst]
if {$ElExist == 0} {
set ${Item}lst {};
lappend ${Item}lst $TotalRecords
} else {
lappend ${Item}lst $TotalRecords
}
set CurrentOptinon [llength ${Item}lst]
}
If I was writing that code, I'd do it like this:
set Area {23401 23402 23403}
foreach Item $Area {
upvar 0 ${Item}lst lst
lappend lst $TotalRecords
set CurrentOptinon [llength $lst]
}
This will behave the same as your code, but it's so much shorter. Here's the tricks in use:
lappend creates a variable if it didn't already exist.
upvar 0 makes a local alias to a variable. So much simpler.
The alias removes the need for magic with llength, but otherwise you could have done:
set CurrentOptinon [llength [set ${Item}lst]]
The $ syntax is in many ways just a short-cut for a call to the single-argument version of set, which reads the named variable. Except if you write set then you can use substitutions in your variable name. As a rule of thumb, if you're extensively using variable names in variables without aliasing, you're probably doing something wrong (unless you really do need the name).
You're using weird variable names. Much better would be an array.
set Area {23401 23402 23403}
foreach Item $Area {
lappend lst($Item) $TotalRecords
set CurrentOptinon [llength $lst($Item)]
}
However, this is likely to require you to change code elsewhere.
I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!
I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].
I am getting the following error:
Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.
I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.
#cyt = <CYT>;
#nuc = <NUC>;
$cyt = join ('',#cyt);
$cyt =~ /\[([^\]]+)\]/g;
#shared = '';
foreach $nuc (#nuc) {
if ($cyt =~ $nuc) {
push #shared, $nuc;
}
}
print #shared;
What I am trying to achieve with this code is compare two different lists loaded into the arrays #cyt and #nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into #shared. Hope that clarifies it a bit.
Your question describes a set intersection, which is covered in the Perl FAQ.
How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in
either A or in B but not in both. Think of it as an xor operation.
Applying it to your problem gives the code below.
Factor out the common code to find the names in the data files. This sub assumes that
every [name] will be entirely contained within a given line rather than crossing a newline boundary
each line of input will contain at most one [name]
If these assumptions are invalid, please provide more representative samples of your inputs.
Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.
sub extract_names {
my($fh) = #_;
my %name;
while (<$fh>) {
++$name{$1} if /\[ ([^\]]+) \]/x;
}
%name;
}
Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.
open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";
my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;
With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.
my %shared;
for (keys %cyt, keys %nuc) {
++$shared{$_};
}
At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.
The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.
for (sort keys %shared) {
if ($shared{$_} > 1) {
print $_, "\n";
}
else {
delete $shared{$_};
}
}