Config::Simple and creating arrays - regex

I am attempting to write a script that collects user input using getopts. I need to be able to limit restrict the values the user can enter. I see how to set a default value but, I have been unable to find any way to set a list of allowed values... so,
I am attempting to use Config::Simple to create an array from values stored in a text file to use to validate against.
contents of values.txt
ChangeCategories resolution, storm
contents of main.pl
#---create array from values.txt ChangeCategories
my #chg_cats = $cfg->param("ChangeCategories");
unlink $_ for #chg_cats;
#----grab user input via getopts
my $change_categories = $opt_c || die "Please enter a valid change category; #chg_cats";
The issue occurs when I attempt to do the pattern match, it is matching only the first value listed on the ChangeCategories line in the values.txt file.
#---pattern mathching code
my $valid_category;
chomp(#chg_cats);
foreach (#chg_cats) {
##foreach my $line (#chg_cats) {
if(($_ =~ $change_categories) )
#if(($_ =~ m/$change_categories/) )
#if(($_ eq $change_categories) )
As you can see, I have tried numerous constructs to correct this and verify that I get correct matching results every time. I am not sure if this is somehow related to 'chomping' but, I have tried every pattern I can think of. I am a beginner to Perl and would very much appreciate any and all help.... and if anyone can tell me an easier/cleaner way to achieve this result, I would be very grateful

Related

How to get every combination of a combolist

I have a program where you can import a combolist, ex.
test:lol
hello:hi
sup:hey
The first one being the "username" and last one being the "password".
I want so that every single username will have a line with each password of the list. I would like all of this put into a separate string list.
Such as:
test:lol
test:hi
test:hey
hello:lol
hello:hi
hello:hey
sup:lol
sup:hi
sup:hey
I think this might be a really simple code, something to do with first separating the list into two lists, then for each username, add every password, but my brains can't really think of anything to come up with a solution for this.
thank you in advance :)
Yes, you're right.
Firs - get two lists users and passwords by separating each string by column.
Second traverse users and passwords and combine them. In pseudocode
foreach uName in users do
foreach pwd in passwords do
print user.concat(":").concat(pwd)
end
end
Explanation:
Get first `uName` from `users`
Get first `pwd` from `passwords`
output "`user`:`pwd`"
Get second `pwd` from `passwords`
output "`user`:`pwd`"
...
Get second `uName` from `users`
Get first `pwd` from `passwords`
output "`user`:`pwd`"
...
Feel free to ask, if something left unclear

cts:value-match on xs:dateTime() type in Marklogic

I have a variable $yearMonth := "2015-02"
I have to search this date on an element Date as xs:dateTime.
I want to use regex expression to find all files/documents having this date "2015-02-??"
I have path-range-index enabled on ModifiedInfo/Date
I am using following code but getting Invalid cast error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime("2015-02-??T??:??:??.????"))
I have also used following code and getting same error
let $result := cts:value-match(cts:path-reference("ModifiedInfo/Date"), xs:dateTime(xs:date("2015-02-??"),xs:time("??:??:??.????")))
Kindly help :)
It seems you are trying to use wild card search on Path Range index which has data type xs:dateTime().
But, currently MarkLogic don't support this functionality. There are multiple ways to handle this scenario:
You may create Field index.
You may change it to string index which supports wildcard search.
You may run this workaround to support your existing system:
for $x in cts:values(cts:path-reference("ModifiedInfo/Date"))
return if(starts-with(xs:string($x), '2015-02')) then $x else ()
This query will fetch out values from lexicon and then you may filter your desired date.
You can solve this by combining a couple cts:element-range-querys inside of an and-query:
let $target := "2015-02"
let $low := xs:date($target || "-01")
let $high := $low + xs:yearMonthDuration("P1M")
return
cts:search(
fn:doc(),
cts:and-query((
cts:element-range-query("country", ">=", $low),
cts:element-range-query("country", "<", $high)
))
)
From the cts:element-range-query documentation:
If you want to constrain on a range of values, you can combine multiple cts:element-range-query constructors together with cts:and-query or any of the other composable cts:query constructors, as in the last part of the example below.
You could also consider doing a cts:values with a cts:query param that searches for values between for instance 2015-02-01 and 2015-03-01. Mind though, if multiple dates occur within one document, you will need to post filter manually after all (like in option 3 of Navin), but it could potentially speed up post-filtering a lot..
HTH!

Perl regex.. match words exactly 2 times...Input is a JSON file

I am a beginner for any sort of regex. I need your help/pointers in resolving an issue. I have a JSON file which looks like this below.
JSON format
{"record-type":"int-stats","time":1389309548046925,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046925,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548041555,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548041554,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046151,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548041667,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548042626,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548035666,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548035635,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548042255,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548041715,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046161,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548023422,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548041617,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548046676,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548045675,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548046172,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
{"record-type":"int-stats","time":1389309548034534,"host-id":"a.b.c.d","port":"ab-0/0/44","latency":108992}
{"record-type":"int-stats","time":1389309548012345,"host-id":"x.x.x.x","port":"ab-0/0/45","latency":36940}
{"record-type":"int-stats","time":1389309548025232,"host-id":"x.x.x.x","port":"ab-0/0/46","latency":11315}
{"record-type":"int-stats","time":1389309548023423,"host-id":"x.x.x.x","port":"ab-0/0/47","latency":102668}
{"record-type":"int-stats","time":1389309548252352,"host-id":"x.x.x.x","port":"ab-0/0/9","latency":347776}
I need to extract "port":"ab-0/0/44" and associated "time" with that port. I am trying to calculate the time difference for any two such occurrences, i.e 1st occurrence-> "time":1389309548046925 "port":"ab-0/0/44" 2nd occurrence -> "time":1389309548041555 "port":"ab-0/0/44". The calculated time difference must be stored in a variable. I tried with a regular expression like this /\"time\":\\d+\.*\"port\":\".b-0\/0\/44\"/. Any help is appreciated. Thanks in advance!
Use the JSON module. It's rather simple.
use strict;
use warnings;
use JSON;
while (<>) {
/\S/ or next;
my $data = decode_json($_);
print "port -> $data->{port}\n";
print "time -> $data->{time}\n";
}
With your data, I get output like this:
port -> ab-0/0/44
time -> 1389309548046925
port -> ab-0/0/45
time -> 1389309548046925
... etc
I'm not sure how you want to calculate your time, but I assume that doing arithmetic is something you can figure out best on your own.

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!

Error in regex hyphen usage inside brackets utilizing perl

I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].
I am getting the following error:
Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.
I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.
#cyt = <CYT>;
#nuc = <NUC>;
$cyt = join ('',#cyt);
$cyt =~ /\[([^\]]+)\]/g;
#shared = '';
foreach $nuc (#nuc) {
if ($cyt =~ $nuc) {
push #shared, $nuc;
}
}
print #shared;
What I am trying to achieve with this code is compare two different lists loaded into the arrays #cyt and #nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into #shared. Hope that clarifies it a bit.
Your question describes a set intersection, which is covered in the Perl FAQ.
How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in
either A or in B but not in both. Think of it as an xor operation.
Applying it to your problem gives the code below.
Factor out the common code to find the names in the data files. This sub assumes that
every [name] will be entirely contained within a given line rather than crossing a newline boundary
each line of input will contain at most one [name]
If these assumptions are invalid, please provide more representative samples of your inputs.
Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.
sub extract_names {
my($fh) = #_;
my %name;
while (<$fh>) {
++$name{$1} if /\[ ([^\]]+) \]/x;
}
%name;
}
Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.
open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";
my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;
With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.
my %shared;
for (keys %cyt, keys %nuc) {
++$shared{$_};
}
At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.
The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.
for (sort keys %shared) {
if ($shared{$_} > 1) {
print $_, "\n";
}
else {
delete $shared{$_};
}
}