Parse CSV based on random header layout perl - regex

I am trying some code using Tie::Handle::CSV, however I suppose this could be done using other modules or none at all.
What I want to do is take a file of random layout and match it to what I predict the headers will be. Then I want to arrange it to my table structure.
DATA
First Name,last name,date of birth
Jim,Johansen,08/25/1989
OR
2nd Name,1st Name,D.O.B
Johansen,Jim,08/25/1989
OR
2nd Name,1st Name,D.O.B,city,county
Johansen,Jim,08/25/1989,milwaukee,N/A
As you can see, I have varying data structures. I want it to arrange them to my schema, even if the fields I request are empty. The way I want to do this is by searching my input with my column variables.
Here is what I am trying.
Code
use Tie::Handle::CSV;
my $name1 =qr/First Name|Name|1st Name/i;
my $name2 =qr/Last Name|Maiden Name|2nd Name/i;
my $date_of_birth =qr/date of birth|D.O.B/i;
my $city =qr/city|town/i;
my $csv_fh = Tie::Handle::CSV->new('list.txt', header => 1);
while (my $csv_line = <$csv_fh>)
{
print $csv_line->{'$date_of_birth'}.",".$csv_line->{'$name1'}." ".$csv_line->{'$name2'}.",".$csv_line->{'$city'}.\n"; ##note I am searching for the column {$'colummn regex'} instead of {'column'} to see if my input file matches any of the header options.
}
close $csv_fh;
My output is blank since this module is not understanding the regex I am implimenting. However, my output would contain the columns specified if I used their literal names, i.e.
The out put I want would be:
Scenario 1
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen, ##noting also that if there is no 'city' in the input data, leave blank.
Scenario 2
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen,
Scenario 3
Date of Birth,Name,City ##my implemented header
08/25/1989,Jim Johansen,milwaukee
Perhaps there is a better option than a module or even my regex variables. Has anyone had to parse csvs in ever changing layouts?

You never use $name1, $name2, etc, much less in a match (or substitution) operator, so you ever execute any regex match.
my $field_names = $csv_fh->header();
my ($name1_header) = grep /First Name|^Name$|1st Name/i, #$field_names;
my ($name2_header) = grep /Last Name|Maiden Name|2nd Name/i, #$field_names;
my ($dob_header ) = grep /date of birth|D\.O\.B/i, #$field_names;
my ($city_header ) = grep /city|town/i, #$field_names;
my #recognized_fields = ( $name1_header, $name2_header, $dob_header, $city_header );
my %recognized_fields = map { $_ => 1 } #recognized_fields;
my #other_headers = grep !$recognized_fields{$_}, #$field_names;
while (my $row = <$csv_fh>) {
my $name1 = $name1_header ? $row->{$name1_header} : undef;
my $name2 = $name2_header ? $row->{$name2_header} : undef;
my $dob = $dob_header ? $row->{$dob_header } : undef;
my $city = $city_header ? $row->{$city_header } : undef;
my #other_fields = #$row{#other_headers};
...
}

Related

How do i read lines from a file into a hash in Perl

I'm working with a file format that has lines of information that looks like this:
ATOM 1 N LYS A 56 20.508 14.774 -7.432 1.00 50.83 N
All i want is the first number, and the three numbers following '56' in the example above; so im using regular expressions to get that information. How do i then put that info into a hash?
So far i have:
my $pdb_file = $ARGV[0];
open (PDBFILE, "<$pdb_file") or die ("$pdb_file not found");
while (<PDBFILE>) {
if ($_=~ /^ATOM\s+(\d+)\s+\w+\s+\w+\s+\w+\s+\d+\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)/) {
my $atom = $1;
my $xcor = $2;
my $ycor = $3;
my $zcor = $4;
print "AtomNumber: $atom\t xyz: $xcor $ycor $zcor\n";
}
}
Instead of using a regex, I would instead recommend using split to split it into fields on whitespace. This will be faster and more robust, it doesn't depend on a detailed knowledge of the format of each field (which could change, like if a number has a minus sign which you forgot to take into account). And it's a lot easier to understand.
my #fields = split /\s+/, $line;
Then you can pick out the fields (for example, the first number is field 2, so $fields[1]) and put them into your hash.
my %coordinate = (
atom => $fields[1],
x => $fields[6],
y => $fields[7],
z => $fields[8]
);
You're reading a bunch of lines, so you're going to make a bunch of hashes which have to go somewhere. I'd recommend putting them all in another hash with some sort of unique field as the key. Possibly the atom field.
$atoms{$coordinate{atom}} = \%coordinate;

First instance after specific string

So, I have this RegEx that captures a specific string I need (thanks to Shawn Mehan):
>?url:\'\/watch\/(video[\w-\/]*)
It works great, but now I need to mod my criteria. I need to capture ONLY the first URL after EACH instance of: videos:[{title:. Bolded all instances below and also bolded the first URL I'd want captured as an example.
How might I approach this? I have a VBScript that will dump each URL to a text file, so I just need help selecting the correct URLs from the blob below. Thinking something like, "if this string is found, do this, loop". Setting the regex global to false should only grab the first instance each round, right? A basic example would help.
I believe I have all of the pieces I need, but I'm not quite sure how to put them together. I'm expecting the code below to loop through and find the index of each instance of "videos:[{title:", then the regex to grab the first URL after (regexp global set to false) based on the pattern, then write the found URL to my text file, loop until all are found. Not working...
(larger portion of html_dump: http://pastebin.com/6i5gmeTB)
Set objWshShell = Wscript.CreateObject("Wscript.Shell")
Set fso = CreateObject("Scripting.FileSystemObject")
Set objRegExp = new RegExp
objRegExp.Global = False
objRegExp.Pattern = ">?url:\'\/watch\/(video[\w-\/]*)"
filename = fso.GetParentFolderName(WScript.ScriptFullName) & "\html_dump.txt" 'Text file contains html
set urldump = fso.opentextfile(filename,1,true)
do until urldump.AtEndOfStream
strLine = urldump.ReadLine()
strSearch = InStrRev(strLine, "videos:[{title:") 'Attempting to find the position of "videos:[{title:" to grab the first URL after.
If strSearch >0 then
Set myMatches = objRegExp.Execute(strLine) 'This matches the URL pattern.
For Each myMatch in myMatches
strCleanURL = myMatch.value
next
'===Writes clean urls to txt file...or, it would it if worked===
filename1 = fso.GetParentFolderName(WScript.ScriptFullName) & "\URLsClean.txt" 'Creates and writes to this file
set WriteURL = fso.opentextfile(filename1,2,true)
WriteURL.WriteLine strCleanURL
WriteURL.Close
else
End if
loop
urldump.close
var streams = [ {streamID:138, cards:[{cardId: 59643,cardTypeId: 48,clickCount: 84221,occurredOn: '2015-08-17T15:30:17.000-07:00',expiredOn: '',header: 'Latest News Headlines', subHeader: 'Here are some of the latest headlines from around the world.', link: '/watch/playlist/544/Latest-News-Headlines', earn: 3, playlistRevisionID: 3427, image: 'http%3A%2F%2Fpthumbnails.5min.com%2F10380591%2F519029502_3_o.jpg', imageParamPrefix: '?', size: 13, durationMin: 15, durationTime: '14:34',pos:0,trkId:'2gs55j6u0nz8', true,videos:[{title:'World\'s First Sky Pool Soon To Appear In South London',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380509%2F519025436_c_140_105.jpg',durationTime:'0:39',url:'/watch/video/716424/worlds-first-sky-pool-soon-to-appear-in-south-london',rating:'4.2857'},{title:'Treasure Hunters Find $4.5 Million in Spanish Coins',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380462%2F519023092_3.jpg',durationTime:'0:54',url:'/watch/video/715927/treasure-hunters-find-4-5-million-in-spanish-coins',rating:'4.25'},{title:'Former President Jimmy Carter Says Cancer Has Spread to Brain',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380499%2F519024920_c_140_105.jpg',durationTime:'1:59',url:'/watch/video/716363/former-president-jimmy-carter-says-cancer-has-spread-to-brain',rating:'2.8889'},{title:'Josh Duggar Had Multiple Accounts on AshleyMadison.Com',thumbnail:'http%3A%2F%2Fpthumbnails.5min.com%2F10380505%2F519025222_c_140_105.jpg',durationTime:'1:30',
Assuming that your input comes from a file and is in correct JSON format you could do something like this in PowerShell:
$jsonfile = 'C:\path\to\input.txt'
$json = Get-Content $jsonfile -Raw | ConvertFrom-Json
$json.streams.cards | ForEach-Object { $_.videos[0].url }
The above is assuming that streams is the topmost key in your JSON data.
Note that the code requires at least PowerShell v3.

Perl String Regular Expression

i need some help in Perl regular expressions.
I have this string:
{
"ITEM":[
{
"-itemID": "1000000" ,
"-itemName": "DisneyJuniorLA" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000001" ,
"-itemName": "31 minutos" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
{
"-itemID": "1000002" ,
"-itemName": "Plaza Sésamo" ,
"-thumbUrl": "" ,
"-packageID": "1" ,
"-itemPrice": "0" ,
"-isLock": "true"
},
]
}
The string is in a variable: $jsonString
I have another variable: $itemName
I want to only keep in $jsonString the itemId value that is above itemName (where itemName equals $itemName)
I would really appreciate your help. I am really amateur in regular expressions.
Thank you!
Notwithstanding that your JSON string is very slightly malformed (there's an extra comma after the last element in the array that should be fixed by whoever's generating the "JSON"), attempting to use regexps to handle this just means you now have two problems instead of one.
More specifically, objects within JSON are explicitly unordered sets of key/value pairs. It's perfectly possible that whatever's changing the JSON could be rewritten such that the JSON is semantically identical but serialised differently, making anything that relies on the current structure brittle and error prone.
Instead, use a proper JSON decoder, and then traverse the resulting object hierarchy directly to find the desired element:
use JSON;
use utf8;
# decode the JSON
my $obj = decode_json($jsonString);
# get the ITEM array
my $itemRef = $obj->{ITEM};
# find all elements matching the item name
my #match = grep { $_->{'-itemName'} eq $itemName } #{$itemRef};
# extract the item ID
if (#match) {
my $itemID = $match[0]->{'-itemID'};
print $itemID;
}
Don't use a regular expression to parse JSON. Use JSON.
Basically :
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json_string;
{
open( my $json_in, "<", 'test.json' );
local $/;
$json_string = <$json_in>;
}
my $json = decode_json ( $json_string );
print Dumper \$json;
foreach my $item ( #{ $json -> {'ITEM'} } ) {
print $item -> {'-itemID'},"\n";
print $item -> {'-itemName'},"\n";
}
But you have to fix your json first. (There's a trailing comma that shouldn't be there. )
JSON is a defined data transfer structure. Whilst you can technically treat it as 'plain text' and extract things from the text, that's definitively the wrong way to do things.
It might work fine for a good long time, but if your source program changes a little - and change their output, whilst still sticking to the JSON standard - your code will break unexpectedly, and you may not realise. That can set off a domino effect of breakages, making a whole system or site just crash and burn. And worse yet - the source of this crash and burn will be hidden away in some script that hasn't been touched in years, so will be very difficult to fix.
This is one of my pet peeves as a professional sysadmin. Please don't even go there.

Retrieve the coding amino-acid when there is certain pattern in a DNA sequence

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:
Input file:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.
Please let me know if this task can be performed by an already available tool.
Thanks for your help. All the best, Bernardo
Edit (due to the confusion generated with my post):
Please forget the original post and sequence1 and sequence2 too.
Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.
I would like to know for which amino-acid the following sequences are coding for:
AGAGAG
GAGAGA
CTCTCT
TCTCTC
The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.
Thanks again, Bernardo
Here's some code that should at least get you started. For example, you can run like:
./retrieve_coding_aa.pl file.fa ATAGTA
Contents of retrieve_coding_aa.pl:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Bio::SeqIO;
use Bio::Tools::CodonTable;
use Data::Dumper;
my $pattern = $ARGV[1];
my $fasta = Bio::SeqIO->new ( -file => $ARGV[0], -format => 'fasta');
while (my $seq = $fasta->next_seq ) {
my $pos = 0;
my %counts;
for (split /($pattern)/ => $seq->seq) {
if ($_ eq $pattern) {
my $dist = $pos % 3;
unless ($dist == 0) {
my $num = 3 - $dist;
s/.{$num}//;
chop until length () % 3 == 0;
}
my $table = Bio::Tools::CodonTable->new();
$counts{$_}++ for split (//, $table->translate($_));
}
$pos += length;
}
print $seq->display_id() . ":\n";
map {
print "$_ => $counts{$_}\n"
}
sort {
$counts{$a} <=> $counts{$b}
}
keys %counts;
print "\n";
}
Here are the results using the sample input:
sequence1:
S => 1
sequence2:
V => 1
I => 1
The Bio::Tools::CodonTable class also supports non-standard codon usage tables. You can change the table using the id pointer. For example:
$table = Bio::Tools::CodonTable->new( -id => 5 );
or:
$table->id(5);
For more information, including how to examine these tables, please see the documentation here: http://metacpan.org/pod/Bio::Tools::CodonTable
I will stick to that first version of what you wanted cause the addendum only confused me even more. (frame?)
I only found ATAGTA once in sequence2 but I assume you want the mirror images/reverse sequence as well, which would be ATGATA in this case. Well my script doesn't do that so you would have to write it up twice in the input_sequences file but that should be no problem I would think.
I work with a file like yours which I call "dna.txt" and a input sequences file called "input_seq.txt". The result file is a listing of patterns and their occurences in the dna.txt file (including overlap-results but it can be set to non-overlap as explained in the awk).
input_seq.txt:
GC
ATA
ATAGTA
ATGATA
dna.txt:
>sequence1
ATGGCGCATAGTAATGC
>sequence2
ATGATAGTAATGCGCGC
results.txt:
GC,6
ATA,2
ATAGTA,2
ATGATA,1
Code is awk calling another awk (but one of them is simple). You have to run
"./match_patterns.awk input_seq.txt" to get the results file generated.:
*match_patterns.awk:*
#! /bin/awk -f
{return_value= system("awk -vsubval="$1" -f test.awk dna.txt")}
test.awk:
#! /bin/awk -f
{string=$0
do
{
where = match(string, subval)
# code is for overlapping matches (i.e ATA matches twice in ATATAC)
# for non-overlapping replace +1 by +RLENGTH in following line
if (RSTART!=0){count++; string=substr(string,RSTART+1)}
}
while (RSTART != 0)
}
END{print subval","count >> "results.txt"}
Files have to be all in the same directory.
Good luck!

Error in regex hyphen usage inside brackets utilizing perl

I have this perl script that compares two arrays to give me back those results found in both of them. The problem arises I believe in a regular expression, where it encounters a hyphen ( - ) inside of brackets [].
I am getting the following error:
Invalid [] range "5-3" in regex; marked by <-- HERE in m/>gi|403163623|ref|XP_003323683.2| leucyl-tRNA synthetase [Puccinia graminis f. sp. tritici CRL 75-3 <-- HERE 6-700-3]
MAQSTPSSIQELMDKKQKEATLDMGGNFTKRDDLIRYEKEAQEKWANSNIFQTDSPYIENPELKDLSGEE
LREKYPKFFGTFPYPYMNGSLHLGHAFTISKIEFAVGFERMRGRRALFPVGWHATGMPIKSASDKIIREL
EQFGQDLSKFDSQSNPMIETNEDKSATEPTTASESQDKSKAKKGKIQAKSTGLQYQFQIMESIGVSRTDI
PKFADPQYWLQYFPPIAKNDLNAFGARVDWRRSFITTDINPYYDAFVRWQMNRLKEKGYVKFGERYTIYS
PKDGQPCMDHDRSSGERLGSQEYTCLKMKVLEWGPQAGDLAAKLGGKDVFFV at comparer line 21, <NUC> chunk 168.
I thought the error could be solved by just adding \Q..\E in the regex so as to bypass the [] but this has not worked. Here is my code, and thanks in advance for any and all help that you may offer.
#cyt = <CYT>;
#nuc = <NUC>;
$cyt = join ('',#cyt);
$cyt =~ /\[([^\]]+)\]/g;
#shared = '';
foreach $nuc (#nuc) {
if ($cyt =~ $nuc) {
push #shared, $nuc;
}
}
print #shared;
What I am trying to achieve with this code is compare two different lists loaded into the arrays #cyt and #nuc. I then compare the name in between the [] of one of the elements in list to to the name in [] of the other. All those finds are then pushed into #shared. Hope that clarifies it a bit.
Your question describes a set intersection, which is covered in the Perl FAQ.
How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Use a hash. Here's code to do both and more. It assumes that each
element is unique in a given array:
my (#union, #intersection, #difference);
my %count = ();
foreach my $element (#array1, #array2) { $count{$element}++ }
foreach my $element (keys %count) {
push #union, $element;
push #{ $count{$element} > 1 ? \#intersection : \#difference }, $element;
}
Note that this is the symmetric difference, that is, all elements in
either A or in B but not in both. Think of it as an xor operation.
Applying it to your problem gives the code below.
Factor out the common code to find the names in the data files. This sub assumes that
every [name] will be entirely contained within a given line rather than crossing a newline boundary
each line of input will contain at most one [name]
If these assumptions are invalid, please provide more representative samples of your inputs.
Note the use of the /x regex switch that tells the regex parser to ignore most whitespace in patterns. In the code below, this permits visual separation between the brackets that are delimiters and the brackets surrounding the character class that captures names.
sub extract_names {
my($fh) = #_;
my %name;
while (<$fh>) {
++$name{$1} if /\[ ([^\]]+) \]/x;
}
%name;
}
Your question uses old-fashioned typeglob filehandles. Note that the paramter extract_names expects is a filehandle. Convenient parameter passing is one of many benefits of indirect filehandles, such as those created below.
open my $cyt, "<", "cyt.dat" or die "$0: open: $!";
open my $nuc, "<", "nuc.dat" or die "$0: open: $!";
my %cyt = extract_names $cyt;
my %nuc = extract_names $nuc;
With the names from cyt.dat in the hash %cyt and likewise for nuc.dat and %nuc, the code here iterates over the keys of both hashes and increments the corresponding keys in %shared.
my %shared;
for (keys %cyt, keys %nuc) {
++$shared{$_};
}
At this point, %shared represents a set union of the names in cyt.dat and nuc.dat. That is, %shared contains all keys from either %cyt or %nuc. To compute the set difference, we observe that the value in %shared for a key present in both inputs must be greater than one.
The final pass below iterates over the keys in sorted order (because hash keys are stored internally in an undefined order). For truly shared keys (i.e., those whose values are greater than one), the code prints them and deletes the rest.
for (sort keys %shared) {
if ($shared{$_} > 1) {
print $_, "\n";
}
else {
delete $shared{$_};
}
}