Adding lists in a python elif statements - python-2.7

I have two data files (datafile1 and datafile2) and I want to add some information from datafile2 to datafile1, but only if it meets certain requirements, and then write all of the information to new file.
Here is an example of datafile1 (I changed the tabs so it's easier to see):
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt
Here is an example of datafile2:
#GInumber OTU Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
1366104624 OTU49 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta Hymenoptera Braconidae Leiophron NA
342734543 OTU171 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta Lepidoptera Limacodidae Euphobetron Euphobetron cupreitincta
290756623 OTU803 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae Apocheima Apocheima pilosaria
296792336 OTU2519 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
What I would like to do is for every line of datafile1, find the line in datafile2 with the same "OTU", and from datafile 2 always add GInumber, Accssn, Ident, Len, M, Gap, Qs, Qe, Ss, Se, evalue, bit, phylum, and class. If Ident falls between certain numbers, then I would like to also add order, family, genus, and species, according to these criteria:
Case #1: Ident > 98.0, add order, family, genus, and species
Case #2: Ident between 96.5 and 98.0, add order, family, "NA", "NA"
Case #3: Ident between 95.0 and 96.5, add order, "NA", "NA", "NA"
Case #4: Ident < 95.0 add "NA", "NA", "NA", "NA"
The desired output would be this:
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq GInumber Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat 1366104624 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta NA NA NA NA
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt 342734543 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta Lepidoptera NA NA NA
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt 290756623 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae NA NA
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt 296792336 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
I wrote this script:
import csv
#Files
besthit_taxonomy_unique_file = "datafile2.txt"
OTUtablefile = "datafile1.txt"
outputfile = "outputfile.txt"
#Settings
OrderLevel = float(95.0)
FamilyLevel = float(96.5)
SpeciesLevel = float(98.0)
#Importing the OTU table, which is tab delimited
OTUtable = list(csv.reader(open(OTUtablefile, 'rU'), delimiter='\t'))
headerOTUs = OTUtable.pop(0)
#Importing the best hit taxonomy table, which is tab delimited
taxonomytable = list(csv.reader(open(besthit_taxonomy_unique_file, 'rU'), delimiter='\t'))
headertax = taxonomytable.pop(0)
headertax.pop(1)
#Getting the header info
totalheader = headerOTUs + headertax
#Merging and assigning the taxonomy at the appropriate level
outputtable = []
NAs = 4 * ["NA"] #This is a list of NAs so that I can add the appropriate number, depending on the Identity
for item in OTUtable:
OTU = item #Just to prevent issues with the list of lists
OTUIDtable = OTU[0]
print OTUIDtable
for thing in taxonomytable:
row = thing #Just to prevent issues with the list of lists
OTUIDtax = row[1]
if OTUIDtable == OTUIDtax:
OTU.append(row[0])
OTU += row[2:15]
PercentID = float(row[3])
if PercentID >= SpeciesLevel:
OTU += row[15:]
elif FamilyLevel <= PercentID < SpeciesLevel:
OTU += row[15:17]
OTU += NAs[:2]
elif OrderLevel <= PercentID < FamilyLevel:
print row[15]
OTU += row[15]
OTU += NAs[:3]
else:
OTU += NAs
outputtable.append(OTU)
#Writing the output file
f1 = open(outputfile, 'w')
for item in totalheader[0:-1]:
f1.write(str(item) + '\t')
f1.write(str(totalheader[-1]) + '\n')
for row in outputtable:
currentrow = row
for item in currentrow[0:-1]:
f1.write(str(item) + '\t')
f1.write(str(currentrow[-1]) + '\n')
For the most part the output is correct, except in case #3 (Ident between 95 and 96.5), when the script outputs the entry for order with every letter having a tab between it.
Here is an example of the output:
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq GInumber Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat 1366104624 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta NA NA NA NA
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt 342734543 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta L e p i d o p t e r a NA NA NA
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt 290756623 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae NA NA
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt 296792336 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
I just can't figure out what's going wrong. The rest of the time the order seems to contain the correct info, but for this one case it seems as if the information in order is stored as a list of lists. However, the output to the screen is this:
OTU171
Lepidoptera
This doesn't seem to indicate a list of lists...
I would be happy for any insights. I also appreciate if anyone has ideas for making my code more pythonic.
Andreanna

Related

How to grep this line "12/15-12:24:51 <1692> ## 0 0 0 0 0 0 0 0 0 0 691 0"

I've a file called test.txt
12/15-12:24:51 <1692> ## 0 0 0 0 0 0 0 0 0 0 691 0
12/15-12:24:51 <1692> END SESSION SUMMARY
12/15-12:24:55 <1692> INFO: SESSION SUMMARY
12/15-12:24:55 <1692> + - ch G B C L S T X Y -
12/15-12:24:55 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:24:55 <1692> END SESSION SUMMARY
12/15-12:24:59 <1692> INFO: SESSION SUMMARY
12/15-12:24:59 <1692> + - ch G B C L S T X Y -
12/15-12:24:59 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:24:59 <1692> END SESSION SUMMARY
12/15-12:25:03 <1692> INFO: SESSION SUMMARY
12/15-12:25:03 <1692> + - ch G B C L S T X Y -
12/15-12:25:03 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:03 <1692> END SESSION SUMMARY
12/15-12:25:07 <1692> INFO: SESSION SUMMARY
12/15-12:25:07 <1692> + - ch G B C L S T X Y -
12/15-12:25:07 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:07 <1692> END SESSION SUMMARY
and need output as
12/15-12:24:51 <1692> ## 0 0 0 0 0 0 0 0 0 0 691 0
12/15-12:24:55 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:24:59 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:03 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
12/15-12:25:07 <1692> ## 0 0 0 0 0 0 0 0 0 0 692 0
Tried following way but couldn't get
cat test.txt | perl -e '$str = do { local $/; <> }; while ($str =~ /(\d\d):(\d\d):(\d\d)?\s.*/) { print "$1:$2:$3:$4\n"}'
Your one-liner has some mistakes. I will go through them, then show you a solution.
cat test.txt |
You don't need to cat into a pipe, just use the file name as argument when using diamond operator <>.
perl -e '$str = do { local $/; <> };
This slurps the entire file into a single string. This is not useful in your case. This is only useful if you are expecting matches that include newlines.
while ($str =~ /(\d\d):(\d\d):(\d\d)?\s.*/) {
This part will only run once, because you did not use the /g modifier. This is especially bad since you are not running in line-by-line mode, because you slurped the file.
The regex will try to match one of the time stamps, I assume, e.g. 12:25:07. Why you would want to do that is beyond me, since each line in your input has such a time stamp, rendering the whole operation useless. You want to try to match something that is unique for the lines you do want.
print "$1:$2:$3:$4\n"}'
This part prints 4 capture groups, and you only have 3 (2 fixed and 1 optional). It will not print the entire line.
What you want is something simplistic like this:
perl -ne'print if /\#\#/' test.txt
Which will go through the file line-by-line, check each line for ## and print the lines found.
Or if you are using *nix, just grep '##' test.txt

Divide full name into first and last when middle name is present

I wanted to see if this was doable in SAS. I have a dataset of the members of congress and want to split full name into first and last. However, occasionally they seem to list their middle initial or name. It is from a .txt file.
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
Good day,
SAS is a bit clunky when it comes to Strings. However it can be done. As other have mentioned, it's the logic defined, which is the really hard part.
Begin with some raw data...
data begin;
input raw_str $ 1-100;
cards;
Norton, Eleanor Holmes [D-DC] 16 0 440 288 0
Cohen, Steve [D-TN] 15 0 320 209 0
Schakowsky, Janice D. [D-IL] 6 0 289 186 0
McGovern, James P. [D-MA] 8 1 252 139 0
Clarke, Yvette D. [D-NY] 7 0 248 166 0
Moore, Gwen [D-WI] 2 3 244 157 1
Hastings, Alcee L. [D-FL] 13 1 235 146 0
Raskin, Jamie [D-MD] 8 1 232 136 0
Grijalva, Raul M. [D-AZ] 9 1 228 143 0
Khanna, Ro [D-CA] 4 0 223 150 0
; run;
first I select the leading names till the first bracket.
count the number of strings
data names;
set begin;
names_only = scan(raw_str,1,'[');
Nr_of_str = countw(names_only,' ');
run;
Assumption: First sting is the last name.
If there are only 2 strings, the first and last are pretty easy with scan and substring:
data names2;
set names;
if Nr_of_str = 2 then do;
last_name = scan(names_only, 1, ' ');
_FirstBlank = find(names_only, ' ');
first_name = strip(substr(names_only, _FirstBlank));
end;
run;
Assumption: there are only 3 strings.
approach 1. Middle name has dot in it. Filter it out.
approach 2. Middle name is shorter than real name:
data names3;
set names2;
if Nr_of_str > 2 then do;
last_name = scan(names_only, 1, ' '); /*this should still hold*/
_FirstBlank = find(names_only, ' '); /*Substring approach */
first_name = strip(substr(names_only, _FirstBlank));
second_str = scan(names_only, 2, ' ');
third_str = scan(names_only, 3, ' ');
if find(second_str,'.') = 0 then /*1st approch */
first_name = scan(names_only, 2, ' ');
else
first_name = scan(names_only, 3, ' ');
if len(second_str) > len(second_str) then /*2nd approch */
first_name = second_str;
else
first_name = third_str;
end;
run;
For more see about subsring and scan:

Transform categorical column into dummy columns using Power Query M

Using Power Query "M" language, how would you transform a categorical column containing discrete values into multiple "dummy" columns? I come from the Python world and there are several ways to do this but one way would be below:
>>> import pandas as pd
>>> dataset = pd.DataFrame(list('ABCDACDEAABADDA'),
columns=['my_col'])
>>> dataset
my_col
0 A
1 B
2 C
3 D
4 A
5 C
6 D
7 E
8 A
9 A
10 B
11 A
12 D
13 D
14 A
>>> pd.get_dummies(dataset)
my_col_A my_col_B my_col_C my_col_D my_col_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 1 0 0 0 0
5 0 0 1 0 0
6 0 0 0 1 0
7 0 0 0 0 1
8 1 0 0 0 0
9 1 0 0 0 0
10 0 1 0 0 0
11 1 0 0 0 0
12 0 0 0 1 0
13 0 0 0 1 0
14 1 0 0 0 0
Interesting question. Here's an easy, scalable method I've found:
Create a custom column of all ones (Add Column > Custom Column > Formula = 1).
Add an index column (Add Column > Index Column).
Pivot on the custom column (select my_col > Transform > Pivot Column).
Replace null values with 0 (select all columns > Transform > Replace Values).
Here's what the M code looks like for this process:
#"Added Custom" = Table.AddColumn(#"Previous Step", "Custom", each 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 0, 1),
#"Pivoted Column" = Table.Pivot(#"Added Index", List.Distinct(#"Added Index"[my_col]), "my_col", "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Pivoted Column",null,0,Replacer.ReplaceValue,Table.ColumnNames(#"Pivoted Column"))
Once you've completed the above, you can remove the index column if desired.

Convert this Word DataFrame into Zero One Matrix Format DataFrame in Python Pandas

Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills
Input DataFrame
user_Id skills
0 user1 [java, hdfs, hadoop]
1 user2 [python, c++, c]
2 user3 [hadoop, java, hdfs]
3 user4 [html, java, php]
4 user5 [hadoop, php, hdfs]
Desired Output DataFrame
user_Id java c c++ hadoop hdfs python html php
user1 1 0 0 1 1 0 0 0
user2 0 1 1 0 0 1 0 0
user3 1 0 0 1 1 0 0 0
user4 1 0 0 0 0 0 1 1
user5 0 0 0 1 1 0 0 1
You can join new DataFrame created by astype if need convert lists to str (else omit), then remove [] by strip and use get_dummies:
df = df[['user_Id']].join(df['skills'].astype(str).str.strip('[]').str.get_dummies(', '))
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
df1 = df['skills'].astype(str).str.strip('[]').str.get_dummies(', ')
#if necessary remove ' from columns names
df1.columns = df1.columns.str.strip("'")
df = pd.concat([df['user_Id'], df1], axis=1)
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0

Using first function

I need to create a new variable WHLDR given the conditions below. I'm not sure the last else if is correct. So if multi > 1 and ref_1 = 0 if rel =0 and ref_1=1 then the first id which meets this condition whldr=1 if not then whldr =0, and continues. This is my code and sample data below.
data temp_all;
merge temp_1 (in=inA)
temp_2 (in=inB)
temp_3 (in=inC)
;
by id;
firstid=first.id;
if multi = 1 then do;
if rel = 0 then whldr=1;
else whldr = 0;
end;
else if multi > 1 and ref_1 >= 1 then do;
if rel =0 and ref_1=1 then whldr=1;
else whldr = 0;
end;
else if multi > 1 and ref_1 = 0 then do;
if rel =0 and ref_1=1 then do;
if rel =0 and ref_0 ne '0' then do;
if first.id=1 then whldr=1 ;
else whldr=0;
end;
end;
end;
run;
Here is sample data:
data have ;
input id a rel b multi ;
cards;
105 . 0 0 1
110 1 0 1 1
110 0 1 1 1
110 . 2 1 1
113 1 0 1 1
113 2 1 1 1
113 0 2 1 1
113 0 2 1 1
135 1 0 1 1
135 0 1 1 1
176 1 0 1 1
176 0 1 1 1
189 1 0 1 1
189 2 1 1 1
189 0 4 1 1
189 0 4 1 1
;
If you have a variable named WHLDR and you want the first observation where it has the value 1 then you can run a data step like this.
data want ;
set have (obs=1);
where whldr=1 ;
run;