Extracting Specific Columns from Multiple Files & Writing to File Python - python-2.7

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change)
000001 000001 ZZ 1:1 01 01 NOTEST 0 0 0 0 1 1 no
I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:
gene name locus log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change)
ZZ 1:1 0 0 0 0
all the log2(fold_change) are obtain from the tenth column from each of the seven files
What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work
dicti = defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
reader = csv.reader((f), delimiter='\t')
for row in reader:
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output.txt", "w") as out:
out.write("gene name" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
for k,v in dicti:
out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])+"\n")
So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:
"".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])
I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:
Here is what I mean for instance,
gene name locus log2(fold change) from file 1 .... log2(fold change) from file7
ZZ 1:3 0
0
because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.

Here is my first approach:
import glob
import numpy as np
with open('output.txt', 'w') as out:
fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
# Title row:
titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
out.write('\t'.join(titles) + '\n')
# Data row:
data = []
for idx, fn in enumerate(fns):
file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
if idx == 0:
data.extend([file[0], file[1]])
data.append(file[2])
out.write('\t'.join(data))
Content of the created file output.txt (Note: I created just three files for testing):
gene_name locus 1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ 1:1 0 0 0

I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.
import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
for row in f:
r = re.compile(r'([^\s]*)\s*')
row = r.findall(row.strip())[:-1]
print row
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
def main():
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output", "w") as out:
data = ("genename" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
r = re.compile(r'([^\s]*)\s*')
data = r.findall(data.strip())[:-1]
out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30} {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
for key in dicti:
print 'locus = ' + str(dicti[key][1])
data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([ len(z[0][0:3]) * "\t" + z[1][9] for z in dicti[key] ])+"\n")
data = r.findall(data.strip())[:-1]
out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
if __name__ == '__main__':
main()
and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted.
Thanks
gene name locus 1.t-log2(fold_change) 2.t-log2(fold_change) 3.t-log2(fold_change) 4.t-log2(fold_change) 5.t-log2(fold_change) 6.t-log2(fold_change) 7.t-log2(fold_change)
ZZ 1:1 0 0 0 0 0 0 0

Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.
import csv
import os
import glob
# Your folder location where the input files are saved.
name_of_folder = '...'
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))
with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
headers_read = False
for input_file in input_files:
if input_file == os.path.join(name_of_folder, output_filename):
# If the output file is in the list of input files, ignore it.
continue
with open(input_file, 'r') as fin:
reader = csv.reader(fin)
if not headers_read:
# Read column headers just once
headers = reader.next()[0].split()
headers = headers[2:4] + [headers[9]]
file_out.write("\t".join(headers + ['\n'])) # Zero based indexing.
headers_read = True
else:
_ = reader.next() # Ignore header row.
for line in reader:
if line: # Ignore blank lines.
line_out = line[0].split()
file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))
>>> !cat output.txt
gene locus log2(fold_change)
ZZ 1:1 0
ZZ 1:1 0

Related

use a custom function to find all words in a column

Background
The following question is a variation from Unnest grab keywords/nextwords/beforewords function.
1) I have the following word_list
word_list = ['crayons', 'cars', 'camels']
2) And df1
l = ['there are many crayons, in the blue box crayons that are',
'cars! i like a lot of sports cars because they go fast',
'the camels, in the middle east have many camels to ride ']
df1 = pd.DataFrame(l, columns=['Text'])
df1
Text
0 there are many crayons, in the blue box crayons that are
1 cars! i like a lot of sports cars because they go fast
2 the camels, in the middle east have many camels to ride
3) I also have a function find_next_words which uses word_list to grab words from Text column in df1
def find_next_words(row, word_list):
sentence = row[0]
trigger_words = []
next_words = []
for keyword in word_list:
words = sentence.split()
for index in range(0, len(words) - 1):
if words[index] == keyword:
trigger_words.append(keyword)
next_words.append(words[index + 1:index + 3])
return pd.Series([trigger_words, next_words], index = ['TriggerWords','NextWords'])
4) And it's pieced together with the following
df2 = df1.join(df.apply(lambda x: find_next_words(x, word_list), axis=1))
Output
Text TriggerWords NextWords
0 [crayons] [[that, are]]
1 [cars] [[because, they]]
2 [camels] [[to, ride]]
Problem
5) The output misses the following
crayons, from row 0 of Text column df1
cars! from row 1 of Text column df1
camels, from row 2 of Text column df1
Goal
6) Grab all corresponding words from df1 even if the words in df1 have a slight variation e.g. crayons, cars! from the words in word_list
(For this toy example, I know I can easily fix this problem by just adding these word variations to word_list = ['crayons,','crayons', 'cars!',cars, 'camels,', 'camels']. But this would be impractical to do with my my real word_list, which contains ~20K words)
Desired Output
Text TriggerWords NextWords
0 [crayons, crayons] [[in, the], [that, are]]
1 [cars, cars] [[i,like],[because, they]]
2 [camels, camels] [[in, the], [to, ride]]
Questions
How do I 1) tweak my word_list (e.g. regex?) 2) or find_next_words function to achieve my desired output?
You can tweak your regex something like this
\b(crayons|cars|camels)\b(?:[^a-z\n]*([a-z]*)[^a-z\n]*([a-z]*))
Regex Demo
import nltk
change
words = sentence.split()
to
words = nltk.word_tokenize(sentence)
this leads to
'crayons', ','
instead of
'crayons,'
which allows find_next_words to correctly identify all words from word_list in Text column

Removing data from loop and saving to a file

I have two loops which have two variables each.
cutoff1 and cutoff2 contain measured data, and BxTime and ByTimecontain time data from 0 to 300 s (It was used to set up scale in matplotlib). I have used this loop:
Loop = zip(BxTime,cutoff1)
for tup in Loop:
print (tup[0], tup[1])
Loop2 = zip(ByTime,cutoff2)
for tup in Loop2:
print (tup[0], tup[1])
It prints in my PyCharm enviroment a vertical list of measurements and time of their occurence, first from Loop, then from Loop2. My question here is a bit complex, because I need to:
Save this loops to a file which will write my data vertically. First column cutoff1 second column BxTime, third column cutoff2 forth column ByTime.
Second, after or before step 1, I need to erase concrete measurements. Any ideas?
UPDATE:
def writeData(fileName, tups):
'''takes a filename, creates the file blank (and deletes existing one)
and a list of 2-tuples of numbers. writes any tuple to the file where
the second value is > 100'''
with open(fileName,"w") as f:
for (BxTime,cutoff1) in tups:
if cutoff1 <= 100:
continue # goes to the nex tuple
f.write(str(BxTime) + '\t' + str(cutoff1) + "\n" )`
Loop = zip(BxTime,cutoff1)
# for tup in Loop:
# print (tup[0], tup[1])
Loop2 = zip(ByTime,cutoff2)
# for tup in Loop2:
# print (tup[0], tup[1])
writeData('filename1.csv', Loop)
writeData('filename2.csv', Loop2)
I have used that code, but:
There are still measurements which contain 100.0
Before saving to a file I have to wait till the whole loop is printed, how to avoid that?
Is there any other way to save it to a text file instead of csv, which later open as Excel?
def writeData(fileName, tups):
'''takes a filename, creates the file blank (and deletes existing one)
and a list of 2-tuples of numbers. writes any tuple to the file where
the second value is >= 100'''
with open(fileName,"w") as f:
for (tim,dat) in tups:
if dat < 100:
continue # goes to the nex tuple
f.write(str(tim) + '\t' + str(dat) + "\n" ) # adapt to '\r\n' for windows
Use it like this:
writeData("kk.csv", [(1,101),(2,99),(3,200),(4,99.999)])
Output:
1 101
3 200
Should work with your zipped Loop and Loop2 of (time, data).
You might need to change the line-end from '\n' to '\r\n' on windows.

Reading in TSP file Python

I need to figure out how to read in this data of the filename 'berlin52.tsp'
This is the format I'm using
NAME: berlin52
TYPE: TSP
COMMENT: 52 locations in Berlin (Groetschel)
DIMENSION : 52
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 565.0 575.0
2 25.0 185.0
3 345.0 750.0
4 945.0 685.0
5 845.0 655.0
6 880.0 660.0
7 25.0 230.0
8 525.0 1000.0
9 580.0 1175.0
10 650.0 1130.0
And this is my current code
# Open input file
infile = open('berlin52.tsp', 'r')
# Read instance header
Name = infile.readline().strip().split()[1] # NAME
FileType = infile.readline().strip().split()[1] # TYPE
Comment = infile.readline().strip().split()[1] # COMMENT
Dimension = infile.readline().strip().split()[1] # DIMENSION
EdgeWeightType = infile.readline().strip().split()[1] # EDGE_WEIGHT_TYPE
infile.readline()
# Read node list
nodelist = []
N = int(intDimension)
for i in range(0, int(intDimension)):
x,y = infile.readline().strip().split()[1:]
nodelist.append([int(x), int(y)])
# Close input file
infile.close()
The code should read in the file, output out a list of tours with the values "1, 2, 3..." and more while the x and y values are stored to be calculated for distances. It can collect the headers, at least. The problem arises when creating a list of nodes.
This is the error I get though
ValueError: invalid literal for int() with base 10: '565.0'
What am I doing wrong here?
This is a file in TSPLIB format. To load it in python, take a look at the python package tsplib95, available through PyPi or on Github
Documentation is available on https://tsplib95.readthedocs.io/
You can convert the TSPLIB file to a networkx graph and retrieve the necessary information from there.
You are feeding the string "565.0" into nodelist.append([int(x), int(y)]).
It is telling you it doesn't like that because that string is not an integer. The .0 at the end makes it a float.
So if you change that to nodelist.append([float(x), float(y)]), as just one possible solution, then you'll see that your problem goes away.
Alternatively, you can try removing or separating the '.0' from your string input.
There are two problem with the code above.I have run the code and found the following problem in lines below:
Dimension = infile.readline().strip().split()[1]
This line should be like this
`Dimension = infile.readline().strip().split()[2]`
instead of 1 it will be 2 because for 1 Dimension = : and for 2 Dimension = 52.
Both are of string type.
Second problem is with line
N = int(intDimension)
It will be
N = int(Dimension)
And lastly in line
for i in range(0, int(intDimension)):
Just simply use
for i in range(0, N):
Now everything will be alright I think.
nodelist.append([int(x), int(y)])
int(x)
function int() cant convert x(string(565.0)) to int because of "."
add
x=x[:len(x)-2]
y=y[:len(y)-2]
to remove ".0"

Separating columns from a file by last column

I'm trying to do this: if the last column is negative number from 1-5 then write second and last column to a file "neg.txt". If a last column is positive number, second and last column need to be written to "pos.txt". My both output files end up empty after execution. I don't know what's wrong with the code, when I think if statement can handle multiple conditions. I also tried with regular expressions but it did't work so I made it as simple as possible to see what is not working.
The input file looks like this:
abandon odustati -2
abandons napusta -2
abandoned napusten -2
absentee odsutne -1
absentees odsutni -1
aboard na brodu 1
abducted otet -2
accepted prihvaceno 1
My code is:
from urllib.request import urlopen
import re
pos=open('lek_pos.txt','w')
neg=open('lek_neg.txt','w')
allCondsAreOK1 = ( parts[2]=='1' and parts[2]=='2' and
parts[2]=='3' and parts[2]=='4' and parts[2]=='5' )
allCondsAreOK2 = ( parts[2]=='-1' and parts[2]=='-2' and
parts[2]=='-3' and parts[2]=='-4' and parts[2]=='-5' )
with open('leksicki_resursi.txt') as pos:
for line in pos:
parts=line.split() # split line into parts
if len(parts) > 1: # if at least 2 columns (parts)
if allCondsAreOK:
pos.write(parts[1]+parts[2])
elif allCondsAreOK2:
neg.write(parts[1]+parts[2])
else:
print("nothing matches")
You don't need a regex, you just need an if/elif checking if after casting to int the last value falls between -5 and -1, if it does you write to the neg file or if the value is any non negative number you write to the pos file:
with open('leksicki_resursi.txt') as f, open('lek_pos.txt','w')as pos, open('lek_neg.txt','w') as neg:
for row in map(str.split, f):
a, b = row[1], int(row[-1])
if b >= 0:
pos.write("{},{}\n".format(a, b))
elif -5 <= b <= -1:
neg.write("{},{}\n".format(a, b))
If the positive nums must also be between 1-5 then you can do something similar to the negative condition:
if 5 >= int(b) >= 0:
pos.write("{},{}\n".format(a, b))
elif -5 <= int(b) <= -1:
neg.write("{},{}\n".format(a, b))
Also if you have empty lines you can filter them out:
for row in filter(None,map(str.split, f)):

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...