multithreads in python over a large multifasta

multithreads in python over a large multifasta - python-2.7

my problem is very simple but I can not figure out how to solve it.
I have a list of about of one million sequences and each of them need to be aligned to a sequencing adapter. I`m thinking to do the alignment in python using pairwise2 tools from Biopython. I would like to use this tool because i need to collect all the alignment scores, do some math and select the sequences based on the math. If I run the code below it works but it is slow because only one alignment is run per time.
def align_call(record, adapter):
score = pairwise2.align.localms(record.seq, adapter.seq, 1, -3, -1, -0.99, one_alignment_only=1, score_only=1)
print record.id + " " + record.seq + " " + adapter.id + " " + str(score)
#results.append(line)
return
if __name__ == '__main__':
fastaSeq = argv[1]
threads = argv[2]
fastaAdapt = argv[3]
listSeq = []
adpt = list(SeqIO.parse(fastaAdapt, "fasta"))
for record in SeqIO.parse(fastaSeq, "fasta"):
align_call(record, adpt[0])
Therefore, I was thinking to change the code and use multithreading or multiprocess to speed up process by sending n number of parallel jobs based on the number of threads that the computer has. So i came up to something like this:
results = []
def align_call(record, adapter):
score = pairwise2.align.localms(record.seq, adapter.seq, 1, -3, -1, -0.99, one_alignment_only=1, score_only=1)
line = record.id + " " + record.seq + " " + adapter.id + " " + str(score)
results.append(line)
return results
if __name__ == '__main__':
fastaSeq = argv[1]
threads = argv[2]
fastaAdapt = argv[3]
listSeq = []
adpt = list(SeqIO.parse(fastaAdapt, "fasta"))
for record in SeqIO.parse(fastaSeq, "fasta"):
pool = Pool(processes=1)
result = pool.apply_async(align_call, args= (record, adpt[0]))
print result.get()
The script works but I can not modulate how many sequence need to be send each time and when I get a lot of them, I run out of cores and memory.
Any idea on how I can do this? Suggestions?
I tried by implementing Queue but it did not work
Thanks
Luigi

You might want to look into vsearch (or usearch)
It is pretty fast! and supports multi-threading
https://github.com/torognes/vsearch
vsearch --usearch_global big.fasta -db adapter_seq.fasta --uc outputfile.txt --id 0.8 --iddef 0 --threads 6
--id is the max % difference from the target sequence that you will allow (80 % in this case)
--iddef 0 is the scoring method 0=identity to based on shortest sequence, 2=strip terminal gaps...)
You can then read in this outputfile.txt and grab the alignment scores/matches/gaps/alignment length query name...
for each sequence.
With the desired query names collected, you can use these to pull the relevant sequences from the original fasta file
If you just want the sequences that were greater than x % match to the adapter, you can use --matched, instead of --uc, which will give you a fasta file of the sequences that matched above the --id threshold.
Alternatively, --samout will give you a sam file, with the seq name, alignment length, cigar code for the alignment as well as the query sequence.
https://github.com/torognes/vsearch/releases/download/v2.7.1/vsearch_manual.pdf has the full set of output options and commands

I think the pool should be created before the SeqIO loop. And you need to use a lock or callback to ensure the output is in right order.

Related

Python Outputting Text in Hex

I'm working with a very large text file (58GB) that I'm attempting to split into smaller chunks. The problem that I'm running into is that the smaller chunks appear to be Hex. I'm having my terminal print each line to stdout as well, but when I'm seeing it printed in stdout it's looking like normal strings to me. Is this known behavior? I've never encountered an issue where Python keeps spitting stuff out in Hex before. Even odder when I tried using Ubuntu's split from the command line it was also generating everything in Hex.
Code snippet below:
working_dir = '/SECRET/'
output_dir = path.join(working_dir, 'output')
test_file = 'SECRET.txt'
report_file = 'SECRET_REPORT.txt'
output_chunks = 100000000
output_base = 'SECRET'
input = open(test_file, 'r')
report_output = open(report_file, 'w')
count = 0
at_line = 0
output_f = None
for line in input:
if count % output_chunks == 0:
if output_f:
report_output.write('[{}] wrote {} lines to {}. Total count is {}'.format(
datetime.now(), output_chunks, str(output_base + str(at_line) + '.txt'), count))
output_f.close()
output_f = open('{}{}.txt'.format(output_base, str(at_line)), 'wb')
at_line += 1
output_f.write(line.encode('ascii', 'ignore'))
print line.encode('ascii', 'ignore')
count += 1

Here's what was going on:
Each line was started with a NUL character. When I was opening up parts of the file using head or PyCharm's terminal it was showing up normal, but when I was looking at my output in Sublime Text it was picking up on that NUL character and rendering the results in Hex. I had to strip '\x00' from each line of the output and it started looking the way I would expect it to

Python v3 inconsistent regex match returns

I'm writing a small python script which takes a log file, matches strings within it and saves them and another custom string "goal " to another text file. Then I take some values from the second file and add them to a list. The problem is that depending on the length of the custom string (e.g. "goalgoalgoal ") the lists with the values varies in length - currently, I'm working with a log file which includes 1031 matches of the string "goal ", but the length of list varies from everything between ~980 and 1029.
Here is the code:
for line in inputfile:
if "Started---" in line:
startTime = line[11:23]
testfile.write("\n"+"Start"+"\n"+"goal "+ startTime+"\n")
counterLines +=1
elif "done!" in line:
testfile.write("\n"+find_between(line, "| ", "done!")+"\n")
elif "Errors:" in line:
testfile.write("\n"+"Errors:"+line.split("Errors:",1)[1]+"\n")
elif "Warnings:" in line:
testfile.write("\n"+"Warnings:"+line.split("Warnings:",1)[1]+"\n")
elif "Successes:" in line:
testfile.write("\n"+"Successes:"+line.split("Successes:",1)[1]+"\n")
elif "END---" in line:
endTime = line[11:23]
testfile.write("\n"+"End"+"\n"+"endTime "+ endTime+"\n")
else:
print("nothing found")
testfileread = open(filePath+"\\testFile.txt", "r")
startTimesList = []
endTimesList = []
for line in testfileread:
matchObj = re.match(r'goal', line)
if matchObj:
startTimesList.append(line)
print(len(startTimesList))
Do you have ideas why the code doesn't work as expected?
Thank you in advance!

Most probably it's due to the fact that you don't flush testFile.txt after writing is completed - as a result, there is unpredictable amount of data in the file when you start reading it. Calling testfile.flush() should fix the problem. Alternatively, wrap the writing logic in a with block.

How to iterate a python list and compare items in a string or another list

Following my earlier question, I have tried to work on a code to return a string if a search term in a certain list is in a string to be returned as follows.
import re
from nltk import tokenize
from nltk.tokenize import sent_tokenize
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\
Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\
at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \
the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \
estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \
conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\
Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \
examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \
(half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \
(95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users."
words = txt
corpus = " ".join(words).lower()
sentences1 = sent_tokenize(corpus)
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if [item in List1] in word_tokenize(j)]
for i in a:
print i,'\n','\n'
foo()
The problem is that the python IDLE does not print anything. What could I have done wrong. What it does is run the code and I get this
>
>

Your question isn't very clear to me so please correct me if i'm getting this wrongly. Are you trying to match the list of keywords (in list1) against the text (in txt)? That is,
For each keyword in list1
Do a match against every sentences in txt.
Print the sentence if they matches?
Instead of writing a complicated regular expression to solve your problem I have broken it down into 2 parts.
First I break the whole lot of text into a list of sentences. Then write simple regular expression to go through every sentences. Trouble with this approach is that it is not very efficient but hey it solves your problem.
Hope this small chunk of code can help guide you to the real solution.
def foo():
List1 = ['risk','cancer','ocp','hormone','OCP',]
txt = "blah blah blah - truncated"
words = txt
matches = []
sentences = re.split(r'\.', txt)
keyword = List1[0]
pattern = keyword
re.compile(pattern)
for sentence in sentences:
if re.search(pattern, sentence):
matches.append(sentence)
print("Sentence matching the word (" + keyword + "):")
for match in matches:
print (match)
--------- Generate random number -----
from random import randint
List1 = ['risk','cancer','ocp','hormone','OCP',]
print(randint(0, len(List1) - 1)) # gives u random index - use index to access List1

Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest.
Variable
FO.variable
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data.
All I want to do is create an additional binary keep column which should be:
1 if FO.variable is inside Variable
0 if FO.Variable is not inside Variable
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. I've spent the past hours trying to get this and R and failing.
Here's what I've tried:
Using grepl for pattern matching. I've used grepl before but this time I'm trying to pass a column instead of a string. My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
My next attempt was to use transform and grep based off another post on SO. I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. I would prefer a vectorized approach but I'm pretty desperate at this point. I haven't used for-loops before as I've avoided them and stuck to other solutions. It doesn't seem to be working quite right not sure if I screwed up the syntax:
for(i in 1:nrow(dd)){
if(dd[i,4] %in% dd[i,2])
dd$test[i] <- 1
}
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
A bonus would be if a solution could run fast. The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
This turned out to not nearly be as simple as I would of thought. Or maybe it is and I'm just a dunce. Either way, I appreciate any help on how to best approach this.

I read the data
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd))
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand)
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character.

I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance.
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...

Here is a data.table approach that I think is very similar in spirit to Martin's:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to
identify all pairs of rn & variable (saved in dtvars) and
see which of these pairs match with rn & F0.variable pairs (in the original table, dt).

Conditional quit multiprocess in python

I'm trying to build a python script that runs several processes in parallel. Basically, the processes are independent, work on different folders and leave their output as text files in those folders. But in some special cases, a process might terminate with a special (boolean) status. If so, I want all the other processes to terminate right away. What is the best way to do this?
I've fiddled with multiprocessing.condition() and multiprocessing.manager, after reading the excellent tutorial by Doug Hellmann:
http://pymotw.com/2/multiprocessing/communication.html
However, I do not seem to understand how to get a multiprocessing process to monitor a status indicator and quit if it takes a special value.
To examplify this, I've written the small script below. It somewhat does what I want, but ends in an exception. Suggestions on more elegant ways to proceed are gratefully welcomed:
br,
Gro
import multiprocessing
def input(i):
"""arbitrary chosen value(8) gives special status = 1"""
if i == 8:
value = 1
else:
value = 0
return value
def sum_list(list):
"""accumulative sum of list"""
sum = 0
for x in range(len(list)):
sum = sum + list[x]
return sum
def worker(list,i):
value = input(i)
list[i] = value
print 'processname',i
if __name__ == '__main__':
mgr = multiprocessing.Manager()
list = mgr.list([0]*20)
jobs = [ multiprocessing.Process(target=worker, args=(list,i))
for i in range(20)
]
for j in jobs:
j.start()
sumlist = sum_list(list)
print sumlist
if sumlist == 1:
break
for j in jobs:
j.join()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js