Removing data from loop and saving to a file - python-2.7

I have two loops which have two variables each.
cutoff1 and cutoff2 contain measured data, and BxTime and ByTimecontain time data from 0 to 300 s (It was used to set up scale in matplotlib). I have used this loop:
Loop = zip(BxTime,cutoff1)
for tup in Loop:
print (tup[0], tup[1])
Loop2 = zip(ByTime,cutoff2)
for tup in Loop2:
print (tup[0], tup[1])
It prints in my PyCharm enviroment a vertical list of measurements and time of their occurence, first from Loop, then from Loop2. My question here is a bit complex, because I need to:
Save this loops to a file which will write my data vertically. First column cutoff1 second column BxTime, third column cutoff2 forth column ByTime.
Second, after or before step 1, I need to erase concrete measurements. Any ideas?
UPDATE:
def writeData(fileName, tups):
'''takes a filename, creates the file blank (and deletes existing one)
and a list of 2-tuples of numbers. writes any tuple to the file where
the second value is > 100'''
with open(fileName,"w") as f:
for (BxTime,cutoff1) in tups:
if cutoff1 <= 100:
continue # goes to the nex tuple
f.write(str(BxTime) + '\t' + str(cutoff1) + "\n" )`
Loop = zip(BxTime,cutoff1)
# for tup in Loop:
# print (tup[0], tup[1])
Loop2 = zip(ByTime,cutoff2)
# for tup in Loop2:
# print (tup[0], tup[1])
writeData('filename1.csv', Loop)
writeData('filename2.csv', Loop2)
I have used that code, but:
There are still measurements which contain 100.0
Before saving to a file I have to wait till the whole loop is printed, how to avoid that?
Is there any other way to save it to a text file instead of csv, which later open as Excel?

def writeData(fileName, tups):
'''takes a filename, creates the file blank (and deletes existing one)
and a list of 2-tuples of numbers. writes any tuple to the file where
the second value is >= 100'''
with open(fileName,"w") as f:
for (tim,dat) in tups:
if dat < 100:
continue # goes to the nex tuple
f.write(str(tim) + '\t' + str(dat) + "\n" ) # adapt to '\r\n' for windows
Use it like this:
writeData("kk.csv", [(1,101),(2,99),(3,200),(4,99.999)])
Output:
1 101
3 200
Should work with your zipped Loop and Loop2 of (time, data).
You might need to change the line-end from '\n' to '\r\n' on windows.

Related

fortran averaging columns frame by frame in specific format

I would like to average first column and last column shown here(that is my output). My code does several things such as compare specific atoms with the name "CA" for each frame (I have 1000 frames), exclude the nearest neighbours and depending on the cutoff value, and count those contacts and print them separately according to my wish (input file looks like this). I would like to print a file where it gives me the output something like following as an example:
1 0
2 12
3 12
....
100 16
Need guidance to achieve that by helping me form a loop or a condition.
open(unit=11,file="0-1000.gro", status="old", action="read")
do f=1,frames,10
25 format (F10.5,F10.5,F10.5)
do h=1,natoms_frames
read(11,format11)nom(h),resname(h),atmtype(h),num(h),x(h),y(h),z(h)
end do
read(11,25)lasta(f),lastb(f),lastc(f)
count=0
do h=1, natoms_frames
if (atmtype(h).eq.' CA') then
count=count+1
CAx(count)=x(h)
CAy(count)=y(h)
CAz(count)=z(h)
end if
end do
do h=1, count
avg_cal=0
cal=0
do hh=h+3, count
if (h.ne.hh) then
! finding distance formula from the gro file
distance = sqrt((CAx(hh)-CAx(h))**2 + (CAy(hh)-CAy(h))**2 + (CAz(hh)-CAz(h))**2)
if (distance.le.cutoff) then
cal = cal+1
set = set+1
final_set=final_set+1
avg_cal=avg_cal+1
end if
end if
end do
write(*,*)h,cal,final_set
end do
end do ! end of frames
close(11)
end program num_contacts
Writing in a file (in ASCII) is the same as printing on screen, except that you need to specify the file's unit.
So first, open your file:
open(unit = 10, file = "filename.dat", access = "sequential", form = "formatted", status = "new", iostat = ierr)
where 'unit' is a unique integer that will be associated to your file (as an ID). Note that the 'form' is 'formatted' as you want to output something readable by a human. Then you can start your loop:
do h = 1, natoms_frames
! BUNCH OF STUFF TO CALCULATE REQUIRED VARIABLES
write(10, *) h, cal, final_set
end do
Finally, close your file:
close(10)

Simulation in python for Unique matches of demand supply using lists

I have data in the form of list of lists where I am trying to match the demand and supply such that each demand matches uniquely to one supply item.
dmndId_w_freq = [['a',4,[1,2,3,4]],['b',6,[5,6,7,8,3,4]],['c',7,[6,5,7,9,8,3,4]],['d',8,[1,6,3,4,5,6,7,10]]]
num_sims = 1
for sim_count in range(num_sims):
dmndID_used_flag = {}
splID_used_flag = {}
dmndID_splId_one_match = {}
for i in dmndId_w_freq:
if i[0] not in dmndID_used_flag.keys():
for j in i[2]:
#print j
#print "CLICK TO CONTINUE..."
#raw_input()
if j in splID_used_flag.keys():
i[2].remove(j)
dmndID_splId_one_match[i[0]] = i[2][0]
splID_used_flag[i[2][0]] = 1
dmndID_used_flag[i[0]] = 1
print
print dmndID_splId_one_match
#print splID_used_flag
#print dmndID_used_flag
#raw_input()
I would expect the output dmndID_splId_one_match to be something like {'a':1,'b':5,'c':6,'d':3}.
But I end up getting {'a':1,'b':5,'c':6,'d':6}
So I am NOT getting a unique match as the supply item 6 is getting mapped to demands 'c' as well as demand 'd'.
I tried to debug the code by looping through it and it seems that the problem is in the for-loop
for j in i[2]:
The code is not going through all the elements of i[2]. It does not happen with the 'a' and 'b' part. but starts happening with the 'c' part. It goes over 6 and 5 but skips 7 (the third element of the list [6,5,7,9,8,3,4]). Similarly, in the 'd' part it skips the 2nd element 6 in the list [1,6,3,4,5,6,7,10]. And that is why the mapping is happening twice, since I am not able to remove it.
What am I doing wrong that it is not executing the for-loop as expected?
Is there a better way to write this code?
I figured out the problem in the for loop. I am looping through i[2] and then trying to modify it within the loop. this makes it unstable.

Python: referring to each duplicate item in a list by unique index

I am trying to extract particular lines from txt output file. The lines I am interested in are few lines above and few below the key_string that I am using to search through the results. The key string is the same for each results.
fi = open('Inputfile.txt')
fo = open('Outputfile.txt', 'a')
lines = fi.readlines()
filtered_list=[]
for item in lines:
if item.startswith("key string"):
filtered_list.append(lines[lines.index(item)-2])
filtered_list.append(lines[lines.index(item)+6])
filtered_list.append(lines[lines.index(item)+10])
filtered_list.append(lines[lines.index(item)+11])
fo.writelines(filtered_list)
fi.close()
fo.close()
The output file contains the right lines for the first record, but multiplied for every record available. How can I update the indexing so it can read every individual record? I've tried to find the solution but as a novice programmer I was struggling to use enumerate() function or collections package.
First of all, it would probably help if you said what exactly goes wrong with your code (a stack trace, it doesn't work at all, etc). Anyway, here's some thoughts. You can try to divide your problem into subproblems to make it easier to work with. In this case, let's separate finding the relevant lines from collecting them.
First, let's find the indexes of all the relevant lines.
key = "key string"
relevant = []
for i, item in enumerate(lines):
if item.startswith(key):
relevant.append(item)
enumerate is actually quite simple. It takes a list, and returns a sequence of (index, item) pairs. So, enumerate(['a', 'b', 'c']) returns [(0, 'a'), (1, 'b'), (2, 'c')].
What I had written above can be achieved with a list comprehension:
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
So, we have the indexes of the relevant lines. Now, let's collected them. You are interested in the line 2 lines before it and 6 and 10 and 11 lines after it. If your first lines contains the key, then you have a problem – you don't really want lines[-1] – that's the last item! Also, you need to handle the situation in which your offset would take you past the end of the list: otherwise Python will raise an IndexError.
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
You could also catch the IndexError, but that won't save us much typing, as we have to handle negative indexes anyway.
The whole program would look like this:
key = "key string"
with open('Inputfile.txt') as fi:
lines = fi.readlines()
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
with open('Outputfile.txt', 'a') as fi:
fi.writelines(out)
To get rid of duplicates you can cast list to set; example:
x=['a','b','a']
y=set(x)
print(y)
will result in:
['a','b']

Why this doble loop doesn't work as i expected?

I have two txt files, with 50000 and 25000 data to compare which data are in both files, but only the first line is compared and added to the list res1, (prints were just to get the idea of how it is working) when i run the code it prints the tuple (as expected), but then only prints the values in lineCue and avoid the second loop, the list result is only the first value taked by lineCue, and not all the values repeated in both files. when
i tried by another way the list content have 24808 repetitions... :(
contratos = 'C:\\CONTRATOS.txt'
cuentas = 'C:\\CUENTAS0.txt'
res1 = [[], []] # res1[0] -> ID, res1[1] -> NO ID
res2 = [] # res2 -> REPE
with open(cuentas, 'rb') as cue:
with open(contratos, 'rb') as con:
for lineCue in cue.xreadlines():
print(lineCue)
for lineCon in con.xreadlines():
print(lineCue, lineCon)
if lineCue == lineCon:
res1[0].append(lineCon)
print(res1[0])
output:
['O199924\r\n']
files:
https://dl.dropboxusercontent.com/u/33113171/CONTRATOS.txt
https://dl.dropboxusercontent.com/u/33113171/CUENTAS0.txt
In the first iteration of the outer loop you read the whole file con. You need to read it from start each time. To do so, use con.seek(0) to go to the beginning of this file before entering the inner loop.

Need help in improving the speed of my code for duplicate columns removal in Python

I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...