I am doing some exercises programming in Fortran90 and I have to write data to file in columns with each name of columns commented and I struggled while doing this because my data come from a do while like this
do while (a<b)
k = 2*a - b
a = a + c
write(3,100) k,a
end do
100 format ('k',E5.1,X,'a',I2)
so when I set format I got k and a in each line of my data file like
k1 a2
k7 a21
k33 a2
and I don't know (and haven't found in the book I read) how to write once the name of each column to get a file like
k a
1 2
7 21
33 2
Any ideas how to do that?
You apply the format specifier to every line, and consequently get a and kon each line... You need to first write a header line, and then the data:
write(3,'(a5,1X,a2)') 'k','a'
do while (a<b)
k = 2*a - b
a = a + c
write(3, '(E5.1,1X,I2)') k,a
end do
Related
I would like to average first column and last column shown here(that is my output). My code does several things such as compare specific atoms with the name "CA" for each frame (I have 1000 frames), exclude the nearest neighbours and depending on the cutoff value, and count those contacts and print them separately according to my wish (input file looks like this). I would like to print a file where it gives me the output something like following as an example:
1 0
2 12
3 12
....
100 16
Need guidance to achieve that by helping me form a loop or a condition.
open(unit=11,file="0-1000.gro", status="old", action="read")
do f=1,frames,10
25 format (F10.5,F10.5,F10.5)
do h=1,natoms_frames
read(11,format11)nom(h),resname(h),atmtype(h),num(h),x(h),y(h),z(h)
end do
read(11,25)lasta(f),lastb(f),lastc(f)
count=0
do h=1, natoms_frames
if (atmtype(h).eq.' CA') then
count=count+1
CAx(count)=x(h)
CAy(count)=y(h)
CAz(count)=z(h)
end if
end do
do h=1, count
avg_cal=0
cal=0
do hh=h+3, count
if (h.ne.hh) then
! finding distance formula from the gro file
distance = sqrt((CAx(hh)-CAx(h))**2 + (CAy(hh)-CAy(h))**2 + (CAz(hh)-CAz(h))**2)
if (distance.le.cutoff) then
cal = cal+1
set = set+1
final_set=final_set+1
avg_cal=avg_cal+1
end if
end if
end do
write(*,*)h,cal,final_set
end do
end do ! end of frames
close(11)
end program num_contacts
Writing in a file (in ASCII) is the same as printing on screen, except that you need to specify the file's unit.
So first, open your file:
open(unit = 10, file = "filename.dat", access = "sequential", form = "formatted", status = "new", iostat = ierr)
where 'unit' is a unique integer that will be associated to your file (as an ID). Note that the 'form' is 'formatted' as you want to output something readable by a human. Then you can start your loop:
do h = 1, natoms_frames
! BUNCH OF STUFF TO CALCULATE REQUIRED VARIABLES
write(10, *) h, cal, final_set
end do
Finally, close your file:
close(10)
I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change)
000001 000001 ZZ 1:1 01 01 NOTEST 0 0 0 0 1 1 no
I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:
gene name locus log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change)
ZZ 1:1 0 0 0 0
all the log2(fold_change) are obtain from the tenth column from each of the seven files
What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work
dicti = defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
reader = csv.reader((f), delimiter='\t')
for row in reader:
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output.txt", "w") as out:
out.write("gene name" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
for k,v in dicti:
out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])+"\n")
So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:
"".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])
I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:
Here is what I mean for instance,
gene name locus log2(fold change) from file 1 .... log2(fold change) from file7
ZZ 1:3 0
0
because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.
Here is my first approach:
import glob
import numpy as np
with open('output.txt', 'w') as out:
fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
# Title row:
titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
out.write('\t'.join(titles) + '\n')
# Data row:
data = []
for idx, fn in enumerate(fns):
file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
if idx == 0:
data.extend([file[0], file[1]])
data.append(file[2])
out.write('\t'.join(data))
Content of the created file output.txt (Note: I created just three files for testing):
gene_name locus 1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ 1:1 0 0 0
I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.
import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
for row in f:
r = re.compile(r'([^\s]*)\s*')
row = r.findall(row.strip())[:-1]
print row
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
def main():
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output", "w") as out:
data = ("genename" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
r = re.compile(r'([^\s]*)\s*')
data = r.findall(data.strip())[:-1]
out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30} {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
for key in dicti:
print 'locus = ' + str(dicti[key][1])
data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([ len(z[0][0:3]) * "\t" + z[1][9] for z in dicti[key] ])+"\n")
data = r.findall(data.strip())[:-1]
out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
if __name__ == '__main__':
main()
and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted.
Thanks
gene name locus 1.t-log2(fold_change) 2.t-log2(fold_change) 3.t-log2(fold_change) 4.t-log2(fold_change) 5.t-log2(fold_change) 6.t-log2(fold_change) 7.t-log2(fold_change)
ZZ 1:1 0 0 0 0 0 0 0
Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.
import csv
import os
import glob
# Your folder location where the input files are saved.
name_of_folder = '...'
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))
with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
headers_read = False
for input_file in input_files:
if input_file == os.path.join(name_of_folder, output_filename):
# If the output file is in the list of input files, ignore it.
continue
with open(input_file, 'r') as fin:
reader = csv.reader(fin)
if not headers_read:
# Read column headers just once
headers = reader.next()[0].split()
headers = headers[2:4] + [headers[9]]
file_out.write("\t".join(headers + ['\n'])) # Zero based indexing.
headers_read = True
else:
_ = reader.next() # Ignore header row.
for line in reader:
if line: # Ignore blank lines.
line_out = line[0].split()
file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))
>>> !cat output.txt
gene locus log2(fold_change)
ZZ 1:1 0
ZZ 1:1 0
I'm importing a very complex .xls file that often combines multiple cells together in the variable names. After importing it into Stata, only the first cell has a variable name, and the other 3 are blank. Is it possible to write a loop to rename all the variables (which come in sets of 4)?
For instance, the variables go: Russia, B, C, D but I would like them to be named Russia_A, Russia_B, Russia_C, Russia_D. Is there a way to do this with a loop or command within Stata?
It's impossible to have blank variable names in Stata, as your own example attests. On the information given your variable names come in fours, so that you could loop. One basic technique is just to cycle over 1, 2, 3, 4 and act accordingly. This example works. If it's not what you want, a minimal reproducible example is essential showing why this is different from what you want.
clear
input Russia B C D Germany E F G France H I J
42 42 42 42 42 42 42 42 42 42 42 42
end
tokenize "A B C D"
local i = 0
foreach v of var * {
local ++i
if `i' == 1 local stub "`v'"
rename `v' `stub'_``i''
if `i' == 4 local i = 0
}
ds
Russia_A Russia_C Germany_A Germany_C France_A France_C
Russia_B Russia_D Germany_B Germany_D France_B France_D
tokenize is possibly the least familiar command here, so see its help if needed.
All that said, it's unlikely that this is a useful data structure. See help reshape.
Here's another way to do it. We set up a counter running over all the variables. This perhaps is more of a finger exercise in macro manipulation.
clear
input Russia B C D Germany E F G France H I J
42 42 42 42 42 42 42 42 42 42 42 42
end
tokenize "A B C D"
forval j = 1/4 {
local sub`j' "``j''"
}
unab all : *
tokenize "`all'"
local J : word count `all'
forval j = 1/`J' {
local k = mod(`j', 4)
if `k' == 0 local k = 4
if `k' == 1 local stub "``j''"
rename ``j'' `stub'`sub`k''
}
ds
I am new to Fortran but I am trying to adapt a Fortran code and I am having trouble doing something which I think is probably quite simple.
I want to adapt a Fortran file called original.f so that it makes an input file called input.inp and populates it with 4 integers calculated earlier in original.f so that input.inp looks like, for example:
&input
A = 1
B = 2
C = 3
D = 4
&end
I know how to write this format:
OPEN(UNIT=10,FILE='input.inp')
WRITE (10,00001) 1,2,3,4
...
...
...
00001 Format (/2x,'&input',
& /2x,'A = ',i4,
& /2x,'B = ',i4,
& /2x,'C = ',i4,
& /2x,'D = ',i4,
& /2x,'&end')
(or something like this that I can fiddle with when I get it working) but I am not sure how to create the input.inp file write this into it and then use this input file.
The input file needs to be used to run an executable called "exec". I would run this in bash as:
./exec < input.inp > output.out
Where output.out contains two arrays called eg(11) and ai(11,6,2) (with dimensions given) like:
eg(1)= 1
eg(2)= 2
...
...
...
eg(11)= 11
ai(1,1,1)= 111
ai(1,2,1)= 121
...
...
...
ai(11,6,2)=1162
Finally I need to read these inputs back into original.f so that they can be used further down in file. I have defined these arrays at the beginning of original.f as:
COMMON /DATA / eg(11),ai(11,6,2)
But I am not sure of the Fortran to read data line by linw from output.out to populate these arrays.
Any help for any of the stages in this process would be hugely appreciated.
Thank you very much
James
Since you have shown how you create the input file, I assume the question is how to read it. The code shows how "a" and "b" can be read from successive lines after skipping the first line. On Windows, if the resulting executable is a.exe, the commands a.exe < data.txt or type data.txt | a.exe will read from data.txt.
program xread
implicit none
character (len=10) :: words(3)
integer, parameter :: iu = 5 ! assuming unit 5 is standard input
integer :: a,b
read (iu,*) ! skip line with &input
read (iu,*) words ! read "a", "=", and "1" into 3 strings
read (words(3),*) a ! read integer from 3rd string
read (iu,*) words ! read "b", "=", and "1" into 3 strings
read (words(3),*) b ! read integer from 3rd string
print*,"a =",a," b =",b
end program xread
If I understand the expanded question correctly, you have to work with an output file, produced by some other code you did not write, with lines like eg(1) = ....
For the simplest case where you know the number of elements and their ordering beforehand, you can simply search each line for the equals sign from behind:
program readme
implicit none
character(100) :: buffer
integer :: i, j, k, pos, eg(11), ai(11,6,2)
do i = 1,11
read*, buffer
pos = index(buffer, '=', back = .true.)
read(buffer(pos+1:), *) eg(i)
enddo
! I have assumed an arbitrary ordering here
do k = 1,2
do i = 1,11
do j = 1,6
read*, buffer
pos = index(buffer, '=', back = .true.)
read(buffer(pos+1:), *) ai(i,j,k)
enddo
enddo
enddo
end program
Assuming here for simplicity that the data are provided to standard input.
I have written a code to take a text file as input and print only the variants which repeat more than once. By variants I mean, chr positions in the text file.
The input file looks like this:
chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1049083 1049083 C A intronic C1orf159 0.13 rs4970407
chr1 1113121 1113121 G A intronic TTLL10 0.13 rs12092254
As you can see, rows 2 and 3 repeat. I'm just taking the first 3 columns and seeing if they are the same. Here, chr1 1049083 1049383 repeat in both row2 and row3. So I print out saying that there is one duplicate and it's position.
I have written the code below. Though it's doing what I want, it's quite slow. It takes me about 5 min to run on a file which have 700,000 rows. I wanted to know if there is a way to speed things up.
Thanks!
#!/usr/bin/env python
""" takes in a input file and
prints out only the variants that occur more than once """
import shlex
import collections
rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []
for row in rows:
var = shlex.split(row)
indices.append("_".join(var[0:3]))
dup_list = []
ind_tuple = collections.Counter(indices).items()
for x, y in ind_tuple:
if y>1:
dup_list.append(x)
print dup_list
print len(dup_list)
Note: In this case the entire row2 is a duplicate of row3. But this is not necessarily the case all the time. Duplicate of chr positions (first three columns) is what I'm looking for.
EDIT:
Edited the code as per the suggestion of damienfrancois. Below is my new code:
f = open('variants.txt', 'r')
indices = {}
for line in f:
row = line.rstrip()
var = shlex.split(row)
index = "_".join(var[0:3])
if indices.has_key(index):
indices[index] = indices[index] + 1
else:
indices[index] = 1
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
I used, time to see how long both the code takes.
My original code:
time run remove_dup.py
14428
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s
Wall time: 209.31 s
Code after modification:
time run remove_dup2.py
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s
I don't see any significant improvement in the time.
Some suggestions:
do not read the whole file at once ; read line by line and process it on the fly ; you'll save memory operations
let indices be a default dict and increment the value at key "_".join(var[0:3]) ; this saves the costly (guessing here, should use a profiler) collections.Counter(indices).items() step
try pypy or a python compiler
split your data in as many subsets as your computer has cores, apply the program to each subset in parallel then merge the results
HTH
A big time sink is probably the if..has_key() portion of the code. In my experience, try-except is a lot faster...
f = open('variants.txt', 'r')
indices = {}
for line in f:
var = line.split()
index = "_".join(var[0:3])
try:
indices[index] += 1
except KeyError:
indices[index] = 1
f.close()
dup_pos = 0
for key, value in indices.items():
if value > 1:
dup_pos = dup_pos + 1
print dup_pos
Another option there would be replace the four try except lines with:
indices[index] = 1 + indices.get(index,0)
This approach only tells how many lines of the lines are duplicated, and not how many times they are repeated. (So if one line is duped 3x, then it will say one...)
If you are only trying to count the duplicates and not delete or note them, you could tally the lines of the file as you go, and compare this to the length of the indices dictionary, and the difference is the number of dupe lines (instead of looping back through and re-counting). This might save a little time, but gives a different answer:
#!/usr/bin/env python
f = open('variants.txt', 'r')
indices = {}
total_len=0
for line in f:
total_len +=1
var = line.split()
index = "_".join(var[0:3])
indices[index] = 1 + indices.get(index,0)
f.close()
print "Number of duplicated lines:", total_len - len(indices.keys())
I'd be curious to hear what your benchmarks are for code that does not include the has_key() test...