I want to speed up some scripts that I need to analyze astronomical data. I have scripts that are run independently for every single image file that I have (which are in .fits file format). I was trying to simply start multiple files at once using Multiprocessing.Pool() and map. My code looks like this:
def sextract(name):
# creates a class instance and calls the method
astro = astrometry(name, *some_more_args)
astro.sextract()
arguments = ['PART1_cal.fits', 'PART2_cal.fits', 'PART3_cal.fits',
'PART4_cal.fits', 'PART5_cal.fits']
with closing(Pool()) as pool:
pool.map(sextract, arguments)
pool.terminate()
and the astrometry.sextract method looks like this:
def sextract(self):
sex = 'sextractor'
headerlist = ['JD-START', 'OBSERVER', 'DATE', 'TM-START', 'OBJECT',
'FLTRNAME', 'RA', 'DEC', 'AIRMASS',
'EXP_TIME']
calib = self._from
# calls the SourceExtrator in the shell with a config-file and
# the file name
os.system(sex + ' -c ' + self.sexpath + ' ' + calib)
# copies the extracted file to a folder on an external HDD
shutil.copy2(self.ext_path, self._to + '/' +
calib.split('/')[-1])
# opens the old .fits image and extracts the header
old = fits.open(calib)
o = old[0].header
# opens the new(copied) .fits image and extracts the header
# this produces the error
extr = fits.open(self._to + '/' + calib.split('/')[-1],
mode='update')
e = extr[1].header
# goes through the old header entries and copies all entries
# in the to the new header
for g in headerlist:
e[g] = o[g]
# saves the new header to the new(copied) file and closes both
# files
extr[1].header = e
extr.flush()
extr.close()
old.close()
Every time I run this code, I get an IOError: Header missing END card., which is thrown by the line where I open the new .fits and put it in the variable extr. At first I thought this would happen because the copy procedure would take too long, and the python tries to access the file before it's finished copying. I inserted a time.sleep(10) to check that, but the error still appears.
I tried using pool.apply() and the normal map() and in both cases I don't get an error. When using pool.apply() it takes almost as long as for the map() case though, so I haven't really accomplished anything doing it this way.
I know this is a very specific question, but I hope someone might know where my problem lies, or got something similar working already.
Related
I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found
files='/home/pathtofiles/files'
#print files
#sys.exit()
for file in files:
fastas=sorted(glob.glob(files + '/*.fasta'))
#print fastas[0]
output_handle=(open(fastas, 'r+'))
genes_files=list(SeqIO.parse(output_handle, 'fasta'))
geneID=genes_files[0].id
print geneID
I am running of ideas on how to direct the script to open when file after another to give me the require information.
I see what you are trying to do, but let me first explain why your current approach is not working.
You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:
>>> files='/home/pathtofiles/files'
>>> for file in files:
>>> print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s
Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.
Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.
As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.
>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp
Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.
I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.
As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.
Combing all of the above, I would rewrite your code like this:
import os
import glob
from Bio import SeqIO
path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
print fasta_path
with open(fasta_path, 'r+') as output_handle:
genes_records = SeqIO.parse(output_handle, 'fasta')
for gene_record in genes_records:
print gene_record.id
This is way I solved the problem, and this script works.
import os,sys
import glob
from Bio import SeqIO
def extracting_information_gene_id():
#to extract geneID information and add the reference gene to each different file
files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
#print file
#sys.exit()
for file in files:
#print file
output_handle=open(file, 'r+')
ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
geneID=ref_genes[0].id
#print geneID
#sys.exit()
#to extract the geneID as a reference record from the genes_files
query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
#print query_genes[geneID].format('fasta') #check point
#sys.exit()
ref_gene=query_genes[geneID].format('fasta')
#print ref_gene #check point
#sys.exit()
output_handle.write(str(ref_gene))
output_handle.close()
query_genes.close()
extracting_information_gene_id()
print 'Reference gene sequence have been added'
I want to download multiple specific links(images´ urls) into a txt file(or any file where all links can be listed underneath each others).
I get them but the code wrtite each link on the top of the other one and at the end it stays only a link :(. Also I want not repeated urls
def dlink(self, image_url):
r = self.session.get(image_url, stream=True)
with open('Output.txt','w') as f:
f.write(image_url + '\n')
The issue is most simply that opening a file with mode 'w' truncates any existing file. You should change 'w' to 'a' instead. This will open an existing file for writing, but append instead of truncating.
More fundamentally, the problem may be that you are opening the file over and over in a loop. This is very inefficient. The only time the approach you use could be really useful is if your program is approaching the OS-imposed limit on number of open files. If this is not the case, I would recommended putting the loop inside the with block, keeping the mode as 'w' since you open the file just once now, and passing the open file to your dlink function.
Edit
Huge mistake of my part, as it is a method, and you will call it several times, if you open it in write mode ('w') or similar, it will Overwrites the existing file if the file exists.
So, if you use the 'a' way, you can see that:
Opens a file for appending. The file pointer is at the end of the file
if the file exists. That is, the file is in the append mode. If the
file does not exist, it creates a new file for writing.
The other problem radics in image_url is a list, so you need to write it line by line:
def dlink(self, image_url):
r = self.session.get(image_url, stream=True)
with open('Output.txt','a') as f:
for url in list(set(image_url)):
f.write(image_url + '\n')
another way to do it:
your_file = open('Output.txt', 'a')
r = self.session.get(image_url, stream=True)
for url in list(set(image_url)):
your_file.write("%s\n" % url)
your_file.close() #dont forget close it :)
the file open mode is wrong,'w' mode make this file was overwritten every time you open it,not appended to it. replace it to 'a' mode.
you can see this https://stackoverflow.com/a/23566951/8178794 for more detail
Open a file with option w overwrite the file if existring, use the mode a to append data to an existing file.
Try :
import requests
from os.path import splitext
# use mode='a' to append result without erasing filename
def dlink(url, filename, mode='w'):
r = requests.get(url)
if r.status_code != 200:
return
# here the link is valid
with open(filename, mode) as desc:
desc.write(url)
def dimg(img_url, img_name):
r = requests.get(img_url, stream=True)
if r.status_code != 200:
return
_, ext = splitext(img_url)
with open(img_name + ext, 'wb') as desc:
for chunk in r:
desc.write(chunk)
dlink('https://image.flaticon.com/teams/slug/freepik.jpg', 'links.txt')
dlink('https://image.flaticon.com/teams/slug/freepik.jpg', 'links.txt', 'a')
dimg('https://image.flaticon.com/teams/slug/freepik.jpg', 'freepik')
I am new to GitPython and I am trying to get the content of a file within a commit. I am able to get each file from a specific commit, but I am getting an error each time I run the command. Now, I know that the file exist in GitPython, but each time I run my program, I am getting the following error:
returned non-zero exit status 1
I am using Python 2.7.6 and Ubuntu Linux 14.04.
I know that the file exist, since I also go directly into Git from the command line, check out the respective commit, search for the file, and find it. I also run the cat command on it, and the file contents are displayed. Many times when the error shows up, it says that the file in question does not exist. I am trying to go through each commit with GitPython, get every blob or file from each individual commit, and run an external Java program on the content of that file. The Java program is designed to return a string to Python. To capture the string returned from my Java code, I am also using subprocess.check_output. Any help will be greatly appreciated.
I tried passing in the command as a list:
cmd = ['java', '-classpath', '/home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*:', 'java_gram.mainJava','absolute/path/to/file']
subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=False)
And I have also tried passing the command as a string:
subprocess.check_output('java -classpath /home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*: java_gram.mainJava {file}'.format(file=entry.abspath.strip()), shell=True)
Is it possible to access the contents of a file from GitPython?
For example, say there is a commit and it has one file foo.java
In that file is the following lines of code:
foo.java
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
public class foo{
public static void main(String[] args) throws Exception{}
}
I want to access everything in the file and run an external program on it.
Any help would be greatly appreciated. Below is a piece of the code I am using to do so
#! usr/bin/env python
__author__ = 'rahkeemg'
from git import *
import git, json, subprocess, re
git_dir = '/home/rahkeemg/Documents/GitRepositories/WhereHows'
# make an instance of the repository from specified path
repo = Repo(path=git_dir)
heads = repo.heads # obtain the different repositories
master = heads.master # get the master repository
print master
# get all of the commits on the master branch
commits = list(repo.iter_commits(master))
cmd = ['java', '-classpath', '/home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*:', 'java_gram.mainJava']
# start at the very 1st commit, or start at commit 0
for i in range(len(commits) - 1, 0, -1):
commit = commits[i]
commit_num = len(commits) - 1 - i
print commit_num, ": ", commit.hexsha, '\n', commit.message, '\n'
for entry in commit.tree.traverse():
if re.search(r'\.java', entry.path):
current_file = str(entry.abspath.strip())
# add the current file or blob to the list for the command to run
cmd.append(current_file)
print entry.abspath
try:
# This is the scenario where I pass arguments into command as a string
print subprocess.check_output('java -classpath /home/rahkeemg/workspace/CSCI499_Java/bin/:/usr/local/lib/*: java_gram.mainJava {file}'.format(file=entry.abspath.strip()), shell=True)
# scenario where I pass arguments into command as a list
j_response = subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=False)
except subprocess.CalledProcessError as e:
print "Error on file: ", current_file
# Use pop on list to remove the last string, which is the selected file at the moment, to make place for the next file.
cmd.pop()
First of all, when you traverse the commit history like this, the file will not be checked out. All you get is the filename, maybe leading to the file or maybe not, but certainly it will not lead to the file from different revision than currently checked-out.
However, there is a solution to this. Remember that in principle, anything you could do with some git command, you can do with GitPython.
To get file contents from specific revision, you can do the following, which I've taken from that page:
git show <treeish>:<file>
therefore, in GitPython:
file_contents = repo.git.show('{}:{}'.format(commit.hexsha, entry.path))
However, that still wouldn't make the file appear on disk. If you need some real path for the file, you can use tempfile:
f = tempfile.NamedTemporaryFile(delete=False)
f.write(file_contents)
f.close()
# at this point file with name f.name contains contents of
# the file from path entry.path at revision commit.hexsha
# your program launch goes here, use f.name as filename to be read
os.unlink(f.name) # delete the temp file
I have an output file from a code which its name will ends to "_x.txt" and I want to connect two codes which second code will use this file as an input and will add more data into it. Finally, it will ends into "blabla_x_f.txt"
I am trying to work it out as below, but seems it is not correct and I could not solve it. Please help:
inf = str(raw_input(*+"_x.txt"))
with open(inf+'_x.txt') as fin, open(inf+'_x_f.txt','w') as fout:
....(other operations)
The main problem is that the "blabla" part of the file could change to any thing every time and will be random strings, so the code needs to be flexible and just search for whatever ends with "_x.txt".
Have a look at Python's glob module:
import glob
files = glob.glob('*_x.txt')
gives you a list of all files ending in _x.txt. Continue with
for path in files:
newpath = path[:-4] + '_f.txt'
with open(path) as in:
with open(newpath, 'w') as out:
# do something
Normally I try to avoid the use of macros, so I actually don't know how to use them beyond the very most basic ones, but I'm trying to do some meta-manipulation so I assume macros are needed.
I have an enum listing various log entries and their respective id, e.g.
enum LogID
{
LOG_ID_ITEM1=0,
LOG_ID_ITEM2,
LOG_ID_ITEM3=10,
...
}
which is used within my program when writing data to the log file. Note that they will not, in general, be in any order.
I do most of my log file post-processing in Matlab so I'd like to write the same variable names and values to a file for Matlab to load in. e.g., a file looking like
LOG_ID_ITEM1=0;
LOG_ID_ITEM2=1;
LOG_ID_ITEM3=10;
...
I have no idea how to go about doing this, but it seems like it shouldn't be too complicated. If it helps, I am using c++11.
edit:
For clarification, I'm not looking for the macro itself to write the file. I want a way to store the enum element names and values as strings and ints somehow so I can then use a regular c++ function to write everything to file. I'm thinking the macro might then be used to build up the strings and values into vectors? Does that work? If so, how?
I agree with Adam Burry that a separate script is likely best for this. Not sure which languages you're familiar with, but here's a quick Python script that'll do the job:
#!/usr/bin/python
'''Makes a .m file from an enum in a C++ source file.'''
from __future__ import print_function
import sys
import re
def parse_cmd_line():
'''Gets a filename from the first command line argument.'''
if len(sys.argv) != 2:
sys.stderr.write('Usage: enummaker [cppfilename]\n')
sys.exit(1)
return sys.argv[1]
def make_m_file(cpp_file, m_file):
'''Makes an .m file from enumerations in a .cpp file.'''
in_enum = False
enum_val = 0
lines = cpp_file.readlines()
for line in lines:
if in_enum:
# Currently processing an enumeration
if '}' in line:
# Encountered a closing brace, so stop
# processing and reset value counter
in_enum = False
enum_val = 0
else:
# No closing brace, so process line
if '=' in line:
# If a value is supplied, use it
ev_string = re.match(r'[^=]*=(\d+)', line)
enum_val = int(ev_string.group(1))
# Write output line to file
e_out = re.match(r'[^=\n,]+', line)
m_file.write(e_out.group(0).strip() + '=' +
str(enum_val) + ';\n')
enum_val += 1
else:
# Not currently processing an enum,
# so check for an enum definition
enumstart = re.match(r'enum \w+ {', line)
if enumstart:
in_enum = True
def main():
'''Main function.'''
# Get file names
cpp_name = parse_cmd_line()
m_name = cpp_name.replace('cpp', 'm')
print('Converting ' + cpp_name + ' to ' + m_name + '...')
# Open the files
try:
cpp_file = open(cpp_name, 'r')
except IOError:
print("Couldn't open " + cpp_name + ' for reading.')
sys.exit(1)
try:
m_file = open(m_name, 'w')
except IOError:
print("Couldn't open " + m_name + ' for writing.')
sys.exit(1)
# Translate the cpp file
make_m_file(cpp_file, m_file)
# Finish
print("Done.")
cpp_file.close()
m_file.close()
if __name__ == '__main__':
main()
Running ./enummaker.py testenum.cpp on the following file of that name:
/* Random code here */
enum LogID {
LOG_ID_ITEM1=0,
LOG_ID_ITEM2,
LOG_ID_ITEM3=10,
LOG_ID_ITEM4
};
/* More random code here */
enum Stuff {
STUFF_ONE,
STUFF_TWO,
STUFF_THREE=99,
STUFF_FOUR,
STUFF_FIVE
};
/* Yet more random code here */
produces a file testenum.m containing the following:
LOG_ID_ITEM1=0;
LOG_ID_ITEM2=1;
LOG_ID_ITEM3=10;
LOG_ID_ITEM4=11;
STUFF_ONE=0;
STUFF_TWO=1;
STUFF_THREE=99;
STUFF_FOUR=100;
STUFF_FIVE=101;
This script assumes that the closing brace of an enum block is always on a separate line, that the first identifier is defined on the line following the opening brace, that there are no blank lines between the braces, that enum appears at the start of a line, and that there is no space following the = and the number. Easy enough to modify the script to overcome these limitations. You could have your makefile run this automatically.
Have you considered "going the other way"? It usually makes more sense to maintain your data definitions in a (text) file, then as part of your build process you can generate a C++ header and include it. Python and mako is a good tool for doing this.