My code for this works perfectly. I can print to the screen exactly how I want it. However, I want it to write to a file so that I can view the file instead of the print screen. So I've tried to do the following but I'm coming up with a few issues. Error message:
from xml.dom import minidom
import sys
import os, fnmatch
def find_files(directory, pattern):
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename
for filename in find_files('c:/Python27','*file.xml'):
print ('Found file.xml:', filename)
xmldoc = minidom.parse(filename)
itemlist = xmldoc.getElementsByTagName('Game')
for item in itemlist:
year = item.getElementsByTagName('Year')
for s in year:
print item.attributes['name'].value, s.attributes['value'].value
TypeError: function takes exactly 1 argument (2 given),code with the write function instead:
from xml.dom import minidom
import sys
import os, fnmatch
def find_files(directory, pattern):
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename
f = open('test.txt','w')
for filename in find_files('c:/Python27','*file.xml'):
f.write('Found file.xml:', filename)
xmldoc = minidom.parse(filename)
itemlist = xmldoc.getElementsByTagName('Game')
for item in itemlist:
year = item.getElementsByTagName('Year')
for s in year:
f.write (item.attributes['name'].value), f.write(s.attributes['value'].value)
If you want to make your two arguments into a single line (that f.write will accept) you can do something like
f.write("Found file.xml:" + filename + "\n")
+ will concatenate the elements and give you a single string with a newline at the end, for a neat stack of the elements you were looking for in a final file.
As it is, the Error message looks like it's telling you exactly what the problem is -- f.write really does take only one argument, and having a comma in the function call indicates a second argument.
Related
I have a csv file with image urls and given file names in two columns. In the file some file names are repetitive but their b respective links are unique. I want to save all the images. So if
A given filename.jpg image exists I want the next images to be saved as filename_2,filename_3.
I use a simple urllib.urlretrieve line to get images
The imports:
import csv
import os
import re
import urllib
First, store your csv data.
file_names = []
urls = []
with open('data.csv', 'r') as file:
reader = csv.reader(file)
for file_name, url in reader:
file_names.append(file_name)
urls.append(url)
file.close()
Make a new list to store your new file names in.
new_file_names = []
Iterate through the file_names list.
for file_name in file_names:
Grab the file extension. There are many image extensions: .jpg, .png, etc.
This is assuming the file extension is only 4 characters long including the . Anytime you see [-4:] throughout the document, be careful of that. If it is an issue, use regex to get the file extension instead.
file_ext = file_name[-4:]
Next iterate through the new_file_names list to see if we grab any matches with file_name from the file_names list.
for temp_file_name in new_file_names:
if temp_file_name == file_name:
When we get a match, first check if it already has a '_\b+' + file_ext. What this means is _ + any numbers + file_ext.
check = re.search('_\d+' + file_ext, temp_file_name)
If the check is True, we now want to see what that number is and add one.
if check:
number = int(check.group(0)[1:-4]) + 1
Now we want to pretty much do the opposite regex as before so we only get the file name + _ but without all the numbers. Then add on the new number and the file_ext.
inverse = re.search('.*_(?=\d+' + file_ext + ')', file_name)
file_name = inverse.group(0) + str(number) + file_ext
This else is for when the match is the very first occurence adding a _1 to the end of the file_name.
else:
file_name = file_name[:-4] + '_1' + file_ext
Append the file_name to the new_file_names list.
new_file_names.append(file_name)
Set a folder (if you want) to store your images. If the folder doesn't exist, it will create one for you.
path = 'img/'
try:
os.makedirs(path)
except OSError:
if not os.path.isdir(path):
raise
Finally, to save the images, we use a for loop and zip up new_file_names and urls. Inside the loop we use urllib.urlretrieve to download the images.
for file_name, url in zip(new_file_names, urls):
urllib.urlretrieve(url, path + file_name)
I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found
files='/home/pathtofiles/files'
#print files
#sys.exit()
for file in files:
fastas=sorted(glob.glob(files + '/*.fasta'))
#print fastas[0]
output_handle=(open(fastas, 'r+'))
genes_files=list(SeqIO.parse(output_handle, 'fasta'))
geneID=genes_files[0].id
print geneID
I am running of ideas on how to direct the script to open when file after another to give me the require information.
I see what you are trying to do, but let me first explain why your current approach is not working.
You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:
>>> files='/home/pathtofiles/files'
>>> for file in files:
>>> print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s
Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.
Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.
As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.
>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp
Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.
I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.
As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.
Combing all of the above, I would rewrite your code like this:
import os
import glob
from Bio import SeqIO
path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
print fasta_path
with open(fasta_path, 'r+') as output_handle:
genes_records = SeqIO.parse(output_handle, 'fasta')
for gene_record in genes_records:
print gene_record.id
This is way I solved the problem, and this script works.
import os,sys
import glob
from Bio import SeqIO
def extracting_information_gene_id():
#to extract geneID information and add the reference gene to each different file
files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
#print file
#sys.exit()
for file in files:
#print file
output_handle=open(file, 'r+')
ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
geneID=ref_genes[0].id
#print geneID
#sys.exit()
#to extract the geneID as a reference record from the genes_files
query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
#print query_genes[geneID].format('fasta') #check point
#sys.exit()
ref_gene=query_genes[geneID].format('fasta')
#print ref_gene #check point
#sys.exit()
output_handle.write(str(ref_gene))
output_handle.close()
query_genes.close()
extracting_information_gene_id()
print 'Reference gene sequence have been added'
I have multiple folders each containing csvs. I am trying to concat the csvs in each subdirectory and then export it. At the end I would have same number of outputs as the folders. At the end I would like to have Folder1.csv, Folder2.csv, ...Folder99.csv etc. This is what
import os
from glob import glob
import pandas as pd
import numpy as np
rootDir = 'D:/Data'
OutDirectory = 'D:/OutPut'
os.chdir(rootDir)
# The directory has folders as follows
# D:/Data/Folder1
# D:/Data/Folder2
# D:/Data/Folder3
# ....
# .....
# D:/Data/Folder99
# Each folders (Folder1, Folder2,..etc.) has many csvs.
frame = pd.DataFrame()
list_ = []
for (dirname, dirs, files) in os.walk(rootDir):
for filename in files:
if filename.endswith('.csv'):
df = pd.read_csv(filename,index_col=None, na_values=['-999'], delim_whitespace= True, header = 0, skiprows = 2)
OutFile = '%s.csv' % OutputFname
list_.append(df)
frame = pd.concat(list_)
df.to_csv(OutDirectory+OutFile, sep = ',', header= True)
I am getting the following error:
IOError: File file200150101.csv does not exist
You need to concatenate dirname and filename for a full path to your files. Change this line like so:
df = pd.read_csv(os.path.join(dirname, filename) ,index_col=None, na_values=['-999'], delim_whitespace= True, header = 0, skiprows = 2)
Edit:
I don't know how pandas works because I never used it. But i think your problem is, that you defined everything you wanted to be done to the CSVs in the inner loop that loops over files only (at least the indentation looks that way - but that could also be a format problem that occured when you pasted your code here on SO).
I rewrote your code and fixed some things that I think might be the problem:
First, I renamed your variables starting with big letters because,
for me, it always looks weird to have vars with big starting letters.
I moved your list variable to the outer loop because it should be
reset every time you enter a new directory as you want all CSVs to be
merged per folder.
And finally, I fixed the indentation. In python indentation tells
the compiler which commands are in the inner or outer loop.
My code now looks like this. You might have to change some things because I can't test it right now:
import os
from glob import glob
import pandas as pd
import numpy as np
rootDir = 'D:/Data'
outDir = 'D:/OutPut'
os.chdir(rootDir)
dirs = os.listdir(rootDir)
frame = pd.DataFrame()
for dirname in dirs:
# the outer loop loops over directories! the actual directory is stored in dirname
list = [] # collect csv data for every directory, not in general
files = glob('%s/*.csv' % (dirname))
for filename in files:
# the inner loop loops over the files in the 'dirname' folder
df = pd.read_csv(filename,index_col=None, na_values=['-999'], delim_whitespace= True, header = 0, skiprows = 2)
# all csv data should be in 'list' now
outFile = '%s.csv' % dirname # define the name for output csv
list.append(df) # do that for every file
# at this point, all files in the actual directory were processed
frame = pd.concat(list_) # and then merge CSVs
# ...actually not sure how pd.concat works, but i guess it does merge the data
frame.to_csv(os.path.join(outDir, outFile), sep = ',', header= True) # save the data
this is just part of the long python script. there is a file called aqfile and it has many parameters. I would like to extract what is next to "OWNER" and "NS".
Note:
OWNER = text
NS = numbers
i could extract what is next to OWNER, because they were just text and i could extract.
for line in aqfile.readlines():
if string.find(line,"OWNER")>0:
print line
m=re.search('<(.*)>',line)
owner=incorp(m.group(1))
break
but when i try to modify the script to extract the numbers
for line in aqfile.readlines():
if string.find(line,"NS")>0:
print line
m=re.search('<(.*)>',line)
ns=incorp(m.group(1))
break
it doesnt work any more.
Can anyone help me?
this is the whole script
#Make a CSV file of datasetnames. pulseprog and, if avaible, (part of) the title
#Note: the whole file tree is read into memory!!! Do not start too high in the tree!!!
import os
import os.path
import fnmatch
import re
import string
max=20000
outfiledesc=0
def incorp(c):
#Vervang " door """ ,CRLF door blankos
c=c.replace('"','"""')
c=c.replace("\r"," ")
c=c.replace("\n"," ")
return "\"%s\"" % (c)
def process(arg,root,files):
global max
global outfiledesc
#Get name,expno,procno from the root
if "proc" in files:
procno = incorp(os.path.basename(root))
oneup = os.path.dirname(root)
oneup = os.path.dirname(oneup)
aqdir=oneup
expno = incorp(os.path.basename(oneup))
oneup = os.path.dirname(oneup)
dsname = incorp(os.path.basename(oneup))
#Read the titlefile, if any
if (os.path.isfile(root + "/title")):
f=open(root+"/title","r")
title=incorp(f.read(max))
f.close()
else:
title=""
#Grab the pulse program name from the acqus parameter
aqfile=open(aqdir+"/acqus")
for line in aqfile.readlines():
if string.find(line,"PULPROG")>0:
print line
m=re.search('<(.*)>',line)
pulprog=incorp(m.group(1))
break
towrite= "%s;%s;%s;%s;%s\n" % (dsname,expno,procno,pulprog,title)
outfiledesc.write(towrite)
#Main program
dialogline1="Starting point of the search"
dialogline2="Maximum length of the title"
dialogline3="output CSV file"
def1="/opt/topspin3.2/data/nmrafd/nmr"
def2="20000"
def3="/home/nmrafd/filelist.csv"
result = INPUT_DIALOG("CSV file creator","Create a CSV list",[dialogline1,dialogline2,dialogline3],[def1,def2,def3])
start=result[0]
tlength=int(result[1])
outfile=result[2]
#Search for procs files. They should be in any dataset.
outfiledesc = open(outfile,"w")
print start
os.path.walk(start,process,"")
outfiledesc.close()
I am using Python 2.7 and imported Tkinter and TK.
What I am trying to do is use a sourced path (a directory path) and concatenate it from picking a file by opening windows explorer. This will enable the user to not have to type in a file name.
I realized I wasn't using a return and would get the following error:
TypeError: cannot concatenate 'str' and 'NoneType' objects
After searching here for this error I found I needed to do a return. I tried to put string in the parenthesis but it doesn't' work. I am definitely missing something.
Here is a sample of my code:
from Tkinter import *
from Tkinter import Tk
from tkFileDialog import askopenfilename
source = '\\\\Isfs\\data$\\GIS Carto\TTP_Draw_Count' ## this a public directory path
filename = ''
filename = getFileName() ##this part is in a different def area.
with open (os.path.join(source + filename), 'r' ) as f: ## this is were it failing.
def getFileName():
Tk().withdraw()
filename = askopenfilename()
return getFileName()
I need to concatenate the source + filename to be used to process a csv file.
I didn't want to put all the code here since it is long and requires a csv file and custom dictionary to merge. All of that works. I hope I have put enough information in this question.
def getFileName():
Tk().withdraw()
filename = askopenfilename()
return getFileName()
You aren't returning the filename that you get here. Change this to:
def getFileName():
Tk().withdraw()
filename = askopenfilename()
return filename
Also note that askopenfilename gets the full path of the chosen file, so source+filename will evaluate to something like u'\\\\Isfs\\data$\\GIS Carto\\TTP_Draw_CountC:/Users/kevin/Desktop/myinput.txt'