The glob.glob function to extract data from files - python-2.7

I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found
files='/home/pathtofiles/files'
#print files
#sys.exit()
for file in files:
fastas=sorted(glob.glob(files + '/*.fasta'))
#print fastas[0]
output_handle=(open(fastas, 'r+'))
genes_files=list(SeqIO.parse(output_handle, 'fasta'))
geneID=genes_files[0].id
print geneID
I am running of ideas on how to direct the script to open when file after another to give me the require information.

I see what you are trying to do, but let me first explain why your current approach is not working.
You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:
>>> files='/home/pathtofiles/files'
>>> for file in files:
>>> print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s
Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.
Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.
As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.
>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp
Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.
I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.
As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.
Combing all of the above, I would rewrite your code like this:
import os
import glob
from Bio import SeqIO
path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
print fasta_path
with open(fasta_path, 'r+') as output_handle:
genes_records = SeqIO.parse(output_handle, 'fasta')
for gene_record in genes_records:
print gene_record.id

This is way I solved the problem, and this script works.
import os,sys
import glob
from Bio import SeqIO
def extracting_information_gene_id():
#to extract geneID information and add the reference gene to each different file
files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
#print file
#sys.exit()
for file in files:
#print file
output_handle=open(file, 'r+')
ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
geneID=ref_genes[0].id
#print geneID
#sys.exit()
#to extract the geneID as a reference record from the genes_files
query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
#print query_genes[geneID].format('fasta') #check point
#sys.exit()
ref_gene=query_genes[geneID].format('fasta')
#print ref_gene #check point
#sys.exit()
output_handle.write(str(ref_gene))
output_handle.close()
query_genes.close()
extracting_information_gene_id()
print 'Reference gene sequence have been added'

Related

Python: Returning a filename for matching a specific condition

import sys, hashlib
import os
inputFile = 'C:\Users\User\Desktop\hashes.txt'
sourceDir = 'C:\Users\User\Desktop\Test Directory'
hashMatch = False
for root, dirs, files in os.walk(sourceDir):
for filename in files:
sourceDirHashes = hashlib.md5(filename)
for digest in inputFile:
if sourceDirHashes.hexdigest() == digest:
hashMatch = True
break
if hashMatch:
print str(filename)
else:
print 'hash not found'
Contents of inputFile =
2899ebdb5f7a90a216e97b3187851fc1
54c177418615a90a6424cb945f7a6aec
dd18bf3a8e0a2a3e53e2661c7fb53534
Contents of sourceDir files =
test
test 1
test 2
I almost have the code working, I'm just tripping up somewhere. My current code that I have posted always returns the else statement, that the hash hasn't been found, even although they do as I have verified this. I have provided the content of my sourceDir so that someone case try this, the file names are test, test 1 and test 2, the same content is in the files.
I must add however, I am not looking for the script to print the actual file content, but rather the name of the file.
Could anyone suggest to where I am going wrong and why it is saying the condition is false?
You need to open the inputFile using open(inputFile, 'rt') then you can read the hashes. Also when you do read the hashes make sure you strip them first to get rid of new line characters \n at the end of the lines

Os.walk - WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect:

new to python and looking for some help on a problem I am having with os.walk. I have had a solid look around and cannot find the right solution to my problem.
What the code does:
Scans a users selected HD or folder and returns all the filenames, subdirs and size. This is then manipulated in pandas (not in code below) and exported to an excel spreadsheet in the formatting I desired.
However, in the first part of the code, in Python 2.7, I am currently experiencing the below error:
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'E:\03. Work\Bre\Files\folder2\icons greyscale flatten\._Icon_18?10 Stainless Steel.psd'
I have explored using raw string (r') but to no avail. Perhaps I am writing it wrong.
I will note that I never get this in 3.5 or on cleanly labelled selected folders. Due to Pandas and pysinstaller problems with 3.5, I am hoping to stick with 2.7 until the error with 3.5 is resolved.
import pandas as pd
import xlsxwriter
import os
from io import StringIO
#Lists for Pandas Dataframes
fpath = []
fname = []
fext = []
sizec = []
# START #Select file directory to scan
filed = raw_input("\nSelect a directory to scan: ")
#Scan the Hard-Drive and add to lists for Pandas DataFrames
print "\nGetting details..."
for root, dirs, files in os.walk(filed):
for filename in files:
f = os.path.abspath(root) #File path
fpath.append(f)
fname.append(filename) #File name
s = os.path.splitext(filename)[1] #File extension
s = str(s)
fext.append(s)
p = os.path.join(root, filename) #File size
si = os.stat(p).st_size
sizec.append(si)
print "\nDone!"
Any help would be greatly appreciated :)
In order to traverse filenames with unicode characters, you need to give os.walk a unicode path name.
Your path contains a unicode character, which is being displayed as ? in the exception.
If you pass in the unicode path, like this os.walk(unicode(filed)) you should not get that exception.
As noted in Convert python filenames to unicode sometimes you'll get a bytestring if the path is "undecodable" by Python 2.

How to get a file to be used as input of the program that ends with special character in python

I have an output file from a code which its name will ends to "_x.txt" and I want to connect two codes which second code will use this file as an input and will add more data into it. Finally, it will ends into "blabla_x_f.txt"
I am trying to work it out as below, but seems it is not correct and I could not solve it. Please help:
inf = str(raw_input(*+"_x.txt"))
with open(inf+'_x.txt') as fin, open(inf+'_x_f.txt','w') as fout:
....(other operations)
The main problem is that the "blabla" part of the file could change to any thing every time and will be random strings, so the code needs to be flexible and just search for whatever ends with "_x.txt".
Have a look at Python's glob module:
import glob
files = glob.glob('*_x.txt')
gives you a list of all files ending in _x.txt. Continue with
for path in files:
newpath = path[:-4] + '_f.txt'
with open(path) as in:
with open(newpath, 'w') as out:
# do something

Registered Trademark: Why does strip remove ® but replace can't find it? How do I remove symbols from folder and file names?

If the registered trademark symbol does not appear at the end of a file or folder name, strip cannot be used. Why doesn't replace work?
I have some old files and folders named with a registered trademark symbol that I want to remove.
The files don't have an extension.
folder: "\data\originals\Word Finder®"
file 1: "\data\originals\Word Finder® DA"
file 2: "\data\originals\Word Finder® Thesaurus"
For the folder, os.rename(p,p.strip('®')) works. However, replace os.rename(p,p.replace('®','')) does not work on either the folder or the files.
Replace works on strings fed to it, ie:
print 'Registered® Trademark®'.replace('®',''). Is there a reason the paths don't follow this same logic?
note:
I'm using os.walk() to get the folder and file names
I have been unable to recreate your issue, so I'm not sure why it isn't working for you. Here is a workaround though: instead of using the registered character in your source code with the string methods, try being more explicit with something like this:
import os
for root, folders, files in os.walk(os.getcwd()):
for fi in files:
oldpath = os.path.join(root, fi)
newpath = os.path.join(root, fi.decode("utf-8").replace(u'\u00AE', '').encode("utf-8"))
os.rename(oldpath, newpath)
Explicitly specifying the unicode codepoint you're looking for can help eliminate the number of places your code could be going wrong. The interpreter no longer has to worry about the encoding of your source code itself.
My original question 'Registered Trademark: Why does strip remove ® but replace can't find it?' is no longer applicable. The problem isn't strip or replace, but how os.rename() deals with unicode characters. So, I added to my question.
Going off of what Cameron said, os.rename() seems like it doesn't work with unicode characters. (please correct me if this is wrong - I don't know much about this). shutil.move() ultimately gives the same result that os.rename() should have.
Despite ScottLawson's suggestion to use u'\u00AE' instead of '®', I could not get it to work.
Basically, use shutil.move(old_name,new_name) instead.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import shutil
import os
# from this answer: https://stackoverflow.com/q/1033424/3889452
def remove(value):
deletechars = '®'
for c in deletechars:
value = value.replace(c,'')
return value
for root, folders, files in os.walk(r'C:\Users\myname\da\data\originals\Word_4_0'):
for f in files:
rename = remove(f)
shutil.move(os.path.join(root,f),os.path.join(root,rename))
for folder in folders:
rename = remove(folder)
shutil.move(os.path.join(root,folder),os.path.join(root,rename))
This also works for the immediate directory (based off of this) and catches more symbols, chars, etc. that aren't included in string.printable and ® doesn't have to appear in the python code.
import shutil
import os
import string
directory_path = r'C:\Users\myname\da\data\originals\Word_4_0'
for file_name in os.listdir(directory_path):
new_file_name = ''.join(c for c in file_name if c in string.printable)
shutil.move(os.path.join(directory_path,file_name),os.path.join(directory_path,new_file_name))

How to replace string in multiple files in the folder?

I m trying to read two files and replace content of one file with content of other file in files present in folder which also has sub directories.
But its tell sub process not defined.
i'm new to python and shell script can anybody help me with this please?
import os
import sys
import os.path
f = open ( "file1.txt",'r')
g = open ( "file2.txt",'r')
text1=f.readlines()
text2=g.readlines()
i = 0;
for line in text1:
l = line.replace("\r\n", "")
t = text2[i].replace("\r\n", "")
args = "find . -name *.tml"
Path = subprocess.Popen( args , shell=True )
os.system(" sed -r -i 's/" + l + "/" + t + "/g' " + Path)
i = i + 1;
To specifically address your actual error, you need to import the subprocess module as you are making use of it (oddly) in your code:
import subprocess
After that, you will find more problems. I will try and keep it as simple as possible with my suggestions. Code first, then I will break it down. Keep in mind, there are more robust ways to accomplish this task. But I am doing my best to keep in mind your experience level and making it make your current approach as closely as possible.
import subprocess
import sys
# 1
results = subprocess.Popen("find . -name '*.tml'",
shell=True, stdout=subprocess.PIPE)
if results.wait() != 0:
print "error trying to find tml files"
sys.exit(1)
# 2
tml_files = []
for tml in results.stdout:
tml_files.append(tml.strip())
if not tml_files:
print "no tml files found"
sys.exit(0)
tml_string = " ".join(tml_files)
# 3
with open ("file1.txt") as f, open("file2.txt") as g:
while True:
# 4
f_line = f.readline()
if not f_line:
break
g_line = g.readline()
if not g_line:
break
f_line = f_line.strip()
g_line = g_line.strip()
if not f_line or not g_line:
continue
# 5
cmd = "sed -i -e 's/%s/%s/g' %s" % \
(f_line.strip(), g_line.strip(), tml_string)
ret = subprocess.Popen(cmd, shell=True).wait()
if ret != 0:
print "error doing string replacement"
sys.exit(1)
You do not need to read in your entire files at once. If they are large this could be a lot of memory. You can consume a line at a time, and you can also make use of what is called "context managers" when you open the files. This will ensure they close properly no matter what happens:
We start with a subprocess command that is run only once to find all your .tml files. Your version had the same command being run multiple times. If the search path is the same, then we only need it once. This checks the exit code of the command and quits if it failed.
We loop over stdout on the subprocess command, and add the stripped lines to a list. This is a more robust way of your replace("\r\n"). It removes whitespace. A "list comprehension" would be better suited here (down the line). If we didn't find any tml files, then we have no work to do, so we exit. Otherwise, we join them together in a space-separated string to be suitable for our command later.
This is called "context managers". You can open the file in a way that no matter what they will be closed properly. The file is open for the length of the context within that code block. We are going to loop forever, and break when appropriate.
We pull a line, one at a time, from each file. If either line is blank, we reached the end of the file and cannot do any more work, so we break out. We then strip the newlines, and if either string is empty (blank line) we still can't do any work, but we just continue to the next available line.
A modified version of your sed command. We construct the command string on each loop for the source and replacement strings, and tack on the tml file string. Bear in mind this is a very naive approach to the replacement. It really expects your replacement strings to be safe characters and not break the s///g sed format. But we run that with another subprocess command. The wait() simply waits for the return code, and we check it for an error. This approach replaces your os.system() version.
Hope this helps. Eventually you can improve this to do more checking and safe operations.