Concat a String in python - python-2.7

this is are my files
2015125_0r89_PEO.txt
2015125_0r89_PED.txt
2015125_0r89_PEN.txt
2015126_0r89_PEO.txt
2015126_0r89_PED.txt
2015126_0r89_PEN.txt
2015127_0r89_PEO.txt
2015127_0r89_PED.txt
2015127_0r89_PEN.txt
and I want to change to this:
US.CAR.PEO.D.2015.125.txt
US.CAR.PED.D.2015.125.txt
US.CAR.PEN.D.2015.125.txt
US.CAR.PEO.D.2015.126.txt
US.CAR.PED.D.2015.126.txt
US.CAR.PEN.D.2015.126.txt
US.CAR.PEO.D.2015.127.txt
US.CAR.PED.D.2015.127.txt
US.CAR.PEN.D.2015.127.txt
this is my code so far,
import os
paths = (os.path.join(root, filename)
for root, _, filenames in os.walk('C:\\data\\MAX\\') #location files
for filename in filenames)
for path in paths:
a = path.split("_")
b = a[2].split(".")
c = "US.CAR."+ b[0] + ".D." + a[0]
print c
when I run the script it's no make any error, but not change the name of the files .txt which it is what it should supposed to do
any help?

The way you do it by first getting the path and then manipulating it will get bad results, in this case is best first get the name of the file, make the changes to it and then change the name of the file itself, like this
for root,_,filenames in os.walk('C:\\data\\MAX\\'):
for name in filenames:
print "original:", name
a = name.split("_")
b = a[2].split(".")
new = "US.CAR.{}.D.{}.{}".format(b[0],a[0],b[1]) #don't forget the file extention
print "new",new
os.rename( os.path.join(root,name), os.path.join(root,new) )
string concatenation is more inefficient, the best way is using string formating.

Related

Traversing multiple folders for searching the same file in multiple foders in python

search the same file in multiple folders
I have tried with os.walk(path) but I am not getting the nested folders traversing
for current_root, folders, file_names in os.walk(self.path, topdown=True):
for i in folders:
print i
for filename in file_names:
count+= 1
file_path = os.path.join(current_root + '\\' + filename)
#print file_path
self.location_dictionary[file_path] = filename
in my code, it will print all folders but it will not enter to the nested folders recursively
ex: I have subdir,subdir1,subdir2 and in subdir I have another dir called abc
in subdir and abc both contain same file name I want to read that file
os.walk does not work that way.
for each current_root it traverses, it provides the list of directories and files directly under it.
You're nesting the loops, which does ... well I don't know...
Here you don't need the folder (so just mute the argument). current_root already contains that info for your files:
for current_root, _, file_names in os.walk(self.path, topdown=True):
for filename in file_names:
count+= 1
file_path = os.path.join(current_root,filename)
#print file_path
self.location_dictionary[file_path] = filename
aside: creating a dictionary with full file as key and filename as value looks, well, not what you want (the same information could be stored in a set or list and os.path.basename could be used to compute the filename. Maybe it's reverse (filename => full path), provided that there are no duplicate filenames.

The glob.glob function to extract data from files

I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found
files='/home/pathtofiles/files'
#print files
#sys.exit()
for file in files:
fastas=sorted(glob.glob(files + '/*.fasta'))
#print fastas[0]
output_handle=(open(fastas, 'r+'))
genes_files=list(SeqIO.parse(output_handle, 'fasta'))
geneID=genes_files[0].id
print geneID
I am running of ideas on how to direct the script to open when file after another to give me the require information.
I see what you are trying to do, but let me first explain why your current approach is not working.
You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:
>>> files='/home/pathtofiles/files'
>>> for file in files:
>>> print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s
Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.
Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.
As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.
>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp
Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.
I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.
As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.
Combing all of the above, I would rewrite your code like this:
import os
import glob
from Bio import SeqIO
path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
print fasta_path
with open(fasta_path, 'r+') as output_handle:
genes_records = SeqIO.parse(output_handle, 'fasta')
for gene_record in genes_records:
print gene_record.id
This is way I solved the problem, and this script works.
import os,sys
import glob
from Bio import SeqIO
def extracting_information_gene_id():
#to extract geneID information and add the reference gene to each different file
files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
#print file
#sys.exit()
for file in files:
#print file
output_handle=open(file, 'r+')
ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
geneID=ref_genes[0].id
#print geneID
#sys.exit()
#to extract the geneID as a reference record from the genes_files
query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
#print query_genes[geneID].format('fasta') #check point
#sys.exit()
ref_gene=query_genes[geneID].format('fasta')
#print ref_gene #check point
#sys.exit()
output_handle.write(str(ref_gene))
output_handle.close()
query_genes.close()
extracting_information_gene_id()
print 'Reference gene sequence have been added'

Python shutil file move in os walk for loop

The code below searches within a directory for any PDFs and for each one it finds it moves into the corresponding folder which has '_folder' appended.
Could it be expressed in simpler terms? It's practically unreadable. Also if it can't find the folder, it destroys the PDF!
import os
import shutil
for root, dirs, files in os.walk(folder_path_variable):
for file1 in files:
if file1.endswith('.pdf') and not file1.startswith('.'):
filenamepath = os.path.join(root, file1)
name_of_file = file1.split('-')[0]
folderDest = filenamepath.split('/')[:9]
folderDest = '/'.join(folderDest)
folderDest = folderDest + '/' + name_of_file + '_folder'
shutil.move(filenamepath2, folderDest)
Really I want to traverse the same directory after constructing the variable name_of_file and if that variable is in a folder name, it performs the move. However I came across issues trying to nest another for loop...
I would try something like this:
for root, dirs, files in os.walk(folder_path_variable):
for filename in files:
if filename.endswith('.pdf') and not filename.startswith('.'):
filepath = os.path.join(root, filename)
filename_prefix = filename.split('-')[0]
dest_dir = os.path.join(root, filename_prefix + '_folder')
if not os.path.isdir(dest_dir):
os.mkdir(dest_dir)
os.rename(filepath, os.path.join(dest_dir, filename))
The answer by John Zwinck is correct, except it contains a bug where if the destination folder already exists, a folder within that folder is created and the pdf is moved to that location. I have fixed this by adding a 'break' statement within the inner for loop (for filename in files).
The code below now executes correctly. Looks for folder named as the pdf's first few characters (taking the prefix split at '-') with '_folder' at the tail, if it exists the pdf is moved into it. If it doesn't, one is created with the prefix name and '_folder' and pdf is moved into it.
for root, dirs, files in os.walk(folder_path_variable):
for filename in files:
if filename.endswith('.pdf') and not filename.startswith('.'):
filepath = os.path.join(root, filename)
filename_prefix = filename.split('-')[0]
dest_dir = os.path.join(root, filename_prefix + '_folder')
if not os.path.isdir(dest_dir):
os.mkdir(dest_dir)
os.rename(filepath, os.path.join(dest_dir, filename))
break

How to rename JPG files with running order using Python

I quite new in Python programming and i try to rename 100 files with ".jpg" extention, located in specific folder using pyhthon.
I need that the files will be renamed by running order start from number 1. This is the code i start writing:
import os,glob,fnmatch
os.chdir(r"G:\desktop\Project\test")
for files in glob.glob("*.jpg"):
print files
When i run it, i get:
>>>
er3.jpg
IMG-20160209-ssdeWA0000.jpg
IMG-20160209-WA0000.jpg
sd4.jpg
tyu2.jpg
uj7.jpg
we3.jpg
yh7.jpg
>>>
so the code, till now is OK.
For example my folder is:
and i need that all the files name will be:
1,2,3,4 - with running order names. Is it possible with python 2.7?
If you simply want to rename all files as 1.jpg, 2.jpg etc. you can do this:
import os
import glob
os.chdir(r"G:\desktop\Project\test")
for index, oldfile in enumerate(glob.glob("*.jpg"), start=1):
newfile = '{}.jpg'.format(index)
os.rename (oldfile,newfile)
enumerate() is used to get get the index of each file from the list returned by glob(), so that it can be used to create the new filename. Note that it allows you to specify the start index, so I've started from 1, rather than Python Standard, zero
If you want this list of files to be sortable properly, you'll want the filename to be padded with zero's as well (001.jpg, etc.). In which case simply replace newfile = '{}.jpg'.format(index)' with newfile = '{:03}.jpg'.format(index).
See the the docs for more on str.format()
To rename all the JPG files from a particular folder First, get the list of all the files contain in the folder.
os.listdir will give you list all the files in images path.
use enumerate to get the index numbers to get the new name for
images.
import os
images_path = r"D:\shots_images"
image_list = os.listdir(images_path)
for i, image in enumerate(image_list):
ext = os.path.splitext(image)[1]
if ext == '.jpg':
src = images_path + '/' + image
dst = images_path + '/' + str(i) + '.jpg'
os.rename(src, dst)
import os
from os import path
os.chdir("//Users//User1//Desktop//newd//pics")
for file in os.listdir():
name,ext=path.splitext(file)
if ext == '.jpeg':
dst= '{}.jpg'.format(name)
os.rename(file,dst)

Python: Returning a filename for matching a specific condition

import sys, hashlib
import os
inputFile = 'C:\Users\User\Desktop\hashes.txt'
sourceDir = 'C:\Users\User\Desktop\Test Directory'
hashMatch = False
for root, dirs, files in os.walk(sourceDir):
for filename in files:
sourceDirHashes = hashlib.md5(filename)
for digest in inputFile:
if sourceDirHashes.hexdigest() == digest:
hashMatch = True
break
if hashMatch:
print str(filename)
else:
print 'hash not found'
Contents of inputFile =
2899ebdb5f7a90a216e97b3187851fc1
54c177418615a90a6424cb945f7a6aec
dd18bf3a8e0a2a3e53e2661c7fb53534
Contents of sourceDir files =
test
test 1
test 2
I almost have the code working, I'm just tripping up somewhere. My current code that I have posted always returns the else statement, that the hash hasn't been found, even although they do as I have verified this. I have provided the content of my sourceDir so that someone case try this, the file names are test, test 1 and test 2, the same content is in the files.
I must add however, I am not looking for the script to print the actual file content, but rather the name of the file.
Could anyone suggest to where I am going wrong and why it is saying the condition is false?
You need to open the inputFile using open(inputFile, 'rt') then you can read the hashes. Also when you do read the hashes make sure you strip them first to get rid of new line characters \n at the end of the lines