Extract zipfiles and gzfiles from a zip folder - python-2.7

I can extract a zip folder containing several compressed files inside it but I don't know how to extract the zip and gz files inside it without repeating the same procedure two times?
import zipfile,fnmatch,os
rootPath = zipDataDirectory
rootPath2 = workingDirectory
pattern = '*.zip'
pattern2 = '*.gz'
for root, dirs, files in os.walk(rootPath):
for filename in fnmatch.filter(files, pattern):
print(os.path.join(root, filename))
zipfile.ZipFile(os.path.join(root, filename)).extractall(os.path.join(root, os.path.splitext(filename)
I tried the following code that is not working
extensionZip = "*.zip"
extensionGz = "*.gz"
for item in os.listdir(workingDirectory):
if item.endswith(extensionZip):
zipfile.ZipFile(item).extractall
else:
gzip.GzipFile.extract(item)

Related

Traversing multiple folders for searching the same file in multiple foders in python

search the same file in multiple folders
I have tried with os.walk(path) but I am not getting the nested folders traversing
for current_root, folders, file_names in os.walk(self.path, topdown=True):
for i in folders:
print i
for filename in file_names:
count+= 1
file_path = os.path.join(current_root + '\\' + filename)
#print file_path
self.location_dictionary[file_path] = filename
in my code, it will print all folders but it will not enter to the nested folders recursively
ex: I have subdir,subdir1,subdir2 and in subdir I have another dir called abc
in subdir and abc both contain same file name I want to read that file
os.walk does not work that way.
for each current_root it traverses, it provides the list of directories and files directly under it.
You're nesting the loops, which does ... well I don't know...
Here you don't need the folder (so just mute the argument). current_root already contains that info for your files:
for current_root, _, file_names in os.walk(self.path, topdown=True):
for filename in file_names:
count+= 1
file_path = os.path.join(current_root,filename)
#print file_path
self.location_dictionary[file_path] = filename
aside: creating a dictionary with full file as key and filename as value looks, well, not what you want (the same information could be stored in a set or list and os.path.basename could be used to compute the filename. Maybe it's reverse (filename => full path), provided that there are no duplicate filenames.

Python shutil file move in os walk for loop

The code below searches within a directory for any PDFs and for each one it finds it moves into the corresponding folder which has '_folder' appended.
Could it be expressed in simpler terms? It's practically unreadable. Also if it can't find the folder, it destroys the PDF!
import os
import shutil
for root, dirs, files in os.walk(folder_path_variable):
for file1 in files:
if file1.endswith('.pdf') and not file1.startswith('.'):
filenamepath = os.path.join(root, file1)
name_of_file = file1.split('-')[0]
folderDest = filenamepath.split('/')[:9]
folderDest = '/'.join(folderDest)
folderDest = folderDest + '/' + name_of_file + '_folder'
shutil.move(filenamepath2, folderDest)
Really I want to traverse the same directory after constructing the variable name_of_file and if that variable is in a folder name, it performs the move. However I came across issues trying to nest another for loop...
I would try something like this:
for root, dirs, files in os.walk(folder_path_variable):
for filename in files:
if filename.endswith('.pdf') and not filename.startswith('.'):
filepath = os.path.join(root, filename)
filename_prefix = filename.split('-')[0]
dest_dir = os.path.join(root, filename_prefix + '_folder')
if not os.path.isdir(dest_dir):
os.mkdir(dest_dir)
os.rename(filepath, os.path.join(dest_dir, filename))
The answer by John Zwinck is correct, except it contains a bug where if the destination folder already exists, a folder within that folder is created and the pdf is moved to that location. I have fixed this by adding a 'break' statement within the inner for loop (for filename in files).
The code below now executes correctly. Looks for folder named as the pdf's first few characters (taking the prefix split at '-') with '_folder' at the tail, if it exists the pdf is moved into it. If it doesn't, one is created with the prefix name and '_folder' and pdf is moved into it.
for root, dirs, files in os.walk(folder_path_variable):
for filename in files:
if filename.endswith('.pdf') and not filename.startswith('.'):
filepath = os.path.join(root, filename)
filename_prefix = filename.split('-')[0]
dest_dir = os.path.join(root, filename_prefix + '_folder')
if not os.path.isdir(dest_dir):
os.mkdir(dest_dir)
os.rename(filepath, os.path.join(dest_dir, filename))
break

Python finds a string in multiple files recursively and returns the file path

I'm learning Python and would like to search for a keyword in multiple files recursively.
I have an example function which should find the *.doc extension in a directory.
Then, the function should open each file with that file extension and read it.
If a keyword is found while reading the file, the function should identify the file path and print it.
Else, if the keyword is not found, python should continue.
To do that, I have defined a function which takes two arguments:
def find_word(extension, word):
# define the path for os.walk
for dname, dirs, files in os.walk('/rootFolder'):
#search for file name in files:
for fname in files:
#define the path of each file
fpath = os.path.join(dname, fname)
#open each file and read it
with open(fpath) as f:
data=f.read()
# if data contains the word
if word in data:
#print the file path of that file
print (fpath)
else:
continue
Could you give me a hand to fix this code?
Thanks,
def find_word(extension, word):
for root, dirs, files in os.walk('/DOC'):
# filter files for given extension:
files = [fi for fi in files if fi.endswith(".{ext}".format(ext=extension))]
for filename in files:
path = os.path.join(root, filename)
# open each file and read it
with open(path) as f:
# split() will create list of words and set will
# create list of unique words
words = set(f.read().split())
if word in words:
print(path)
.doc files are rich text files, i.e. they wont open with a simple text editor or python open method. In this case, you can use other python modules such as python-docx.
Update
For doc files (previous to Word 2007) you can also use other tools such as catdoc or antiword. Try the following.
import subprocess
def doc_to_text(filename):
return subprocess.Popen(
'catdoc -w "%s"' % filename,
shell=True,
stdout=subprocess.PIPE
).stdout.read()
print doc_to_text('fixtures/doc.doc')
If you are trying to read .doc file in your code the this won't work. you will have to change the part where you are reading the file.
Here are some links for reading a .doc file in python.
extracting text from MS word files in python
Reading/Writing MS Word files in Python
Reading/Writing MS Word files in Python

Find and rename all files in directory matching a certain pattern

I'm attempting to write a program that will loop through every subfolder, find and rename all the files that match a given pattern in the filename. The files are all .jpg files and have the following patter:
[0-9][0-9][0-9]_UsersfirstnameUserslastname[0-9][0-9][0-9].jpg
so for instance one folders would have the following:
452_AlexBobenko002.jpg
452_AlexBobenko003.jpg
452_AlexBobenko007.jpg
Then it would go to another folder where the following files exists:
834_CatDonald001.jpg
...
834_CatDonlad126.jpg
I would like to rename the files so that there would be an underscore after the last letter and before the last set of 3 digits. So the patter would go from:
[0-9][0-9][0-9]_UsersfirstnameUserslastname[0-9][0-9][0-9].jpg
to
[0-9][0-9][0-9]_UsersfirstnameUserslastname_[0-9][0-9][0-9].jpg
and from the above example I would have:
452_AlexBobenko002.jpg --> 452_AlexBobenko_002.jpg
452_AlexBobenko003.jpg --> 452_AlexBobenko_003.jpg
452_AlexBobenko007.jpg --> 452_AlexBobenko_007.jpg
and
834_CatDonald001.jpg --> 834_CatDonald_001.jpg
...
834_CatDonlad126.jpg --> 834_CatDonald_126.jpg
So far I have been able to locate the desired files with the following:
path = mydir
folders = [filename for filename in os.listdir(path) if filename.startswith('EMP-')]
subfolders = [[] for i in range(len(folders))]
# Will populate the empty sublist of subfolders with the contents of each distinct folder
for i in range(len(folders)):
subfolders[i] = [subfolder for subfolder in os.listdir(path +'\\%s' %folders[i])]
for z_1 in range(len(folders)):
for z_2 in range(len(subfolders[z_1])):
if os.path.isdir(path + '\\%s\\%s' % (folders[z_1], subfolders[z_1][z_2])) == True:
for file in glob.glob(path + '\\%s\\%s\\[0-9][0-9][0-9]_*.jpg' % (folders[z_1], subfolders[z_1][z_2])):
#rename(file)
I really have no clue how to rename them

Downloading data from website

I use the following code for downloading two files in a folder from a website.
I want to download some files that contain "MOD09GA.A2008077.h22v05.005.2008080122814.hdf" and "MOD09GA.A2008077.h23v05.005.2008080122921.hdf" in the page. But I don't know how to select these files. The code below download all the files, but I only need two of them.
Does anyone have any ideas?
URL = 'http://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.005/2008.03.17/';
% Local path on your machine
localPath = 'E:/myfolder/';
% Read html contents and parse file names with ending *.hdf
urlContents = urlread(URL);
ret = regexp(urlContents, '"\S+.hdf.xml"', 'match');
% Loop over all files and download them
for k=1:length(ret)
filename = ret{k}(2:end-1);
filepathOnline = strcat(URL, filename);
filepathLocal = fullfile(localPath, filename);
urlwrite(filepathOnline, filepathLocal);
end
Try the regexp with tokens instead:
localPath = 'E:/myfolder/';
urlContents = 'aaaa "MOD09GA.A2008077.h22v05.005.2008080122814.hdf.xml" and "MOD09GA.A2008077.h23v05.005.2008080122921.hdf.xml" aaaaa';
ret = regexp(urlContents , '"(\S+)(?:\.\d+){2}(\.hdf\.xml)"', 'tokens');
%// Loop over each file name
for k=1:length(ret)
filename = [ret{k}{:}];
filepathLocal = fullfile(localPath, filename)
end