downloading multiple files with urllib.urlretrieve - python-2.7

I'm trying to download multiple files from a website.
The url resembles this: foo.com/foo-1.pdf.
Since I want those files to be stored in a directory of my choice,
I have written the following code:
import os
from urllib import urlretrieve
ext = ".pdf"
for i in range(1,37):
print "fetching file " + str(i)
url = "http://foo.com/Lec-" + str(i) + ext
myPath = "/dir/"
filename = "Lec-"+str(i)+ext
fullfilename = os.path.join(myPath, filename)
x = urlretrieve(url, fullfilename)
EDIT : Complete error message.
Traceback (most recent call last):
File "scraper.py", line 10, in <module>
x = urlretrieve(url, fullfilename)
File "/usr/lib/python2.7/urllib.py", line 94, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/usr/lib/python2.7/urllib.py", line 244, in retrieve
tfp = open(filename, 'wb')
IOError: [Errno 2] No such file or directory: /dir/Lec-1.pdf'
I'd be grateful if someone could point out where I have gone wrong.
Thanks in advance!

As for me your code works (Python3.9). So make sure your script has access to the directory you've specified. Also, it looks like you are trying to open a file which does not exist. So make sure you've downloaded the file before opening it:
fullfilename = os.path.abspath("d:/DownloadedFiles/Lec-1.pdf")
print(fullfilename)
if os.path.exists(fullfilename): # open file only if it exists
with open(fullfilename, 'rb') as file:
content = file.read() # read file's content
print(content[:150]) # print only the first 150 characters
The output would be as follows:
C:/Users/Administrator/PycharmProjects/Tests/dtest.py
d:\DownloadedFiles\Lec-1.pdf
b'%PDF-1.6\r%\xe2\xe3\xcf\xd3\r\n2346 0 obj <</Linearized 1/L 1916277/O 2349/E 70472/N 160/T 1869308/H [ 536 3620]>>\rendobj\r \r\nxref\r\n2346 12\r\n0000000016 00000 n\r'
Process finished with exit code 0

Related

Saving files in Python using "with" method

I am wanting to create a file and save it to json format. Every example I find specifies the 'open' method. I am using Python 2.7 on Windows. Please help me understand why the 'open' is necessary for a file I am saving for the first time.
I have read every tutorial I could find and researched this issue but with no luck still. I do not want to create the file outside of my program and then have my program overwrite it.
Here is my code:
def savefile():
filename = filedialog.asksaveasfilename(initialdir =
"./Documents/WorkingDirectory/",title = "Save file",filetypes = (("JSON
files","*.json"), ("All files", "*.")))
with open(filename, 'r+') as currentfile:
data = currentfile.read()
print (data)
Here is this error I get:
Exception in Tkinter callback Traceback (most recent call last):
File "C:\Python27\lib\lib-tk\Tkinter.py", line 1542, in call
return self.func(*args) File "C:\Users\CurrentUser\Desktop\newproject.py", line 174, in savefile
with open(filename, 'r+') as currentfile: IOError: [Errno 2] No such file or directory:
u'C:/Users/CurrentUser/Documents/WorkingDirectory/test.json'
Ok, I figured it out! The problem was the mode "r+". Since I am creating the file, there is no need for read and write, just write. So I changed the mode to 'w' and that fixed it. I also added the '.json' so it would be automatically added after the filename.
def savefile():
filename = filedialog.asksaveasfilename(initialdir =
"./Documents/WorkingDirectory/",title = "Save file",filetypes = (("JSON
files","*.json"), ("All files", "*.")))
with open(filename + ".json", 'w') as currentfile:
line1 = currentfile.write(stringone)
line2 = currentfile.write(stringtwo)
print (line1,line2)

ConvertAPI convertapi-python failing with error to convert PDF_PPT input file to PPTX output file

I followed the instructions for this API conversion service, but get an error that is not covered in their documentation. Below is just a test to see if I can get it working for 1 PDF file. I have over 100 to convert, so I would rather use a proven service, rather than create my own converter.
Here's the code I used:
# -*- coding: utf-8 -*-
''' Spyder Editor '''
import convertapi
path = r'C:\conversion_test'
filename = r'10-cleaning-data-in-python-folder_4-cleaning-data-for-analysis-folder_ch4_slides'
path_and_filename = path + filename
convertapi.api_secret = 'LoDZP1klb1farkdh'
convertapi.convert('pptx', {'File': path_and_filename }, from_format = 'pdf').save_files(path)
The error I am getting is: AttributeError: 'ApiError' object has no attribute 'message'
Here's the full stack trace.
File "C:\ProgramData\Anaconda3\envs\convertapi\lib\site-packages\IPython\core\interactiveshell.py", line 3284, in run_code
self.showtraceback(running_compiled_code=True)
File "C:\ProgramData\Anaconda3\envs\convertapi\lib\site-packages\IPython\core\interactiveshell.py", line 2023, in showtraceback
self._showtraceback(etype, value, stb)
File "C:\ProgramData\Anaconda3\envs\convertapi\lib\site-packages\ipykernel\zmqshell.py", line 546, in _showtraceback
u'evalue' : py3compat.safe_unicode(evalue),
File "C:\ProgramData\Anaconda3\envs\convertapi\lib\site-packages\ipython_genutils\py3compat.py", line 65, in safe_unicode
return unicode_type(e)
File "C:\ProgramData\Anaconda3\envs\convertapi\lib\site-packages\convertapi\exceptions.py", line 14, in str
message = "%s Code: %s. %s" % (self.message, self.code, self.invalid_parameters)
AttributeError: 'ApiError' object has no attribute 'message'

Errno 22 when using shutil.copyfile on dictionary values in python

I am getting a feedback error message that I can't seem to resolve. I have a csv file that I am trying to read and generate pdf files based on the county they fall in. If there is only one map in that county then I do not need to append the files (code TBD once this hurdle is resolved as I am sure I will run into the same issue with the code when using pyPDF2) and want to simply copy the map to a new directory with a new name. The shutil.copyfile does not seem to recognize the path as valid for County3 which meets the condition to execute this command.
Map.csv file
County Maps
County1 C:\maps\map1.pdf
County1 C:\maps\map2.pdf
County2 C:\maps\map1.pdf
County2 C:\maps\map3.pdf
County3 C:\maps\map3.pdf
County4 C:\maps\map2.pdf
County4 C:\maps\map3.pdf
County4 C:\maps\map4.pdf
My code:
import csv, os
import shutil
from PyPDF2 import PdfFileMerger, PdfFileReader, PdfFileWriter
merged_file = PdfFileMerger()
counties = {}
with open(r'C:\maps\Maps.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=",")
for n, row in enumerate(reader):
if not n:
continue
county, location = row
if county not in counties:
counties[county] = list()
counties[county].append((location))
for k, v in counties.items():
newPdfFile = ('C:\maps\Maps\JoinedMaps\County-' + k +'.pdf')
if len(str(v).split(',')) > 1:
print newPdfFile
else:
shutil.copyfile(str(v),newPdfFile)
print 'v: ' + str(v)
Feedback message:
C:\maps\Maps\JoinedMaps\County-County4.pdf
C:\maps\Maps\JoinedMaps\County-County1.pdf
v: ['C:\\maps\\map3.pdf']
Traceback (most recent call last):
File "<module2>", line 22, in <module>
File "C:\Python27\ArcGIS10.5\lib\shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 22] invalid mode ('rb') or filename: "['C:\\\\maps\\\\map3.pdf']"
There are no blank lines in the csv file. In the csv file I tried changing the back slashes to forward slashes, double slashes, etc. I still get the error message. Is it because data is returned in brackets? If so, how do I strip these?
You are actually trying to create the file ['C:\maps\map3.pdf'], you can tell this because the error messages shows the filename its trying to create:
IOError: [Errno 22] invalid mode ('rb') or filename: "['C:\\\\maps\\\\map3.pdf']"
This value comes from the fact that you are converting to string, the value of the dictionary key, which is a list here:
shutil.copyfile(str(v),newPdfFile)
What you need to do is check if the list has more than one member or not, then step through each member of the list (the v) and copy the file.
for k, v in counties.items():
newPdfFile = (r'C:\maps\Maps\JoinedMaps\County-' + k +'.pdf')
if len(v) > 1:
print newPdfFile
else:
for filename in v:
shutil.copyfile(filename, newPdfFile)
print('v: {}'.format(filename))

Unzip csv file from Zip on ftp server

I want to log into a ftp server (not a public url) and download a csv file which is located in a zip file and then save this to a particular directory:
#log in OK
# this is the zip file I want to download
fpath = strDate + ".zip"
#set where to save file
ExtDir = "A:\\LOCAL\\DIREC\\TORY\\"""
ExtDir = ExtDir + strdate + "\\"
ExtFile = ExtDir + "Download.zip"
#download files
#use zipfile.ZipFile as alternative method to open(ExtFile, 'w')
with zipfile.ZipFile(ExtFile,'w') as outzip:
ftp.retrbinary('RETR %s' % fpath , outzip.write)
outzip.close
I get this error
File "C:\Program Files (x86)\Python 2.7\lib\ftplib.py", line 419, in retrbinary callback(data)
File "C:\Program Files (x86)\Python 2.7\lib\zipfile.py", line 1123, in write st = os.stat(filename)
TypeError: stat() argument 1 must be encoded string without null bytes, not str
Fixed using:
ftp.retrlines('RETR %s' % fpath ,lambda s, w=outzip.write: w(s+"\n"))

Python pdf to txt

I would like to convert pdf file to txt. Here is my code:
testFile = urllib.URLopener()
testFile.retrieve("http://url_to_download" , "/Users/gabor_dev/Desktop/pdf_tst/tst.pdf")
content = ""
pdf = pyPdf.PdfFileReader(file("/Users/gabor_dev/Desktop/pdf_tst/tst.pdf", "rb"))
for i in range(0, pdf.getNumPages()):
f = open("/Users/gabor_dev/Desktop/pdf_tst/xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
My pdf is downloaded, but when I try to convert it to my txt only the first word of the pdf shows up in my txt file, and then I get this error:
Traceback (most recent call last):
File "/Users/gabor_dev/PycharmProjects/text_class_tst/textClass.py", line 26, in <module>
f.write(" ")
ValueError: I/O operation on closed file
What am I doing wrong?
Thank you!
Better use with open :
import urllib
import pyPdf
testFile = urllib.URLopener()
testFile.retrieve("http://www.pdf995.com/samples/pdf.pdf" , "./tst.pdf")
content = ""
pdf = pyPdf.PdfFileReader(file("./tst.pdf", "rb"))
with open("./xxx.txt",'a') as f :
for i in range(0, pdf.getNumPages()):
content= pdf.getPage(i).extractText() + "\n"
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
Tested and works