In python 2.7.3, I try to merge two files in one.
I download a file over the Internet. The entire file size is exactly 3,197,743 bytes. I download it in two parts, one part is 3,000,000 bytes in size, the second part is 197,743 in size. Then, I want to merge the two files to reconstruct the entire file.
Here my code :
import requests
import shutil
URL = 'some_URL'
headers = {'user-agent': 'Agent'}
headers.update({'range': 'bytes=0-2999999'})
response = requests.get(URL, headers=headers)
file = open('some_file', 'wb')
file.write(response.content)
file.close()
headers2 = {'user-agent': 'Agent'}
headers2.update({'range': 'bytes=3000000-'})
response2 = requests.get(URL, headers=headers2)
file2 = open('some_file2', 'wb')
file2.write(response2.content)
file2.close()
source = open('some_file2','rb')
destination = open('some_file','ab')
shutil.copyfileobj(source,destination)
destination.close()
source.close()
At the end, I have one file ('some-file' in the example) which size is exactly 3,197,743 bytes but the file is corrupted. I tried this with a PDF file.
Where is the problem ?
I tried to solve your problem with different approaches and used diff tool to identify whether the program retrieves part files differently. I identified that there are no difference, thus I am not really sure what's wrong.
However, I propose following solution to resolve your usecase
import urllib2
URL = "http://traffic.org/general-reports/traffic_pub_gen19.pdf"
req = urllib2.urlopen(URL)
CHUNK = 3000000
with open("some_file.pdf", 'wb') as fp:
while True:
chunk = req.read(CHUNK)
if not chunk: break
fp.write(chunk)
Related
I'm using django to generate personalized file, but while doing so a file is generated, and in terms of space using it is quite a poor thing to do.
this is how i do it right now:
with open(filename, 'wb') as f:
pdf.write(f) #pdf is an object of pyPDF2 library
with open(filename, 'rb') as f:
return send_file(data=f, filename=filename) #send_file is a HTTPResponse parametted to download file data
So in the code above a file is generated.
The easy fix would be to deleted the file after downloading it, but i remember in java using stream object to handle this case.
Is it possible to do so in Python?
EDIT:
def send_file(data, filename, mimetype=None, force_download=False):
disposition = 'attachment' if force_download else 'inline'
filename = os.path.basename(filename)
response = HttpResponse(data, content_type=mimetype or 'application/octet-stream')
response['Content-Disposition'] = '%s; filename="%s"' % (disposition, filename)
return response
Without knowing the exact details of the pdf.write and send_file functions, I expect in both cases they will take an object that conforms to the BinaryIO interface. So, you could try using a BytesIO to store the content in an in-memory buffer, rather than writing out to a file:
with io.BytesIO() as buf:
pdf.write(buf)
buf.seek(0)
send_file(data=buf, filename=filename)
Depending on the exact nature of the above-mentioned functions, YMMV.
I'm currently trying to save a file via requests, it's rather large, so I'm instead streaming it.
I'm unsure how to specifically do this, as I keep getting different errors. This is what I have so far.
def download_file(url, matte_upload_path, matte_servers, job_name, count):
local_filename = url.split('/')[-1]
url = "%s/static/downloads/%s_matte/%s/%s" % (matte_servers[0], job_name, count, local_filename)
with requests.get(url, stream=True) as r:
r.raise_for_status()
fs = FileSystemStorage(location=matte_upload_path)
print(matte_upload_path, 'matte path upload')
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
fs.save(local_filename, f)
return local_filename
but it returns
io.UnsupportedOperation: read
I'm basically trying to have requests save it to the specific location via django, any help would be appreciated.
I was able to solve this, by using a tempfile to save the python requests, then saving it via the FileSystemStorage
local_filename = url.split('/')[-1]
url = "%s/static/downloads/%s_matte/%s/%s" % (matte_servers[0], job_name, count, local_filename)
response = requests.get(url, stream=True)
fs = FileSystemStorage(location=matte_upload_path)
lf = tempfile.NamedTemporaryFile()
# Read the streamed image in sections
for block in response.iter_content(1024 * 8):
# If no more file then stop
if not block:
break
# Write image block to temporary file
lf.write(block)
fs.save(local_filename, lf)
I have written code for downloading a file through an API. It works fine as I can see. The file size is the same. But the file has no longer a default file icon. I am pretty new at this and maybe I am doing something wrong. I am reading the file as I would for a standard textfile and saving it in the same way with the binary option. So how can the files be same size and still something seems to be missing in the downloaded file? Is there a better way to download files?
This is the code on the server:
file_location = 'static/File.pkg'
try:
with open(file_location, 'rb') as f:
filex_data = f.read()
response = HttpResponse(filex_data, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="File.pkg"'
return response
This is the code on my local computer:
url = 'http://myServer/waprfile/'
x = requests.get(url, data=data, headers=headers)
f = open("TheNewFile.pgk", "ab")
f.write(x.content)
f.close()
I'm trying to shorten or simplify my code.
I want to download a log file from an internal server which is updated every 10 seconds, but I'm only running my script every 10 or 15 minutes.
The log file is semicolon seperated and has many rows in it I don't use. So my workflow is as following.
get current date in YYYYMMDD format
download the file
delay for waiting that the file is finished downloading
trim the file to the rows I need
only process last line of the file
delete the files
I'm new to python and if you could help me to shorten/simplify my code in less steps I would be thankful.
import urllib
import time
from datetime import date
today = str(date.today())
import csv
url = "http://localserver" + today + ".log"
urllib.urlretrieve (url, "output.log")
time.sleep(15)
with open("output.log","rb") as source:
rdr= csv.reader(source, delimiter=';')
with open("result.log","wb") as result:
wtr= csv.writer( result )
for r in rdr:
wtr.writerow( (r[0], r[1], r[2], r[3], r[4], r[5], r[15], r[38], r[39], r[42], r[54], r[90], r[91], r[92], r[111], r[116], r[121], r[122], r[123], r[124]) )
with open('result.log') as myfile:
print (list(myfile)[-1]) #how do I access certain rows here?
You could probably make use of the advanced module, requests as below. The timeout can be increased depending on the time it takes for the download to complete successfully. Furthermore, the two with open statements can be consolidated in a single line. What is more, in order to load the line one by one in to the memory, we can make use of iter_lines generator. Note that stream=True should be set in order to load line one at a time.
from datetime import date
import csv
import requests
# Declare variables
today = str(date.today())
url = "http://localserver" + today + ".log"
outfile = 'output.log'
# Instead of waiting for 15 seconds explicitly consider using requests module
# with timeout parameter
response = requests.get(url, timeout=15, stream=True)
if response.status_code != 200:
print('Failed to get data:', response.status_code)
with open(outfile, 'w') as dest:
writer = csv.writer(dest)
# Walk through the request response line by line w/o loadin gto memory
line = list(response.iter_lines())[-1]
# Decode the response to string and split line by line
reader = csv.reader(line.decode('utf-8').splitlines(), delimiter=';')
# Read line by line for the splitted content and write to file
for r in reader:
writer.writerow((r[0], r[1], r[2], r[3], r[4], r[5], r[15], r[38], r[39], r[42], r[54], r[90], r[91], r[92],
r[111], r[116], r[121], r[122], r[123], r[124]))
print('File written successfully: ' + outfile)
hi im trying to watermark a pdf fileusing pypdf2 though i get this error i cant figure out what goes wrong.
i get the following error:
Traceback (most recent call last): File "test.py", line 13, in <module>
page.mergePage(watermark.getPage(0)) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1594, in mergePage
self._mergePage(page2) File "C:\Python27\site-packages\PyPDF2\pdf.py", line 1651, in _mergePage
page2Content, rename, self.pdf) File "C:Python27\site-packages\PyPDF2\pdf.py", line 1547, in
_contentStreamRename
op = operands[i] KeyError: 0
using python 2.7.6 with pypdf2 1.19 on windows 32bit.
hopefully someone can tell me what i do wrong.
my python file:
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input = PdfFileReader(open("test.pdf", "rb"))
watermark = PdfFileReader(open("watermark.pdf", "rb"))
# print how many pages input1 has:
print("test.pdf has %d pages." % input.getNumPages())
print("watermark.pdf has %d pages." % watermark.getNumPages())
# add page 0 from input, but first add a watermark from another PDF:
page = input.getPage(0)
page.mergePage(watermark.getPage(0))
output.addPage(page)
# finally, write "output" to document-output.pdf
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
Try writing to a StringIO object instead of a disk file. So, replace this:
outputStream = file("outputs.pdf", "wb")
output.write(outputStream)
outputStream.close()
with this:
outputStream = StringIO.StringIO()
output.write(outputStream) #write merged output to the StringIO object
outputStream.close()
If above code works, then you might be having file writing permission issues. For reference, look at the PyPDF working example in my article.
I encountered this error when attempting to use PyPDF2 to merge in a page which had been generated by reportlab, which used an inline image canvas.drawInlineImage(...), which stores the image in the object stream of the PDF. Other PDFs that use a similar technique for images might be affected in the same way -- effectively, the content stream of the PDF has a data object thrown into it where PyPDF2 doesn't expect it.
If you're able to, a solution can be to re-generate the source pdf, but to not use inline content-stream-stored images -- e.g. generate with canvas.drawImage(...) in reportlab.
Here's an issue about this on PyPDF2.