Save pdf from django-wkhtmltopdf to server (instead of returning as a response) - django

I'm writing a Django function that takes some user input, and generates a pdf for the user. However, the process for generating the pdf is quite intensive, and I'll get a lot of repeated requests so I'd like to store the generated pdfs on the server and check if they already exist before generating them.
The problem is that django-wkhtmltopdf (which I'm using for generation) is meant to return to the user directly, and I'm not sure how to store it on the file.
I have the following, which works for returning a pdf at /pdf:
urls.py
urlpatterns = [
url(r'^pdf$', views.createPDF.as_view(template_name='site/pdftemplate.html', filename='my_pdf.pdf'))
]
views.py
class createPDF(PDFTemplateView):
filename = 'my_pdf.pdf'
template_name = 'site/pdftemplate.html'
So that works fine to create a pdf. What I'd like is to call that view from another view and save the result. Here's what I've got so far:
#Create pdf
pdf = createPDF.as_view(template_name='site/pdftemplate.html', filename='my_pdf.pdf')
pdf = pdf(request).render()
pdfPath = os.path.join(settings.TEMP_DIR,'temp.pdf')
with open(pdfPath, 'w') as f:
f.write(pdf.content)
This creates temp.pdf and is about the size I'd expect but the file isn't valid (it renders as a single completely blank page).
Any suggestions?

Elaborating on the previous answer given: to generate a pdf file and save to disk do this anywhere in your view:
...
context = {...} # build your context
# generate response
response = PDFTemplateResponse(
request=self.request,
template=self.template_name,
filename='file.pdf',
context=context,
cmd_options={'load-error-handling': 'ignore'})
# write the rendered content to a file
with open("file.pdf", "wb") as f:
f.write(response.rendered_content)
...
I have used this code in a TemplateView class so request and template fields were set like that, you may have to set it to whatever is appropriate in your particular case.

Well, you need to take a look to the code of wkhtmltopdf, first you need to use the class PDFTemplateResponse in wkhtmltopdf.views to get access to the rendered_content property, this property get us access to the pdf file:
response = PDFTemplateResponse(
request=<your_view_request>,
template=<your_template_to_render>,
filename=<pdf_filename.pdf>,
context=<a_dcitionary_to_render>,
cmd_options={'load-error-handling': 'ignore'})
Now you could use the rendered_content property to get access to the pdf file:
mail.attach('pdf_filename.pdf', response.rendered_content, 'application/pdf')
In my case I'm using this pdf to attach to an email, you could store it.

Related

Stop spinner after Django FileResponse Save

I have a template link of a url that runs a view function that generates a file and returns a FileResponse. Works great, but in cases it can take a while to generate the file so I'd like to start and stop a spinner before and after.
I've tried using a click event on the link to run and $.ajax() or $.get() function that sends the url, and in this way I can start the spinner. But the FileResponse doesn't generate a Save window in this case. (code below)
Is there a way to generate and save a file in a Django view via JavaScript? The following never opens the file save window.
$("#ds_downloads a").click(function(e){
urly='/datasets/{{ds.id}}/augmented/'+$(this).attr('ref')
startDownloadSpinner()
$.ajax({
type: 'GET',
url: urly
}).done(function() {
spinner_dl.stop();
})
})
Adding the function below that creates the FileResponse. This generates a local system file save window allowing the user to save the file locally. The event of either opening or closing (save) that window would be the time to stop the spinner, but I can't seem to access it via javascript.
with open(fn, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter='\t',
quotechar='', quoting=csv.QUOTE_NONE)
writer.writerow(['col1','col2','col3'])
for f in features:
geoms = f.geoms.all()
gobj = augGeom(geoms)
row = [
gobj['col1'],
gobj['col2'],
gobj['col3']]
writer.writerow(row)
response = FileResponse(open(fn, 'rb'),content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename="'+os.path.basename(fn)+'"'
Just use HttpResponse, the browser adds the spinner automatically. No need to add extra spinner.
from csv import writer
from django.http import HttpResponse
response = HttpResponse(content_type='text/csv')
response.status_code = 200
response["Content-Disposition"] = "attachment; filename={}".format(filename)
try:
csv_writer = writer(response)
# define headers
csv_writer.writerow(headers)
[csv_writer.writerow(row) for row in features.values_list(*headers)]
except Exception:
raise
return response
I came up with a somewhat klugey solution: I generate a POST request with $.ajax to a function-based view that initiates a Celery task to perform the file creation and save it to the filesystem. It returns the Celery task_id to the browser while the task runs, feeding that to a celery_progress.js routine that checks the progress of the task and on completion returns the filename; an href link is created with that in the markup, and a click() generated on it. That ajax function starts and stops a spinner with spinner.js at the appropriate points.
Not posting code b/c it is a kluge and I wouldn't recommend it. I'm guessing there's a better way to do this with a django form, but this works, the project is overdue, and I haven't got the time to look further.

Directing Output Paths of Altered Files

How can I direct the destination of the output file to my db?
My models.py is structured like so:
class Model(models.Model):
char = models.CharField(max_length=50, null=False, blank=False)
file = models.FileField(upload_to=upload_location, null=True, blank=True)
I have the user enter a value for 'char', and then the value of 'char' is printed on to a file. The process of successfully printing onto the file is working, however, the file is outputting to my source directory.
My goal is to have the output file 'pdf01.pdf' output to my db and be represented as 'file' so that the admin can read it.
Much of the information in the Dango docs has been focussed on directing the path of objects imported by the user directly, not on files that have been created internally. I have been reading mostly from these docs:
Models-Fields
Models
File response objects
Outputting PDFs
I have seen it recommend to write to a buffer, not a file, then save the buffer contents to my db however I haven't been able to find many examples of how to do that relevant to my situation online.
Perhaps there is a relevant gap in my knowledge regarding buffers and BytesIO? Here is the function I have been using to alter the pdf, I have been using BytesIO to temporarily store files throughout the process but have not been able to figure out how to use it to direct the output anywhere specific.
can = canvas.Canvas(BytesIO(), pagesize=letter)
can.drawString(10, 10, char)
can.save()
BytesIO().seek(0)
text_pdf = PdfFileReader(BytesIO())
base_file = PdfFileReader(open("media/01.pdf", "rb"))
page = base_file.getPage(0)
page.mergePage(text_pdf.getPage(0))
PdfFileWriter().addPage(page)
PdfFileWriter().write(open("pdf01.pdf", "wb")
FileField does not store files directly in the database. Files get uploaded in a location on the filesystem determined by the upload_to argument. Only some metadata are stored in the DB, including the path of the file in your filesystem.
If you want to have the contents of the files in the database, you could create a new File model that includes a BinaryField to store the data and a CharField to store the URL from which the file can be fetched. To feed the data of PdfFileWriter to the binary field of Django, perhaps the most appropriate would be to use BytesIO.
I found this workaround to direct the file to a desired location (in this case both my media_cdn folder and also output it to an admin.)
I set up an admin action to perform the function that outputs the file so the admin will have access to both the output version in the form of both an HTTP response and through the media_cdn storage.
Hope this helps anyone who struggles with the same problem.
#admin.py
class edit_and_output():
def output:
author = Account.email
#alter file . . .
with open('media_cdn/account/{0}.pdf'.format(author), 'wb') as out_file:
output.write(out_file)
response = HttpResponse(content_type='application/pdf')
response['Content-Disposition'] = 'attachment;filename="{0}.pdf"'.format(author)
output.write(response)

Django - writing a PDF created with xhtml2pdf to server's disk

I'm trying to write a script that will save a pdf created by xhtml2pdf directly to the server, without doing the usual route of prompting the user to download it to their computer. Documents() is the Model I am trying to save to, and the new_project and output_filename variables are set elsewhere.
html = render_to_string(template, RequestContext(request, context)).encode('utf8')
result = open(output_filename, "wb")
pdf = CreatePDF(src=html, dest=results, path = "", encoding = 'UTF-8', link_callback=link_callback) #link callback was originally set to link_callback, defined below
result.close()
if not pdf.err:
new_doc=Documents()
new_doc.project=new_project
new_doc.posted_by=old_mess[0].from_user_fk.username
new_doc.documents = result
new_doc.save()
With this configuration when it reaches new_doc.save() I get the error: 'file' object has no attribute '_committed'
Does anyone know how I can fix this? Thanks!
After playing around with it I found a working solution. The issue was I was not creating the new Document while result (the pdf) was still open.
"+" needed to be added to open() so that the pdf file was available for reading and writing, and not just writing.
Note that this does save the pdf in a different folder first (Files). If that is not the desired outcome for your application you will need to delete it.
html = render_to_string(template, RequestContext(request, context)).encode('utf8')
results = StringIO()
result = open("Files/"+output_filename, "w+b")
pdf = CreatePDF(src=html, dest=results, path = "", encoding = 'UTF-8', link_callback=link_callback) #link callback was originally set to link_callback, defined below
if not pdf.err:
result.write(results.getvalue())
new_doc=Documents()
new_doc.project=new_project
new_doc.documents.save(output_filename, File(result))
new_doc.posted_by=old_mess[0].from_user_fk.username
new_doc.save()
result.close()

Django-Weasyprint image issue

As it says in the docs page, I defined a img tag in my html file as follows:
<img src='{% static 'image.png' %}'/>
This url exists in the server and I even made a different view with a http response and the image is displayed just fine. Here is the code for both views:
The pdf-weasyprint view:
def card_view(request):
template = loader.get_template('card.html')
context = {'sample': None
}
html = template.render(RequestContext(request, context))
response = HttpResponse(mimetype='application/pdf')
HTML(string=html).write_pdf(response)
return response
The html view:
def card_view2(request):
context = {'sample': None,
}
return render_to_response('card.html', context,
context_instance=RequestContext(request))
I thought the default url fetcher was supposed to find and render the image (it's a png - so no format issue should be involved)
Any ideas? Any help would be appreciated!!
What exactly is the issue? Do you get anything in the logs? (You may need to configure logging if your server does not log stderr.) What does the generated HTML look like?
I’d really need answers to the above to confirm, but my guess is that the image’s URL is relative, but with HTML(string=...) WeasyPrint has no idea of what is the base URL. Try something like this. (I’m not sure of the Django details.)
HTML(string=html, base_url=request.build_absolute_uri()).write_pdf(response)
This will make a real HTTP request on your app, which may deadlock on a single-threaded server. (I think the development server defaults to a single thread.)
To avoid that and the cost of going through the network, you may want to look into writing a custom "URL fetcher". It could be anywhere from specialized to just this one image, to a full Django equivalent of Flask-WeasyPrint.
Here it is a URL fetcher which reads (image) files locally, without performing an HTTP request:
from weasyprint import HTML, CSS, default_url_fetcher
import mimetypes
def weasyprint_local_fetcher(url):
if url.startswith('local://'):
filepath = url[8:]
with open(filepath, 'rb') as f:
file_data = f.read()
return {
'string': file_data,
'mime_type': mimetypes.guess_type(filepath)[0],
}
return default_url_fetcher(url)
In order to use it, use the local:// scheme in your URLs, for example:
<img src="local://myapp/static/images/image.svg" />
Then, pass the fetcher to the HTML __init__ method:
html = HTML(
string=html_string,
url_fetcher=weasyprint_local_fetcher,
)

Django ImageField validation (is it sufficient)?

I have a lot of user uploaded content and I want to validate that uploaded image files are not, in fact, malicious scripts. In the Django documentation, it states that ImageField:
"Inherits all attributes and methods from FileField, but also validates that the uploaded object is a valid image."
Is that totally accurate? I've read that compressing or otherwise manipulating an image file is a good validation test. I'm assuming that PIL does something like this....
Will ImageField go a long way toward covering my image upload security?
Django validates the image uploaded via form using PIL.
See https://code.djangoproject.com/browser/django/trunk/django/forms/fields.py#L519
try:
# load() is the only method that can spot a truncated JPEG,
# but it cannot be called sanely after verify()
trial_image = Image.open(file)
trial_image.load()
# Since we're about to use the file again we have to reset the
# file object if possible.
if hasattr(file, 'reset'):
file.reset()
# verify() is the only method that can spot a corrupt PNG,
# but it must be called immediately after the constructor
trial_image = Image.open(file)
trial_image.verify()
...
except Exception: # Python Imaging Library doesn't recognize it as an image
raise ValidationError(self.error_messages['invalid_image'])
PIL documentation states the following about verify():
Attempts to determine if the file is broken, without actually decoding
the image data. If this method finds any problems, it raises suitable
exceptions. This method only works on a newly opened image; if the
image has already been loaded, the result is undefined. Also, if you
need to load the image after using this method, you must reopen the
image file.
You should also note that ImageField is only validated when uploaded using form. If you save the model your self (e.g. using some kind of download script), the validation is not performed.
Another test is with the file command. It checks for the presence of "magic numbers" in the file to determine its type. On my system, the file package includes libmagic as well as a ctypes-based wrapper /usr/lib64/python2.7/site-packages/magic.py. It looks like you use it like:
import magic
ms = magic.open(magic.MAGIC_NONE)
ms.load()
type = ms.file("/path/to/some/file")
print type
f = file("/path/to/some/file", "r")
buffer = f.read(4096)
f.close()
type = ms.buffer(buffer)
print type
ms.close()
(Code from here.)
As to your original question: "Read the Source, Luke."
django/core/files/images.py:
"""
Utility functions for handling images.
Requires PIL, as you might imagine.
"""
from django.core.files import File
class ImageFile(File):
"""
A mixin for use alongside django.core.files.base.File, which provides
additional features for dealing with images.
"""
def _get_width(self):
return self._get_image_dimensions()[0]
width = property(_get_width)
def _get_height(self):
return self._get_image_dimensions()[1]
height = property(_get_height)
def _get_image_dimensions(self):
if not hasattr(self, '_dimensions_cache'):
close = self.closed
self.open()
self._dimensions_cache = get_image_dimensions(self, close=close)
return self._dimensions_cache
def get_image_dimensions(file_or_path, close=False):
"""
Returns the (width, height) of an image, given an open file or a path. Set
'close' to True to close the file at the end if it is initially in an open
state.
"""
# Try to import PIL in either of the two ways it can end up installed.
try:
from PIL import ImageFile as PILImageFile
except ImportError:
import ImageFile as PILImageFile
p = PILImageFile.Parser()
if hasattr(file_or_path, 'read'):
file = file_or_path
file_pos = file.tell()
file.seek(0)
else:
file = open(file_or_path, 'rb')
close = True
try:
while 1:
data = file.read(1024)
if not data:
break
p.feed(data)
if p.image:
return p.image.size
return None
finally:
if close:
file.close()
else:
file.seek(file_pos)
So it looks like it just reads the file 1024 bytes at a time until PIL says it's an image, then stops. This obviously does not integrity-check the entire file, so it really depends on what you mean by "covering my image upload security": illicit data could be appended to an image and passed through your site. Someone could DOS your site by uploading a lot of junk or a really big file. You could be vulnerable to an injection attack if you don't check any uploaded captions or make assumptions about the image's uploaded filename. And so on.