How does one use magic to verify file type in a Django form clean method? - django

I have written an email form class in Django with a FileField. I want to check the uploaded file for its type via checking its mimetype. Subsequently, I want to limit file types to pdfs, word, and open office documents.
To this end, I have installed python-magic and would like to check file types as follows per the specs for python-magic:
mime = magic.Magic(mime=True)
file_mime_type = mime.from_file('address/of/file.txt')
However, recently uploaded files lack addresses on my server. I also do not know of any method of the mime object akin to "from_file_content" that checks for the mime type given the content of the file.
What is an effective way to use magic to verify file types of uploaded files in Django forms?

Stan described good variant with buffer. Unfortunately the weakness of this method is reading file to the memory. Another option is using temporary stored file:
import tempfile
import magic
with tempfile.NamedTemporaryFile() as tmp:
for chunk in form.cleaned_data['file'].chunks():
tmp.write(chunk)
print(magic.from_file(tmp.name, mime=True))
Also, you might want to check the file size:
if form.cleaned_data['file'].size < ...:
print(magic.from_buffer(form.cleaned_data['file'].read()))
else:
# store to disk (the code above)
Additionally:
Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows NT or later).
So you might want to handle it like so:
import os
tmp = tempfile.NamedTemporaryFile(delete=False)
try:
for chunk in form.cleaned_data['file'].chunks():
tmp.write(chunk)
print(magic.from_file(tmp.name, mime=True))
finally:
os.unlink(tmp.name)
tmp.close()
Also, you might want to seek(0) after read():
if hasattr(f, 'seek') and callable(f.seek):
f.seek(0)
Where uploaded data is stored

Why no trying something like that in your view :
m = magic.Magic()
m.from_buffer(request.FILES['my_file_field'].read())
Or use request.FILES in place of form.cleaned_data if django.forms.Form is really not an option.

mime = magic.Magic(mime=True)
attachment = form.cleaned_data['attachment']
if hasattr(attachment, 'temporary_file_path'):
# file is temporary on the disk, so we can get full path of it.
mime_type = mime.from_file(attachment.temporary_file_path())
else:
# file is on the memory
mime_type = mime.from_buffer(attachment.read())
Also, you might want to seek(0) after read():
if hasattr(f, 'seek') and callable(f.seek):
f.seek(0)
Example from Django code. Performed for image fields during validation.

You can use django-safe-filefield package to validate that uploaded file extension match it MIME-type.
from safe_filefield.forms import SafeFileField
class MyForm(forms.Form):
attachment = SafeFileField(
allowed_extensions=('xls', 'xlsx', 'csv')
)

In case you're handling a file upload and concerned only about images,
Django will set content_type for you (or rather for itself?):
from django.forms import ModelForm
from django.core.files import File
from django.db import models
class MyPhoto(models.Model):
photo = models.ImageField(upload_to=photo_upload_to, max_length=1000)
class MyForm(ModelForm):
class Meta:
model = MyPhoto
fields = ['photo']
photo = MyPhoto.objects.first()
photo = File(open('1.jpeg', 'rb'))
form = MyForm(files={'photo': photo})
if form.is_valid():
print(form.instance.photo.file.content_type)
It doesn't rely on content type provided by the user. But
django.db.models.fields.files.FieldFile.file is an undocumented
property.
Actually, initially content_type is set from the request, but when
the form gets validated, the value is updated.
Regarding non-images, doing request.FILES['name'].read() seems okay to me.
First, that's what Django does. Second, files larger than 2.5 Mb by default
are stored on a disk. So let me point you at the other answer
here.
For the curious, here's the stack trace that leads to updating
content_type:
django.forms.forms.BaseForm.is_valid: self.errors
django.forms.forms.BaseForm.errors: self.full_clean()
django.forms.forms.BaseForm.full_clean: self._clean_fields()
django.forms.forms.BaseForm._clean_fiels: field.clean()
django.forms.fields.FileField.clean: super().clean()
django.forms.fields.Field.clean: self.to_python()
django.forms.fields.ImageField.to_python

Related

Filling MS Word Template from Django

I found some python docs relating to docxtpl at this link:
https://docxtpl.readthedocs.io/en/latest/
I followed the instruction and entered the code found at this site into a view and created the associated URL. When I go to the URL I would like for a doc to be generated - but I get an error that no HTTP response is being returned. I understand I am not defining one, but I am a bit confused about what HTTP response I need to define (I am still very new to this). The MS word template that I have saved is titled 'template.docx'.
Any help would be greatly appreciated!
VIEWS.PY
def doc_test(request):
doc = DocxTemplate("template.docx")
context = { 'ultimate_consignee' : "World company" }
doc.render(context)
doc.save("generated_doc.docx")
I would like accessing this view to generate the doc, where the variables are filled with what is defined in the context above.
Gist: Read the contents of the file and return the data in an HTTP response.
First of all, you'll have to save the file in memory so that it's easier to read. Instead of saving to a file name like doc.save("generated_doc.docx"), you'll need to save it to a file-like object.
Then read the contents of this file-like object and return it in an HTTP response.
import io
from django.http import HttpResponse
def doc_test(request):
doc = DocxTemplate("template.docx")
# ... your other code ...
doc_io = io.BytesIO() # create a file-like object
doc.save(doc_io) # save data to file-like object
doc_io.seek(0) # go to the beginning of the file-like object
response = HttpResponse(doc_io.read())
# Content-Disposition header makes a file downloadable
response["Content-Disposition"] = "attachment; filename=generated_doc.docx"
# Set the appropriate Content-Type for docx file
response["Content-Type"] = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
return response
Note: This code may or may not work because I haven't tested it. But the general principle remains the same i.e. read the contents of the file and return it in an HTTP response with appropriate headers.
So if this code doesn't work, maybe because the package you're using doesn't support writing to file-like objects or for some other reason, then it would be a good idea to ask the creator of the package or file an issue on their Github about how to read the contents of the file.
Here is a more concise solution:
import os
from io import BytesIO
from django.http import FileResponse
from docxtpl import DocxTemplate
def downloadWord(request, pk):
context = {'first_name' : 'xxx', 'sur_name': 'yyy'}
byte_io = BytesIO()
tpl = DocxTemplate(os.path.join(BASE_PATH, 'template.docx'))
tpl.render(context)
tpl.save(byte_io)
byte_io.seek(0)
return FileResponse(byte_io, as_attachment=True, filename=f'generated_{pk}.docx')

How to use validators on FileField content

In my model, I want to use a validator to analyze the content of a file, the thing I can not figure out is how to access the content of the file to parse through it as the file has not yet been saved (which is good) when the validators are running.
I'm not understanding how to get the data from the value passed to the validator into a file (I assume I should use tempfile) so I can then open it and evaluate the data.
Here's a simplified example, in my real code, I want to open the file and evaluate it with csv.
in Models.py
class ValidateFile(object):
....
def __call__(self, value):
# value is the fieldfile object but its not saved
# I believe I need to do something like:
temp_file = tempfile.TemporaryFile()
temp_file.write(value.read())
# Check the data in temp_file
....
class MyItems(models.Model):
data = models.FileField(upload_to=get_upload_path,
validators=[FileExtensionValidator(allowed_extensions=['cv']),
ValidateFile()])
Thanks for the help!
Take a look how this is done in the ImageField implementation:
So your ValidateFile class may be something like this:
from io import BytesIO
class ValidateFile(object):
def __call__(self, value):
if value is None:
#do something when None
return None
if hasattr(value, 'temporary_file_path'):
file = value.temporary_file_path()
else:
if hasattr(value, 'read'):
file = BytesIO(value.read())
else:
file = BytesIO(value['content'])
#Now validate your file
No need for tempfile:
The value passed to a FileField validator is an instance of FieldFile, as already mentioned by the OP.
Under the hood, the FieldFile instance might already use a tempfile.NamedTemporaryFile (source), or it might wrap an in-memory file, but you need not worry about that:
To "evaluate the data" you can simply treat the FieldFile instance as any Python file object.
For example, you could iterate over it:
def my_filefield_validator(value):
# note that value is a FieldFile instance
for line in value:
... # do something with line
The documentation says:
In addition to the API inherited from File such as read() and write(), FieldFile includes several methods that can be used to interact with the underlying file: ...
and the FieldFile class provides
... a wrapper around the result of the Storage.open() method, which may be a File object, or it may be a custom storage’s implementation of the File API.
An example of such an underlying file implementation is the InMemoryUploadedFile docs/source.
Also from the docs:
The File class is a thin wrapper around a Python file object with some Django-specific additions
Also note: class-based validators vs function-based validators

How to limit file types on file uploads for ModelForms with FileFields?

My goal is to limit a FileField on a Django ModelForm to PDFs and Word Documents. The answers I have googled all deal with creating a separate file handler, but I am not sure how to do so in the context of a ModelForm. Is there a setting in settings.py I may use to limit upload file types?
Create a validation method like:
def validate_file_extension(value):
if not value.name.endswith('.pdf'):
raise ValidationError(u'Error message')
and include it on the FileField validators like this:
actual_file = models.FileField(upload_to='uploaded_files', validators=[validate_file_extension])
Also, instead of manually setting which extensions your model allows, you should create a list on your setting.py and iterate over it.
Edit
To filter for multiple files:
def validate_file_extension(value):
import os
ext = os.path.splitext(value.name)[1]
valid_extensions = ['.pdf','.doc','.docx']
if not ext in valid_extensions:
raise ValidationError(u'File not supported!')
Validating with the extension of a file name is not a consistent way. For example I can rename a picture.jpg into a picture.pdf and the validation won't raise an error.
A better approach is to check the content_type of a file.
Validation Method
def validate_file_extension(value):
if value.file.content_type != 'application/pdf':
raise ValidationError(u'Error message')
Usage
actual_file = models.FileField(upload_to='uploaded_files', validators=[validate_file_extension])
An easier way of doing it is as below in your Form
file = forms.FileField(widget=forms.FileInput(attrs={'accept':'application/pdf'}))
Django since 1.11 has a FileExtensionValidator for this purpose:
class SomeDocument(Model):
document = models.FileFiled(validators=[
FileExtensionValidator(allowed_extensions=['pdf', 'doc'])])
As #savp mentioned, you will also want to customize the widget so that users can't select inappropriate files in the first place:
class SomeDocumentForm(ModelForm):
class Meta:
model = SomeDocument
widgets = {'document': FileInput(attrs={'accept': 'application/pdf,application/msword'})}
fields = '__all__'
You may need to fiddle with accept to figure out exactly what MIME types are needed for your purposes.
As others have mentioned, none of this will prevent someone from renaming badstuff.exe to innocent.pdf and uploading it through your form—you will still need to handle the uploaded file safely. Something like the python-magic library can help you determine the actual file type once you have the contents.
For a more generic use, I wrote a small class ExtensionValidator that extends Django's built-in RegexValidator. It accepts single or multiple extensions, as well as an optional custom error message.
class ExtensionValidator(RegexValidator):
def __init__(self, extensions, message=None):
if not hasattr(extensions, '__iter__'):
extensions = [extensions]
regex = '\.(%s)$' % '|'.join(extensions)
if message is None:
message = 'File type not supported. Accepted types are: %s.' % ', '.join(extensions)
super(ExtensionValidator, self).__init__(regex, message)
def __call__(self, value):
super(ExtensionValidator, self).__call__(value.name)
Now you can define a validator inline with the field, e.g.:
my_file = models.FileField('My file', validators=[ExtensionValidator(['pdf', 'doc', 'docx'])])
I use something along these lines (note, "pip install filemagic" is required for this...):
import magic
def validate_mime_type(value):
supported_types=['application/pdf',]
with magic.Magic(flags=magic.MAGIC_MIME_TYPE) as m:
mime_type=m.id_buffer(value.file.read(1024))
value.file.seek(0)
if mime_type not in supported_types:
raise ValidationError(u'Unsupported file type.')
You could probably also incorporate the previous examples into this - for example also check the extension/uploaded type (which might be faster as a primary check than magic.) This still isn't foolproof - but it's better, since it relies more on data in the file, rather than browser provided headers.
Note: This is a validator function that you'd want to add to the list of validators for the FileField model.
I find that the best way to check the type of a file is by checking its content type. I will also add that one the best place to do type checking is in form validation. I would have a form and a validation as follows:
class UploadFileForm(forms.Form):
file = forms.FileField()
def clean_file(self):
data = self.cleaned_data['file']
# check if the content type is what we expect
content_type = data.content_type
if content_type == 'application/pdf':
return data
else:
raise ValidationError(_('Invalid content type'))
The following documentation links can be helpful:
https://docs.djangoproject.com/en/3.1/ref/files/uploads/ and https://docs.djangoproject.com/en/3.1/ref/forms/validation/
I handle this by using a clean_[your_field] method on a ModelForm. You could set a list of acceptable file extensions in settings.py to check against in your clean method, but there's nothing built-into settings.py to limit upload types.
Django-Filebrowser, for example, takes the approach of creating a list of acceptable file extensions in settings.py.
Hope that helps you out.

Django form validation, clean(), and file upload

Can someone illuminate me as to exactly when an uploaded file is actually written to the location returned by "upload_to" in the FileField, in particular with regards to the order of field, model, and form validation and cleaning?
Right now I have a "clean" method on my model which assumes the uploaded file is in place, so it can do some validation on it. It looks like the file isn't yet saved, and may just be held in a temporary location or in memory. If that is the case, how do I "open" it or find a path to it if I need to execute some external process/program to validate the file?
Thanks,
Ian
The form cleansing has nothing to do with actually saving the file, or with saving any other data for that matter. The file isn't saved until to you run the save() method of the model instance (note that if you use ModelName.objects.create() this save() method is called for you automatically).
The bound form will contain an open File object, so you should be able to do any validation on that object directly. For example:
form = MyForm(request.POST, request.FILES)
if form.is_valid():
file_object = form.cleaned_data['myFile']
#run any validation on the file_object, or define a clean_myFile() method
# that will be run automatically when you call form.is_valid()
model_inst = MyModel('my_file' = file_object,
#assign other attributes here....
)
model_inst.save() #file is saved to disk here
What do you need to do on it? If your validation will work without a temporary file, you can access the data by calling read() on what your file field returns.
def clean_field(self):
_file = self.cleaned_data.get('filefield')
contents = _file.read()
If you do need it on the disk, you know where to go from here :) write it to a temporary location and do some magic on it!
Or write it as a custom form field. This is the basic idea how I go about verification of an MP3 file using the 'mutagen' library.
Notes:
first check the file size then if correct size write to tmp location.
Will write the file to temporary location specified in SETTINGS check its MP3 and then delete it.
The code:
from django import forms
import os
from mutagen.mp3 import MP3, HeaderNotFoundError, InvalidMPEGHeader
from django.conf import settings
class MP3FileField(forms.FileField):
def clean(self, *args, **kwargs):
super(MP3FileField, self).clean(*args, **kwargs)
tmp_file = args[0]
if tmp_file.size > 6600000:
raise forms.ValidationError("File is too large.")
file_path = getattr(settings,'FILE_UPLOAD_TEMP_DIR')+'/'+tmp_file.name
destination = open(file_path, 'wb+')
for chunk in tmp_file.chunks():
destination.write(chunk)
destination.close()
try:
audio = MP3(file_path)
if audio.info.length > 300:
os.remove(file_path)
raise forms.ValidationError("MP3 is too long.")
except (HeaderNotFoundError, InvalidMPEGHeader):
os.remove(file_path)
raise forms.ValidationError("File is not valid MP3 CBR/VBR format.")
os.remove(file_path)
return args

Processing file uploads before object is saved

I've got a model like this:
class Talk(BaseModel):
title = models.CharField(max_length=200)
mp3 = models.FileField(upload_to = u'talks/', max_length=200)
seconds = models.IntegerField(blank = True, null = True)
I want to validate before saving that the uploaded file is an MP3, like this:
def is_mp3(path_to_file):
from mutagen.mp3 import MP3
audio = MP3(path_to_file)
return not audio.info.sketchy
Once I'm sure I've got an MP3, I want to save the length of the talk in the seconds attribute, like this:
audio = MP3(path_to_file)
self.seconds = audio.info.length
The problem is, before saving, the uploaded file doesn't have a path (see this ticket, closed as wontfix), so I can't process the MP3.
I'd like to raise a nice validation error so that ModelForms can display a helpful error ("You idiot, you didn't upload an MP3" or something).
Any idea how I can go about accessing the file before it's saved?
p.s. If anyone knows a better way of validating files are MP3s I'm all ears - I also want to be able to mess around with ID3 data (set the artist, album, title and probably album art, so I need it to be processable by mutagen).
You can access the file data in request.FILES while in your view.
I think that best way is to bind uploaded files to a form, override the forms clean method, get the UploadedFile object from cleaned_data, validate it anyway you like, then override the save method and populate your models instance with information about the file and then save it.
a cleaner way to get the file before be saved is like this:
from django.core.exceptions import ValidationError
#this go in your class Model
def clean(self):
try:
f = self.mp3.file #the file in Memory
except ValueError:
raise ValidationError("A File is needed")
f.__class__ #this prints <class 'django.core.files.uploadedfile.InMemoryUploadedFile'>
processfile(f)
and if we need a path, ther answer is in this other question
You could follow the technique used by ImageField where it validates the file header and then seeks back to the start of the file.
class ImageField(FileField):
# ...
def to_python(self, data):
f = super(ImageField, self).to_python(data)
# ...
# We need to get a file object for Pillow. We might have a path or we might
# have to read the data into memory.
if hasattr(data, 'temporary_file_path'):
file = data.temporary_file_path()
else:
if hasattr(data, 'read'):
file = BytesIO(data.read())
else:
file = BytesIO(data['content'])
try:
# ...
except Exception:
# Pillow doesn't recognize it as an image.
six.reraise(ValidationError, ValidationError(
self.error_messages['invalid_image'],
code='invalid_image',
), sys.exc_info()[2])
if hasattr(f, 'seek') and callable(f.seek):
f.seek(0)
return f