Appropriate file upload validation - django

Background
In a targeted issue tracking application (in django) users are able add file attachments to internal messages. Files are mainly different image formats, office documents and spreadsheets (microsoft or open office), PDFs and PSDs.
A custom file field type (type extending FileField) currently validates that the files don't exceed a given size and that the file's content_type is in a the applications MIME Type 'white list'. But as the user base is very varied (multi national and multi platform) we are frequently having to adjust our white list as users using old or brand new application versions have different MIME types (even though they are valid files, and are opened correctly by other users within the business).
Note: Files are not 'executed' by apache, they are just stored (with unix permissions 600) and can be downloaded by users.
Question
What are the pro's and con's for the different types of validation?
A few options:
MIME type white list or black list
File extension while list or black list
Django file upload input validation and security even suggests "you have to actually read the file to be sure it's a JPEG, not an .EXE" (is that even viable when numerous types of files are to be accepeted?)
Is there a 'right' way to validate file uploads?
Edit
Let me clarify. I can understand that actually checking the entire file in the program that it should be opened with to ensure it works and isn't broken would be the only way to fully confirm that the file is what it says it is, and that it isn't corrupted.
But the files in question are like email attachments. we can't possibly verify that every PSD is a valid and working Photoshop image, same goes for JPG or any other type. Even if it is what it says it is, we couldn't guarantee that it's a fully functional file.
So What I was hoping to get at is: Is file magic absolutely crucial? What protection does it really add? And again does a MIME type whitelist actually add any protection that a file extension whitelist doesn't? If a file has an file extension of CSV, JPG, GIF, DOC, PSD is it really viable to check that it is what it says it is, even though the application itself doesn't depend on file?
Is it dangerous to use simple file extension whitelist excluding the obvious offenders (EXE, BAT, etc.) and, I think, disallowing files that are dangerous to the users?

The best way to validate that a file is what it says it is by using magic.
Er, that is, magic. Files can be identified by the first few bytes of their content. It's generally more accurate than extensions or mime types, since you're judging what a file is by what it contains rather than what the browser or user claimed it to be.
There's an article on FileMagic on the Python wiki
You might also look into using the python-magic package
Note that you don't need to get the entire file before using magic to determine what it is. You can read the first chunk of the file and send those bytes to be identified by file magic.
Clarification
Just to point out that using magic to identify a file really just means reading the first small chunk of a file. It's definitely more overhead then just checking the extension but not too mch work. All that file magic does is check that the file "looks" like it's the file you want. It's like checking the file extension only you're looking at the first few chars of the content instead of the last few chars of the filename. It's harder to spoof than just changing the filename. I'd recommend against a mime type whitelist. A file extension whitelist should work fine for your needs, just make sure that you include all possible extensions. Otherwise a perfectly valid file might be rejected just because it ends with .jpeg instead of .jpg.

Related

How to identify whether a file is DICOM or not which has no extension

I have few files in my GCP bucket folder like below:
image1.dicom
image2.dicom
image3
file1
file4.dicom
Now, I want to even check the files which has no extension i.e image3, file1 are dicom or not.
I use pydicom reader to read dicom files to get the data.
dicom_dataset = pydicom.dcmread("dicom_image_file_path")
Please suggest is there a way to validate the above two files are dicom or not in one sentence.
You can use the pydicom.misc.is_dicom function or do:
try:
ds = dcmread(filename)
except InvalidDicomError:
pass
Darcy has the best general answer, if you're looking to check file types prior to processing them. Apart from checking is the file a dicom file, it will also make sure the file doesn't have any dicom problems itself.
However, another way to quickly check that may or may not be better, depending on your use case, is to simply check the file's signature (or magic number as they're sometimes known.
See https://en.wikipedia.org/wiki/List_of_file_signatures
Basically if the bytes from position 128 to 132 in the file are "DICM" then it should be a dicom file.
If you only want to check for 'is it dicom?' for a set of files, and do it very quickly, this might be another approach

how can I get the original file type after user modification using PE header in legacy driver

I have developed a legacy driver to allow and block the transfer of specific files from hard disk to external devices. This works fine.
The issue I face is that, here the user is able to modify the file name and file file type.
How can I find the original file type and file name modified by user ?
Is it possible to find the original file type using portable executable header ?
(Files type for example .pdf,.txt)
During my research I found that they are able to find original file type.How do they find the original file type. Similar has been done by " http://checkfiletype.com/"
Thanks in advance. Can you provide any solution for this.
This game had a name.
The name of the game is "last one who moves wins"
I will gladly exfiltrate files by base85 encoding them and dropping that as content in an allowed type.
Your users will no doubt come up with other clever ways.
Now if you were doing this for virus control I'd say just examine file contents and if it looks like an executable say no. The first two characters of an executable file are always MZ.

Shall I rename user uploaded files?

Django==1.11.6
There are file upload attacks. But modern Django seems well guarded against them.
Django security guide is here:
https://docs.djangoproject.com/en/1.11/topics/security/#user-uploaded-content
Concerning user uploaded files it is much shorter than other security guides.
In the Internet we can find this kind of advice:
The application should not use the file name supplied by the user.
Instead, the uploaded file should be renamed according to a
predetermined convention.
Well, I think that renaming is a good idea.
Shall I rename user uploaded files or it is not dangerous in case of modern Django?
There are a couple of reasons why you should (in some cases, need) to rename uploaded files. So it does not even matter whether Django has good measures against some attacks.
You have to deal with duplicate file names
File names can be veeery long
File names can contain characters that are not supported by the backend's file system
Special characters in file names can cause problems when you want to access the files using a URL
File names can contain lower/uppercase characters which might lead to duplicates on filesystems that are case-insensitive

OSX- Auto Delete file after x-time

Can we add metadata to unlink/remove a file after x-time automatically. That is system automatically removes that file, if it finds that particular metadata attached with that file
Note- file can be present at any location, and user may move that file anywhere on their system, but based on that metadata file should get deleted(i.e system should call unlink/remove) for that file.
Is there a cocoa/objective-c/c++ api to set such metadata/attributes of a file?
The main point is i am creating an application through which i am providing some trial files to the user, and those files are also usable by other application which recognises them. After trial expiry, i want to delete those files, but user can always move my files to a different location and use them forever, how to protect those files from permanent use?
No, there is no built-in mechanism to auto-delete a file based on some metadata.
You could add the feature yourself, with an accompanying agent that would trawl for files with the metadata and delete them when the time came.
If you are doing this for good housekeeping you can follow #Petesh answer.
If you are doing this because you really want those files gone then no. The user could move the file to a USB stick and remove it, or edit the metadata, etc.
Your earlier question "Completely restricting all types of access to a folder" seems to addressing the same issue and the suggestions are the same as given there - use encryption or implement your own file system.
E.g. have a special "trial file" format which is the same as the ordinary format - which is readable by other apps - but encrypted and includes an expiry date. Your app then decrypts the file, checks the date, and either does its thing or reports to the user the file is out of date.
The system isn't unbreakable, but its a reasonable barrier - easy for you to do, too hard for the average user to break.

How to get the content-type of a file

I am implementing a HTTP/1.0 server that processes GET or HEAD request.
I've finished Date, Last-Modified, and Content-Length, but I don't know how to get the Content-Type of a file.
It has to return directory for directory(which I can do using stat() function), and for a regular file, text/html for text or html file, and image/gif for image or gif file.
Should this be hard-coded, using the name of the file?
I wonder if there is any function to get this Content-Type.
You could either look at the file extension (which is what most web servers do -- see e.g. the /etc/mime.types file; or you could use libmagic to automatically determine the content type by looking at the first few bytes of the file.
It depends how sophisticated you want to be.
If the files in question are all properly named and there are only several types to handle, having a switch based file suffix is sufficient. Going to the extreme case, making the right decision no matter what the file is would probably require either duplicating the functionality of Unix file command or running it on file in question (and then translating the output to the proper Content-Type).