What are the ramifications of null bytes and multipart/form-data? - django

A third party is sending us a flat file that is supposed to contain exclusively printable ASCII characters. However, we've discovered that there's a string of about 50 0x00 bytes in the middle of the file.
We want to be able to upload the file to our web application, but I've discovered that Django doesn't seem to like the null characters in the multipart/form-data. If I remove the null characters, the upload succeeds. (Sorry I don't have the stack trace available at the moment, but will produce one if necessary)
We can pre-process the file to remove the null characters and/or work with our third party to fix their file generator, but I don't like to leave mystical problems like this.
Does this sound like a bug in Django or is there some aspect of multipart/form-data that I don't fully understand? Do I need to set a transfer encoding of some sort so Django doesn't get hung up on the null characters?

Nope, no transfer-encoding is needed (or ever used by browsers) on form-data. It's perfectly valid to include a run of 50 null bytes in a multipart/form-data value... indeed given that most binary files contain a lot of nulls that situation should arise as often as not with file uploads!
Which makes me question whether it's really a Django bug, or whether there's not something else going on. Let's have that stacktrace!

Related

how to store additional data in a text file apart from it's content - C++

I am doing this small university project, where I have to create a console-based text editor with some features, and making files password protected is one of them. As I said, it's a university project for an introductory OOP course, so it doesn't need to be the most secure thing on planet. I am planning to use a simple Caesar cipher to encrypt my file.
The only problem is the password. I'll use the password as the encryption key and it will work, but the problem is handling the case where the password is wrong. If no checks are placed then it would just show gibberish, but I want to make it so that it displays a message in case of a wrong password.
The idea I have come up with is to somehow store the hash of the unencrypted file in that text file (but it shouldn't show that hash when I open the file up with notepad) and after decrypting with the provided password, I can just hash the contents and check if it matches with the hidden hash stored in that file. Is it possible?
I am using Windows, by the way, and portability is not an issue.
In general, you can't theoretically design a data format where nothing but plain text is a valid subset of it, but there can also be metadata (hash or something else). Just think about it: how do you store something other than text (i. e. metadata) in a file where every single byte is to be interpreted as text?
That said, there are some tricks to hide the metadata in plain sight. With Unicode, the palette of tricks is wider. For example, you can use spacelike characters to encode metadata or indicate metadata presence in the way that the user won't notice. Consider Unicode BOM. It's the "zero-length space" character. Won't be seen in Notepad, serves as metadata. You could so something similar.
They already mentioned alternative data streams. While one of those could work to keep the metadata, an alternative data stream doesn't survive archival, e-mailing, uploading to Google Drive/OneDrive/Dropbox, copying with a program that is not aware of it, or copying to a filesystem that doesn't support it (e. g. a CD or a flash drive with FAT).

Using General Unicode Properties

I am trying to take advantage of the regex functionality : \p{UNICODE PROPERTY NAME}
However, I am struggling with understanding the a mapping of those property names.
I went direct to the Unicode.org website ( http://www.unicode.org/Public/UCD/latest/ucd/) and downloaded a file 'UnicodeData.txt' which has the catagory listed... but this only shows 27,268 character values.
But I understand there are 65k characters in utf-8 or ucs-2 .... so I am confused why the Unicode.org download only has 24k rows.
... am I missing a point here somewhere ?
I am sure I'm just being blind to something simple here ... if someone can help me understand.... I'd be grateful !
Everything is fine so far. The characters you see are all but the CJK ones (Chinese-Japanese-Korean). The Unicode consortium let those out of the main UnicodeData file to keep it at a reasonable size.
If you want to look up properties for single characters only (and not for bulks), you can use websites, that prepare that data for you, like Graphemica, FileFormat or (my own) Codepoints.net.
If, however, you need bulk lookups, Unicode also provides the data as an XML file with a specific syntax, that groups codepoints together. That might be the best choice for processing the data.

Appropriate file upload validation

Background
In a targeted issue tracking application (in django) users are able add file attachments to internal messages. Files are mainly different image formats, office documents and spreadsheets (microsoft or open office), PDFs and PSDs.
A custom file field type (type extending FileField) currently validates that the files don't exceed a given size and that the file's content_type is in a the applications MIME Type 'white list'. But as the user base is very varied (multi national and multi platform) we are frequently having to adjust our white list as users using old or brand new application versions have different MIME types (even though they are valid files, and are opened correctly by other users within the business).
Note: Files are not 'executed' by apache, they are just stored (with unix permissions 600) and can be downloaded by users.
Question
What are the pro's and con's for the different types of validation?
A few options:
MIME type white list or black list
File extension while list or black list
Django file upload input validation and security even suggests "you have to actually read the file to be sure it's a JPEG, not an .EXE" (is that even viable when numerous types of files are to be accepeted?)
Is there a 'right' way to validate file uploads?
Edit
Let me clarify. I can understand that actually checking the entire file in the program that it should be opened with to ensure it works and isn't broken would be the only way to fully confirm that the file is what it says it is, and that it isn't corrupted.
But the files in question are like email attachments. we can't possibly verify that every PSD is a valid and working Photoshop image, same goes for JPG or any other type. Even if it is what it says it is, we couldn't guarantee that it's a fully functional file.
So What I was hoping to get at is: Is file magic absolutely crucial? What protection does it really add? And again does a MIME type whitelist actually add any protection that a file extension whitelist doesn't? If a file has an file extension of CSV, JPG, GIF, DOC, PSD is it really viable to check that it is what it says it is, even though the application itself doesn't depend on file?
Is it dangerous to use simple file extension whitelist excluding the obvious offenders (EXE, BAT, etc.) and, I think, disallowing files that are dangerous to the users?
The best way to validate that a file is what it says it is by using magic.
Er, that is, magic. Files can be identified by the first few bytes of their content. It's generally more accurate than extensions or mime types, since you're judging what a file is by what it contains rather than what the browser or user claimed it to be.
There's an article on FileMagic on the Python wiki
You might also look into using the python-magic package
Note that you don't need to get the entire file before using magic to determine what it is. You can read the first chunk of the file and send those bytes to be identified by file magic.
Clarification
Just to point out that using magic to identify a file really just means reading the first small chunk of a file. It's definitely more overhead then just checking the extension but not too mch work. All that file magic does is check that the file "looks" like it's the file you want. It's like checking the file extension only you're looking at the first few chars of the content instead of the last few chars of the filename. It's harder to spoof than just changing the filename. I'd recommend against a mime type whitelist. A file extension whitelist should work fine for your needs, just make sure that you include all possible extensions. Otherwise a perfectly valid file might be rejected just because it ends with .jpeg instead of .jpg.

Once something is HTML or URL encoded should it ever be decoded? Is encoding enough?

First time AntiXSS 4 user here. In order to make my application more secure, I've used Microsoft.Security.Application.Encoder.UrlEncode on QueryString parameters and
Microsoft.Security.Application.Encoder.HtmlEncode on a parameter entered into a form field.
I have a multiple and I would appreciate it if you could try to answer all of them (doesn't have to be at once or by the same person - any abswers at all would be very helpful).
My first question is am I using these methods appropriately (that is am I using an appropriate AntiXSS method for an appropriate situation)?
My second question is once I've encoded something, should it ever be decoded. I am confused because I know that HttpUtility class provides ways to both encode and decode so why isn't the same done in AntiXSS? If this helps, the parameters that I've encoded are never going to be treated as anything other then text inside the application.
My third question is related to the third one but I wanted to emphasize it because it's important (and is probably the source of my overall confusion). I've heard that the .NET framework automatically decodes things like QueryStrings, hence no no need for explicit decode method. If that is so, then what is the point of HTML encoding something in the first place if it is going to be undone. It just... doesn't seem safe? What am I missing, especially since, as mentioned the HttpUtility class provides for decoding.
And the last question, does AntiXSS help against SQL injection at all or does it only protext against XSS attacks?
It's hard to say if you're using it correctly. If you use UrlEncode when building a query string which is then output as a link in a page then yes that's correct. If you're Html Encoding when you write something out as a value then yes, that's correct (well kind of, if it's set via an HTML attribute you ought to use HtmlAttributeEncode, but they're pretty much the same.)
The .NET decoders work with AntiXSS's encoded values, so there was no point in me rewriting them grin
The point of encoding is that you do it when you output. So, for example, if a user has, on a form, input window.alert('Numpty!) and you just put that input raw in your output the javascript would run. If you encoded it first you would see < become < and so on.
No, SQL injection is an entirely different problem.

How can I get Django to output bad characters instead of returning an error

i've got some weird characters in my database, which seem to mess up django when returning a page. I get this error come up:
TemplateSyntaxError at /search/legacy/
Caught an exception while rendering: Could not decode to UTF-8 column 'maker' with text 'i� G�r'
(the actual text is slightly different, but since it is a company name i've changed it)
how can i get django to output this text? i'm currently running the site from sqlite (fast dev), is this the issue?
Also, on a completely unrelated note, is it possible to use a database view?
thanks
Probably not.
Django is using UTF-8 Strings internally, and it seems that your database returns some invalid string. You should fix the data in the database and use exclusively UTF-8 in all your application (data import, database, templates, source files, ...).
I have a related problem with a site owner who uses Apple's iPages for article creation, then does a copy-paste into a Django admin textbox. This process creates 'funny characters' that screw up Django and/or MySQL (you wouldn't believe the number of different double-left/right quote characters there are). I can't 'fix' the customer so I have a function that looks for known strangeness and translates it to something useful before. A complete PITA.
That's a bit of a confusing error message, and without knowing more details I'm not clear what the source of the problem is (the error message phrasing "decode to UTF-8" seems wrong, as normally you would encode to UTF-8). Perhaps Django is expecting to find data in some other encoding and is trying to decode it and re-encode as UTF-8, but is choking on some characters that aren't valid for the encoding it's expecting?
In general, you want to make sure that you're storing UTF-8 in your database, and that internally you're using unicode objects (not str objects) everywhere in your code.
Some other reading that may be helpful:
Unicode in the real world
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Django Tips: UTF-8, ASCII Encoding Errors, Urllib2, and MySQL