Regular Expression if a file is a text file - regex

I have an uploader in JSP, and through this I can upload several kinds of file. I need to perform a control (I think with regular expression, just a simple check on file extension) where the JSP engine could read the content of the file and understand if the file is an image or plain text file. I can accept only plain text (text or XML) and discard all the other kind of files. Could someone help me or suggest another way to do that?

As a very basic check you could verify the file extension with something like \.txt$, however this doesn't prevent people from uploading different filetypes with a .txt extension. You might be better off by checking the mimetype of the file uploaded, a JSP example can be found here.

Related

Does fin in C++ work with .doc files?

I used fin to read in a .doc file, and then store all the text in a string. When I tried printing the string, I just saw unknown characters.
When I copied the contents of the .doc file into a .txt file and then read the .txt file in using fin, everything worked fine.
My question is whether fin works with complex files (such as .doc) or just with .txt files. I only had text in my .doc file (no graphics or anything), but the font was Calibri, which is not the font that fout uses to print text to a .doc file.
If by fin you mean an fistream yes it will work to read the file contents, however in the case of complex files you have to deal with the file format, the c++ library will not automatically extract just the text contents. In the case where you saved the file as text that's all that is left and so that's all a stream would read.
fstream by default does all operations in text mode and .doc files use MS-DOC binary file format. So probably when you tried to read the doc file and print it, it showed characters that you couldn't understand (probably that was binary).
If you try to read any file in fstream, it does read it.
I tried reading a .mp4 file in binary using fstream and it did read the file( i can assure that because i pasted the read contents in another file and that file turned out to be the same video).
So answer to your question is you can read any file in fstream but fstream does all this operations in only two ways, either text or binary.
So reading just any file won't do much good unless you want to do something like copying the file contents to another.
You first need to understand the .doc file format. Read first the doc (computing) wikipage. It is very complex (so you'll need months of work at least) but more or less documented.
You could consider a different approach to your overall goal. For example, if you need to parse a .doc file (provided by some Microsoft Word software), you might use libreoffice which provides some library to parse it, or you could find another library (e.g. DocxFactory, wvware, ...), or you could use some COM interface to Word (on a Microsoft Windows operating system with MicroSoft Word installed).
If your goal is to generate some document, you might consider the PDF format (which is a standard), perhaps using some text formatter like LaTeX or Lout to generate it, or some library (e.g. cairo, PoDoFo, etc ...).
My question is whether fin works with complex files (such as .doc)
BTW, C++ standard IO is capable of reading binary files, but you need to write your parser for them (so you need to understand precisely your file format). You should prefer open formats to proprietary formats.

Opening an existing .doc file using ofstream in C++

Assuming I have a file with .doc extension in Windows platform, how can I open the the file for outputting its contents on the screen using the ofstream object in C++? I am aware that the object can be used to open files in text and binary modes. But I would like to know if a .doc (or even .pdf) file can be opened and its contents read.
I've never actually done this before, but after reading up on it, I think I might have a suggestion. The .docx format is actually just XML that is zipped up. After unzipping, the file is located at word/document.xml. Doing this in a program is where it gets fun.
Two options: If you're using C++ CLR (.NET) then Microsoft has an SDK for you. It should make it pretty easy to open Office documents.
Otherwise if you're just using regular C++, you might have to do some extra work.
Open the file and unzip it using a library like zlib
Find the document.xml file inside
Parse the XML document. You'll probably want to use some kind of XML parsing library for this. You'll have to look up the specs for the XML to figure out how to get the text you want.
C++ std library has ifstream class that can be used to read simple text files, and for read binary files too.
It is up to you to interpret these bytes in the file. To proper interpret the binary file you need to know the format of the file.
If you think of MS Word files then I would start from here: http://en.wikipedia.org/wiki/Office_Open_XML to understand MS Word 2007 format.
You might find the Boost Iostreams library ( http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/home.html ) somehow useful if you want to make some filter by yourself.

how to get xsl from existing pdf?

Is it possible to get the .xsl file from an existing .pdf file?
I know that with Apache FOP you can get a .pdf file from a .xml and .xsl but I would like to go in the other direction. Any idea?
XML+XSL->PDF with Apache FOP, but is it somehow possible PDF->XSL?????
The reason why I would like to do that is because I want to open a PDF that has a form inside, edit it adding some information to the form and then save it again as PDF.
I already have the edited form as .xml and I'm trying to generate the PDF, but the I need a .xsl file for the layout... so I thought that maybe I could reuse the layout from the original PDF as they will be the same. Any other better approach?? I would like to avoid creating a specific XSL file for every form.
Thanks
Definitely not the XSLT file, since that's not even part of what FOP does. FOP only works with FO documents, the fact that it allows you to use XML+XSLT to get the FO source is just a nice usability feature. However, once it gets the FO file, it doesn't know how that was obtained, so it can't embed in any way the XSLT file.
You could post-process the PDF file using another tool, like PDFBox, to embed any metadata you want.

Determine if file is binary or text

Is there a way to determine if a file is a binary or text file using the the File Management functions or MFC?
In the File Management functions, GetFileType doesn't seem to distinguish between binary and text files. Same with the dwFileAttributes attribute here.
In MFC, I tried looking at CFile::GetStatus(), but the m_attribute doesn't say anything about whether files are binary or text.
Does anyone know a way to do this using one of these two libraries? Thank you.
(I'd like to know because I am trying to make a function that recursively goes through a directory. I rewrite the text files (using CStdioFile) and replace some words here and there... but it seems to screw up any images I have in the directory. I'd like to be able to copy the images too... but i need a way to distinguish between binary and text files so I can treat them differently.)
As far as I know, there's no simple API to do this, MFC or otherwise. However, there's a bunch of useful ideas in these similar questions:
How do I distinguish between 'binary' and 'text' files?
How to identify the file content as ASCII or binary

Appropriate file upload validation

Background
In a targeted issue tracking application (in django) users are able add file attachments to internal messages. Files are mainly different image formats, office documents and spreadsheets (microsoft or open office), PDFs and PSDs.
A custom file field type (type extending FileField) currently validates that the files don't exceed a given size and that the file's content_type is in a the applications MIME Type 'white list'. But as the user base is very varied (multi national and multi platform) we are frequently having to adjust our white list as users using old or brand new application versions have different MIME types (even though they are valid files, and are opened correctly by other users within the business).
Note: Files are not 'executed' by apache, they are just stored (with unix permissions 600) and can be downloaded by users.
Question
What are the pro's and con's for the different types of validation?
A few options:
MIME type white list or black list
File extension while list or black list
Django file upload input validation and security even suggests "you have to actually read the file to be sure it's a JPEG, not an .EXE" (is that even viable when numerous types of files are to be accepeted?)
Is there a 'right' way to validate file uploads?
Edit
Let me clarify. I can understand that actually checking the entire file in the program that it should be opened with to ensure it works and isn't broken would be the only way to fully confirm that the file is what it says it is, and that it isn't corrupted.
But the files in question are like email attachments. we can't possibly verify that every PSD is a valid and working Photoshop image, same goes for JPG or any other type. Even if it is what it says it is, we couldn't guarantee that it's a fully functional file.
So What I was hoping to get at is: Is file magic absolutely crucial? What protection does it really add? And again does a MIME type whitelist actually add any protection that a file extension whitelist doesn't? If a file has an file extension of CSV, JPG, GIF, DOC, PSD is it really viable to check that it is what it says it is, even though the application itself doesn't depend on file?
Is it dangerous to use simple file extension whitelist excluding the obvious offenders (EXE, BAT, etc.) and, I think, disallowing files that are dangerous to the users?
The best way to validate that a file is what it says it is by using magic.
Er, that is, magic. Files can be identified by the first few bytes of their content. It's generally more accurate than extensions or mime types, since you're judging what a file is by what it contains rather than what the browser or user claimed it to be.
There's an article on FileMagic on the Python wiki
You might also look into using the python-magic package
Note that you don't need to get the entire file before using magic to determine what it is. You can read the first chunk of the file and send those bytes to be identified by file magic.
Clarification
Just to point out that using magic to identify a file really just means reading the first small chunk of a file. It's definitely more overhead then just checking the extension but not too mch work. All that file magic does is check that the file "looks" like it's the file you want. It's like checking the file extension only you're looking at the first few chars of the content instead of the last few chars of the filename. It's harder to spoof than just changing the filename. I'd recommend against a mime type whitelist. A file extension whitelist should work fine for your needs, just make sure that you include all possible extensions. Otherwise a perfectly valid file might be rejected just because it ends with .jpeg instead of .jpg.