Does fin in C++ work with .doc files? - c++

I used fin to read in a .doc file, and then store all the text in a string. When I tried printing the string, I just saw unknown characters.
When I copied the contents of the .doc file into a .txt file and then read the .txt file in using fin, everything worked fine.
My question is whether fin works with complex files (such as .doc) or just with .txt files. I only had text in my .doc file (no graphics or anything), but the font was Calibri, which is not the font that fout uses to print text to a .doc file.

If by fin you mean an fistream yes it will work to read the file contents, however in the case of complex files you have to deal with the file format, the c++ library will not automatically extract just the text contents. In the case where you saved the file as text that's all that is left and so that's all a stream would read.

fstream by default does all operations in text mode and .doc files use MS-DOC binary file format. So probably when you tried to read the doc file and print it, it showed characters that you couldn't understand (probably that was binary).
If you try to read any file in fstream, it does read it.
I tried reading a .mp4 file in binary using fstream and it did read the file( i can assure that because i pasted the read contents in another file and that file turned out to be the same video).
So answer to your question is you can read any file in fstream but fstream does all this operations in only two ways, either text or binary.
So reading just any file won't do much good unless you want to do something like copying the file contents to another.

You first need to understand the .doc file format. Read first the doc (computing) wikipage. It is very complex (so you'll need months of work at least) but more or less documented.
You could consider a different approach to your overall goal. For example, if you need to parse a .doc file (provided by some Microsoft Word software), you might use libreoffice which provides some library to parse it, or you could find another library (e.g. DocxFactory, wvware, ...), or you could use some COM interface to Word (on a Microsoft Windows operating system with MicroSoft Word installed).
If your goal is to generate some document, you might consider the PDF format (which is a standard), perhaps using some text formatter like LaTeX or Lout to generate it, or some library (e.g. cairo, PoDoFo, etc ...).
My question is whether fin works with complex files (such as .doc)
BTW, C++ standard IO is capable of reading binary files, but you need to write your parser for them (so you need to understand precisely your file format). You should prefer open formats to proprietary formats.

Related

outputting pdf files using C++ file I/O

I was trying to output a string to a file using File I/O of C++, but I decided to change the extension of the output file to .pdf, .docx and so on.
None of them seem to open up.
So are there any more internal translations required in order to make that file a proper(file which opens) pdf or docx??
void convert(char *file){
ifstream in(file);
ofstream out("out.pdf",ios::out);
char str;
while(in && out){
in.get(str);
out.put(str);
}
in.close();
out.close();
cout<<"Done!!!";
}
PDF is (and also docx which actually is OOXML; and of course also OpenDocument & DocBook) a file format (and quite a complex one), specified in a complex English document (the ISO 32000 standard) defining which sequences of bytes are valid (and what they represent).
Grossly speaking a PDF file contains elementary steps to draw ink on paper or pixels on screen. With gross simplification, these steps include things like "move the pen to position x=23 y=50", "choose the Arial font of 10pt", "draw the word abc in that font at that position", "choose the pink color for the pen" etc etc, but the details are much more complex.
File extensions don't mean anything (except as an important convention).
To generate a PDF file, you either need to spend weeks or months to study the specification of that PDF format, or to use some library for that (see this question, and also podofo, poppler and several other libraries). Even with the help of a library, you need to understand something about that format.
Are there any more internal translations required in order to make that file
I'm not sure to understand what you mean by translation (a better word would be "converter"). You could generate PDF by sending some other content to a suitable program emitting PDF. You might consider using some document formatter (like LaTeX thru pdflatex, or Lout, or some older variant of troff, etc..) which you would feed with a higher-level file format containing text formatting directives mixed with (some encoding of) the text to be formatted. On Unix like systems you might even use some command pipeline (to avoid writing some "temporary" file), perhaps using popen(3) and related functions.
To write a pdf, docx, etc. you need to setup a file header, and format the data correctly. Changing the extension does not really do much if anything at all.
You will have to use an outside an outside library or format the code yourself. Below are a couple of examples of extensions you can use:
PDF
docx

How do I input and output various file types in c++

I've seen a lot of examples of i/o with text files I'm just wondering if you can do the same with other file types like mp3's, jpg's, zip files, etc..?
Will iostream and fstream work for all of these or do I need another library? Do I need a new sdk?
It's all binary data so I'd think it would be that simple. But I've been unpleasently surprised before.
Could I convert all files to text or binary?
It depend on what you mean by "work"
You can think of those files as a book written in Greek.
If you want to just mess with binary representation (display text in Greek on screen) then yes, you can do that.
If you want to actually extract some info: edit video stream, remove voice from audio (actually understand what is written), then you would need to either parse file format yourself (learn Greek) or use some specialized library (hire a translator).
Either way, filestreams are suited to actually access those files data (and many libraries do use them under the hood)
You can work on binary streams by opening them with openmode binary :
ifstream ifs("mydata.mp3", ios_base::binary);
Then you read and write any binary content. However, if you need to generate or modify such content, play a video or display a piture, the you you need to know the inner details of the format you are using. This can be exremely complex, so a library would be recomended. And even with a library, advanced programming skills are required.
Examples of open source libraries: ffmpeg for usual audio/video format, portaudio for audio, CImg for image processing (in C++), libpng for png graphic format, lipjpeg for jpeg. Note that most libraries offer a C api.
Some OS also supports some native file types (example, windows bitmaps).
You can open these files using fstream, but the important thing to note is you must be intricately aware of what is contained within the file in order to process it.
If you just want to open it and spit out junk, then you can definitely just start at the first line of the file and exhaustively push all data into your console.
If you know what the file looks like on the inside, then you can process it just as you would any other file.
There may be specific libraries for processing specific files, but the fstream library will allow you to access any file you'd like.
All files are just bytes. There's nothing stopping you from reading/writing those bytes however you see fit.
The trick is doing something useful with those bytes. You could read the bytes from a .jpg file, for example, but you have to know what those bytes mean, and that's complicated. Usually it's best to use libraries written by people who know about the format in question, and let them deal with that complexity.

How to read files containing different formats/extensions?

I am very new to c++ and I wanted to write a program that would read and extract data from files with/of different format (example: .dat). I just want to read and extract the data from it. Some people say something about file headers, structures and bodies, what are they actually ?
Basically, you need a different strategy (code) for each file format.
A file with extension .txt usually contains ASCII data and is simple to read.
A file with extension .doc contains binary data for MS Word and is virtually impossible to read with something other than MS Word.
All other file formats are somwhere inbetween these extremes.
The file extension will give you a hint about the files contents. Often people use the extension as a synonym for the actual file format. So we say "I have a .WAV file" when we actually mean "I have a binary file in RIFF/WAVE format with an .wav extension"
Some file formats (Like .WAV .MP3 .TIFF and so on) contain a (well documented) header which describes the file's structure in the first few bytes.
So Header means: The first few Bytes of a file which describe the contents/structure/layout of the file. For example in the first few bytes of a .WAV file you'll find number of channels, sampling rate, etc which explains how the rest of the file needs to be read in, interpreted and send to an audio device.
Some other popular extensions (like .dat .bin .hex) say not much more than "this is binary data in an unspecified format/structure." So you need (a lot of) additional information to read these files in a meaningful way.
Wikipedia article about file extensions
Wikipedia article about file formats
For each type of file there will be a specification defining the format. There may well be headers(information about the data stored in the file) and data structures(ways of organising the actual data in the file), others may just be plain text files where a new line character separates lines.
To write code to interpret a file, for instance .jpg you would need to get the file format specification for JPEG, read it, and then implement it in your code. You would do this for each file format you needed to read in your program.
The structure and content of common files like images, videos, sound, CAD data, text processing... is extremely complex. Mastering them would take you more than a lifetime.
Files often begin with a signature, i.e. a small number of bytes that is deemed to be unique and can be used to check the file type. But there is no standardization at all. For instance, a MS Bitmap image begins with the letters "BM", while xml content begins with a string like "?xml version="1.0" encoding="UTF-8"?".
A header is an initial section of the file that gives information about the data itself such as data type and size, allowing the interpret the subsequent data correctly. For instance, the TIFF image format has a complex header that can contain dozens of "tags" before the bitmap data.
Here is an example.

Difference between .dat and .txt in c++

I am using .txt files in my program for reading and writing records (records contains both text and numerals). Recently i came to know that .dat file also can be used like .txt for file operations. I would like to know the difference between the two and the advantages and disadvantages of one over another.
Text files or .txt files are a bit hard to parse in programs and easy to read. whereas .dat is usually used to store data that is not just plain text.
Generally .txt files contains letters, characters and symbols which is readable.
.dat is binary text file in which data is not always printable on screen.
The extension of a file is a helper so that the operating system (or user) can choose the appropriate program to open it. The actual file contents do not matter. There are some conventions what extensions to use but there is nothing from keeping you to use any arbitrary extension for your files. For instance you can rename a .jar file to .zip-file and be able to open the file with pkunzip.
So for C++ the extension does not matter, but for you as a programmer it may give a hint of the file contents i.e. open it in text or binary mode.
In most languages like C/C++ there is no difference what is your file type in file operations(Read, Write or Edit).
just if you want to work with binary files you should open them in binary format because if you reached \0 in text file it's file end. Dat files are binary too!
If you want to store and read some data, XML file and somtimes DAT files are better because of good libraries to read them. they don't need hard parsing of Text files

Opening an existing .doc file using ofstream in C++

Assuming I have a file with .doc extension in Windows platform, how can I open the the file for outputting its contents on the screen using the ofstream object in C++? I am aware that the object can be used to open files in text and binary modes. But I would like to know if a .doc (or even .pdf) file can be opened and its contents read.
I've never actually done this before, but after reading up on it, I think I might have a suggestion. The .docx format is actually just XML that is zipped up. After unzipping, the file is located at word/document.xml. Doing this in a program is where it gets fun.
Two options: If you're using C++ CLR (.NET) then Microsoft has an SDK for you. It should make it pretty easy to open Office documents.
Otherwise if you're just using regular C++, you might have to do some extra work.
Open the file and unzip it using a library like zlib
Find the document.xml file inside
Parse the XML document. You'll probably want to use some kind of XML parsing library for this. You'll have to look up the specs for the XML to figure out how to get the text you want.
C++ std library has ifstream class that can be used to read simple text files, and for read binary files too.
It is up to you to interpret these bytes in the file. To proper interpret the binary file you need to know the format of the file.
If you think of MS Word files then I would start from here: http://en.wikipedia.org/wiki/Office_Open_XML to understand MS Word 2007 format.
You might find the Boost Iostreams library ( http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/home.html ) somehow useful if you want to make some filter by yourself.