I have a PDF file with tables, images and so on. I want to translate text of this PDF file into another language and create a PDF file that is similar to the first file but contains translated text (it should have images, tables, ... like first file).
How can I write a program in C++ that does this work?
I have a program that extracts text from PDF file and converts text but I can't create output PDF file with tables and images in special positions. How can I create a PDF file that has the layout as the original file?
Your program should read the PDF in a memory structure (like a tree of objects) then translate the text leafs in memory, then dump the memory structure back to PDF.
To do so, you need a pdf parsing library which allows you to manipulate the object representation.
I am not a C++ dev, so I don't know the C++ library universe; but from a quick search on google, it looks like PoDoFo can do this job.
Related
I want to store videos and images in c++ files. Can we store data other than text in c++ files. Like images, and videos.
Yes. If you install any picture onto your computer, you can just change the file extension to cpp. The file type is a C++ file, however, if you attempt to read the file it will of course be corrupted and not really make sense.
Changing the extension back to .png or .jpg will restore the image file back to its former glory.
I used fin to read in a .doc file, and then store all the text in a string. When I tried printing the string, I just saw unknown characters.
When I copied the contents of the .doc file into a .txt file and then read the .txt file in using fin, everything worked fine.
My question is whether fin works with complex files (such as .doc) or just with .txt files. I only had text in my .doc file (no graphics or anything), but the font was Calibri, which is not the font that fout uses to print text to a .doc file.
If by fin you mean an fistream yes it will work to read the file contents, however in the case of complex files you have to deal with the file format, the c++ library will not automatically extract just the text contents. In the case where you saved the file as text that's all that is left and so that's all a stream would read.
fstream by default does all operations in text mode and .doc files use MS-DOC binary file format. So probably when you tried to read the doc file and print it, it showed characters that you couldn't understand (probably that was binary).
If you try to read any file in fstream, it does read it.
I tried reading a .mp4 file in binary using fstream and it did read the file( i can assure that because i pasted the read contents in another file and that file turned out to be the same video).
So answer to your question is you can read any file in fstream but fstream does all this operations in only two ways, either text or binary.
So reading just any file won't do much good unless you want to do something like copying the file contents to another.
You first need to understand the .doc file format. Read first the doc (computing) wikipage. It is very complex (so you'll need months of work at least) but more or less documented.
You could consider a different approach to your overall goal. For example, if you need to parse a .doc file (provided by some Microsoft Word software), you might use libreoffice which provides some library to parse it, or you could find another library (e.g. DocxFactory, wvware, ...), or you could use some COM interface to Word (on a Microsoft Windows operating system with MicroSoft Word installed).
If your goal is to generate some document, you might consider the PDF format (which is a standard), perhaps using some text formatter like LaTeX or Lout to generate it, or some library (e.g. cairo, PoDoFo, etc ...).
My question is whether fin works with complex files (such as .doc)
BTW, C++ standard IO is capable of reading binary files, but you need to write your parser for them (so you need to understand precisely your file format). You should prefer open formats to proprietary formats.
I am very new to c++ and I wanted to write a program that would read and extract data from files with/of different format (example: .dat). I just want to read and extract the data from it. Some people say something about file headers, structures and bodies, what are they actually ?
Basically, you need a different strategy (code) for each file format.
A file with extension .txt usually contains ASCII data and is simple to read.
A file with extension .doc contains binary data for MS Word and is virtually impossible to read with something other than MS Word.
All other file formats are somwhere inbetween these extremes.
The file extension will give you a hint about the files contents. Often people use the extension as a synonym for the actual file format. So we say "I have a .WAV file" when we actually mean "I have a binary file in RIFF/WAVE format with an .wav extension"
Some file formats (Like .WAV .MP3 .TIFF and so on) contain a (well documented) header which describes the file's structure in the first few bytes.
So Header means: The first few Bytes of a file which describe the contents/structure/layout of the file. For example in the first few bytes of a .WAV file you'll find number of channels, sampling rate, etc which explains how the rest of the file needs to be read in, interpreted and send to an audio device.
Some other popular extensions (like .dat .bin .hex) say not much more than "this is binary data in an unspecified format/structure." So you need (a lot of) additional information to read these files in a meaningful way.
Wikipedia article about file extensions
Wikipedia article about file formats
For each type of file there will be a specification defining the format. There may well be headers(information about the data stored in the file) and data structures(ways of organising the actual data in the file), others may just be plain text files where a new line character separates lines.
To write code to interpret a file, for instance .jpg you would need to get the file format specification for JPEG, read it, and then implement it in your code. You would do this for each file format you needed to read in your program.
The structure and content of common files like images, videos, sound, CAD data, text processing... is extremely complex. Mastering them would take you more than a lifetime.
Files often begin with a signature, i.e. a small number of bytes that is deemed to be unique and can be used to check the file type. But there is no standardization at all. For instance, a MS Bitmap image begins with the letters "BM", while xml content begins with a string like "?xml version="1.0" encoding="UTF-8"?".
A header is an initial section of the file that gives information about the data itself such as data type and size, allowing the interpret the subsequent data correctly. For instance, the TIFF image format has a complex header that can contain dozens of "tags" before the bitmap data.
Here is an example.
I am using .txt files in my program for reading and writing records (records contains both text and numerals). Recently i came to know that .dat file also can be used like .txt for file operations. I would like to know the difference between the two and the advantages and disadvantages of one over another.
Text files or .txt files are a bit hard to parse in programs and easy to read. whereas .dat is usually used to store data that is not just plain text.
Generally .txt files contains letters, characters and symbols which is readable.
.dat is binary text file in which data is not always printable on screen.
The extension of a file is a helper so that the operating system (or user) can choose the appropriate program to open it. The actual file contents do not matter. There are some conventions what extensions to use but there is nothing from keeping you to use any arbitrary extension for your files. For instance you can rename a .jar file to .zip-file and be able to open the file with pkunzip.
So for C++ the extension does not matter, but for you as a programmer it may give a hint of the file contents i.e. open it in text or binary mode.
In most languages like C/C++ there is no difference what is your file type in file operations(Read, Write or Edit).
just if you want to work with binary files you should open them in binary format because if you reached \0 in text file it's file end. Dat files are binary too!
If you want to store and read some data, XML file and somtimes DAT files are better because of good libraries to read them. they don't need hard parsing of Text files
I have a folder that contains 300 different files. There are 150 .cft files and 150 .s01 files. Each .cft file has a corresponding .s01 file of the same name. I would like to create a program that can read the files from the folder and place each .cft file and its corresponding .s01 file into an excel document. I would like the .cft file to be on the first worksheet in the document and the .s01 file to be on the second sheet. Then I would like the program to save the file and name it (---------).xls. The (---------) would be the name of the .cft and .s01 file since they are both the same.
So!!! I wrote a program that is able to take the .cft file and the .s01 file, append them and place them in a user defined .xls document. However...I don't want to manually get the names of the 150 files and have to type each one into the program. I also don't want the files to be placed on the same worksheet.
So!!!! I don't want to waste time trying to code something impossible, so before I spend anymore time on this I have a few questions:
Is it possible to read all of the files in a folder and match files of the same name but with different types?
If this is possible, is it then possible to place the corresponding .cft file and .s01 file in the same excel document but on different worksheets?
Then, is it possible to create and save this worksheet as (---------).xls, (-------) being the name of the matching .cft and .s01 file?
So basically...I want to write this code because I am lazy and I don't want to do anything manually ><;;; lol
Example:
The main folder contains 8 files:
dog.cft dog.s01 cat.cft cat.s01 tree.cft tree.s01 bird.cft bird.s01
The program reads all of the files in the folder and recognizes that dog.cft and dog.s01 go together.
The program then creates an excel document and on worksheet 1 places dog.cft and on worksheet 2 places dog.s01.
The program then saves the excel document as dog.xls
Then the program loops through the main folder repeating this process for each of the .cft and .s01 pairs until all 150 pairs have been separated and saved in their own excel document.
I don't know if I'm dreaming a little too big with this but any advice is much appreciated!
personally I would do this with a macro in excel rather than in c++ because doing excel related functions is much easier that way. All of the requirements are possible using VBA within excel.
Yes, it's possible.
For the listing of files in a folder, you can use the Windows API functions FindFirstFile and FindNextFile. When you finish iterating the folder, you'll need to call FindClose.
For creating the Excel spreadsheet and working with the workbook's sheets, you can use COM automation. Here's a link to an article on doing so from C++ (MFC); the article explains where to find one that isn't MFC based.
If you get started and have specific questions about either of the tasks, please post them as separate questions. This should have been two individual questions, in fact - one about iterating the content of a folder and a different one about working with Excel files from C++.