How can I search PDF? - c++

Im doing a small project in C++ in LINUX PLATFORM.i need to search 10 or more PDF files and find required data.how can i do so?.
i will make my question more clear with following eg
Suppose i have ten text books all about c++ and i need info about the topic array. How i can search the pdf and find data?

Read this pdftotext
If you actually want to write code to do then you'll probably have to learn of to navigate the internals of a PDF file. There have been some answers on how to do that for example one pointing to this article which on the 2nd page has the code in C for a basic PDF parser
xtractpro

Related

Knowledge required for making a simple CLI C++ program that finds lyric for a song and displays it to the user

I'm a beginner to C++ and I want to make a program that finds lyrics for songs entered by the user and displays it to the screen.
What are the things I should know about to build this program?
Thanks
What you need essentially is to revisit your C++ tutorials and relearn. Try to take first rung before stepping midway on a ladder.
That been said, you will need - not arranged in any particular order:
Find a lyrics provider and read their API documentation ( or have the lyrics in a file or db!)
Learn how to parse XML or JSON or text
Learn how to retrieve data over a network or from file or db
Learn how to construct searches
If you are looking for a code snippet then I am afraid you are on the wrong site. You should look at Github for that as that is too bulky for a Q&A site.
Good luck!

How to generate C++ library with xerces for specific XML

I've gone through this xerces C++ tutorial, which shows how you might write a nice C++ class that allows you to access your data from the XML using simple function calls. The problem is that 200 lines of C++ seems like excessive amount of work just to grab a couple pieces of data from an XML file. I am hoping to find something that will take in my XML file and spit out C++. I have tried to search for solutions online to generate this for me but I can't find anything.

How to read and write doc, pdf files using files in c++

I m writing a c++ program using files and i need to take the input from existing files such as doc files and pdf files. how to program it in c++? And after getting the inputs, how can i write those details into a new doc or pdf files? Can anyone explain me with an example?
C++ as a language doesn't equip you with such features as "write to DOC file" or "read from PDF file". The only staff available to you a a programmer is raw byte-by-byte reading or writing. To make your new brand file as PDF/DOC/etc compatible you have to conform the chosen file format. The same about reading - you should understand which portions of raw byte array are responsible for what.
In common, this task named as "parsing" or "serialization". And it's a good idea to use one of existing parsers for particular file format instead of reinventing the wheel. Moreover, some file formats can be patent-pending so you may be not allowed to deal with it without license purchase.
Some clues so far:
PDF parsing in C++ (PoDoFo)
Microsoft word Text Parser in "C"
There are some libraries available on the web now(the question is from 2013, maybe that time there weren't many).
Apart from the links in selected answer, you can try PDFTron. It also supports new features, eg. Linearization.
Here is one of their samples is ->
https://www.pdftron.com/documentation/samples/cpp/TextExtractTest
(That program itself contains 4 if blocks, with slightly different features of the library/SDK, to try)
There should be more, search on the web for PDF parsing libraries.

Extracting key words from HTML to C++ under linux

I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.
To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.
You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

Combining two PDF files in C++

In C++ I'm generating a PDF report with libHaru. I'm looking for someway to append two pages from an existing PDF file to the end of my report. Is there any free way to do that?
Thanks.
Try PoDoFo
http://podofo.sourceforge.net/
You should be able to open both of the PDFs as PDFMemDocuments using PDFMemDocument.Load( filename ).
Then, acquire references to the two pages you want to copy and add to the end of the document using InsertPages, or optionally, remove all but the last two pages of the source document, then call PDFDocument.
Append and pass the called document. Hard to say which would be faster or more stable.
Hope that helps,
Troy
You can use the Ghostscript utility pdf2ps to convert the PDF files to PostScript, append the PostScript files, and then convert them back to a PDF using ps2pdf.