Epub library for C++ - c++

Is there any library in C++ for creating Epub files, I need to use it with Qt.
My program can export html & css, but I don't know how to convert that to an Epub.

from my googling efforts it appears that most of it is hand written and their isnt a globally accepted SDK. i found a nice tutorial for you which walks you through making epub files. and i did see some other links about using it with QT. maybe someone knows of a good open source project thats somewhere?
epub tutorial

Once you've got the HTML and CSS, you're most of the way there; what remains is the content.opf file, which basically lists all the files in the epub document and the overall metadata (author, publisher, ISBN, etc); and the table of contents. epub 2.0.1 uses the toc.ncx file as a table of contents--it's basically an xml document. epub 3.0 uses the toc.xhtml, which is much more intuitive--it's essentially an ordered list in a nav element. You can do either epub 2.0.1 or epub 3.0; there's enough backwards compatibility built in that older devices will be able to read an epub 3.0 file--as long as you include both a toc.ncx and a toc.xhtml.
You may have to tinker with your CSS; epub doesn't support everything, and the device manufacturers all seem to interpret things differently; it's very "browser wars"-ish.
I find the IDPF's epub spec is the best place to go for formatting info. Here's the relevant bits:
content.opf
toc.xhtml
toc.ncx

Related

Railo 4 - which document formats are supported by Cfindex / Lucene?

I thought i had a simple question, but somehow i cant find a source for the answer....which document formats can be indexed by the Lucene version that is packaged with Railo 4.0?
Somehow .doc and .pdf seem to go well, but docx and rtf just don't seem to get indexed....Is there a list available somewhere? And for all formats that arent supported, what would be the best way to get that info indexed aswell by cfindex?
<cfindex
collection = "#collection#"
action = "update"
type = "file"
key ="#ABSfilepath#"
title="#ABSfilepath#"
>
thanks!
Question also posted to Railo mailing list: web link.
Railo 4 uses Lucene 2.4.1 - how do you tell? Same way you tell the version for all third-party software that Railo uses: locate the JAR file (in the lib/ext directory), open that archive (using 7-zip or equivalent), and look at META-INF/MANIFEST.MF where you find content like this:
Specification-Title: Lucene Search Engine: core
Specification-Version: 2.4.1
Specification-Vendor: The Apache Software Foundation
Implementation-Title: org.apache.lucene
Implementation-Version: 2.4.1 750176 - 2009-03-04 21:56:52
Implementation-Vendor: The Apache Software Foundation
This seems to be a pretty old version and doesn't look like it has any docs on the Apache Lucene website. (It might be possible to upgrade Lucene by replacing the relevant JARs, but this might also cause dependency issues; do at own risk.)
Since the Lucene website doesn't help, a search for "lucene 2.4.1 indexable documents" brings back a pertinent question about v2.3.2 which asks:
Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e.
Microsoft Windows 2007 documents?
With the response:
Lucene doesn't actually support any of the document types. What happens
is that some program is used to parse the files into an indexable stream
and that stream is indexed. That used to be POI in the old days.
Ok, so assuming that is still accurate, Lucene doesn't control the filetypes, Apache POI does.
Checking the JARs tells us Railo 4.0 uses Apache POI v3.8 and looking at the POI changelog reveals that .docx support arrived in v3.5
So, your .docx files should be supported along with the other MS Office formats. If it's definitely not being indexed, you probably need to identify if it's a POI issue or a Lucene issue or a Railo issue - creating a simple reproducable test case with both .doc and .docx documents is probably a good first step.
Beyond that, you'll need someone familiar with Lucene/POI to advise - there may or not be log files that will contain details of possible indexing/retrieval errors, or ways to interact with Lucene directly (not via Railo/cfindex) that can help identify where the issue lies.

Is there such a thing like a Printer-Markup-Language

I like to print a document. The content of the document are tables and text with different colors. Does a lightwight printer-file-format exist, which can be used like a template?
PS, PDF, DOC files in my opinion are to heavy to parse. May there exist some XML or YAML file format which supports:
Easy creation (maybe with a WYSIWYG-Editor)
Parsing and manipulation with Library-Support
Easy sending to the printer (maybe with Library-Support)
Or do I have to do it the usual way and paint within a CDC?
I noticed you’re using MFC (so, Windows). In that case the answer is a qualified yes. In recent versions of Windows, Microsoft offers the XPS Document API which lets you create and manipulate a PDF-like document using XML, which can then be printed using the XPS Print API.
(For earlier versions of Windows that don’t support this API, you could try to deal with the XPS file format directly, but that is probably a lot harder than using CDC. Even with the API you will be working at a fairly low level.)
End users can generate XPS documents using the XPS print driver that is available for free from Microsoft (and bundled with certain MS products—they probably already have it on their system).
There is no universal language that is supported across all (or even many) printers. While PCL and PS are the most used, there are also printers which only work with specific printer drivers because they only support a proprietary data format (often pre-rendered on the client).
However, you could use XSL-FO to create documents which can then be rendered to a printer driver using library support.
I think something like TeX or LaTeX (or even troff or groff) may meet your needs. Google them and see.
There are also libraries to render documents for print from HTML source. Look at http://libharu.sourceforge.net/ for example. This outputs a printer-ready .PDF
A think that Post Script is a really good choice for that.
It is actually a very simple language, and it must be very easy to parse becuse it is stack-oriented. Then -- most printers supprort it, and even if you have no support you can use GhostScript to convert for many different formats (Consider GS as a "virtual PS supporting printer").
Finally there are a lot of books and tutorials for the language.
About the parsing -- you can actually define new variables and functions in PS. So, maybe, your problem can be solved (almost) entirely using PS.
HTML + CSS can be printed -- properly. CSS was designed to support this with the media attribute to specify that your CSS is for printer layout, not for screen layout. Tools like PRINCE (free + commercial versions) exist to render this for printing.
I think postscript is the markup language used by printers. I read this somewhere, so correct me if postscript is now outdated.
http://en.wikipedia.org/wiki/PostScript
For more powerful suite you can use Latex. It will give options of creating templates where you can just copy the text.
On a more GUI friendly note, MS-Word and other word processors have templates. The issue is they are not of a common standard or markup.
You can also use HTML to render stuff in a common markup but it will not be very printer friendly.

Help programmatically add text to an existing PDF

I need to write a program that displays a PDF which a third-party supplies. I need to insert text data in to the form before displaying it to the user. I do have the option to convert the PDF in to another format, but it has to look exactly like the original PDF. C++ is the preferred language of choice, but other languages can be investigated (e.g. C#). It need to work on a Windows desktop machine.
What libraries, tools, strategies, or other programming languages do you suggest investigate to accomplish this task? Are there any online examples you could direct me to.
Thank-you in advance.
What about PoDoFo:
The PoDoFo library is a free, portable
C++ library which includes classes to
parse PDF files and modify their
contents into memory. The changes can
be written back to disk easily. The
parser can also be used to extract
information from a PDF file (for
example the parser could be used in a
PDF viewer). Besides parsing PoDoFo
includes also very simple classes to
create your own PDF files. All classes
are documented so it is easy to start
writing your own application using
PoDoFo.
iTextSharp is a free library that you can use in .Net applications. Take a look at the iText page - that is for the iText project, which is a Java library. iTextSharp is part of that project, and is a port to C# and .Net.
Consider Python It have a lot PDF librarys (both creating and extracting) eg:
http://pypi.python.org/pypi/pdfsplit/0.4.2
http://pypi.python.org/pypi/JagPDF/1.4.0
http://pypi.python.org/pypi/pdfminer/20091129
http://pypi.python.org/pypi/podofo/0.0.1
http://pypi.python.org/pypi/pyFPDF/1.52
There are also good tools for using C/C++ code in Python and to create .exe form Python scripts. If you decide to use different language consider Python as prototyping language!

Load Excel data into Linux / wxWidgets C++ application?

I'm using wxWidgets to write cross-plafrom applications. In one of applications I need to be able to load data from Microsoft Excel (.xls) files, but I need this to work on Linux as well, so I assume I cannot use OLE or whatever technology is available on Windows.
I see that there are many open source programs that can read excel files (OpenOffice, KOffice, etc.), so I wonder if there is some library that I could use?
Excel files it needs to support are very simple, straight tabular data. I don't need to extract any formatting except column/row position and the data itself.
Suggestedd reference: What is a simple and reliable C library for working with Excel files?
I came across other libraries (chicago on sf.net, xlsLib) but they seem to be outdated.
jrh
I can say that I know of a wxWidgets application that reads Excel .xls and .xlsx files on any platform. For the .xlsx files we used an XML parser and zip stream reader and grab the data we need, pretty easy to get going. For the .xls files we used: ExcelFormat, which works well and we found the author to be very generous with his support.
Maybe just some encouragement to give it a go? It was a couple of days work to get working.
Maybe http://www.libxl.com/ can help ?
I think that it is not something easy to do. xls files are quite complex and it is a proprietary format.
Maybe this is a stupid idea but why don't you upload and access your doc with Google docs. There are some apis to access your doc.
2 potential problems:
- Your app needs internet access
- Currently there is no C++ api.
But there are api for several languages including python see http://code.google.com/intl/fr/apis/gdata/articles/python_client_lib.html

library for doing diffs

I've been tasked with creating a tool that can diff and merge the configuration files for my company's product. The configurations are stored as either XML or URL-encoded strings. I'm looking for a library, preferably open source with a license compatible with commercial software, that can do these diffs. Our app is written in C++, so C++ libraries would be best, but I'm willing to look at libraries that are C#-specific since I can write a wrapper that exposes it to C++ via COM. Three-way diffs would be ideal, but two-way is acceptable. If it has an understanding of XML, that would also be a plus (since XML nodes can be reordered without changing the document, etc). Any library suggestions? Should I even consider writing my own diff tools in the hopes of giving it semantic knowledge of our formats?
Thanks to this similar question, I've already discovered this google library, which seems really great, but I'm still looking for other options. It also seems to be able to output the diffs in HTML format (using the <ins> and <del> tags that I didn't know existed before I discovered it), which could be really handy, but it seems to be a unified diff only. I'm going to need to display the results in a web browser, and probably have to build an interface for doing the merges in the browser as well. I don't expect a library to be able to help with these tasks, but it must produce output in a format that is amenable to me building this on top of it. I'm currently envisioning something along the lines of TortoiseMerge (side-by-side diffs, not unified), except browser-based. Any tips/tricks/design ideas on how to present this would be appreciated too.
Subversion comes with libsvn_diff and libsvn_delta licensed under Apache Software License.
Here is a C++ library that can diff what the author calls semistructured data. It deals nicely with HTML and XML. Since your data is XML it would make a lot of sense to use this instead of plain text diff. This is especially the case when the files are machine generated.
I am currently trying to use this library to build a tool that diffs Visual Studio project files. These are basically XML files and using a plain diff tool like Winmerge is too painful because Visual Studio pretty much mucks up the whole file by crazy reordering. The idea is to do some kind of a structured diff to address the problem.
For diffing the XML I would propose that you normalize it first: sort all the elements in alphabetic order, then generate a stream of tokens/xml that represents the original document but is independent of the original formatting. After running the diff, parse the result to get a tree containing what was added / removed.