Office.Interop for document format conversion - server-side

I have generated documents ( in Docx, Xlsx, PDF formats) using ReportViewer.WebForms.
Problem is, I need few additional formats (Html and Rtf), so i made conversions using Microsoft.Office.Interop.Word. Basicly it's opens file and saves it in different format.
Whats need to be done in server side (besides installing word, only word is enough?)?I know it's bad practice, but are there any other solutions?
P.S.Commercial librarys like Aspose.Words are awesome, but project is too small to buy license.
P.P.S. Those two formats will be rearly used, so performance is not very big issue. Documents quite simple, no more than simple 3 tables.

Related

"Best" Input File Formats for C++? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am starting work on a new piece of software that will end up needing some robust and expandable file IO. There are a lot of formats out there. XML, JSON, INI, etc. However, there are always plusses and minuses so I thought I would ask for some community input.
Here are some rough requirements:
The format is a "standard"...I don't want to reinvent the wheel if I don't have to. It doesn't have to be a formal IEEE standard, but something you could Google and get some information on as a new user, may have some support tools (editors) beyond vi. (Though the software users will generally be computer savvy and happy to use vi.)
Easily integrates with C++. I don't want to have to pull along a 100mb library and three different compilers to get it up and running.
Supports tabular input (2d, n-dimensional)
Supports POD types
Can expand as more inputs are required, binds well to variables, etc.
Parsing speed is not terribly important
Ideally, as easy to write (reflect) as it is to read
Works well on Windows and Linux
Supports compositing (one file referencing another file to read, and so on.)
Human Readable
In a perfect world, I would use a header-only library or some clean STL implementation, but I'm fine with leveraging Boost or some small external library if it works well.
So, what are your thoughts on various formats? Drawbacks? Advantages?
Edit
Options to consider? Anything else to add?
XML
YAML
SQLite
Google Protocol Buffers
Boost Serialization
INI
JSON
There is one excellent format that meets all your criteria:
SQLite!
Please read article about using SQLite as an application file format. Also, please watch Google Tech Talk by D. Richard Hipp (SQLite author) about this very topic.
Now, lets see how SQLite meets your requirements:
The format is a "standard"
SQLite has become format of choice for most mobile environments, and for many desktop apps (Firefox, Thunderbird, Google Chrome, Adobe Reader, you name it).
Easily integrates with C++
SQLite has standard C interface, which is only one source file and one header file. There are C++ wrappers too.
Supports tabular input (2d, n-dimensional)
SQLite table is as tabular as you could possibly imagine. To represent say 3-dimensional data, create table with columns x,y,z,value and store your data as a set of rows like this:
x1,y1,z1,value1
x2,y2,z2,value2
...
Supports POD types
I assume by POD you meant Plain Old Data, or BLOB. SQLite lets you store BLOB fields as is.
Can expand as more inputs are required, binds well to variables
This is where it really shines.
Parsing speed is not terribly important
But SQLite speed is superb. In fact, parsing is basically transparent.
Ideally, as easy to write (reflect) as it is to read
Just use INSERT to write and SELECT to read - what could be easier?
Works well on Windows and Linux
You bet, and all other platforms as well.
Supports compositing (one file referencing another file to read)
You can ATTACH one database to another.
Human Readable
Not in binary, but there are many excellent SQLite browsers/editors out there. I like SQLite Expert Personal on Windows and sqliteman on Linux. There is also SQLite editor plugin for Firefox.
There are other advantages that SQLite gives you for free:
Data is indexable which makes it very fast to search. You just cannot do this using XML, JSON or any other text-only formats.
Data can be edited partially, even when amount of data is very large. You do not have to rewrite few gigabytes just to edit one value.
SQLite is fully transactional: it guarantees that your data is consistent at all times. Even if your application (or whole computer) crashes, your data will be automatically restored to last known consistent state on next first attempt to connect to the database.
SQLite stores your data verbatim: you do not need to worry about escaping junk characters in your data (including zero bytes embedded in your strings) - simply always use prepared statements, that's all it takes to make it transparent. This can be big and annoying problem when dealing with text data formats, XML in particular.
SQLite stores all strings in Unicode: UTF-8 (default) or UTF-16. In other words, you do not need to worry about text encodings or international support for your data format.
SQLite allows you to process data in small chunks (row by row in fact), thus it works well in low memory conditions. This can be a problem for any text based formats, because often they need to load all text into memory to parse it. Granted, there are few efficient stream-based XML parsers out there, but in general any XML parser will be quite memory greedy compared to SQLite.
Having worked quite a bit with both XML and json, here's my rather subjective opinion of both as extendable serialization formats:
The format is a "standard": Yes for both
Easily integrates with C++: Yes for both. In each case you'll probably wind up with some kind of library to handle it. On Linux, libxml2 is a standard, and libxml++ is a C++ wrapper for it; you should be able to get both of those from your distro's package manager. It will take some small effort to get those working on Windows. There appears to be some support in Boost for json, but I haven't used it; I've always dealt with json using libraries. Really, the library route is not very onerous for either.
Supports tabular input (2d, n-dimensional): Yes for both
Supports POD types: Yes for both
Can expand as more inputs are required: Yes for both - that's one big advantage to both of them.
Binds well to variables: If what you mean is some way inside the file itself to say "This piece of data must be automatically deserialized into this variable in my program", then no for both.
As easy to write (reflect) as it is to read: Depends on the library you use, but in my experience yes for both. (You can actually do a tolerable job of writing json using printf().)
Works well on Windows and Linux: Yes for both, and ditto Mac OS X for that matter.
Supports one file referencing another file to read: If you mean something akin to a C #include, then XML has some ability to do this (e.g. document entities), while json doesn't.
Human readable: Both are typically written in UTF-8, and permit line breaks and indentation, and thus can be human-readable. However, I've just been working with a 479 KB XML file that's all on one line, so I had to run it through a prettyprinter to make sense of it. json can also be pretty unreadable, but in my experience is often formatted better than XML.
When starting new projects, I generally prefer json; it's more compact and more human-readable. The main reason I might select XML over json would be if I were worried about receiving badly-formed documents, since XML supports automated document format validation, while you have to write your own validation code with json.
Check out google buffers. This handles most of your requirements.
From their documentation, the high level steps are:
Define message formats in a .proto file.
Use the protocol buffer compiler.
Use the C++ protocol buffer API to write and read messages.
For my purposes, I think the way to go is XML.
The format is a standard, but allows for modification and flexibility for the schema to change as the program requirements evolve.
There are several library options. Some are larger (Xerces-C) some are smaller (ezxml), but there are many options, so we won't be locked in to a single provider or very specific solution.
It can supports tabular input (2d, n-dimensional). This requires more parsing work on "our" end, and is likely the weakest point for XML.
Supports POD types: Absolutely.
Can expand as more inputs are required, binds well to variables, etc. through schema modifications and parser modifications.
Parsing speed is not terribly important, so processing a text file or files is not an issue.
XML can be programmatically written just as easily as read.
Works well on Windows and Linux or any other OS that supports C and text files.
Supports compositing (one file referencing another file to read, and so on.)
Human Readable with many text editors (Sublime, vi, etc.) supporting syntax highlighting out of the box. Many web browsers display the data well.
Thanks for all the great feedback! I think if we wanted a purely binary solution, Protocol Buffers or boost::serialization is likely the way that we would go.

KORMARC to MARC21 converter

Does anyone know if there is a free open-source solution to convert KORMARC (Korean MARC) into MARC21 (aka USMARC)?
While I'm not certain it has KORMARC support, you may want to try USEMARCON if you can find a mapping. From the USEMARCON page:
USEMARCON facilitates the conversion of catalogue records from one MARC format to another e.g. from UKMARC to UNIMARC. The software was designed as a toolbox-style application, allowing users with detailed knowledge of the source and target MARC formats to develop rules governing the behaviour of the conversion. Rules files may be supplemented by additional tables for more accurate conversion of MARC-specific character sets or coded information. The tables and rules files are simple ASCII text files and can be created using any standard text editor such as MS Windows Notepad.
Also, this thread from the Ask a Korean Studies Librarian Google Group might be useful, particularly the following message:
Library of Congress once tried to download records from the National
Library of Korea (NLK) to use as order records. LC wrote a
specification and developed a in-house program to convert KORMARC to
USMARC. Since NLK records only provide script, LC used a
transliterator to provide romanization for Voyager system developed by
non-LC programmer. The feedback of this method is not very positive
by LC staff. ... In stead of converting KORMARC to USMARC, a few research libraries
including LC is currently using MarcEdit with Excel spreadsheets which
are provided by Korean vendors based on contract. Vendors provide
both Korean script and romanization for several elements of MARC
fields (ISBN, title, author, publisher, place, series, etc.) in
different columns of spreadsheet for your order items. It sounds a
lot simpler to set up initially. And once MarcEdit is set up
properly, it creates MARC records.

psd file format

I am attempting to find documentation of the psd file format so I can read in a .psd and then save out the individual layers as files, along with do other modifications. Does anyone know of any document in on the .psd file format? (Just for reference, I will be writing this in C++)
If there are any code examples of loading a .psd file in C++ then I would appreciate them being linked.
(Please not turn this into a "just use XXX software". This is not homework, or anything related to that. I am doing this because I think it will be a fun project to work on. I will ask for posts to be down voted if this happens.)
There's also some Objective-C code on GitHub (should be easily understandable for anyone with a C++ background), also source of this gem, which appears to sum it up nicely:
At this point, I'd like to take a moment to speak to you about the Adobe PSD format.
PSD is not a good format. PSD is not even a bad format. Calling it such would be an
insult to other bad formats, such as PCX or JPEG. No, PSD is an abysmal format. Having
worked on this code for several weeks now, my hate for PSD has grown to a raging fire
that burns with the fierce passion of a million suns.
If there are two different ways of doing something, PSD will do both, in different
places. It will then make up three more ways no sane human would think of, and do those
too. PSD makes inconsistency an art form. Why, for instance, did it suddenly decide
that these particular chunks should be aligned to four bytes, and that this alignement
should not be included in the size? Other chunks in other places are either unaligned,
or aligned with the alignment included in the size. Here, though, it is not included.
Either one of these three behaviours would be fine. A sane format would pick one. PSD,
of course, uses all three, and more.
Trying to get data out of a PSD file is like trying to find something in the attic of
your eccentric old uncle who died in a freak freshwater shark attack on his 58th
birthday. That last detail may not be important for the purposes of the simile, but
at this point I am spending a lot of time imagining amusing fates for the people
responsible for this Rube Goldberg of a file format.
Earlier, I tried to get a hold of the latest specs for the PSD file format. To do this,
I had to apply to them for permission to apply to them to have them consider sending
me this sacred tome. This would have involved faxing them a copy of some document or
other, probably signed in blood. I can only imagine that they make this process so
difficult because they are intensely ashamed of having created this abomination. I
was naturally not gullible enough to go through with this procedure, but if I had done
so, I would have printed out every single page of the spec, and set them all on fire.
Were it within my power, I would gather every single copy of those specs, and launch
them on a spaceship directly into the sun.
PSD is not my favourite file format.
Just so you are warned. :)
This will not be a fun project, the .psd format is big. It incorporates every feature Adobe has put into Photoshop over many years.
I believe the specification can be had from Adobe, but they don't just hand it out to the public. You'll have to contact them and jump through some hoops first.
The PSD file format specification as written by Adobe is here;
http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/
Last update: June 2012. As far as I know this is the best available source about the PSD file format even there are few mistakes.
First I recommend starting by dividing PSD into blocks.
Enjoy!
MyPSD::CPSD class is a C++ class that can load images saved in Adobe's Photoshop native format.
http://www.codeproject.com/Articles/10398/Import-Adobe-Photoshop-psd-images
MolecularMatters psd_sdk seems like a good library to take inspiration form: https://github.com/MolecularMatters/psd_sdk
It allows to read layers from a .psd file and much more.

How to convert pdf documents to html files?

Should remain format,looks almost the same as original.
A couple of examples:
This page discusses how to use software called pdftohtml to convert in Ubuntu.
This page lists shareware (probably Windows) which converts PDF to various MS formats, including htm.
I even found a couple of videos (a Google video and one on www.break.com). I didn't look at them because I think they'll just describe how to use some software.
These are obviously unsatisfactory if you want to know how to do it yourself.
I think PDF started out as a compressed 'postscript' file, but these days would probably contain images (of scanned documents, for example).
If that's the case, don't bother looking for text, you can extract the images and create HTML pages to display the images. This should at least enable you to preserve the formatting.
At the very least, you could screen-capture the PDF pages to create the images. Crude, I know, but it would work whether the PDF was postscript or images.

library for doing diffs

I've been tasked with creating a tool that can diff and merge the configuration files for my company's product. The configurations are stored as either XML or URL-encoded strings. I'm looking for a library, preferably open source with a license compatible with commercial software, that can do these diffs. Our app is written in C++, so C++ libraries would be best, but I'm willing to look at libraries that are C#-specific since I can write a wrapper that exposes it to C++ via COM. Three-way diffs would be ideal, but two-way is acceptable. If it has an understanding of XML, that would also be a plus (since XML nodes can be reordered without changing the document, etc). Any library suggestions? Should I even consider writing my own diff tools in the hopes of giving it semantic knowledge of our formats?
Thanks to this similar question, I've already discovered this google library, which seems really great, but I'm still looking for other options. It also seems to be able to output the diffs in HTML format (using the <ins> and <del> tags that I didn't know existed before I discovered it), which could be really handy, but it seems to be a unified diff only. I'm going to need to display the results in a web browser, and probably have to build an interface for doing the merges in the browser as well. I don't expect a library to be able to help with these tasks, but it must produce output in a format that is amenable to me building this on top of it. I'm currently envisioning something along the lines of TortoiseMerge (side-by-side diffs, not unified), except browser-based. Any tips/tricks/design ideas on how to present this would be appreciated too.
Subversion comes with libsvn_diff and libsvn_delta licensed under Apache Software License.
Here is a C++ library that can diff what the author calls semistructured data. It deals nicely with HTML and XML. Since your data is XML it would make a lot of sense to use this instead of plain text diff. This is especially the case when the files are machine generated.
I am currently trying to use this library to build a tool that diffs Visual Studio project files. These are basically XML files and using a plain diff tool like Winmerge is too painful because Visual Studio pretty much mucks up the whole file by crazy reordering. The idea is to do some kind of a structured diff to address the problem.
For diffing the XML I would propose that you normalize it first: sort all the elements in alphabetic order, then generate a stream of tokens/xml that represents the original document but is independent of the original formatting. After running the diff, parse the result to get a tree containing what was added / removed.