So, I've got an app that needs to deal with files created by Adobe InDesign (.INDD), and while the XMP Metadata is useful, there are additional things that I want to know about the files that do not appear to be in the metadata.
Specifically, I would want to know the number of actual pages (not just number of page previews created), and what the dimensions of those pages are.
Has anyone run across any toolkit, sdk, etc. that can get me this information?
This will be for a non-open source commercial app, so licenses are a potential roadblock. Also, this app will not be a plug-in for any Adobe product, so the InDesign Plugin SDK is not an option either.
C++ is the preferred language.
.indd is a proprietary format owned by Adobe. You are not allowed to interact with this format outside of InDesign. If the documents are saved in the .idml format, it's quite possible and not very difficult, but if all you have to work with is a bunch of .indd files that someone else created, you're gonna have to use a plugin or scripts together with InDesign.
Related
Larger context: we're working on an Intranet portal's search engine, which needs to be able to search within ALL office types: doc, docx, xls,xlsx, ppt, and pptx. Having the search algo already in place, we've implemented the indexer using Office automation; however, client is concerned, that this is 1, error-prone, and 2, not recommended by Microsoft (and also -not covered in their license).
I've read the previous answers in this regard on SO, however it would require us to integrate an extremely large amount of distinct libraries to cover all the edges, which we don't have the resource to do so.
Hence, we're looking for a simple web service, to which we can submit any of these documents, and would return a simple, plain text (or html, or even PDF -we've got parsers for both) output.
Are there any such services (free, or paid), that covers all of the file formats above?
Many thanks.
I would suggest to try Apache Tika - it's free and open source. It allows to extract text contents from MS Office file formats (and from other popular formats, too). There is a server application included which you can run on your own server.
I'm note sure about the service, however if you can managed and deploy three .NET assemblies for DOC/DOCX, XLS/XLSX, and PPT/PPTX. Then you may try Aspose components -- Aspose.Words, Aspose.Cells, and Aspose.Slides respectively. These DLLs don't require MS Office to be installed on your server and they work fine on any Windows OS and on 32-bit/64-bit environments. You may also see the documentation. These components provide many advanced features to deal with document elements as well. Please see if this might help in your scenario.
Disclosure: I work as developer evangelist at Aspose.
Is anyone aware of a good, general purpose file preview component for MFC/C++ desktop applications?
Specifically, I'm looking for a component that I could embed in my application that would allow a broad range of file types (text files, multimedia, etc.) to be previewed without the need for original applications (such as MS Word, etc.) to be installed.
I could only find one, via Google:
http://www.file-viewer-sdk.com/
Unfortunately, these folks want $60k for unlimited redistribution, which is outside of our budget.
Anyone have any recommendations? If not a component, is anyone using another general-purpose strategy that works well for them?
You can write your own shell preview host once you know the interfaces.
You might want to check out Autovue, originally made by Cimmetry since acquired by Oracle
.
Our product makes limited use of their SDK to do some document conversions (Mostly RTF->PS) and that works well enough for us.
What solutions are there? I know only solutions for replacing Bookmarks in Word (.doc) files with Apache POI?
Are there also possibilities to change images, layouts, text-styles in .doc and .ppt documents?
I think about replacement of areas in Word and PowerPoint documents for bulk processing.
Platform: MS-Office 2003
What are your platform limitations?
Obviously Apache POI will get you at least part of the way there.
Microsoft's own COM API's are fairly powerful and are documented here. I would recommend using them if a) you are not running in a server (many users, multithreaded) environment; b) you can have a proper version of powerpoint installed on the production machine; and c) you can code against a COM object model.
It's a bit pricey, but Aspose.Slides is a very powerful library for manipulating PowerPoint files
If you include using other Office suits as an option, here's a list of possible solutions:
Apache POI-HSLF
PowerPoint 2007 APIs
OpenOffice.org UNO
Using POI you can't edit .pptx file format, but you don't depend on the apps installed on the system. Other two options, on the contrary, make use of other apps, but they are definitely better for dealing with presentations. OpenOffice has better compability with older formats, by the way. Also if you use UNO, you'll have a great choice of languages, UNO exists for Java, C++, Python and other languages.
My experience is not directly with Power Point, but I've actually rolled my own WordML (XML) generator. It a) removed all dependencies on Word, b) was very fast c) and let me build up documents from scratch.
But it was a lot of work to create. And I was only creating a write only implementation.
I'm not as familiar with Power Point, so this is conjecture, but you may be able to roll your own by reading XML (Power Point 2003??) and/or cracking the Office Open XML file (zipped XML), then using XPath to manipulate the data, and then saving everything back to disk.
This won't work on older OLE Compound Document based Power Point files though.
I've done something like that before: programmatically accessed and manipulated PowerPoint presentations. Back when I did it, it was all in C++ using COM, but similar principles apply to C#/VB .NET apps, since they do COM interop very easily.
What you're looking for is called the Office Document Model. Basically, Office applications expose their documents programmatically, as trees of objects that define their contents. These objects are accessible via an API, and you can manipulate them, add new ones, and do whatever other processing you want. It's exceedingly powerful; you can use it to manipulate pretty much all aspects of a document. But you'll need an installation of Office and Visual Studio to be able to use it.
Some links:
Intro: http://msdn.microsoft.com/en-us/library/d58327k6.aspx
Hope this helps!
Apparently new users can only include one link per posting. How lame! :)
Here's the other link I meant to include:
Example of manipulating PowerPoint documents programmatically: http://msdn.microsoft.com/en-us/library/cc668192.aspx
Our win32 applications (written in C++) have been around for over 10 years, and haven't been updated to follow "good practices" in terms of where they keep files. The application defaults to installing in the "C:\AppName" folder, and keeps application-generated files, configuration files, downloaded files, and saved user documents in subfolders of that folder.
Presumably, it's "best practices" to default to installing under "c:\Program Files\AppName" nowadays. But if we do that, where should we keep the rest of our files? Starting from Vista, writing to the program files folder is problematic, and there seem to be a million other places that you can put different files, and I'm confused.
Is there a reference somewhere for what goes where?
Edit: To expand on questions people have asked so far:
I'm familiar with the SHGetFolderPath function, but there are lots and lots of options that you can get from it, and I can't find a resource that says "Here is exactly what each of these options is used for, and when you might want to use it".
Up until now, we've done the "All files, including saved user files, under one folder" thing, and it's worked fine - but not when people want to install the app under the Program Files folder. For some reason, the virtualization monkeying around that Vista does isn't working for our application; if we're going to be making changes anyway, we might as well make an effort to do things the "right" way, since we don't want to have to change it again in 12 months time.
Further question:
We include some "sample" documents with our app, which we update every now and again. Is it appropriate to install them into My Documents, if we'll be overwriting them every few months? Or is My Documents assumed to be totally safe for users to mess around in?
If we can't install them to My Documents, where should we put them so that users can see them easily?
Presumably, it's "best practices" to default to installing under "c:\Program Files\AppName"
Close, but not quite. Users can configure the name of the Program Files folder and may not even have a C: drive. Instead, install to the %ProgramFiles%\AppName environment variable folder.
Note you should assume you only have read access to this folder after the installation has finished. For program data files where you might need write access, use %AppData%\AppName.
Finally, are you sure yours is the only app with that name? If you're not 100% certain of that, you might want to include your company name in there as well.
The mechanisms you use to retrieve those variables will vary depending on your programming platform. It normally comes down to the SHGetFolderPath() Win32 method in the end, but different platforms like Java or .Net may provide simpler abstractions as well.
Some guidelines are in this Knowledge Base article: How to write a Windows XP Application that stores user and application data in the correct location by using Visual C++. Also, if you search MSDN for Windows Logo Program you will find documentation regarding what an app needs to do to be truly compliant.
SHGetKnownFolderPath can get you the directories you need. If backwards compatibility with XP and earlier is required, use the deprecated SHGetFolderPath
Having said that, if you app came with documentation that said "everything used by this app is in this directory" I would love it ;)
Use the Windows SHGetFolderPath() function to get the correct directories.
Edit: To reply to your other question, added in the edit: Where to put the sample files of your application does very much depend on whether your application is installed for a single user or for all users, and whether the person installing the application can be assumed to be the one who uses it.
If your program is to be used by multiple users on a system, copying stuff into "My Documents" is not going to work - the files would be accessible only for the user installing the application. Worse, if the only user of your application needed to install as Administrator, then [s]he will not have access to the files either. So unless you are fairly certain that there is only one user for your application, and they have sufficient permissions to install the application using their own account, don't use "My Documents".
IMO you should install sample files into the directory identified by CSIDL_COMMON_APPDATA. This will give you exactly one copy for all users, and since you want every user to see the original, unaltered sample files all users should consider them read-only. In fact, your setup program should probably make them read-only. Opening one of the samples will work for all users, but as soon as they try to save their modifications the application should detect that the file is read-only, and open the "Save As" dialog, pointing to "My Documents" or suitable directory inside. That will also keep all user modifications when the installer updates the sample files later on.
It is of course somewhat more difficult for the users to find the sample files. You could add a link to the samples folder to the start menu group of your application, so that access to the files is fast, and of course you should properly document everything.
For your application binaries, you can assume that you may write to the PROGRAM FILES directory (use the %ProgramFiles% environment variable to support installations other than the default English version - e.g. in german Installations this will be c:\Programme by default). Wikipedia lists the most common variables. Another option are the SHGetFolderPath or newer SHGetKnownFolderPath functions.
For User data, you should assume that the application is running with limited access rights and may only write to the user's home directory. Same applies for registry entries. This path should probably be configurable b the user, as the home directory may actually be a network server and a user might have a second disk attached for data storage. For information on the current (Vista) filesystem guidelines see this article.
Regarding plugins, this might be more complicated. The best practice seams to be offering the option to install for the current user only, and placing the plugin in the user directory, or install for all users and place the files into your program files directory (but remember to check for write permission and request elavated access if needed).
There are plenty of environment variables like: %USERPROFILE%, %HOMEPATH%, %APPDATA% all of these points to some user-specific directories, where you can put your user-specific files.
For system-wide storage you can use %ALLUSERSPROFILE%, that is the place where you should put your read/write datafiles that are not specific to any user.
Sorry I don't know the correct answer, but...
Do you have a business case for wanting to do that? Are your customers complaining that files aren't stored where they expect? Are your applications crippled in some way because you store files in non-standard locations? If not, I don't see a reason for spending time and budget to redo your file storage strategy just to meet "best" practice. If your programs just work, then IMHO you should leave them alone and spend money and time on things that matter.
There is a directory structure under c:\users for user oriented data.
There is documentation for porting apps from older windows OSs to Vista.
Check out http://www.innovateon.com and follow the links to Vista. There is documentation regarding certification that has the details on topics like this.
We have a similar app created ~10 years ago using MFC. The easiest thing to do was create a folder right off of C:\ (e.g. C:\OurApp). No install files, no special permissions, no registry changes, etc. Clients (and particularly their sys admins) LOVE it.
One other consideration - are you planning to all of a sudden change the installation folder for existing clients (assuming this is installed in many locations)? If something isn't broke, why fix it?
We're in the middle of deploying a new software system to lot's of users in lot's of places (200+ users over 8 countries). In the past we've written a manual for the users, then update it every so often. This works ok, in that all the users ahve the same manual and it covers the main things but it has it's problems, like it doesn't get updated that often, we sometimes miss updates, and some users will have old copies.
We've been talking about using a wiki during the testing and deployment phases to build a knowledge base about the system. Ideally we'd then like some way to convert that into some form fo electronic document that we can then 'pretty-fie' and send out as the official manual, as well as letting users use and update the wiki.
Has anyone else done anything similar ? Any suggestions for wiki systems, workflows, document formats etc?
Most wikis support export via PDF e.g.:
MediaWiki PDF Export
DokuWiki PDF Export
TWiki PDF Export
You can write something that generates LaTeX from the wiki and renders a manual to PDF. With packages like hyperref you can retain cross-references as hyperlinks.
Additionally, you can integrate content from multiple sources such as a data dictionary into the LaTeX document, which can be mixed and matched with the wiki content. You could also set the architecture up so it can support cross-referencing that goes either way.
Framemaker could also support this using generated MIF files, and you could also use Lout in a similar way or convert your wiki content to docbook, which would allow you to use any of the many rendering options available to that format.
As an aside, the following Stackoverflow postings discuss various systems for maintaining documentation.
Application (Not a Markup Language) for Producing a User Manual
Can LaTeX be used for producing any documentation that accompanies software?
What tools are used to write documentation?
What tools does your team use for writing user manuals?
How best to write documentation (ideally in latex) targeting both the web (html) and paper (pdf)?
Best tool(s) for working with DocBook XML documents?
What is the recommended toolchain for formatting XML DocBook?
Is a successor for TeX/LaTeX in sight?
Madcap Flare is a help-and-manual authoring tool that uses HTML for the source of each topic. You could pretty easily do a mass import of the Wiki pages. Would then require some cleaning but after that you have a nice single-source system that can output CHM, web-browsable help, PDF, DOC/DOCX, etc.
How are you storing the help source at the moment? Is it MS Word files, MS help, LaTeX?
If you put your help source files under version control then you will get all the benefits of a wiki without having to migrate to a new system - people can make edits to the help files easily - those changes can be tracked, reverted etc. and you get the prettified manuals as before.
I followed Node's links and came across some mediawiki pages that I thought were noteworthy.
Extension:OpenDocument Export
Extension:PDF Writer
Category:Data extraction extensions
I gave a previous answer which may be useful for the "wiki to PDF" part -- look at using the open source PediaPress code or functionality. You can get ODFs from it too, although their PDFs are already quite pretty (but you might want to rebrand it and restyle it for your company I suppose).