We have a requirement to extract the content of an uploaded PDF and then show it in a text area with the same formatting. Our application runs on ColdFusion 10.
We used cfpdf however the extracted text is a single paragraph with No formatting.
Is there anyway in ColdFusion to read the PDF as text retaining the source formatting (100% accuracy is not expected) .
Note:
We tried itext and it works very well however it comes with a licensing cost which I don't want to spend as I'm only going to use this single feature from the bundle that itext is providing.
Tried PDFbox but the formatting is not retained.
Related
We are planning to render millions of pdf's using Apache FOP by using XSL-FO as input.
Is there a decent XSLT WYSIWYG designer that allows to easily design an XSLT that will transform the XML input data to the XSL-FO required for processing by FOP?
I see a lot of commercial ones - Ecrion , Antenna House.. Any open source ones?
The only somewhat decent editor that I have found is MiniScribus Scribe but I gut stuck with it at the point of wanting to put a horizontal line and the opened odt file lost its table format in Scribe... it says that it doesnt support yet headers/footers and table borders... not so decent.
There are some converters that could be of good use, like html 2 fo and odt to fo converters but the fo code generated by them returned a lot of exceptions from the Apache's FOP processor. The odt/html file with which I was testing had only a table, two horizontal lines and some unformatted text and only one page.
These tools, the convertors and the editor as well are now in beta phase so maybe there will a decent solution, so far I have not been able to find it.
I would like to convert 100+ RTF files to Wiki Markup, but I can only find "Wiki to RTF" converters on the web and even here on StackOverflow.
I only need RTF --> Wiki Markup
Is there anything like this out there?
I simply asked the wrong question.
Did some research and found out that there is no Converter which converts RTF directly to a "Wiki format".
The "better" question: Save Word file as Wiki markup.
There are some approaches using Microsoft Word to save as .txt (Wiki markup):
http://www.mediawiki.org/wiki/Extension:Word2MediaWikiPlus
http://www.microsoft.com/en-us/download/details.aspx?id=12298
http://techwiki.openstructs.org/index.php/Wiki_converters
http://www.consumingexperience.com/2008/03/convert-word-doc-or-webpage-to-wiki.html
Good luck!
It may be unwieldy with hundreds of files, but I have used wikEd to convert RTF and Word formatted text to wiki markup.
Wikedbox is a usable implementation of wikEd without installing it:
http://www.appropedia.org/index.php?title=Appropedia:Wikedbox&action=edit
'''Wikedbox HELPS YOU CONVERT HTML (WEB FORMATTED CONTENT) TO MEDIAWIKI.'''. More instructions at [[Appropedia:WikEd]].
INSTRUCTIONS:
If you're not in edit mode, click the "edit" tab now.
Paste in your content.
Click the red [W] to convert. (Middle row, second block from the left.)
Your text is ready to be copied and used on your wiki page. (You may need to make further corrections.)
'''DO NOT SAVE THIS PAGE'''. (You'll be blocked from doing so, anyway, unless you're an admin.)
You can use Pandoc:
pandoc -s README -o example.rtf
This will convert your file to Markdown. I don't know which Wiki you want to use and If it understands Markdown, but I thinl you can also convert it to MediaWiki or other output formats (see the Pandoc User manual).
I wanted to add a PDF generation button for articles. Everything is working well until I noticed that the file sizes are upwards of 4MB for a document with 200KB of JPG images and about 120KB of HTML. So, I tossed the CFDocument into the CFPDF tag which reduced it to 1.5MB. Better. Then I put it through Acrobat's web optimizer which took it down to 335KB. I cannot find an "optimizing" solution with either CFDocument or CFPDF. I was hoping for a quality setting or something. I should also note that CFDocument takes a while to process (relatively speaking). Since ColdFusion 9 added an optimize function, I'm guessing that I'm out of luck until this server is upgraded. True?
<cfdocument format="pdf"
localurl=true
name="loc.tempPDF">
<cfoutput>#loc.articleContent#</cfoutput>
</cfdocument>
<cfpdf action = "write"
destination = "#expandPath('\pdf\temp.pdf')#"
source = "loc.tempPDF"
overwrite = "yes"
saveOption = "linear" />
That's correct, there is currently no way to optimize PDFs in ColdFusion 8 with the native cfdocument or cfpdf tags. If you absolutely have to make this happen without upgrading to CF9 (which has much improved PDF compression), then you could look at the iText library for generating PDFs via Java.
I built a Document Management System using CF 6.1 a few years ago. I used GhostScript to create, concatenate and optimize PDFs.
Here's a blog post showing how you can use GS to optimize a PDF's size.
I have a webpage, where user has a possible to Print this page OR to save it on his/her computer.
How may I save it as a Word or PDF document?
Thanks.
For the MS Word requirement, most versions of Office can interpret basic html/xml. So you might consider the old cfcontent hack as a simpler alternative to POI. (The Word package is not quite as mature as the spreadsheet package.)
Basically you generate html, but use cfheader/cfcontent to tell the browser the content is really a Word document. It is obviously not a true MS Word file. But it is simpler than most options.
http://msdn.microsoft.com/en-us/library/aa155477.aspx
<cfheader name="Content-Disposition" value="attachment; filename=someFile.doc">
<cfcontent type="application/msword">
... your html code here ...
For microsoft office documents you can use the Apache POI project. This means in your coldfusion code you need to use some basic java code to call the poi methods.
However, if you choose the pdf document things are quite easier. You can use the cfdocument tag with the PDF format option
Using the POI or OpenOffice interface (depending on your version) you can create a Word doc. Using the built-in PDF generation tools, you can create a PDF doc. HOwever, you can only present that as an option.
There is no way to override the save/print menu functions. No matter how you handle it, I cna save the source document instead of the .doc or .pdf. Similarly, you cannot prevent me from printing the original document, instead of a prepared PDF.
Here is a method that has worked for me:
Create PDF or FlashPaper with ColdFusion
However, just like printing, you will have to sacrifice some graphics, so this would be best used for exporting content (but as you did not specify, I'm just clarifying that this is possible but at a cost).
Hope that helps.
Use cfdocument to display as a PDF, then they can just click the disk image to save it to their computer. Or you can use the filename= attribute of cfdocument to assign a filename to it, and it will prompt them to save it instead of displaying in the browser.
Should remain format,looks almost the same as original.
A couple of examples:
This page discusses how to use software called pdftohtml to convert in Ubuntu.
This page lists shareware (probably Windows) which converts PDF to various MS formats, including htm.
I even found a couple of videos (a Google video and one on www.break.com). I didn't look at them because I think they'll just describe how to use some software.
These are obviously unsatisfactory if you want to know how to do it yourself.
I think PDF started out as a compressed 'postscript' file, but these days would probably contain images (of scanned documents, for example).
If that's the case, don't bother looking for text, you can extract the images and create HTML pages to display the images. This should at least enable you to preserve the formatting.
At the very least, you could screen-capture the PDF pages to create the images. Crude, I know, but it would work whether the PDF was postscript or images.