I would like to download normal webpages, webhosted ppt and pdfs in python. However, to minimize that amount of data that I would need to download, I would like to download just the text and ignore any images.
This sounds feasible with normal websites, I'm not sure if its possible for ppt and pdfs. How can I accomplish this?
I'm planning to use the textract module to extract the content of these pages after downloading them, but I'd be interested to know if there are alternatives that would make my problem easier to solve.
Take a look at the textract library. This accomplishes pretty much all your requirements, ie, html, pdf and ppt.
Related
Is there a way to extend Wagtail's documents to show file previews? There's a service ( I have not tested ), which looks cool, but the free plan to Pro plan is a huge leap in cost. I am hoping someone has figured this out already and can point me to the solution. Thank you.
FilePreviews.io's 100 documents a month on the free plan seems pretty generous to me. You could try to build something similar, e.g. using ImageMagick to create PDF thumbnails:
http://duncanlock.net/blog/2013/11/18/how-to-create-thumbnails-for-pdfs-with-imagemagick-on-linux/
and use a service like Aspose to convert common file formats to PDFs:
https://www.aspose.com
Or you could do it manually, by adding a thumbnail field to your document model, and telling users to provide their own thumbnails.
But if you or your client are uploading more than 100 documents a month, and you want reliable thumbnail generation for multiple document types, $49 a month may be better value than working it all out yourself.
As part of my application, my client has requested that I include an automated e-mailing system. As part of this system, I generate HTML code and use automation to send it via. Outlook.
However, they also require a PDF copy of the HTML document to be sent as an attachment. My initial attempts involved using libHaru, which proved difficult to use efficiently, as I was required to create the PDF document from scratch, which required computation of the position of each of the lines in a table, and positioning of all the text, etc.
I was wondering if there would be a way to programmatically convert HTML code (or an HTML file if need be) into a PDF document either by using Win32/MFC itself or an external library.
Thanks in advance!
EDIT: Just to clarify, I am looking for solutions which minimize external dependencies.
You should evaluate this utility wkhtmltopdf:
http://code.google.com/p/wkhtmltopdf/
You can call it from the command line without the need to run a setup.
I use it generating my output documents as html then cal a ShellExecute(...) to convert it to PDF. It's great!
Inside uses webkit + qt. So compability with modern HTML is OK.
Hope it helps.
I'd take a look at PDF Creator, which can be used as a COM object (that acts pretty much like a printer). I haven't used it to print HTML, so I'm not sure, but my guess is that you'll probably end up having to instantiate a web browser control to render the HTML, and then feed it from there to the PDF control.
Some possible answers are in this thread:
C++ Library to Convert HTML to PDF?
Not sure if they will satisfy your particular requirements, but these might at least get you started.
Edit:
Some other possible options here.
Not MFC but you can try QtWebKit. It can render and export HTML to PDF, PNG, JPEG
I am looking for a django app that can help smooth the process of uploading big size documents by using HTTP Post.
Documents ranging anything from 150mb to 500mb.
I wrote a small library that handles PDF uploads and parses it to my scribd library and through that embed it onto my site.
Currently my model is quite simple, it takes a FileField, preferably PDF and just and uploads the PDF File, through that makes use of the scribd library and send it directly to scribd for encoding.
The problem is, somewhere along the actual upload process, it times out, no errors in the log, I have adjusted my django apps size for files, Apache's size for files, and I am a bit lost at the moment not knowing where to go from here.
Although I want to eliminate the manual work, so ideally I'd still like to use it through my site.
Any help or pointers would be appreciated.
Probably a bit too late, but, have you tried django-bft
I want to pull the text out of html files for indexing purposes, and do so as fast as possible. Rather than create something from scratch, I want to see how much I can find already done for me.
Currently I'm just piping the output of html2text, which works, but between being python and trying to prettify the text, I'm sure the speed could be improved.
So, with Linux/unix being priority, what (c/c++) libraries would be best suited to this kind of task?
To extract the text you can use an HTML parser like htmlcxx or libxml. You can can also use any XML library after tidying up the HTML. For indexing the text you can use CLucene.
Should remain format,looks almost the same as original.
A couple of examples:
This page discusses how to use software called pdftohtml to convert in Ubuntu.
This page lists shareware (probably Windows) which converts PDF to various MS formats, including htm.
I even found a couple of videos (a Google video and one on www.break.com). I didn't look at them because I think they'll just describe how to use some software.
These are obviously unsatisfactory if you want to know how to do it yourself.
I think PDF started out as a compressed 'postscript' file, but these days would probably contain images (of scanned documents, for example).
If that's the case, don't bother looking for text, you can extract the images and create HTML pages to display the images. This should at least enable you to preserve the formatting.
At the very least, you could screen-capture the PDF pages to create the images. Crude, I know, but it would work whether the PDF was postscript or images.