How to convert pdf documents to html files? - pdf-to-html

Should remain format,looks almost the same as original.

A couple of examples:
This page discusses how to use software called pdftohtml to convert in Ubuntu.
This page lists shareware (probably Windows) which converts PDF to various MS formats, including htm.
I even found a couple of videos (a Google video and one on www.break.com). I didn't look at them because I think they'll just describe how to use some software.
These are obviously unsatisfactory if you want to know how to do it yourself.
I think PDF started out as a compressed 'postscript' file, but these days would probably contain images (of scanned documents, for example).
If that's the case, don't bother looking for text, you can extract the images and create HTML pages to display the images. This should at least enable you to preserve the formatting.
At the very least, you could screen-capture the PDF pages to create the images. Crude, I know, but it would work whether the PDF was postscript or images.

Related

only loading text in content of url in python

I would like to download normal webpages, webhosted ppt and pdfs in python. However, to minimize that amount of data that I would need to download, I would like to download just the text and ignore any images.
This sounds feasible with normal websites, I'm not sure if its possible for ppt and pdfs. How can I accomplish this?
I'm planning to use the textract module to extract the content of these pages after downloading them, but I'd be interested to know if there are alternatives that would make my problem easier to solve.
Take a look at the textract library. This accomplishes pretty much all your requirements, ie, html, pdf and ppt.

why is my libharu pdf oversized with .png images?

I am creating a pdf using libharu in C++ (compiled as a .cgi) that features .png images.
The code is fine, but my pdf's are ridiculously oversized.
Each page features one image of around 30kb and around 4 text characters in libharu's system font. If I open a 20 page output file of 25mb and "print" it to a file in my operating system it becomes 256kb or so with no visible change to the images.
I think the issue is related to libharu because this guy see's it too, here. He is using php so, libharu as a compiled .cgi. (my C++ code is also compiled .cgi, linked to libharu).
Another guy here on stack overflow has also seen size issues with libharu, but his problem does not mention anything to do with .png so it may be unrelated.
Code for reference:
WorkingGraphic = HPDF_LoadPngImageFromMem ( *gPdfPtr,
PngAssets[AssetIndex], //Image data ptr
PngSizes[AssetIndex]); //data length
//Render Appropriate
HPDF_Page_DrawImage (*BlitParams->page,
WorkingGraphic,
BlitParams->OutputRect->X,
BlitParams->OutputRect->Y,
BlitParams->OutputRect->Width,
BlitParams->OutputRect->Height);
Does anyone know how to drive libharu so it creates sensible sized pdf's when you use .png images?
Right I don't know how to remove a question but maybe this info will be useful to others anyway.
I may have had the same issue as this fellow here where I have duplicated this answer.
What I needed to do was enable compression of the .pdf, which I had not done.
Documentation link
C Code:
HPDF_SetCompressionMode (pdf, HPDF_COMP_ALL);
It's because I didn't do enough research to know that .pdf format does not natively support .png, or if it has been updated to do so, libharu still doesn't. So, this option tells libharu to use zlib to zip compress everything it can, including your images.
The implementation is not perfect (you will still see a size difference if you zip your output .pdf) but it is acceptable for my use case.
If you don't need the full-size image in the PDF, you can reduce the image to a thumbnail using GDI+ APIs, equal in size to however big you want the image to appear in the PDF.
Save the scaled PNG to a temporary file, and pass the thumbnail PNG to Haru PDF. This will reduce the size of the PDF file.
The image will be pixellated when the viewer zooms in.

Office.Interop for document format conversion

I have generated documents ( in Docx, Xlsx, PDF formats) using ReportViewer.WebForms.
Problem is, I need few additional formats (Html and Rtf), so i made conversions using Microsoft.Office.Interop.Word. Basicly it's opens file and saves it in different format.
Whats need to be done in server side (besides installing word, only word is enough?)?I know it's bad practice, but are there any other solutions?
P.S.Commercial librarys like Aspose.Words are awesome, but project is too small to buy license.
P.P.S. Those two formats will be rearly used, so performance is not very big issue. Documents quite simple, no more than simple 3 tables.

RTF to Wiki Converter?

I would like to convert 100+ RTF files to Wiki Markup, but I can only find "Wiki to RTF" converters on the web and even here on StackOverflow.
I only need RTF --> Wiki Markup
Is there anything like this out there?
I simply asked the wrong question.
Did some research and found out that there is no Converter which converts RTF directly to a "Wiki format".
The "better" question: Save Word file as Wiki markup.
There are some approaches using Microsoft Word to save as .txt (Wiki markup):
http://www.mediawiki.org/wiki/Extension:Word2MediaWikiPlus
http://www.microsoft.com/en-us/download/details.aspx?id=12298
http://techwiki.openstructs.org/index.php/Wiki_converters
http://www.consumingexperience.com/2008/03/convert-word-doc-or-webpage-to-wiki.html
Good luck!
It may be unwieldy with hundreds of files, but I have used wikEd to convert RTF and Word formatted text to wiki markup.
Wikedbox is a usable implementation of wikEd without installing it:
http://www.appropedia.org/index.php?title=Appropedia:Wikedbox&action=edit
'''Wikedbox HELPS YOU CONVERT HTML (WEB FORMATTED CONTENT) TO MEDIAWIKI.'''. More instructions at [[Appropedia:WikEd]].
INSTRUCTIONS:
If you're not in edit mode, click the "edit" tab now.
Paste in your content.
Click the red [W] to convert. (Middle row, second block from the left.)
Your text is ready to be copied and used on your wiki page. (You may need to make further corrections.)
'''DO NOT SAVE THIS PAGE'''. (You'll be blocked from doing so, anyway, unless you're an admin.)
You can use Pandoc:
pandoc -s README -o example.rtf
This will convert your file to Markdown. I don't know which Wiki you want to use and If it understands Markdown, but I thinl you can also convert it to MediaWiki or other output formats (see the Pandoc User manual).

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.
I want to experiment with implementing a map-reduce algorithm.
Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.
Edit: Any that aren't torrents? I can't get those at work.
The stackoverflow database is available for download.
Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.
One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).
If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.
If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.
Out of curiosity, what are you using all this data for?
You could use a web crawler and scrape 100MB of data?
There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.
One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).
https://dumps.wikimedia.org/metawiki/latest/
You want to look for any files with the -articles.xml.bz2 suffix.