PDF to Word Doc in Python - python-2.7

I've read though the other stack overflow questions regarding this but it doesn't answer my issue, so down vote away. Its version 2.7.
All I want to do is use python to convert a PDF to a Word doc. At minimum convert to text so I can copy and paste into a word doc.
This is the code I have so far. All it prints is the female gender symbol.
Is my code wrong? Am I approaching this wrong? Do some PDFs just not work with PDFMiner? Do you know of any other alternatives to accomplish my goal of converting a PDF to Word, besides using PyPDF2 or PDFMiner?
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file('Bottom Dec.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print convert_pdf_to_txt(1)

from pdf2docx import Converter
pdf_file = 'E:\Muhammad UMER LAR.pdf'
doc_file= 'E:\Lari.docx'
c=Converter(pdf_file)
c.convert(doc_file)
c.close()

Another alternative solution is Aspose.Words Cloud SDK for Python, you can install it from pip for PDF to DOC conversion.
import asposewordscloud
import asposewordscloud.models.requests
api_client = asposewordscloud.ApiClient()
api_client.configuration.host = 'https://api.aspose.cloud'
# Get AppKey and AppSID from https://dashboard.aspose.cloud/
api_client.configuration.api_key['api_key'] = 'xxxxxxxxxxxxxxxxxxxxx' # Put your appKey here
api_client.configuration.api_key['app_sid'] = 'xxxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx' # Put your appSid here
words_api = asposewordscloud.WordsApi(api_client)
filename = '02_pages.pdf'
remote_name = 'TestPostDocumentSaveAs.pdf'
dest_name = 'TestPostDocumentSaveAs.doc'
#upload PDF file to storage
request_stoarge = asposewordscloud.models.requests.UploadFileRequest(filename,remote_name)
response = words_api.upload_file(request_stoarge)
#Convert PDF to DOC and save to storage
save_options = asposewordscloud.SaveOptionsData(save_format='doc', file_name=dest_name)
request = asposewordscloud.models.requests.SaveAsRequest(remote_name, save_options)
result = words_api.save_as(request)
print("Result {}".format(result))
I'm developer evangelist at Aspose.

Related

Wolframalpha in python with GTTS

I am trying to make a Friday like virtual assistant using this code
import os
from gtts import gTTS
import time
import playsound
import speech_recognition as sr
while True:
def speak(text):
tts = gTTS(text=text, lang="en")
filename = "voice.mp3"
tts.save(filename)
playsound.playsound(filename)
def get_audio():
r = sr.Recognizer()
with sr.Microphone() as source:
audio = r.listen(source)
said = ""
try:
said = r.recognize_google(audio)
print(said)
except Exception as e:
print("Exception: " + str(e))
return said
text = get_audio()
if "who are you" in text:
speak(" I am Monday the virtual assistant")
And i was wondering how to put wolfram alpha in it so i would, say search for ..., then it would speak the answer from wolfram alpha.
Any help would be amazing :)
Install wolframalpha
Then add the following to your code:
import wolframalpha
if 'search for ' in text:
text = text.replace("search for ", "")
client = wolframalpha.Client(app_id)
res = client.query(text)
print(next(res.results).text)
speak(next(res.results).text)
To use the API, you have to go to the homepage, sign up for an account, create an app and get an app id.
To avoid getting any errors, keep the indentation in your 'speak' function uniform.

how to convert .docx file to html using python?

import mammoth
f = open("D:\filename.docx", 'rb')
document = mammoth.convert_to_html(f)
I am unable to get a .html file while i run this code,please help me to get it, When i converted to .html file i am not getting images inserted into word file into .html file,Can you please help me how to get images into .html from .docx?
Try this:
import mammoth
f = open("path_to_file.docx", 'rb')
b = open('filename.html', 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
This is may be late to answer this question but just incase if someone still looking for the answer where word "tables/images/" should remains same after conversion to html below answer would help.
import win32com.client as win32
# Open MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
wordFilePath = "C:\filename.docx"
doc = word.Documents.Open(wordFilePath)
# change to a .html
txt_path = wordFilePath.split('.')[0] + '.html'
# wdFormatFilteredHTML has value 10
# saves the doc as an html
doc.SaveAs(txt_path, 10)
doc.Close()
# noinspection PyBroadException
try:
word.ActiveDocument()
except Exception:
word.Quit()
I suggest you to try the following code
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value

Replace White-space with hyphen then create URL

I'm trying to speed up a process of webscraping by sending raw data to python in lieu of correctly formatted data.
Current data is received as an excel file with data formatted as:
26 EXAMPLE RD EXAMPLEVILLE SA 5000
Data is formatted in excel via macros to:
Replace all spaces with hyphen
Change all text to lower-case
Paste text onto end of http://example.com/property/
Formatted data is http://www.example.com/property/26-example-rd-exampleville-sa-5000
What i'm trying to accomplish:
Get python to go into excel sheet and follow formatting rules listed above, then pass the records to the scraper.
Here is the code I have been trying to compile - please go easy i am VERY new.
Any advice or reading sources related to python formatting would be appreciated.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import csv
from lxml import html
import xlrd
# URL_BUILDER
# Source File for UNFORMATTED DATA
file_location = "C:\Python27\Projects\REA_SCRAPER\NewScraper\ScrapeFile.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('((PythonScraperDNC))')
# REA_SCRAPER
# Pass Data from URL_BUILDER to URL_LIST []
URL_LIST = []
# Search Phrase to capture suitable URL's for Scraping
text2search = \
'''<p class="property-value__title">
RECENTLY SOLD
</p>'''
# Write Sales .CSV file
with open('Results.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for (index, url) in enumerate(URL_LIST):
page = requests.get(url)
print '<Scanning Url For Sale>'
if text2search in page.text:
tree = html.fromstring(page.content)
(title, ) = (x.text_content() for x in tree.xpath('//title'))
(price, ) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold, ) = (x.text_content().strip() for x intree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
else:
writer.writerow(['No Sale'])
If you're just trying to figure out how to do the formatting in Python:
text = '26 EXAMPLE RD EXAMPLEVILLE SA 5000'
url = 'http://example.com/property/' + text.replace(' ', '-').lower()
print(url)
# Output:
# http://example.com/property/26-example-rd-exampleville-sa-5000

How can I use nessrest api (python) to export nessus scan reports in xml?

I am trying to automate the running of and downloading nessus scans using python. I have been using the nessrest api for python, and am able to successfully run a scan, but am not being successfully download the report in nessus format.
Any ideas how I can do this? I have been using the module scan_download, but that actually executes before my scan even finishes.
Thanks for the help in advance!
Just looking back at this question, heres an example of using Nessrest API to pull down CSV report exports from you nessus host,
#!/usr/bin/python2.7
import sys
import os
import io
from nessrest import ness6rest
file_format = 'csv' # options: nessus, csv, db, html
dbpasswd = ''
scan = ness6rest.Scanner(url="https://nessus:8834", login="admin", password="P#ssword123", insecure=True)
scan.action(action='scans', method='get')
folders = scan.res['folders']
scans = scan.res['scans']
if scan:
scan.action(action='scans', method='get')
folders = scan.res['folders']
scans = scan.res['scans']
for f in folders:
if not os.path.exists(f['name']):
if not f['type'] == 'trash':
os.mkdir(f['name'])
for s in scans:
scan.scan_name = s['name']
scan.scan_id = s['id']
folder_name = next(f['name'] for f in folders if f['id'] == s['folder_id'])
folder_type = next(f['type'] for f in folders if f['id'] == s['folder_id'])
# skip trash items
if folder_type == 'trash':
continue
if s['status'] == 'completed':
file_name = '%s_%s.%s' % (scan.scan_name, scan.scan_id, file_format)
file_name = file_name.replace('\\','_')
file_name = file_name.replace('/','_')
file_name = file_name.strip()
relative_path_name = folder_name + '/' + file_name
# PDF not yet supported
# python API wrapper nessrest returns the PDF as a string object instead of a byte object, making writing and correctly encoding the file a chore...
# other formats can be written out in text mode
file_modes = 'wb'
# DB is binary mode
#if args.format == "db":
# file_modes = 'wb'
with io.open(relative_path_name, file_modes) as fp:
if file_format != "db":
fp.write(scan.download_scan(export_format=file_format))
else:
fp.write(scan.download_scan(export_format=file_format, dbpasswd=dbpasswd))
can see more examples here,
https://github.com/tenable/nessrest/tree/master/scripts

What data 'structure' does fs.get_last_version return?

When I use get_last_version to get an image from the database, what is actually returned ie an array, the merged binary data of all the chunks that make up the file (as a string), or something else?
dbname = 'grid_files'
db = connection[dbname]
fs = gridfs.GridFS(db)
filename = "my_image.jpg"
my_image_file = fs.get_last_version(filename=filename)
I'm wanting to base64 encode my_image_file with:
import base64
encoded_img_file = base64.b64encode(my_image_file)
return encoded_img_file
But I'm getting a 500 error.
I haven't been able to glean what is actually returned when using get_last_version from the docs:
http://api.mongodb.org/python/current/api/gridfs/#gridfs.GridFS.get_last_version
More Research:
I followed the logic from this post:
http://blog.pythonisito.com/2012/05/gridfs-mongodb-filesystem.html
And in shell running Python on server could see that Binary() was returned - so should I be able to base64 encode this as demonstrated above?:
>>> import pymongo
>>> import gridfs
>>> import os
>>> hostname = os.environ['OPENSHIFT_MONGODB_DB_URL']
>>> conn = pymongo.MongoClient(host=hostname)
>>> db = conn.grid_files
>>> fs = gridfs.GridFS(db)
>>> list(db.fs.chunks.find())
[{u'files_id': ObjectId('52db4d9e70914413718f2ec4'), u'_id': ObjectId('52db4d9e7
0914413718f2ec5'), u'data': Binary('lots of binary code', 0), u'n': 0}]
Unless there is a better answer, this is what I've come up with.
get_last_version returns a Binary() object.
In regards to base64 encoding it, and returning it, this is how I did it:
dbname = 'grid_files'
db = connection[dbname]
fs = gridfs.GridFS(db)
filename = "my_image.jpg"
my_image_file = fs.get_last_version(filename=filename)
encoded_img_file = base64.b64encode(my_image_file.read())
return encoded_img_file
Then accessed it on the front end with:
$("#my_img").attr("src", "data:image/png;base64," + data);