import mammoth
f = open("D:\filename.docx", 'rb')
document = mammoth.convert_to_html(f)
I am unable to get a .html file while i run this code,please help me to get it, When i converted to .html file i am not getting images inserted into word file into .html file,Can you please help me how to get images into .html from .docx?
Try this:
import mammoth
f = open("path_to_file.docx", 'rb')
b = open('filename.html', 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
This is may be late to answer this question but just incase if someone still looking for the answer where word "tables/images/" should remains same after conversion to html below answer would help.
import win32com.client as win32
# Open MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
wordFilePath = "C:\filename.docx"
doc = word.Documents.Open(wordFilePath)
# change to a .html
txt_path = wordFilePath.split('.')[0] + '.html'
# wdFormatFilteredHTML has value 10
# saves the doc as an html
doc.SaveAs(txt_path, 10)
doc.Close()
# noinspection PyBroadException
try:
word.ActiveDocument()
except Exception:
word.Quit()
I suggest you to try the following code
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value
Related
I'm having trouble with converting full transcribed speech to a text file. Eventually, I get what I need but not the entire text from the audio file. Let me note this (1 Pic), I can see the whole text when I use print() function but get only one line of that text when I try to write it to .txt file (2 Pic).
Also, you can look at my code if you need additional info and stuff. Thank you in advance!
from google.cloud import speech
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'PATH'
client = speech.SpeechClient()
with open('sample.wav', "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=8000,
language_code="en-US",
# Enable automatic punctuation
enable_automatic_punctuation=True,
)
response = client.recognize(config=config, audio=audio)
for result in response.results:
extr = result.alternatives[0].transcript
print(extr)
with open("guru9.txt","w+") as f:
f.write(extr)
f.close()
What happens in your code is, per iteration you open, write, close your file. You should move out your opening and closing of your file outside the loop.
myfile = open("guru9.txt","w+")
for result in response.results:
extr = result.alternatives[0].transcript
myfile.write(extr)
myfile.close()
I have a website that has a button. When it's clicked, it returns a number of pandas dataframers into an excel file and returns that excel file automatically as download.
It seems to work ok, except when I open the file, it seems to be corrupted. It asks if some of the tabs should be recovered. I'm using the code below. Any suggestions are appreciated what could be the cause for this.
import io
from flask.helpers import make_response
from pandas.io.excel import ExcelWriter
output = io.BytesIO()
writer = ExcelWriter(output)
dfs = [df1,df2....]
tabs ['tab1','tab2',....]
for df, tab_name in zip(dfs, tab_names):
df.to_excel(writer, tab_name)
writer.close()
resp = make_response(output.getvalue())
resp.headers['Content-Disposition'] = 'attachment; filename=output.xlsx'
resp.headers["Content-type"] = "text/csv"
return resp
You'll need to to add
output.seek(0)
after you close the writer.
You might also find it easier to write
return send_file(output, attachment_filename="output.xlsx", as_attachment=True)
(after importing send_file from flask)
I'm trying to speed up a process of webscraping by sending raw data to python in lieu of correctly formatted data.
Current data is received as an excel file with data formatted as:
26 EXAMPLE RD EXAMPLEVILLE SA 5000
Data is formatted in excel via macros to:
Replace all spaces with hyphen
Change all text to lower-case
Paste text onto end of http://example.com/property/
Formatted data is http://www.example.com/property/26-example-rd-exampleville-sa-5000
What i'm trying to accomplish:
Get python to go into excel sheet and follow formatting rules listed above, then pass the records to the scraper.
Here is the code I have been trying to compile - please go easy i am VERY new.
Any advice or reading sources related to python formatting would be appreciated.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import csv
from lxml import html
import xlrd
# URL_BUILDER
# Source File for UNFORMATTED DATA
file_location = "C:\Python27\Projects\REA_SCRAPER\NewScraper\ScrapeFile.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('((PythonScraperDNC))')
# REA_SCRAPER
# Pass Data from URL_BUILDER to URL_LIST []
URL_LIST = []
# Search Phrase to capture suitable URL's for Scraping
text2search = \
'''<p class="property-value__title">
RECENTLY SOLD
</p>'''
# Write Sales .CSV file
with open('Results.csv', 'wb') as csv_file:
writer = csv.writer(csv_file)
for (index, url) in enumerate(URL_LIST):
page = requests.get(url)
print '<Scanning Url For Sale>'
if text2search in page.text:
tree = html.fromstring(page.content)
(title, ) = (x.text_content() for x in tree.xpath('//title'))
(price, ) = (x.text_content() for x in tree.xpath('//div[#class="property-value__price"]'))
(sold, ) = (x.text_content().strip() for x intree.xpath('//p[#class="property-value__agent"]'))
writer.writerow([title, price, sold])
else:
writer.writerow(['No Sale'])
If you're just trying to figure out how to do the formatting in Python:
text = '26 EXAMPLE RD EXAMPLEVILLE SA 5000'
url = 'http://example.com/property/' + text.replace(' ', '-').lower()
print(url)
# Output:
# http://example.com/property/26-example-rd-exampleville-sa-5000
I've read though the other stack overflow questions regarding this but it doesn't answer my issue, so down vote away. Its version 2.7.
All I want to do is use python to convert a PDF to a Word doc. At minimum convert to text so I can copy and paste into a word doc.
This is the code I have so far. All it prints is the female gender symbol.
Is my code wrong? Am I approaching this wrong? Do some PDFs just not work with PDFMiner? Do you know of any other alternatives to accomplish my goal of converting a PDF to Word, besides using PyPDF2 or PDFMiner?
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file('Bottom Dec.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print convert_pdf_to_txt(1)
from pdf2docx import Converter
pdf_file = 'E:\Muhammad UMER LAR.pdf'
doc_file= 'E:\Lari.docx'
c=Converter(pdf_file)
c.convert(doc_file)
c.close()
Another alternative solution is Aspose.Words Cloud SDK for Python, you can install it from pip for PDF to DOC conversion.
import asposewordscloud
import asposewordscloud.models.requests
api_client = asposewordscloud.ApiClient()
api_client.configuration.host = 'https://api.aspose.cloud'
# Get AppKey and AppSID from https://dashboard.aspose.cloud/
api_client.configuration.api_key['api_key'] = 'xxxxxxxxxxxxxxxxxxxxx' # Put your appKey here
api_client.configuration.api_key['app_sid'] = 'xxxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx' # Put your appSid here
words_api = asposewordscloud.WordsApi(api_client)
filename = '02_pages.pdf'
remote_name = 'TestPostDocumentSaveAs.pdf'
dest_name = 'TestPostDocumentSaveAs.doc'
#upload PDF file to storage
request_stoarge = asposewordscloud.models.requests.UploadFileRequest(filename,remote_name)
response = words_api.upload_file(request_stoarge)
#Convert PDF to DOC and save to storage
save_options = asposewordscloud.SaveOptionsData(save_format='doc', file_name=dest_name)
request = asposewordscloud.models.requests.SaveAsRequest(remote_name, save_options)
result = words_api.save_as(request)
print("Result {}".format(result))
I'm developer evangelist at Aspose.
I am new to Python and seem to be having trouble getting an image to download and save to a file. I was wondering if someone could point out my error. I have tried two methods in various ways to no avail. Here is my code below:
# Ask user to enter URL
url= http://hosted.ap.org/dynamic/stories/A/AF_PISTORIUS_TRIAL?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-04-15-15-48-52
timestamp = datetime.date.today()
soup = BeautifulSoup(urllib2.urlopen(url).read())
#soup = BeautifulSoup(requests.get(url).text)
# ap
links = soup.find("td", {'class': 'ap-mediabox-td'}).find_all('img', src=True)
for link in links:
imgfile = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
imgurl = "http://hosted.ap.org/dynamic/files" + link
download_img = urllib2.urlopen(imgurl).read()
#download_img = requests.get(imgurl, stream=True)
#imgfile.write(download_img.content)
imgfile.write(download_img)
imgfile.close()
# link outputs: /photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
# imgurl outputs: http://hosted.ap.org/dynamic/files/photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
I receive no console error, just an empty picture file.
The relative path of the image can be obtained as simply as by doing:
link = link["src"]
Your statement:
link = link["src"].split("src=")[-1]
is excessive. Replace it with above and you should get the image file created. When I tried it out, I could get the image file to be created. However, I was not able to view the image. It said, the image was corrupted.
I have had success in the past doing the same task using python's requests library using this code snippet:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open('photo.jpg', 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
f.close()
url in the snippet above would be your imgurl computed with the changes I suggested at the beginning.
Hope this helps.