Python: Trouble getting image to download/save to file - python-2.7

I am new to Python and seem to be having trouble getting an image to download and save to a file. I was wondering if someone could point out my error. I have tried two methods in various ways to no avail. Here is my code below:
# Ask user to enter URL
url= http://hosted.ap.org/dynamic/stories/A/AF_PISTORIUS_TRIAL?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-04-15-15-48-52
timestamp = datetime.date.today()
soup = BeautifulSoup(urllib2.urlopen(url).read())
#soup = BeautifulSoup(requests.get(url).text)
# ap
links = soup.find("td", {'class': 'ap-mediabox-td'}).find_all('img', src=True)
for link in links:
imgfile = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
imgurl = "http://hosted.ap.org/dynamic/files" + link
download_img = urllib2.urlopen(imgurl).read()
#download_img = requests.get(imgurl, stream=True)
#imgfile.write(download_img.content)
imgfile.write(download_img)
imgfile.close()
# link outputs: /photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
# imgurl outputs: http://hosted.ap.org/dynamic/files/photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
I receive no console error, just an empty picture file.

The relative path of the image can be obtained as simply as by doing:
link = link["src"]
Your statement:
link = link["src"].split("src=")[-1]
is excessive. Replace it with above and you should get the image file created. When I tried it out, I could get the image file to be created. However, I was not able to view the image. It said, the image was corrupted.
I have had success in the past doing the same task using python's requests library using this code snippet:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open('photo.jpg', 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
f.close()
url in the snippet above would be your imgurl computed with the changes I suggested at the beginning.
Hope this helps.

Related

Save a file from requests using django filesystem

I'm currently trying to save a file via requests, it's rather large, so I'm instead streaming it.
I'm unsure how to specifically do this, as I keep getting different errors. This is what I have so far.
def download_file(url, matte_upload_path, matte_servers, job_name, count):
local_filename = url.split('/')[-1]
url = "%s/static/downloads/%s_matte/%s/%s" % (matte_servers[0], job_name, count, local_filename)
with requests.get(url, stream=True) as r:
r.raise_for_status()
fs = FileSystemStorage(location=matte_upload_path)
print(matte_upload_path, 'matte path upload')
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
fs.save(local_filename, f)
return local_filename
but it returns
io.UnsupportedOperation: read
I'm basically trying to have requests save it to the specific location via django, any help would be appreciated.
I was able to solve this, by using a tempfile to save the python requests, then saving it via the FileSystemStorage
local_filename = url.split('/')[-1]
url = "%s/static/downloads/%s_matte/%s/%s" % (matte_servers[0], job_name, count, local_filename)
response = requests.get(url, stream=True)
fs = FileSystemStorage(location=matte_upload_path)
lf = tempfile.NamedTemporaryFile()
# Read the streamed image in sections
for block in response.iter_content(1024 * 8):
# If no more file then stop
if not block:
break
# Write image block to temporary file
lf.write(block)
fs.save(local_filename, lf)

Django, Store jpg file received as string in http POST

I am receiving an http request from a desktop application with a screenshot. I cannot speak with the developer or see source code, so all I have is the http request I am getting.
The file isn't in request.FILES, it is in request.POST.
#csrf_exempt
def create_contract_event_handler(request, contract_id, event_type):
keyboard_events_count = request.POST.get('keyboard_events_count')
mouse_events_count = request.POST.get('mouse_events_count')
screenshot_file = request.POST.get('screenshot_file')
barr2 = bytes(screenshot_file.encode(encoding='utf8'))
with open('.test/output.jpeg', 'wb') as f:
f.write(barr2)
f.close()
The file is corrupted.
The binary starts like this, I don't know if that helps:
����JFIFHH��C
%# , #&')*)-0-(0%()(��C
(((((((((((((((((((((((((((((((((((((((((((((((((((�� `"��
Also, if I try to open the image with PIL, I get the following error:
from PIL import Image
im = Image.open('./test/output.jpg')
#OSError: cannot identify image file './test/output.jpg'
Finally, I managed to touch the code in the other hand, the 'filename' was missing in the header and for that reason I was getting the file in the POST instead of in the FILES dictionary.

how to convert .docx file to html using python?

import mammoth
f = open("D:\filename.docx", 'rb')
document = mammoth.convert_to_html(f)
I am unable to get a .html file while i run this code,please help me to get it, When i converted to .html file i am not getting images inserted into word file into .html file,Can you please help me how to get images into .html from .docx?
Try this:
import mammoth
f = open("path_to_file.docx", 'rb')
b = open('filename.html', 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
This is may be late to answer this question but just incase if someone still looking for the answer where word "tables/images/" should remains same after conversion to html below answer would help.
import win32com.client as win32
# Open MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
wordFilePath = "C:\filename.docx"
doc = word.Documents.Open(wordFilePath)
# change to a .html
txt_path = wordFilePath.split('.')[0] + '.html'
# wdFormatFilteredHTML has value 10
# saves the doc as an html
doc.SaveAs(txt_path, 10)
doc.Close()
# noinspection PyBroadException
try:
word.ActiveDocument()
except Exception:
word.Quit()
I suggest you to try the following code
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value

PDF to Word Doc in Python

I've read though the other stack overflow questions regarding this but it doesn't answer my issue, so down vote away. Its version 2.7.
All I want to do is use python to convert a PDF to a Word doc. At minimum convert to text so I can copy and paste into a word doc.
This is the code I have so far. All it prints is the female gender symbol.
Is my code wrong? Am I approaching this wrong? Do some PDFs just not work with PDFMiner? Do you know of any other alternatives to accomplish my goal of converting a PDF to Word, besides using PyPDF2 or PDFMiner?
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file('Bottom Dec.pdf', 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print convert_pdf_to_txt(1)
from pdf2docx import Converter
pdf_file = 'E:\Muhammad UMER LAR.pdf'
doc_file= 'E:\Lari.docx'
c=Converter(pdf_file)
c.convert(doc_file)
c.close()
Another alternative solution is Aspose.Words Cloud SDK for Python, you can install it from pip for PDF to DOC conversion.
import asposewordscloud
import asposewordscloud.models.requests
api_client = asposewordscloud.ApiClient()
api_client.configuration.host = 'https://api.aspose.cloud'
# Get AppKey and AppSID from https://dashboard.aspose.cloud/
api_client.configuration.api_key['api_key'] = 'xxxxxxxxxxxxxxxxxxxxx' # Put your appKey here
api_client.configuration.api_key['app_sid'] = 'xxxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx' # Put your appSid here
words_api = asposewordscloud.WordsApi(api_client)
filename = '02_pages.pdf'
remote_name = 'TestPostDocumentSaveAs.pdf'
dest_name = 'TestPostDocumentSaveAs.doc'
#upload PDF file to storage
request_stoarge = asposewordscloud.models.requests.UploadFileRequest(filename,remote_name)
response = words_api.upload_file(request_stoarge)
#Convert PDF to DOC and save to storage
save_options = asposewordscloud.SaveOptionsData(save_format='doc', file_name=dest_name)
request = asposewordscloud.models.requests.SaveAsRequest(remote_name, save_options)
result = words_api.save_as(request)
print("Result {}".format(result))
I'm developer evangelist at Aspose.

Python: Have correct image url, cannot download image

I am getting the correct url for the image. But I can't seem to be able to download the image and save it to a file. I am new to python so any guidance would be greatly appreciated. I have tried this with several other article sources and haven't had any trouble downloading the image once I get the url. Guess it doesn't like africom?
url: http://www.africom.mil/Newsroom/Article/12058/multinational-participation-plays-key-factor-to-exercise-african-lion
soup = BeautifulSoup(urllib2.urlopen(url).read())
links = soup.find("div", {'class': 'usafricom_ArticlePhotoContainer'}).find_all('img', src=True)
for link in links:
imgfile = open('%s' % timestamp + "_" + title.encode("utf-8") + ".jpg", "wb")
link = link["src"].split("src=")[-1]
imgurl = "www.africom.mil" + link + ".jpg"
download_img = urllib2.urlopen(imgurl).read()
imgfile.write(download_img)
imgfile.close()
I am not sure what is the error you are seeing with your code. Your question does not mention the error. When I tried your code, I hit this error:
ValueError: unknown url type: www.africom.mil/Image/12059/High/030414-M-XI134-002.jpg
This error is because of this line in your code:
imgurl = "www.africom.mil" + link + ".jpg"
It does not specify the http protocol. Change it to:
imgurl = "http://www.africom.mil" + link + ".jpg"
and check. It worked for me with this change.