Python: Have correct image url, cannot download image - python-2.7

I am getting the correct url for the image. But I can't seem to be able to download the image and save it to a file. I am new to python so any guidance would be greatly appreciated. I have tried this with several other article sources and haven't had any trouble downloading the image once I get the url. Guess it doesn't like africom?
url: http://www.africom.mil/Newsroom/Article/12058/multinational-participation-plays-key-factor-to-exercise-african-lion
soup = BeautifulSoup(urllib2.urlopen(url).read())
links = soup.find("div", {'class': 'usafricom_ArticlePhotoContainer'}).find_all('img', src=True)
for link in links:
imgfile = open('%s' % timestamp + "_" + title.encode("utf-8") + ".jpg", "wb")
link = link["src"].split("src=")[-1]
imgurl = "www.africom.mil" + link + ".jpg"
download_img = urllib2.urlopen(imgurl).read()
imgfile.write(download_img)
imgfile.close()

I am not sure what is the error you are seeing with your code. Your question does not mention the error. When I tried your code, I hit this error:
ValueError: unknown url type: www.africom.mil/Image/12059/High/030414-M-XI134-002.jpg
This error is because of this line in your code:
imgurl = "www.africom.mil" + link + ".jpg"
It does not specify the http protocol. Change it to:
imgurl = "http://www.africom.mil" + link + ".jpg"
and check. It worked for me with this change.

Related

Django, Store jpg file received as string in http POST

I am receiving an http request from a desktop application with a screenshot. I cannot speak with the developer or see source code, so all I have is the http request I am getting.
The file isn't in request.FILES, it is in request.POST.
#csrf_exempt
def create_contract_event_handler(request, contract_id, event_type):
keyboard_events_count = request.POST.get('keyboard_events_count')
mouse_events_count = request.POST.get('mouse_events_count')
screenshot_file = request.POST.get('screenshot_file')
barr2 = bytes(screenshot_file.encode(encoding='utf8'))
with open('.test/output.jpeg', 'wb') as f:
f.write(barr2)
f.close()
The file is corrupted.
The binary starts like this, I don't know if that helps:
����JFIFHH��C
%# , #&')*)-0-(0%()(��C
(((((((((((((((((((((((((((((((((((((((((((((((((((�� `"��
Also, if I try to open the image with PIL, I get the following error:
from PIL import Image
im = Image.open('./test/output.jpg')
#OSError: cannot identify image file './test/output.jpg'
Finally, I managed to touch the code in the other hand, the 'filename' was missing in the header and for that reason I was getting the file in the POST instead of in the FILES dictionary.

how to convert .docx file to html using python?

import mammoth
f = open("D:\filename.docx", 'rb')
document = mammoth.convert_to_html(f)
I am unable to get a .html file while i run this code,please help me to get it, When i converted to .html file i am not getting images inserted into word file into .html file,Can you please help me how to get images into .html from .docx?
Try this:
import mammoth
f = open("path_to_file.docx", 'rb')
b = open('filename.html', 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
This is may be late to answer this question but just incase if someone still looking for the answer where word "tables/images/" should remains same after conversion to html below answer would help.
import win32com.client as win32
# Open MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
wordFilePath = "C:\filename.docx"
doc = word.Documents.Open(wordFilePath)
# change to a .html
txt_path = wordFilePath.split('.')[0] + '.html'
# wdFormatFilteredHTML has value 10
# saves the doc as an html
doc.SaveAs(txt_path, 10)
doc.Close()
# noinspection PyBroadException
try:
word.ActiveDocument()
except Exception:
word.Quit()
I suggest you to try the following code
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value

Python 3 | Beautiful Soup For Loop, Variable not assigned

I'm in the process of learning Python, and I decided to train a bit of my programming by trying to make a program that could research text in a site called "Library of Babel" (https://libraryofbabel.info/)
I'm using BeautifulSoup to get the actual text out of the HTML code and I'm then using Regular Expressions to search for what I'm looking for, in this case I was testing it with just the leter "a".
But for some reason the code gives a error and says the variable I'm searching the "a" on, is not assigned.
Code:
import re
import requests
from bs4 import BeautifulSoup
url = "https://libraryofbabel.info/browse.cgi"
pages,data=[],[]
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("li",{"onclick":"gethexfromlist(this.innerHTML); enterhex();"}):
page = text.string
pages.append(page)
for eachRoom in pages:
url = "https://libraryofbabel.info/browse.cgi?" + eachRoom
for eachWall in range(1,5):
url = url + "-w" + str(eachWall)
for eachShelf in range(1,6):
url = url + "s-" + str(eachShelf)
for eachVolume in range(1,33):
if len(str(eachVolume)) == 1:
url = url + "-v0" + str(eachVolume)
else:
url = url + "-v" + str(eachVolume)
for eachPage in range(1,411):
url = url + ":" + str(eachPage)
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("div",{"class":"bookrealign"}):
rdata = text.string
if data == []:
data = re.findall(r"a",rdata)
else:
break
Error:
Traceback (most recent call last):
File "C:\Users\...", line 37, in <module>
data = re.findall(r"a",rdata)
NameError: name 'rdata' is not defined
Thanks in advance for any help given :)
Your if is outside the loop and soup.findAll("div",{"class":"bookrealign"}) finds nothing so rdata never gets defined.

Python: Trouble getting image to download/save to file

I am new to Python and seem to be having trouble getting an image to download and save to a file. I was wondering if someone could point out my error. I have tried two methods in various ways to no avail. Here is my code below:
# Ask user to enter URL
url= http://hosted.ap.org/dynamic/stories/A/AF_PISTORIUS_TRIAL?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-04-15-15-48-52
timestamp = datetime.date.today()
soup = BeautifulSoup(urllib2.urlopen(url).read())
#soup = BeautifulSoup(requests.get(url).text)
# ap
links = soup.find("td", {'class': 'ap-mediabox-td'}).find_all('img', src=True)
for link in links:
imgfile = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
imgurl = "http://hosted.ap.org/dynamic/files" + link
download_img = urllib2.urlopen(imgurl).read()
#download_img = requests.get(imgurl, stream=True)
#imgfile.write(download_img.content)
imgfile.write(download_img)
imgfile.close()
# link outputs: /photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
# imgurl outputs: http://hosted.ap.org/dynamic/files/photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
I receive no console error, just an empty picture file.
The relative path of the image can be obtained as simply as by doing:
link = link["src"]
Your statement:
link = link["src"].split("src=")[-1]
is excessive. Replace it with above and you should get the image file created. When I tried it out, I could get the image file to be created. However, I was not able to view the image. It said, the image was corrupted.
I have had success in the past doing the same task using python's requests library using this code snippet:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open('photo.jpg', 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
f.close()
url in the snippet above would be your imgurl computed with the changes I suggested at the beginning.
Hope this helps.

Program for Backup - Python

Im trying to execute the following code in Python 2.7 on Windows7. The purpose of the code is to take back up from the specified folder to a specified folder as per the naming pattern given.
However, Im not able to get it work. The output has always been 'Backup Failed'.
Please advise on how I get resolve this to get the code working.
Thanks.
Code :
backup_ver1.py
import os
import time
import sys
sys.path.append('C:\Python27\GnuWin32\bin')
source = 'C:\New'
target_dir = 'E:\Backup'
target = target_dir + os.sep + time.strftime('%Y%m%d%H%M%S') + '.zip'
zip_command = "zip -qr {0} {1}".format(target,''.join(source))
print('This is a program for backing up files')
print(zip_command)
if os.system(zip_command)==0:
print('Successful backup to', target)
else:
print('Backup FAILED')
See if escaping the \'s helps :-
source = 'C:\\New'
target_dir = 'E:\\Backup'