I am getting the correct url for the image. But I can't seem to be able to download the image and save it to a file. I am new to python so any guidance would be greatly appreciated. I have tried this with several other article sources and haven't had any trouble downloading the image once I get the url. Guess it doesn't like africom?
url: http://www.africom.mil/Newsroom/Article/12058/multinational-participation-plays-key-factor-to-exercise-african-lion
soup = BeautifulSoup(urllib2.urlopen(url).read())
links = soup.find("div", {'class': 'usafricom_ArticlePhotoContainer'}).find_all('img', src=True)
for link in links:
imgfile = open('%s' % timestamp + "_" + title.encode("utf-8") + ".jpg", "wb")
link = link["src"].split("src=")[-1]
imgurl = "www.africom.mil" + link + ".jpg"
download_img = urllib2.urlopen(imgurl).read()
imgfile.write(download_img)
imgfile.close()
I am not sure what is the error you are seeing with your code. Your question does not mention the error. When I tried your code, I hit this error:
ValueError: unknown url type: www.africom.mil/Image/12059/High/030414-M-XI134-002.jpg
This error is because of this line in your code:
imgurl = "www.africom.mil" + link + ".jpg"
It does not specify the http protocol. Change it to:
imgurl = "http://www.africom.mil" + link + ".jpg"
and check. It worked for me with this change.
Related
I am receiving an http request from a desktop application with a screenshot. I cannot speak with the developer or see source code, so all I have is the http request I am getting.
The file isn't in request.FILES, it is in request.POST.
#csrf_exempt
def create_contract_event_handler(request, contract_id, event_type):
keyboard_events_count = request.POST.get('keyboard_events_count')
mouse_events_count = request.POST.get('mouse_events_count')
screenshot_file = request.POST.get('screenshot_file')
barr2 = bytes(screenshot_file.encode(encoding='utf8'))
with open('.test/output.jpeg', 'wb') as f:
f.write(barr2)
f.close()
The file is corrupted.
The binary starts like this, I don't know if that helps:
����JFIFHH��C
%# , #&')*)-0-(0%()(��C
(((((((((((((((((((((((((((((((((((((((((((((((((((�� `"��
Also, if I try to open the image with PIL, I get the following error:
from PIL import Image
im = Image.open('./test/output.jpg')
#OSError: cannot identify image file './test/output.jpg'
Finally, I managed to touch the code in the other hand, the 'filename' was missing in the header and for that reason I was getting the file in the POST instead of in the FILES dictionary.
import mammoth
f = open("D:\filename.docx", 'rb')
document = mammoth.convert_to_html(f)
I am unable to get a .html file while i run this code,please help me to get it, When i converted to .html file i am not getting images inserted into word file into .html file,Can you please help me how to get images into .html from .docx?
Try this:
import mammoth
f = open("path_to_file.docx", 'rb')
b = open('filename.html', 'wb')
document = mammoth.convert_to_html(f)
b.write(document.value.encode('utf8'))
f.close()
b.close()
This is may be late to answer this question but just incase if someone still looking for the answer where word "tables/images/" should remains same after conversion to html below answer would help.
import win32com.client as win32
# Open MS Word
word = win32.gencache.EnsureDispatch('Word.Application')
wordFilePath = "C:\filename.docx"
doc = word.Documents.Open(wordFilePath)
# change to a .html
txt_path = wordFilePath.split('.')[0] + '.html'
# wdFormatFilteredHTML has value 10
# saves the doc as an html
doc.SaveAs(txt_path, 10)
doc.Close()
# noinspection PyBroadException
try:
word.ActiveDocument()
except Exception:
word.Quit()
I suggest you to try the following code
import mammoth
with open("document.docx", "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
html = result.value
I'm in the process of learning Python, and I decided to train a bit of my programming by trying to make a program that could research text in a site called "Library of Babel" (https://libraryofbabel.info/)
I'm using BeautifulSoup to get the actual text out of the HTML code and I'm then using Regular Expressions to search for what I'm looking for, in this case I was testing it with just the leter "a".
But for some reason the code gives a error and says the variable I'm searching the "a" on, is not assigned.
Code:
import re
import requests
from bs4 import BeautifulSoup
url = "https://libraryofbabel.info/browse.cgi"
pages,data=[],[]
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("li",{"onclick":"gethexfromlist(this.innerHTML); enterhex();"}):
page = text.string
pages.append(page)
for eachRoom in pages:
url = "https://libraryofbabel.info/browse.cgi?" + eachRoom
for eachWall in range(1,5):
url = url + "-w" + str(eachWall)
for eachShelf in range(1,6):
url = url + "s-" + str(eachShelf)
for eachVolume in range(1,33):
if len(str(eachVolume)) == 1:
url = url + "-v0" + str(eachVolume)
else:
url = url + "-v" + str(eachVolume)
for eachPage in range(1,411):
url = url + ":" + str(eachPage)
r = requests.get(url)
r = r.text
soup = BeautifulSoup(r,"html.parser")
for text in soup.findAll("div",{"class":"bookrealign"}):
rdata = text.string
if data == []:
data = re.findall(r"a",rdata)
else:
break
Error:
Traceback (most recent call last):
File "C:\Users\...", line 37, in <module>
data = re.findall(r"a",rdata)
NameError: name 'rdata' is not defined
Thanks in advance for any help given :)
Your if is outside the loop and soup.findAll("div",{"class":"bookrealign"}) finds nothing so rdata never gets defined.
I am new to Python and seem to be having trouble getting an image to download and save to a file. I was wondering if someone could point out my error. I have tried two methods in various ways to no avail. Here is my code below:
# Ask user to enter URL
url= http://hosted.ap.org/dynamic/stories/A/AF_PISTORIUS_TRIAL?SITE=AP&SECTION=HOME&TEMPLATE=DEFAULT&CTIME=2014-04-15-15-48-52
timestamp = datetime.date.today()
soup = BeautifulSoup(urllib2.urlopen(url).read())
#soup = BeautifulSoup(requests.get(url).text)
# ap
links = soup.find("td", {'class': 'ap-mediabox-td'}).find_all('img', src=True)
for link in links:
imgfile = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
imgurl = "http://hosted.ap.org/dynamic/files" + link
download_img = urllib2.urlopen(imgurl).read()
#download_img = requests.get(imgurl, stream=True)
#imgfile.write(download_img.content)
imgfile.write(download_img)
imgfile.close()
# link outputs: /photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
# imgurl outputs: http://hosted.ap.org/dynamic/files/photos/F/f5cc6144-d991-4e28-b5e6-acc0badcea56-small.jpg
I receive no console error, just an empty picture file.
The relative path of the image can be obtained as simply as by doing:
link = link["src"]
Your statement:
link = link["src"].split("src=")[-1]
is excessive. Replace it with above and you should get the image file created. When I tried it out, I could get the image file to be created. However, I was not able to view the image. It said, the image was corrupted.
I have had success in the past doing the same task using python's requests library using this code snippet:
r = requests.get(url, stream=True)
if r.status_code == 200:
with open('photo.jpg', 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
f.close()
url in the snippet above would be your imgurl computed with the changes I suggested at the beginning.
Hope this helps.
Im trying to execute the following code in Python 2.7 on Windows7. The purpose of the code is to take back up from the specified folder to a specified folder as per the naming pattern given.
However, Im not able to get it work. The output has always been 'Backup Failed'.
Please advise on how I get resolve this to get the code working.
Thanks.
Code :
backup_ver1.py
import os
import time
import sys
sys.path.append('C:\Python27\GnuWin32\bin')
source = 'C:\New'
target_dir = 'E:\Backup'
target = target_dir + os.sep + time.strftime('%Y%m%d%H%M%S') + '.zip'
zip_command = "zip -qr {0} {1}".format(target,''.join(source))
print('This is a program for backing up files')
print(zip_command)
if os.system(zip_command)==0:
print('Successful backup to', target)
else:
print('Backup FAILED')
See if escaping the \'s helps :-
source = 'C:\\New'
target_dir = 'E:\\Backup'