Scraping images using beautiful soup - python-2.7

I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?
timestamp = time.asctime()
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write('\n' + "Image(s): " + download_img.read() + '\n' + '\n')
txt.close()

You are appending a new line and text to the start of the data for every image, essentially corrupting it.
Also, you are writing every image into the same file, again corrupting them.
Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
timestamp = time.asctime()
txt = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write(download_img.read())
txt.close()

Related

Python pdfminer extract image produces multiple images per page (should be single image)

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin).
I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides no result (as in cannot be opened via finder on OSX)
Sample:
fp = open('the_file.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
pdf_item = device.get_result()
for thing in pdf_item:
if isinstance(thing, LTImage):
save_image(thing)
if isinstance(thing, LTFigure):
find_images_in_thing(thing)
def find_images_in_thing(outer_layout):
for thing in outer_layout:
if isinstance(thing, LTImage):
save_image(thing)
save_image either writes a file per image in pageNum_imgNum format in 'wb' mode or a single image per page in 'a' mode. I have tried numerous file extensions with no luck.
Resources I've looked into:
http://denis.papathanasiou.org/posts/2010.08.04.post.html (outdatted pdfminer version)
http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html
It's been a while since this question has been asked, but I'll contribute for the sake of the community, and potentially for your benefit :)
I've been using an image parser called pdfimages, available through the poppler PDF processing framework. It also outputs several files per image; it seems like a relatively common behavior for PDF generators to 'tile' or 'strip' the images into multiple images that then need to be pieced together when scraping, but appear to be entirely intact while viewing the PDF. The formats/file extensions that I have seen through pdfimages and elsewhere are: png, tiff, jp2, jpg, ccitt. Have you tried all of those?
Have you tried something like this?
from binascii import b2a_hex
def determine_image_type (stream_first_4_bytes):
"""Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
file_type = None
bytes_as_hex = b2a_hex(stream_first_4_bytes).decode()
if bytes_as_hex.startswith('ffd8'):
file_type = '.jpeg'
elif bytes_as_hex == '89504e47':
file_type = '.png'
elif bytes_as_hex == '47494638':
file_type = '.gif'
elif bytes_as_hex.startswith('424d'):
file_type = '.bmp'
return file_type
A (partial) solution for the image tiling problem is posted here: PDF: extracted images are sliced / tiled
I would use in image library to find the image type:
import io
from PIL import Image
image = Image.open(io.BytesIO(thing.stream.get_data()))
print(image.format)

Using lxml, how can I append to a document and then wrap the entire document in a tag so I can search it with xpath?

Without the entire document being wrapped in a single tag, xpath gives me the "Extra content at end of the document" error. This is no issue, I could wrap the entire thing in one tag. But, in my program you will be writing to this document many times, and going into the document and then editing it defeats the purpose of having the program.
This is my code for writing to the document:
def write():
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
post = open('post.txt', 'w')
document = etree.Element('document')
title = raw_input('title>>')
while 1:
message = raw_input('post>>')
post.write(message + '\n')
if '[done]' in message:
tags = raw_input('tags>>')
break
post = open('post.txt', 'r')
postf = post.read()
article = etree.SubElement(document, 'article', title=title, date=st, tags=tags)
article.text = postf
post.close()
with open('postf.txt', 'a') as file:
file.write(etree.tostring(article, pretty_print=True) + '\n')
file.close()
return document, article
And this is the code for searching the document:
if search in command:
query = command.replace(search + ' ', "") #remove precursor
post = open('postf.txt', 'r')
postf = str(post.read())
root = etree.fromstring(postf)
articles = root.xpath('//article[contains(#tags, "%s")]' % query)
for article in articles:
print etree.tostring(article, pretty_print=True)
Is there a step in there somewhere I can add that will wrap the entire document in a single tag after each "write()" function is called?
Let me know if it is needed to post my full program, but I am fairly certain that this is the only part of the code that would effect what I am looking to do. If not, leave a comment and I will edit the rest in. Thank you.
You can create a "virtual wrapper" around your "multi-root" XML file by placing the following file in the same directory as that file:
<!DOCTYPE doc [
<!ENTITY e SYSTEM "article.xml">
]>
<doc>&e;</doc>
You can then target XPath expressions at this virtual document. That way you retain the ability to append data to the real article.xml, while being able to execute XPath queries at any time.

scraping multiple images using beautiful soup

I'm trying to grab all of the img links in the slideshow on a Reuters article. I was wondering if someone could explain to me why this only grabs the first image and no others?
Here's the article for reference: http://www.reuters.com/article/2014/04/11/us-cocoa-gold-westafrica-insight-idUSBREA3A0DP20140411
links = soup.find_all("div", {'id': 'frame_fd1fade'})
for link in links:
for img in link.find_all('img', src=True):
img = img["src"].split("src=")[-1]
print img
In the page source for the article, there's only one div with id="frame_fd1fade". Within it, there is one img tag, which contains the first picture. You'll have to look into the mechanism that the page uses to change pictures and use that somehow to get your images.
Try running this to see how many instances of frame_fd1fade there are in the source:
import urllib
import re
f = urllib.urlopen("http://www.reuters.com/article/2014/04/11/us-cocoa-gold-westafrica-insight-idUSBREA3A0DP20140411")
cnt = 0
for line in f:
if re.search("frame_fd1fade", line):
cnt += 1
print "cnt =", cnt

Get content and image from RSS url in django-yarr

I'm using django-yarr for my RSS reader applications. Is there any way to fetch content from RSS URL and save in database? Or is there any library that could do that?
Are you looking to read data from an RSS, process it and save it?
Use Requests to fetch the data.
import requests
req = requests.get('http://feeds.bbci.co.uk/news/technology/rss.xml')
reg.text // XML as a string
BeautifulSoup, lxml or ElementTree to process the data (or similar libraries that can process xml)
from bs4 import BeautifulSoup
soup = BeautifulSoup(req.text)
images = soup.findAll('media:thumbnail')
Finally do whatever you want with the data
for image in images:
thing = DjangoModelThing()
thing.image = image.attrs.get('url')
thing.save()
UPDATE
Alternatively you could grab each article from the RSS
articles = soup.findAll('item')
for article in articles:
title = article.find('title')
description = article.find('description')
link = article.find('link')
images = article.find('media:thumbnail')

Saving animated GIFs using PIL (image saved does not animate)

I have Apache2 + PIL + Django + X-sendfile. My problem is that when I save an animated GIF, it won't "animate" when I output through the browser.
Here is my code to display the image located outside the public accessible directory.
def raw(request,uuid):
target = str(uuid).split('.')[:-1][0]
image = Uploads.objects.get(uuid=target)
path = image.path
filepath = os.path.join(path,"%s.%s" % (image.uuid,image.ext))
response = HttpResponse(mimetype=mimetypes.guess_type(filepath))
response['Content-Disposition']='filename="%s"'\
%smart_str(image.filename)
response["X-Sendfile"] = filepath
response['Content-length'] = os.stat(filepath).st_size
return response
UPDATE
It turns out that it works. My problem is when I try to upload an image via URL. It probably doesn't save the entire GIF?
def handle_url_file(request):
"""
Open a file from a URL.
Split the file to get the filename and extension.
Generate a random uuid using rand1()
Then save the file.
Return the UUID when successful.
"""
try:
file = urllib.urlopen(request.POST['url'])
randname = rand1(settings.RANDOM_ID_LENGTH)
newfilename = request.POST['url'].split('/')[-1]
ext = str(newfilename.split('.')[-1]).lower()
im = cStringIO.StringIO(file.read()) # constructs a StringIO holding the image
img = Image.open(im)
filehash = checkhash(im)
image = Uploads.objects.get(filehash=filehash)
uuid = image.uuid
return "%s" % (uuid)
except Uploads.DoesNotExist:
img.save(os.path.join(settings.UPLOAD_DIRECTORY,(("%s.%s")%(randname,ext))))
del img
filesize = os.stat(os.path.join(settings.UPLOAD_DIRECTORY,(("%s.%s")%(randname,ext)))).st_size
upload = Uploads(
ip = request.META['REMOTE_ADDR'],
filename = newfilename,
uuid = randname,
ext = ext,
path = settings.UPLOAD_DIRECTORY,
views = 1,
bandwidth = filesize,
source = request.POST['url'],
size = filesize,
filehash = filehash,
)
upload.save()
#return uuid
return "%s" % (upload.uuid)
except IOError, e:
raise e
Any ideas?
Thanks!
Wenbert
Where does that Image class come from and what does Image.open do?
My guess is that it does some sanitizing of the image data (which is a good thing), but does only save the first frame of the Gif.
Edit:
I'm convinced this is an issue with PIL. The PIL documentation on GIF says:
PIL reads GIF87a and GIF89a versions of the GIF file format. The library writes run-length encoded GIF87a files.
To verify, you can write the contents of im directly to disk and compare with the source image.
The problem is saving a PIL-opened version of the image. When you save it out via PIL, it will only save the first frame.
However, there's an easy workaround: Make a temp copy of the file, open that with PIL, and then if you detect that it's an animated GIF, then just save the original file, not the PIL-opened version.
If you save the original animated GIF file and then stream it back into your HTTP response, it will come through animated to the browser.
Example code to detect if your PIL object is an animated GIF:
def image_is_animated_gif(image):
# verify image format
if image.format.lower() != 'gif':
return False
# verify GIF is animated by attempting to seek beyond the initial frame
try:
image.seek(1)
except EOFError:
return False
else:
return True