scraping multiple images using beautiful soup - python-2.7

I'm trying to grab all of the img links in the slideshow on a Reuters article. I was wondering if someone could explain to me why this only grabs the first image and no others?
Here's the article for reference: http://www.reuters.com/article/2014/04/11/us-cocoa-gold-westafrica-insight-idUSBREA3A0DP20140411
links = soup.find_all("div", {'id': 'frame_fd1fade'})
for link in links:
for img in link.find_all('img', src=True):
img = img["src"].split("src=")[-1]
print img

In the page source for the article, there's only one div with id="frame_fd1fade". Within it, there is one img tag, which contains the first picture. You'll have to look into the mechanism that the page uses to change pictures and use that somehow to get your images.
Try running this to see how many instances of frame_fd1fade there are in the source:
import urllib
import re
f = urllib.urlopen("http://www.reuters.com/article/2014/04/11/us-cocoa-gold-westafrica-insight-idUSBREA3A0DP20140411")
cnt = 0
for line in f:
if re.search("frame_fd1fade", line):
cnt += 1
print "cnt =", cnt

Related

Export data from BeautifulSoup to CSV

[DISCLAIMER] I have been through plenty of the other answers on the area, but they do not seem to work for me.
I want to be able to export the data I have scraped as a CSV file.
My question is how do I write the piece of code which outputs the data to a CSV?
Current Code
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
req = requests.get(url).text
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Output from the code
View Position
</a>
<a href='/career/management-consultants-to-help-our-customers-succeed-with-
it/'>
Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
View Position
</a>
<a href='/career/management-consultants-within-process-improvement/'>
Management consultants within process improvement
COPENHAGEN • We are looking for consultants with profound
experience in Six Sigma, Lean and operational
management
Code I have tried
with open('ImplementTest1.csv',"w") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["link.get", "link.text"])
csv_file.close()
Output in CSV format
Column 1: Url Links
Column 2: Job description
E.g
Column 1: /career/management-consultants-to-help-our-customers-succeed-with-
it/
Column 2: Management consultants to help our customers succeed with IT
COPENHAGEN • At Implement Consulting Group, we wish to make a difference in
the consulting industry, because we believe that the ability to create Change
with Impact is a precondition for success in an increasingly global and
turbulent world.
Try this script and get the csv output:
import csv ; import requests
from bs4 import BeautifulSoup
outfile = open('career.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["job_link", "job_desc"])
res = requests.get("http://implementconsultinggroup.com/career/#/6257").text
soup = BeautifulSoup(res,"lxml")
links = soup.find_all("a")
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
item_link = link.get("href").strip()
item_text = link.text.replace("View Position","").strip()
writer.writerow([item_link, item_text])
print(item_link, item_text)
outfile.close()

Python pdfminer extract image produces multiple images per page (should be single image)

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin).
I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides no result (as in cannot be opened via finder on OSX)
Sample:
fp = open('the_file.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
pdf_item = device.get_result()
for thing in pdf_item:
if isinstance(thing, LTImage):
save_image(thing)
if isinstance(thing, LTFigure):
find_images_in_thing(thing)
def find_images_in_thing(outer_layout):
for thing in outer_layout:
if isinstance(thing, LTImage):
save_image(thing)
save_image either writes a file per image in pageNum_imgNum format in 'wb' mode or a single image per page in 'a' mode. I have tried numerous file extensions with no luck.
Resources I've looked into:
http://denis.papathanasiou.org/posts/2010.08.04.post.html (outdatted pdfminer version)
http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html
It's been a while since this question has been asked, but I'll contribute for the sake of the community, and potentially for your benefit :)
I've been using an image parser called pdfimages, available through the poppler PDF processing framework. It also outputs several files per image; it seems like a relatively common behavior for PDF generators to 'tile' or 'strip' the images into multiple images that then need to be pieced together when scraping, but appear to be entirely intact while viewing the PDF. The formats/file extensions that I have seen through pdfimages and elsewhere are: png, tiff, jp2, jpg, ccitt. Have you tried all of those?
Have you tried something like this?
from binascii import b2a_hex
def determine_image_type (stream_first_4_bytes):
"""Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
file_type = None
bytes_as_hex = b2a_hex(stream_first_4_bytes).decode()
if bytes_as_hex.startswith('ffd8'):
file_type = '.jpeg'
elif bytes_as_hex == '89504e47':
file_type = '.png'
elif bytes_as_hex == '47494638':
file_type = '.gif'
elif bytes_as_hex.startswith('424d'):
file_type = '.bmp'
return file_type
A (partial) solution for the image tiling problem is posted here: PDF: extracted images are sliced / tiled
I would use in image library to find the image type:
import io
from PIL import Image
image = Image.open(io.BytesIO(thing.stream.get_data()))
print(image.format)

Web-scraping with Python: NoneType error, can't scrape table's data

this is my first attempt at coding so please forgive my daftness. I'm trying to learn web scraping by practising with this link:
https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0
I've honestly spent hours trying to figure out what's wrong with my code here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody')
list_of_rows = []
for row in table.find('tr'):
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append()
list_of_rows.append(list_of_cells)
outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)
My terminal then spits out this: 'NoneType' object has no attribute 'find', saying there's an error in line 13. Not sure if it helps in queries but this is a list of what I've tried:
Different permutations of 'find'/'findAll'
Instead of '.find', used '.findAll'
Instead of '.findAll', used '.find'
Different permutations for line 10
Tried soup.find('tbody')
Tried soup.find('table')
Opened source code, tried soup.find('table', attrs={'class':'table table-condensed'})
Different permutations for line 13
similarly tried with just 'tr' tag; or
tried adding 'attrs={}' stuff
I've really tried but can't figure out why I can't scrape that simple 10 row table. If anyone could post code that works, that'd be phenomenal. Thank you for your patience!
The URL you request in your code is not HTML but JSON.
You have a few mistakes, the biggest is you are using BeautifulSoup3 which has not been developed for years, you should be use bs4, you also need to use find_all when you want want multiple tags. Also you have not passed cell to list_of_cells.append() on line 13 so that is the cause of your other error:
from bs4 import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table')
list_of_rows = []
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
list_of_cells.append(cell)
list_of_rows.append(list_of_cells)
I am not sure exactly what you want but that appends the tds from the first table on the page. There is also and api you can use and adownloadable csv if you do actually want the data.

Python lxml xpath no output

For educational purposes I am trying to scrape this page using lxml and requests in Python.
Specifically I just want to print the research areas of all the professors on the page.
This is what I have done till now
import requests
from lxml import html
response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)
for row in parsed_body.xpath('//div[#id="maincontent"]//tr[position() mod 2 = 1]'):
for column in row.xpath('//td[#class="fcardcls"]/tr[2]/td/font/text()'):
print column.strip()
But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.
1. Empty List Returned
2. Python-lxml-xpath problem
First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.
Here is the complete working code printing names and a list of research areas per name:
import requests
from lxml import html
response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)
for row in parsed_body.xpath('.//td[#class="fcardcls"]'):
name = row.findtext(".//a[#href]/b")
name = ' '.join(name.split()) # getting rid of multiple spaces
research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")
print(name, research_areas)
The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

Scraping images using beautiful soup

I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?
timestamp = time.asctime()
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write('\n' + "Image(s): " + download_img.read() + '\n' + '\n')
txt.close()
You are appending a new line and text to the start of the data for every image, essentially corrupting it.
Also, you are writing every image into the same file, again corrupting them.
Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
timestamp = time.asctime()
txt = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write(download_img.read())
txt.close()