Issue in scraping data from a website using beautiful soup

Issue in scraping data from a website using beautiful soup - python-2.7

I am trying to scrape list of 41 items & their prices from a website. But my output csv is missing some 2-3 items which come at the end of the page. Reason for this being, some devices have their price mentioned in different class than rest of the devices.
Recursion in my code is running for name and price together and for items where price is mentioned under different class, it is taking the price value from the next device. Hence, it is skipping last 2-3 items as prices for those devices already entered in recursion for previous devices.
Below is the referred code:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])
Price are usually mentioned under listGrid-price but for some 2-3 items which are outofstock at the moment price is under listGrid-price-outOfStock I need to include this also in my recursion so that right price comes before the item and loop runs for all the devices.
Please pardon my ignorance as I am new to programming

You can use a comparator function, to make custom comparison and pass it to your findAll().
So if you modify your line with prices assignment to:
prices = soup.findAll('div', class_=match_both)
and define the function as:
def match_both(arg):
if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
return True
return False
(function can be made much more concise, verbosity here just to give you an idea of how it works)
it will thus compare to both and return a match in any of the cases.
More info can be found in documentation. (The has_six_characters variant)
Now, since you also asked how to exclude particular text.
text argument to findAll() can also have custom comparators.
So in this case, you don't want text saying Write a review to match and cause a shift in price vs text.
Hence your edited script to exclude review part:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
def match_both(arg):
if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
return True
return False
def not_review(arg):
if not arg:
return arg
return "Write a review" not in arg
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=not_review)
prices = soup.findAll('div', class_=match_both)
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

Related

BeautifulSoup unable to get Inner tags

I am currently trying to scrape product data off lazada.sg using bs4 in the below code.
from bs4 import BeautifulSoup
import requests
url = "https://www.lazada.sg/shop-mobiles/"
page = requests.get(url)
content = page.text #read html
soup = BeautifulSoup(content, 'html.parser')
products = soup.find_all("div", {"class" : "c16H9d"}) #find div tags containing product details
with open("test.txt", 'w') as f:
f.write(str(products))
However the output in test.txt is just [].
I found that the above class is in <div id="root">, which I extracted and got this result.
How will I be able to access the 'inner div tags'?
Here is a snippet of the page source.

Data is dynamically loaded from script tag. You can regex out and use json library to view. You will need to adjust print line presumably for 2.7
import requests, re, json, pprint
r = requests.get('https://www.lazada.sg/shop-mobiles/')
p = re.compile(r'window.pageData=(.*)<')
data = json.loads(p.findall(r.text)[0])
for item in data['mods']['listItems']:
pprint.pprint(item)
break # delete me later

Scrapy: Return Each Item in a New CSV Row Using Item Loader

I am attempting to produce a csv output of select items contained in a particular class (title, link, price) that parses out each item in its own column, and each instance in its own row using itemloaders and the items module.
I can produce the output using a self-contained spider (without use of items module), however, I'm trying to learn the proper way of detailing the items in the items module, so that I can eventually scale up projects using the proper structure. (I will detail this code as 'Working Row Output Spider Code' below)
I have also attempted to incorporate solutions determined or discussed in related posts; in particular:
Writing Itemloader By Item to XML or CSV Using Scrapy posted by Sam
Scrapy Return Multiple Items posted by Zana Daniel
by using a for loop as he notes at the bottom of the comments section. However, I can get scrapy to accept the for loop, it just doesn't result in any change, that is the items are still grouped in single fields rather than being output into independent rows.
Below is a detail of the code contained in two project attempts --'Working Row Output Spider Code' that does not incorporate items module and items loader, and 'Non Working Row Output Spider Code'-- and the corresponding output of each.
Working Row Output Spider Code: btobasics.py
import scrapy
import urlparse
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['http://http://books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
titles = response.xpath('//*[#class="product_pod"]/h3//text()').extract()
links = response.xpath('//*[#class="product_pod"]/h3/a/#href').extract()
prices = response.xpath('//*[#class="product_pod"]/div[2]/p[1]/text()').extract()
for item in zip(titles, links, prices):
# create a dictionary to store the scraped info
scraped_info = {
'title': item[0],
'link': item[1],
'price': item[2],
}
# yield or give the scraped info to scrapy
yield scraped_info
Run Command to produce CSV: $ scrapy crawl basic -o output.csv
Working Row Output WITHOUT STRUCTURED ITEM LOADERS
Non Working Row Output Spider Code: btobasictwo.py
import datetime
import urlparse
import scrapy
from btobasictwo.items import BtobasictwoItem
from scrapy.loader.processors import MapCompose
from scrapy.loader import ItemLoader
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['http://http://books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
# Create the loader using the response
links = response.xpath('//*[#class="product_pod"]')
for link in links:
l = ItemLoader(item=BtobasictwoItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[#class="product_pod"]/h3//text()',
MapCompose(unicode.strip))
l.add_xpath('link', '//*[#class="product_pod"]/h3/a/#href',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_xpath('price', '//*[#class="product_pod"]/div[2]/p[1]/text()',
MapCompose(unicode.strip))
# Log fields
l.add_value('url', response.url)
l.add_value('date', datetime.datetime.now())
return l.load_item()
Non Working Row Output Items Code: btobasictwo.items.py
from scrapy.item import Item, Field
class BtobasictwoItem(Item):
# Primary fields
title = Field()
link = Field()
price = Field()
# Log fields
url = Field()
date = Field()
Run Command to produce CSV: $ scrapy crawl basic -o output.csv
Non Working Row Code Output WITH STRUCTURED ITEM LOADERS
As you can see, when attempting to incorporate the items module, itemloaders and a for loop to structure the data, it does not seperate the instances by row, but rather puts all instances of a particular item (title, link, price) in 3 fields.
I would greatly appreciate any help on this, and apologize for the lengthy post. I just wanted to document as much as possible so that anyone wanting to assist could run the code themselves, and/or fully appreciate the problem from my documentation. (please leave a comment instructing on length of post if you feel it is not appropriate to be this lengthly).
Thanks very much

You need to tell your ItemLoader to use another selector:
def parse(self, response):
# Create the loader using the response
links = response.xpath('//*[#class="product_pod"]')
for link in links:
l = ItemLoader(item=BtobasictwoItem(), selector=link)
# Load fields using XPath expressions
l.add_xpath('title', './/h3//text()',
MapCompose(unicode.strip))
l.add_xpath('link', './/h3/a/#href',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_xpath('price', './/div[2]/p[1]/text()',
MapCompose(unicode.strip))
# Log fields
l.add_value('url', response.url)
l.add_value('date', datetime.datetime.now())
yield l.load_item()

How to get attribute values from variable in Python

So I'm doing a relatively simple project so I can teach myself Python. I've come to a point where I'm stuck. So I have a variable named element in pycharm debugger which shows as
This variable is type Tag, which is correct to me. In element I want to see if the class="schedule_dgrd_time/result"which is not the case in the above image.
I see that within element there is an attrs.
How can I access that value? If I do element.string I get the text value which in this case would be Sat.(...I could make that work), but I was wondering if I can check the class attribute value first.
I've been searching for this for a couple days now and just can't get it. I've googled myself to death at this point. Any help or pointers would be greatly appreciated. Thanks for reading.
Update
Here is my code
import urllib2
import datetime
import re
from bs4 import BeautifulSoup
# today's date
date = datetime.datetime.today().strftime('%-m/%d/%Y')
validDay = "Mon\.|Tue\.|Wed\.|Thu(r)?(s)?\.|Fri\."
website = "http://www.texassports.com/schedule.aspx?path=baseball"
opener = urllib2.build_opener()
##add headers that make it look like I'm a browser
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
page = opener.open(website)
# turn page into html object
soup = BeautifulSoup(page, 'html.parser')
#print soup.prettify()
#get all home games
all_rows = soup.find_all('tr', class_='schedule_home_tr')
# see if any game is today
# entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('.*({}).*'.format(date)))]
# hard coding for testing weekend
entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('3/11/2017'))]
time = "schedule_dgrd_time/result"
for elements in entryForToday:
for element in elements:
#this is where I'm stuck.
# if element.attrs:
# print element.attrs['class'][0]
I know a double nested for loop is not ideal so if you have a better way I'm glad to hear it. Thanks

So I was able to figure out. I have some NavigableString which doesn't have attrs so that was throwing an error. element.attrs['class'][0] does work now. I had to check if isinstanceOf a tag, if not it would skip it. Anywho, my code is below for anyone that is interested.
import urllib2
import datetime
import re
from bs4 import BeautifulSoup
from bs4 import Tag
# today's date
date = datetime.datetime.today().strftime('%-m/%d/%Y')
validDay = "Mon\.|Tue\.|Wed\.|Thu(r)?(s)?\.|Fri\."
website = "http://www.texassports.com/schedule.aspx?path=baseball"
opener = urllib2.build_opener()
##add headers that make it look like I'm a browser
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
page = opener.open(website)
# turn page into html object
soup = BeautifulSoup(page, 'html.parser')
#print soup.prettify()
#get all home games
all_rows = soup.find_all('tr', class_='schedule_home_tr')
# see if any game is today
# entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('.*({}).*'.format(date)))]
# hard coding for testing weekend
entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('3/14/2017'))]
classForTime = "schedule_dgrd_time/result"
timeOfGame = "none";
if entryForToday:
entryForToday = [t for t in entryForToday if t.findAll('td',
class_='schedule_dgrd_game_day_of_week',
text=re.compile('.*({}).*'.format(validDay)))]
if entryForToday:
for elements in entryForToday:
for element in elements:
if isinstance(element, Tag):
if element.attrs['class'][0] == classForTime:
timeOfGame = element.text
# print element.text
break
print timeOfGame

Python web crawler using BeautifulSoup, trouble getting URLs

so I am trying to build a dynamic web crawler to get all url links within links.
so far i am able to get all the links for Chapters, but when I trying to do section links from each chapter, my output does not print out anything.
the code i used :
#########################Chapters#######################
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
for title in range (1,4):
url = base_url.format(title=title)
r = requests.get(url)
for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if 'chapt' in link['href']:
href = "http://law.justia.com" + link['href']
leveltwo(href)
#########################Sections#######################
def leveltwo(item_url):
r = requests.get(item_url)
soup = BeautifulSoup((r.content),"html.parser")
section = soup.find('div', {'class': 'primary-content' })
for sublinks in section.find_all('a'):
sectionlinks = sublinks.get('href')
print (sectionlinks)

With some minor modifications to your code, I was able to get it to run and output the sections. Mainly, you needed to fix your indentation, and define a function before you call it.
#########################Chapters#######################
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
def leveltwo(item_url):
r = requests.get(item_url)
soup = BeautifulSoup((r.content),"html.parser")
section = soup.find('div', {'class': 'primary-content' })
for sublinks in section.find_all('a'):
sectionlinks = sublinks.get('href')
print (sectionlinks)
base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
for title in range (1,4):
url = base_url.format(title=title)
r = requests.get(url)
for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
try:
if 'chapt' in link['href']:
href = "http://law.justia.com" + link['href']
leveltwo(href)
else:
continue
except KeyError:
continue
#########################Sections#######################
output:
/codes/alabama/2015/title-3/chapter-1/section-3-1-1/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-2/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-3/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-4/index.html etc.

You don't need any try/except blocks, you can use href=True with find or find_all to only select the anchor tags with href's or a css select a[href] as below, the chapter links are in the first ul with inside the article tag with the id #maincontent so you don't need to filter at all:
base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
import requests
from bs4 import BeautifulSoup
def leveltwo(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.content, "html.parser")
section_links = [a["href"] for a in soup.select('div .primary-content a[href]')]
print (section_links)
for title in range(1, 4):
url = base_url.format(title=title)
r = requests.get(url)
for link in BeautifulSoup(r.content, "html.parser").select("#maincontent ul:nth-of-type(1) a[href]"):
href = "http://law.justia.com" + link['href']
leveltwo(href)
If you were to use find_all you simply need to pass find_all(.., href=True) to filter your anchor tags to only select ones that have hrefs.

Automating pulling csv files off google Trends

pyGTrends does not seem to work. Giving errors in Python.
pyGoogleTrendsCsvDownloader seems to work, logs in, but after getting 1-3 requests (per day!) complains about exhausted quota, even though manual download with the same login/IP works flawlessly.
Bottom line: neither work. Searching through stackoverflow: many questions from people trying to pull csv's from Google, but no workable solution I could find...
Thank you in advance: whoever will be able to help. How should the code be changed? Do you know of another solution that works?
Here's the code of pyGoogleTrendsCsvDownloader.py
import httplib
import urllib
import urllib2
import re
import csv
import lxml.etree as etree
import lxml.html as html
import traceback
import gzip
import random
import time
import sys
from cookielib import Cookie, CookieJar
from StringIO import StringIO
class pyGoogleTrendsCsvDownloader(object):
'''
Google Trends Downloader
Recommended usage:
from pyGoogleTrendsCsvDownloader import pyGoogleTrendsCsvDownloader
r = pyGoogleTrendsCsvDownloader(username, password)
r.get_csv(cat='0-958', geo='US-ME-500')
'''
def __init__(self, username, password):
'''
Provide login and password to be used to connect to Google Trends
All immutable system variables are also defined here
'''
# The amount of time (in secs) that the script should wait before making a request.
# This can be used to throttle the downloading speed to avoid hitting servers too hard.
# It is further randomized.
self.download_delay = 0.25
self.service = "trendspro"
self.url_service = "http://www.google.com/trends/"
self.url_download = self.url_service + "trendsReport?"
self.login_params = {}
# These headers are necessary, otherwise Google will flag the request at your account level
self.headers = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'),
("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
("Accept-Language", "en-gb,en;q=0.5"),
("Accept-Encoding", "gzip, deflate"),
("Connection", "keep-alive")]
self.url_login = 'https://accounts.google.com/ServiceLogin?service='+self.service+'&passive=1209600&continue='+self.url_service+'&followup='+self.url_service
self.url_authenticate = 'https://accounts.google.com/accounts/ServiceLoginAuth'
self.header_dictionary = {}
self._authenticate(username, password)
def _authenticate(self, username, password):
'''
Authenticate to Google:
1 - make a GET request to the Login webpage so we can get the login form
2 - make a POST request with email, password and login form input values
'''
# Make sure we get CSV results in English
ck = Cookie(version=0, name='I4SUserLocale', value='en_US', port=None, port_specified=False, domain='www.google.com', domain_specified=False,domain_initial_dot=False, path='/trends', path_specified=True, secure=False, expires=None, discard=False, comment=None, comment_url=None, rest=None)
self.cj = CookieJar()
self.cj.set_cookie(ck)
self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cj))
self.opener.addheaders = self.headers
# Get all of the login form input values
find_inputs = etree.XPath("//form[#id='gaia_loginform']//input")
try:
#
resp = self.opener.open(self.url_login)
if resp.info().get('Content-Encoding') == 'gzip':
buf = StringIO( resp.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = resp.read()
xmlTree = etree.fromstring(data, parser=html.HTMLParser(recover=True, remove_comments=True))
for input in find_inputs(xmlTree):
name = input.get('name')
if name:
name = name.encode('utf8')
value = input.get('value', '').encode('utf8')
self.login_params[name] = value
except:
print("Exception while parsing: %s\n" % traceback.format_exc())
self.login_params["Email"] = username
self.login_params["Passwd"] = password
params = urllib.urlencode(self.login_params)
self.opener.open(self.url_authenticate, params)
def get_csv(self, throttle=False, **kwargs):
'''
Download CSV reports
'''
# Randomized download delay
if throttle:
r = random.uniform(0.5 * self.download_delay, 1.5 * self.download_delay)
time.sleep(r)
params = {
'export': 1
}
params.update(kwargs)
params = urllib.urlencode(params)
r = self.opener.open(self.url_download + params)
# Make sure everything is working ;)
if not r.info().has_key('Content-Disposition'):
print "You've exceeded your quota. Continue tomorrow..."
sys.exit(0)
if r.info().get('Content-Encoding') == 'gzip':
buf = StringIO( r.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
else:
data = r.read()
myFile = open('trends_%s.csv' % '_'.join(['%s-%s' % (key, value) for (key, value) in kwargs.items()]), 'w')
myFile.write(data)
myFile.close()

Although I don't know python, I may have a solution. I am currently doing the same thing in C# and though I didn't get the .csv file, I got created a custom URL through code and then downloaded that HTML and saved to a text file (also through code). In this HTML (at line 12) is all the information needed to create the graph that is used on Google Trends. However, this has alot of unnecessary text within it that needs to be cut down. But either way, you end up with the same result. The Google Trends data. I posted a more detailed answer to my question here:
Downloading .csv file from Google Trends

There is an alternative module named pytrends - https://pypi.org/project/pytrends/ It is really cool. I would recommend this.
Example usage:
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
pytrend = TrendReq()
#It is the term that you want to search
pytrend.build_payload(kw_list=["Eminem is the Rap God"])
# Find which region has searched the term
df = pytrend.interest_by_region()
df.to_csv("path\Eminem_InterestbyRegion.csv")
Potentially if you have a list of terms to search you could make use of "for loop" to automate the insights as per your wish.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Issue in scraping data from a website using beautiful soup - python-2.7

Related

BeautifulSoup unable to get Inner tags

Scrapy: Return Each Item in a New CSV Row Using Item Loader

How to get attribute values from variable in Python

Python web crawler using BeautifulSoup, trouble getting URLs

Automating pulling csv files off google Trends

Categories

Resources