How to get attribute values from variable in Python - python-2.7

So I'm doing a relatively simple project so I can teach myself Python. I've come to a point where I'm stuck. So I have a variable named element in pycharm debugger which shows as
This variable is type Tag, which is correct to me. In element I want to see if the class="schedule_dgrd_time/result"which is not the case in the above image.
I see that within element there is an attrs.
How can I access that value? If I do element.string I get the text value which in this case would be Sat.(...I could make that work), but I was wondering if I can check the class attribute value first.
I've been searching for this for a couple days now and just can't get it. I've googled myself to death at this point. Any help or pointers would be greatly appreciated. Thanks for reading.
Update
Here is my code
import urllib2
import datetime
import re
from bs4 import BeautifulSoup
# today's date
date = datetime.datetime.today().strftime('%-m/%d/%Y')
validDay = "Mon\.|Tue\.|Wed\.|Thu(r)?(s)?\.|Fri\."
website = "http://www.texassports.com/schedule.aspx?path=baseball"
opener = urllib2.build_opener()
##add headers that make it look like I'm a browser
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
page = opener.open(website)
# turn page into html object
soup = BeautifulSoup(page, 'html.parser')
#print soup.prettify()
#get all home games
all_rows = soup.find_all('tr', class_='schedule_home_tr')
# see if any game is today
# entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('.*({}).*'.format(date)))]
# hard coding for testing weekend
entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('3/11/2017'))]
time = "schedule_dgrd_time/result"
for elements in entryForToday:
for element in elements:
#this is where I'm stuck.
# if element.attrs:
# print element.attrs['class'][0]
I know a double nested for loop is not ideal so if you have a better way I'm glad to hear it. Thanks

So I was able to figure out. I have some NavigableString which doesn't have attrs so that was throwing an error. element.attrs['class'][0] does work now. I had to check if isinstanceOf a tag, if not it would skip it. Anywho, my code is below for anyone that is interested.
import urllib2
import datetime
import re
from bs4 import BeautifulSoup
from bs4 import Tag
# today's date
date = datetime.datetime.today().strftime('%-m/%d/%Y')
validDay = "Mon\.|Tue\.|Wed\.|Thu(r)?(s)?\.|Fri\."
website = "http://www.texassports.com/schedule.aspx?path=baseball"
opener = urllib2.build_opener()
##add headers that make it look like I'm a browser
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
page = opener.open(website)
# turn page into html object
soup = BeautifulSoup(page, 'html.parser')
#print soup.prettify()
#get all home games
all_rows = soup.find_all('tr', class_='schedule_home_tr')
# see if any game is today
# entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('.*({}).*'.format(date)))]
# hard coding for testing weekend
entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('3/14/2017'))]
classForTime = "schedule_dgrd_time/result"
timeOfGame = "none";
if entryForToday:
entryForToday = [t for t in entryForToday if t.findAll('td',
class_='schedule_dgrd_game_day_of_week',
text=re.compile('.*({}).*'.format(validDay)))]
if entryForToday:
for elements in entryForToday:
for element in elements:
if isinstance(element, Tag):
if element.attrs['class'][0] == classForTime:
timeOfGame = element.text
# print element.text
break
print timeOfGame

Related

Webscraping is buggy through AWS Lambda, but works fine in VS Code and on EC2 instance

My dependencies are fine, Lambda doesn't create any errors, code runs smoothly. I also checked memory (512MB) and timeout (5 mins). Just instead of a list of HTML divs I'm getting a list of empty lists. Interestingly there are quite a few lists nested, so it might even be the number of divs I'm trying to scrape, they're just completely empty.
import requests
from bs4 import BeautifulSoup
def lambda_handler(event, context):
url3='https://www.szybko.pl/l/na-sprzedaz/lokal-mieszkalny/Wroc%C5%82aw?assetType=lokal-mieszkalny&localization_search_text=Wroc%C5%82aw&market=aftermarket&price_min_sell=200000&price_max_sell=400000&meters_min=30&rooms_min=2'
def get_last_page3(url):
result = requests.get(url)
source = result.content
soup = BeautifulSoup(source, 'html.parser')
last_page = soup.find_all("li",{'class': 'blank'})[1].text
return int(last_page)
def get_list_of_soups3(url):
list_of_soups=[]
for page in range(1,get_last_page3(url)+1):
try:
result = requests.get(url+'&strona='+str(page))
source = result.content
soup = BeautifulSoup(source, 'html.parser')
ads = soup.find_all("div",{'class': "gt-listing-item-asset listing-item"})
list_of_soups.append(ads)
except Exception as e:
print(e)
break
return list_of_soups
all_ads3 = []
try:
for soup in get_list_of_soups3(url3):
for s in soup:
name = s.find("a")['aria-label'].replace('Szczegóły ogłoszenia - ','')
district = s.find("a",{'class': 'mapClassClick list-elem-address popup-gmaps'}).text.replace('\n','').replace(' ','').replace(', dolnośląskie','')
price = s.find("span",{'class': 'listing-price'}).text.strip().replace(' zł','').replace(' ','')[:6]
rooms = s.find("li",{'class': 'asset-feature rooms'}).text.replace(' ','')
sq = s.find("li",{'class': 'asset-feature area'}).text.replace('m²','').replace(',','.')
price_sq = s.find("span",{'class': 'listing-price'}).find('i').text.replace('zł/m²','').replace(' ','').strip()
link = s.find('a')['href'].strip()
ad=[name,district,int(price),int(rooms),round(float(sq)),int(price_sq),link]
all_ads3.append(ad)
except Exception as e:
print('error: website changed or unresponsive',e)
return get_list_of_soups3(url3)
Also, a similar code scraping a similar website works perfectly fine from both IDE and Lambda. Both Lambdas are configured in the same way.
I'm using Python with requests and beautiful soup libraries.
I was able to solve this by changing the HTML class of divs scraped in the second function. I achived this with print statement debugging.
Not sure what is the reason, my guess would be that maybe Lambda couldn't handle a photothumbnail that was included in the original div? Maybe something to do with the way ads are generated on this particular website?
Code also includes my print statements in comments and has try/except removed. The crucial change in line 29: ads = soup.find_all("div",{'class': "listing-content"})
import requests
from bs4 import BeautifulSoup
def lambda_handler(event, context):
# # Scraping url3: szybko.pl
url3='https://www.szybko.pl/l/na-sprzedaz/lokal-mieszkalny/Wroc%C5%82aw?assetType=lokal-mieszkalny&localization_search_text=Wroc%C5%82aw&market=aftermarket&price_min_sell=200000&price_max_sell=400000&meters_min=30&rooms_min=2'
def get_last_page3(url):
result = requests.get(url)
source = result.content
#print('SOURCE:',source)
soup = BeautifulSoup(source, 'html.parser')
last_page = soup.find_all("li",{'class': 'blank'})[1].text
print('PAGE:',last_page)
return int(last_page)
def get_list_of_soups3(url):
list_of_soups=[]
for page in range(1,get_last_page3(url)+1):
try:
result = requests.get(url+'&strona='+str(page))
#print('RESULT:',result)
source = result.content
soup = BeautifulSoup(source, 'html.parser')
#print('SOUP:',soup) #it's fine
ads = soup.find_all("div",{'class': "listing-content"})
#print('ADS:',ads)
list_of_soups.append(ads)
except Exception as e:
print(e)
break
return list_of_soups
all_ads3 = []
for soup in get_list_of_soups3(url3):
for s in soup:
name = s.find("a",{'class': 'listing-title-heading hide-overflow-text'}).find("div",{'class': "tooltip"}).text#.replace('Szczegóły ogłoszenia - ','')
district = s.find("a",{'class': 'mapClassClick list-elem-address popup-gmaps'}).text.replace('\n','').replace(' ','').replace(', dolnośląskie','').strip()
price = s.find("div",{'class': 'listing-title'}).find_all("span")[2]['content']#.text.strip().replace(' zł','').replace(' ','')[:6]
rooms = s.find("li",{'class': 'asset-feature rooms'}).text.replace(' ','')
sq = s.find("li",{'class': 'asset-feature area'}).text.replace('m²','').replace(',','.')
price_sq = int(price)/round(float(sq))
link = s.find('a')['href'].strip()
ad=[name,district,int(price),int(rooms),round(float(sq)),int(price_sq),link]
all_ads3.append(ad)
return len(all_ads3)

Spider won't run after updating Scrapy

As seems to frequently happen here, I am quite new to Python 2.7 and Scrapy. Our project has us scraping website date, following some links and more scraping, and so on. This was all working fine. Then I updated Scrapy.
Now when I launch my spider, I get the following message:
This wasn't coming up anywhere previously (none of my prior error messages looked anything like this). I am now running scrapy 1.1.0 on Python 2.7. And none of the spiders that had previously worked on this project are working.
I can provide some example code if need be, but my (admittedly limited) knowledge of Python suggests to me that its not even getting to my script before bombing out.
EDIT:
OK, so this code is supposed to start at the first authors page for Deakin University academics on The Conversation, and go through and scrape how many articles they have written and comments they have made.
import scrapy
from ltuconver.items import ConversationItem
from ltuconver.items import WebsitesItem
from ltuconver.items import PersonItem
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
import bs4
class ConversationSpider(scrapy.Spider):
name = "urls"
allowed_domains = ["theconversation.com"]
start_urls = [
'http://theconversation.com/institutions/deakin-university/authors']
#URL grabber
def parse(self, response):
requests = []
people = Selector(response).xpath('///*[#id="experts"]/ul[*]/li[*]')
for person in people:
item = WebsitesItem()
item['url'] = 'http://theconversation.com/'+str(person.xpath('a/#href').extract())[4:-2]
self.logger.info('parseURL = %s',item['url'])
requests.append(Request(url=item['url'], callback=self.parseMainPage))
soup = bs4.BeautifulSoup(response.body, 'html.parser')
try:
nexturl = 'https://theconversation.com'+soup.find('span',class_='next').find('a')['href']
requests.append(Request(url=nexturl))
except:
pass
return requests
#go to URLs are grab the info
def parseMainPage(self, response):
person = Selector(response)
item = PersonItem()
item['name'] = str(person.xpath('//*[#id="outer"]/header/div/div[2]/h1/text()').extract())[3:-2]
item['occupation'] = str(person.xpath('//*[#id="outer"]/div/div[1]/div[1]/text()').extract())[11:-15]
item['art_count'] = int(str(person.xpath('//*[#id="outer"]/header/div/div[3]/a[1]/h2/text()').extract())[3:-3])
item['com_count'] = int(str(person.xpath('//*[#id="outer"]/header/div/div[3]/a[2]/h2/text()').extract())[3:-3])
And in my Settings, I have:
BOT_NAME = 'ltuconver'
SPIDER_MODULES = ['ltuconver.spiders']
NEWSPIDER_MODULE = 'ltuconver.spiders'
DEPTH_LIMIT=1
Apparently my six.py file was corrupt (or something like that). After swapping it out with the same file from a colleague, it started working again 8-\

Python web crawler using BeautifulSoup, trouble getting URLs

so I am trying to build a dynamic web crawler to get all url links within links.
so far i am able to get all the links for Chapters, but when I trying to do section links from each chapter, my output does not print out anything.
the code i used :
#########################Chapters#######################
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
for title in range (1,4):
url = base_url.format(title=title)
r = requests.get(url)
for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if 'chapt' in link['href']:
href = "http://law.justia.com" + link['href']
leveltwo(href)
#########################Sections#######################
def leveltwo(item_url):
r = requests.get(item_url)
soup = BeautifulSoup((r.content),"html.parser")
section = soup.find('div', {'class': 'primary-content' })
for sublinks in section.find_all('a'):
sectionlinks = sublinks.get('href')
print (sectionlinks)
With some minor modifications to your code, I was able to get it to run and output the sections. Mainly, you needed to fix your indentation, and define a function before you call it.
#########################Chapters#######################
import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
def leveltwo(item_url):
r = requests.get(item_url)
soup = BeautifulSoup((r.content),"html.parser")
section = soup.find('div', {'class': 'primary-content' })
for sublinks in section.find_all('a'):
sectionlinks = sublinks.get('href')
print (sectionlinks)
base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
for title in range (1,4):
url = base_url.format(title=title)
r = requests.get(url)
for link in BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
try:
if 'chapt' in link['href']:
href = "http://law.justia.com" + link['href']
leveltwo(href)
else:
continue
except KeyError:
continue
#########################Sections#######################
output:
/codes/alabama/2015/title-3/chapter-1/section-3-1-1/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-2/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-3/index.html
/codes/alabama/2015/title-3/chapter-1/section-3-1-4/index.html etc.
You don't need any try/except blocks, you can use href=True with find or find_all to only select the anchor tags with href's or a css select a[href] as below, the chapter links are in the first ul with inside the article tag with the id #maincontent so you don't need to filter at all:
base_url = "http://law.justia.com/codes/alabama/2015/title-{title:01d}/"
import requests
from bs4 import BeautifulSoup
def leveltwo(item_url):
r = requests.get(item_url)
soup = BeautifulSoup(r.content, "html.parser")
section_links = [a["href"] for a in soup.select('div .primary-content a[href]')]
print (section_links)
for title in range(1, 4):
url = base_url.format(title=title)
r = requests.get(url)
for link in BeautifulSoup(r.content, "html.parser").select("#maincontent ul:nth-of-type(1) a[href]"):
href = "http://law.justia.com" + link['href']
leveltwo(href)
If you were to use find_all you simply need to pass find_all(.., href=True) to filter your anchor tags to only select ones that have hrefs.

Using Python and Mechanize to submit data in the website's html

I have this website and there are four input boxes which are Symbol, Expiry Date, From, To. Now i have written a code to scrape data from the Symbol and Expiry Date which is like this:
import requests
import json
from bs4 import BeautifulSoup
r = requests.get("http://www.mcxindia.com/sitepages/BhavCopyCommodityWise.aspx")
soup = BeautifulSoup(r.content)
pop = []
pop_dates = []
count = 0
print soup.prettify()
option_list = soup.findAll("option")
#print option_list
for value in option_list:
#print value
if value.find(text = True):
text = ''.join(value.find(text = True))
text1 = text.encode('ascii')
if count < 32:
pop.append(text1)
while count == 32 or count > 32:
pop_dates.append(text1)
break
count = count + 1
print pop
print pop_dates
So What i want to do is for From and To i want to give the dates from my code and it will take that input, use it on the website's html and give the output as usual in that website. How can i do this?? I heard mechanize can do this stuffs but how could i use mechanize in this case??
You can try out something like this:
from mechanize import Browser
from bs4 import BeautifulSoup
br = Browser()
br.set_handle_robots( False )
br.addheaders = [('User-agent', 'Firefox')]
br.open("http://www.mcxindia.com/sitepages/BhavCopyCommodityWise.aspx")
br.select_form("form1")
#now enter the dates according to your choice
br.form["mTbFromDate"] = "date-From"
br.form["mTbFromDate"] = "date-To"
response = br.submit()
#now read the response with BeautifulSoup and do whatever you want
soup = BeautifulSoup(response.read())

Issue in scraping data from a website using beautiful soup

I am trying to scrape list of 41 items & their prices from a website. But my output csv is missing some 2-3 items which come at the end of the page. Reason for this being, some devices have their price mentioned in different class than rest of the devices.
Recursion in my code is running for name and price together and for items where price is mentioned under different class, it is taking the price value from the next device. Hence, it is skipping last 2-3 items as prices for those devices already entered in recursion for previous devices.
Below is the referred code:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('™','').replace('®','').strip(),textcontent])
Price are usually mentioned under listGrid-price but for some 2-3 items which are outofstock at the moment price is under listGrid-price-outOfStock I need to include this also in my recursion so that right price comes before the item and loop runs for all the devices.
Please pardon my ignorance as I am new to programming
You can use a comparator function, to make custom comparison and pass it to your findAll().
So if you modify your line with prices assignment to:
prices = soup.findAll('div', class_=match_both)
and define the function as:
def match_both(arg):
if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
return True
return False
(function can be made much more concise, verbosity here just to give you an idea of how it works)
it will thus compare to both and return a match in any of the cases.
More info can be found in documentation. (The has_six_characters variant)
Now, since you also asked how to exclude particular text.
text argument to findAll() can also have custom comparators.
So in this case, you don't want text saying Write a review to match and cause a shift in price vs text.
Hence your edited script to exclude review part:
# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
def match_both(arg):
if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
return True
return False
def not_review(arg):
if not arg:
return arg
return "Write a review" not in arg
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=not_review)
prices = soup.findAll('div', class_=match_both)
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('™','').replace('®','').strip(),textcontent])