finding text in repeating tag - python-2.7

trying to get specific text that is in a span class from a web page. I can get the first instance, but not sure how to iterate to get the one i need.
<div class="pricing-base__plan-pricing">
<div class="pricing-base__plan-price pricing-base__plan-price--annual">
<sup class="pricing-base__price-symbol">$</sup>
<span class="pricing-base__price-value">14</span></div>
<div class="pricing-base__plan-price pricing-base__plan-price--monthly">
<sup class="pricing-base__price-symbol">$</sup>
<span class="pricing-base__price-value">18</span>
</div>
<div class="pricing-base__term">
<div class="pricing-base__term-wrapper">
<div class="pricing-base__date">mo*</div>
</div>
I need to get the "18" in the line
18
that number changes quite often and that is what my code is looking to scrape.

You can use a class selector as shown to retrieve a list of all prices then index into that list to get annual and monthly
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.gotomeeting.com/meeting/pricingc')
soup = bs(r.content, 'lxml')
prices = [item.text for item in soup.select('.pricing-base__price-value')]
monthly = prices[1]
annual = prices[0]
You could also add in parent classes:
monthly = soup.select_one('.pricing-base__plan-price--monthly .pricing-base__price-value').text
annual = soup.select_one('.pricing-base__plan-price--annual .pricing-base__price-value').text
Example:

Related

web scraping dynamic list

<div class="col col-1-1"><h2 class="heading">Flowers</h2><ul class="icon-list"> <li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 1<span class="icon-list__count">81</span> </li>
<li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 2 <span class="icon-list__count">52</span> </li>
<li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 3<span class="icon-list__count">29</span> </li>
</ul></div>
This is one example of a list of measures for one type of flowers. How to scrape the value of the measures and store in a python dictionary? Hope the code would be flexible to allow for the possibility that on another pager there might be measure 2 and 3 only, or measure 3 and 4 (a new measure not appearing on this page), or completely new measure 4 and 5.
New to python - would appreciate any advice.
BeautifulSoup is the best when you are scraping a more static and less dynamic website.
Try using unique identifiers present in a tag to navigate in this tree like structure. This piece of code will give you a dictionary with measure n as key and value as its value.
from bs4 import BeautifulSoup
import re
html = '<div class="col col-1-1"><h2 class="heading">Flowers</h2><ul class="icon-list"><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 1<span class="icon-list__count">81</span></li><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 2 <span class="icon-list__count">52</span></li><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 3<span class="icon-list__count">29</span></li></ul></div>'
soup = BeautifulSoup(html,'lxml')
li_tags = soup.find_all('li') # ['measure 181', 'measure 2 52', 'measure 329']
span_tags = soup.find_all('span',class_='icon-list__count') # ['81', '52', '29']
li_list= []
for li in li_tags:
li_list.append(li.text)
measure_dict = {}
for i in range(len(li_list)):
li_list[i] = re.sub(span_tags[i].text,'',li_list[i]) #converting 'measure 181 into 'measure 1' and likewise
measure_dict[li_list[i]] = span_tags[i].text # if you want the values as integers then use int(span_tags[i].text) in this line
print(measure_dict)
#{'measure 1': '81', 'measure 2 ': '52', 'measure 3': '29'}
The code will be flexible if the identifier I have used here class = 'icon-list__count' is present in every page you access and moreover when it also contains the data that you want to scrape. So you can hope it's the same and if not you have to traverse into the html tags to find your desired data by identify them on your own.
If in case the website uses Javascript() in the place where you want to scrape then it's better to use Selenium as it's a better scraping tool for dynamic websites.
Advice:
Reading the documentation of the module is far more helpful than watching random YT videos in the long run!
Try using re module whenever you want to play with strings, it's much better than the pre-defined methods in string

How do I consider an element's ancestor when parsing with BeautifulSoup?

I'm using Python 3.7, Django, and BeautifulSoup. I am currnently looking for "span" elements in my document that contain the text "Review". I do so like this
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
my_soup = BeautifulSoup(html, features="html.parser")
rev_elts = my_soup.findAll("span", text=re.compile("Review"))
for rev_elt in rev_elts:
... processing
but I'd like to add a wrinkle to where I don't want to consider those elements if they have a DIV ancestor with the class "child". So for example, I don't want to consider something like this
<div class="child">
<p>
<span class="s">Reviews</span>
...
</p>
</div>
How can I adjust my search to take this into account?
If you are using BeautifulSoup 4.7+, it has some improved CSS selector support. It handles many selectors up through CSS level 4 and a couple of custom ones like :contains(). In addition to all of that, it handles complex selectors in pseudo-classes like :not() which level 4 was supposed to handle, but they've recently pushed that support out to CSS level 5 selector support.
So in this example we will use the custom :contains selector to search for spans which contain the text Review. In addition, we will say we don't want it to match div.class span.
from bs4 import BeautifulSoup
html = """
<div>
<p><span>Review: Let's find this</span></p>
</div>
<div class="child">
<p><span>Review: Do not want this</span></p>
</div>
"""
soup = BeautifulSoup(html, features="html.parser")
spans = soup.select('span:contains(Review):not(div.child span)')
print(spans)
Output
[<span>Review: Let's find this</span>]
Depending on your case, maybe :contains isn't robust enough. In that case, you can still do something similar. Soup Sieve is the underlying library included with Beautiful Soup 4.7+, and you can import it directly to filter your regular expression returns:
from bs4 import BeautifulSoup
import soupsieve as sv
import re
html = """
<div>
<p><span>Review: Let's find this</span></p>
</div>
<div class="child">
<p><span>Review: Do not want this</span></p>
</div>
"""
soup = BeautifulSoup(html, features="html.parser")
spans = soup.find_all("span", text=re.compile("Review"))
spans = sv.filter(':not(div.child span)', spans)
print(spans)
Output
[<span>Review: Let's find this</span>]
CSS selector is the way to go in this case as #facelessuser has answered. But just in case you are wondering this can be done without using css selector as well.
You can iterate over all of an element’s parents with .parents. You could define a custom filter function which checks if any of the parents has a class of "child" and return True otherwise (in addition to all your other conditions).
from bs4 import BeautifulSoup, Tag
html="""
<div class="child">
<p><span id="1">Review</span></p>
</div>
<div>
<p><span id="2">Review</span></p>
</div>
"""
soup=BeautifulSoup(html,'html.parser')
def my_func(item):
if isinstance(item,Tag) and item.name=='span' and 'Review' in item.text:
for parent in item.parents:
if parent.has_attr('class'):
if 'child' in parent.get('class'):
return False
return True
my_spans=soup.find_all(my_func)
print(my_spans)
Outputs:
[<span id="2">Review</span>]

Scrapy error loop xpath

I have the follow html structure:
<div id="mod_imoveis_result">
<a class="mod_res" href="#">
<div id="g-img-imo">
<div class="img_p_results">
<img src="/img/image.jpg">
</div>
</div>
</a>
</div>
This is a product result page, so is 7 blocks for page with that mod_imoveis_result id. I need get image src from all blocks. Each page have 7 blocks like above.
I try:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class QuotesSpider(scrapy.Spider):
name = "magichat"
start_urls = ['https://magictest/results']
def parse(self, response):
for bimb in response.xpath('//div[#id="mod_imoveis_result"]'):
yield {
'img_url': bimb.xpath('//div[#id="g-img-imo"]/div[#class="img_p_results"]/img/#src').extract_first(),
'text': bimb.css('#titulo_imovel::text').extract_first()
}
next_page = response.xpath('//a[contains(#class, "num_pages") and contains(#class, "pg_number_next")]/#href').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
I can't understand why text target is ok, but img_url get first result for all blocks for page. Example: each page have 7 blocks, so 7 texts and 7 img_urls, but, img_urls is the same for all other 6 blocks, and text is right, why?
If i change extract_first to extract i get others urls, but the result come in the same brackts. Example:
text: 1aaaa
img_url : a,b,c,d,e,f,g
but i need
text: 1aaaa
img_url: a
text: 2aaaa
img_url: b
What is wrong with that loop?
// selects the root node i.e. <div id="mod_imoveis_result"> of for node you're trying to get which is div[#id="g-img-imo"] so the two tage that were missed it the reason of NO DATA
**. **selects the current node which is mentioned in your xpath irrespective of how deep it is.
In your case xpath('./div[#id="g-img-imo"]/div[#class="img_p_results"]/img/#src') denotes selection from root node i.e. from arrow
<div id="mod_imoveis_result">
<a class="mod_res" href="#">
---> <div id="g-img-imo">
<div class="img_p_results">
<img src="/img/image.jpg">
</div>
</div>
</a>
</div>
I hope you i made it clear.
If all your classes have separate div names, in your case different class tag, then you can directly call image div and extract image URL.
//*[#class="img_p_results"]/img/#src

How do I scrape nested data using selenium and Python

I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]
Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()
You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors

How to write this in regular expression in Python?

I have a big HTML file from which I need to parse some data using Regular expression. The first is the name of restaurant. Hotel names are in this format:
Update:
<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body><div class="businessresult clearfix">
<div class="leftcol">
<div id="bizTitle0" class="itemheading">
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
</div>
<div class="itemcategories">
Categories: Italian, Seafood
</div>
<div class="itemneighborhoods">
Neighborhood: Marina/Cow Hollow
</div>
</div>
<div class="rightcol">
<div class="rating"><img src="yelp_listings_files/stars_map.html" alt="4 star rating" title="4 star rating" class="stars_4 " height="325" width="83"></div> <a class="reviews" href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco">270 reviews</a>
<address>
1809 Union St<br>San Francisco, CA 94123<br>
</address><div class="phone">
(415) 409-8001
</div>
</div>
There are altogether 40 hotels. I think there's two spaces after the . in number. I need to list all the hotels from 1 to 40. I have tried using:
re.findall("[./0-9]", string_Name)
It outputs the number. I want to get the number and all the hotel names. How can I do that?
The answer by Blender gives the rating and the restaurant list. That's fine but I want rating and the restaurant name in a different variable.
Parse the HTML:
import re
from bs4 import BeautifulSoup
html = '''
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
<a href="https://courses.ischool.berkeley.edu/biz/ristorante-parma-san-francisco" id="bizTitleLink4">5. Ristorante Parma
</a>
'''
soup = BeautifulSoup(html)
for link in soup.find_all('a', text=re.compile(r'^\d')):
print link.get_text()
And the output:
1. Capannina
5. Ristorante Parma
You shouldn't run regexes on html directly (preferring to use an HTML parser first), but try this regex:
(\d+)\.\s+([^<]+)
one or more digits
a dot
one or more whitespace characters
one or more non < letters
The presence of the brackets () creates a capture group. The contents of the capture group 1 will be the number. The contents of the capture group 2 will be the name.