How do I scrape nested data using selenium and Python

How do I scrape nested data using selenium and Python - python-2.7

I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]

Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()

You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors

Related

web scraping dynamic list

<div class="col col-1-1"><h2 class="heading">Flowers</h2><ul class="icon-list"> <li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 1<span class="icon-list__count">81</span> </li>
<li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 2 <span class="icon-list__count">52</span> </li>
<li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 3<span class="icon-list__count">29</span> </li>
</ul></div>
This is one example of a list of measures for one type of flowers. How to scrape the value of the measures and store in a python dictionary? Hope the code would be flexible to allow for the possibility that on another pager there might be measure 2 and 3 only, or measure 3 and 4 (a new measure not appearing on this page), or completely new measure 4 and 5.
New to python - would appreciate any advice.

BeautifulSoup is the best when you are scraping a more static and less dynamic website.
Try using unique identifiers present in a tag to navigate in this tree like structure. This piece of code will give you a dictionary with measure n as key and value as its value.
from bs4 import BeautifulSoup
import re
html = '<div class="col col-1-1"><h2 class="heading">Flowers</h2><ul class="icon-list"><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 1<span class="icon-list__count">81</span></li><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 2 <span class="icon-list__count">52</span></li><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 3<span class="icon-list__count">29</span></li></ul></div>'
soup = BeautifulSoup(html,'lxml')
li_tags = soup.find_all('li') # ['measure 181', 'measure 2 52', 'measure 329']
span_tags = soup.find_all('span',class_='icon-list__count') # ['81', '52', '29']
li_list= []
for li in li_tags:
li_list.append(li.text)
measure_dict = {}
for i in range(len(li_list)):
li_list[i] = re.sub(span_tags[i].text,'',li_list[i]) #converting 'measure 181 into 'measure 1' and likewise
measure_dict[li_list[i]] = span_tags[i].text # if you want the values as integers then use int(span_tags[i].text) in this line
print(measure_dict)
#{'measure 1': '81', 'measure 2 ': '52', 'measure 3': '29'}
The code will be flexible if the identifier I have used here class = 'icon-list__count' is present in every page you access and moreover when it also contains the data that you want to scrape. So you can hope it's the same and if not you have to traverse into the html tags to find your desired data by identify them on your own.
If in case the website uses Javascript() in the place where you want to scrape then it's better to use Selenium as it's a better scraping tool for dynamic websites.
Advice:
Reading the documentation of the module is far more helpful than watching random YT videos in the long run!
Try using re module whenever you want to play with strings, it's much better than the pre-defined methods in string

What is AttributeError: object has no attribute 'w3c'?

I am trying to perform drag and drop, python-webdriver.
But I'm not successful at it. Used simple drag and drop apis & drag and drop by offset. And also used action chains, nothing worked out for me. I could see few ppl mentioned that it has worked for them Could someone please guide me here.
from selenium.webdriver.common.action_chains import ActionChains
def test_drag_and_drop(self):
source = self.find_elements("xpath=xpath_of_source")
destination = self.find_elements("id=id_of_destination")
ActionChains(self).drag_and_drop(source, destination).perform()
return(self)
Getting error : AttributeError: object has no attribute 'w3c'?
Draggable part HTML Code:
<div id="textBox" class="whiteBox textBox" style="height:160px;width:100%;">
<span style="padding-top:4px;padding-bottom:4px;clear:both;float:left;" _attr="constant" _type="textName">
<div class="simpleClass" contenteditable="false" dontcancelselect="true" onselectstart="GetBrowser().allowDrag(event, this)" draggable="true">
Text1
<img class="textBox_icon" contenteditable="false" src="img/text_box.gif" style="display:none">
</div>
</span>
<span> same for Text2 </span>
<span> same for Text3 </span>
Droppable part HTML Code :
<div id="messageDiv" class="contentEditableOuterContainer multiLine" style="position:relative">
<pre id="messagearea" class="contentEditableContainer multiLine inputpre" contenteditable="true" spellcheck="false">
Source : xpath=//div[#id='textBox']//div[contains(text(),'Text1')]
Destination : id=messagearea
We can drag "Text1" to droppable area as many times.

How can I use non-ASCII characters?

I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:

Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.

How to write this in regular expression in Python?

I have a big HTML file from which I need to parse some data using Regular expression. The first is the name of restaurant. Hotel names are in this format:
Update:
<html><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8"></head><body><div class="businessresult clearfix">
<div class="leftcol">
<div id="bizTitle0" class="itemheading">
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
</div>
<div class="itemcategories">
Categories: Italian, Seafood
</div>
<div class="itemneighborhoods">
Neighborhood: Marina/Cow Hollow
</div>
</div>
<div class="rightcol">
<div class="rating"><img src="yelp_listings_files/stars_map.html" alt="4 star rating" title="4 star rating" class="stars_4 " height="325" width="83"></div> <a class="reviews" href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco">270 reviews</a>
<address>
1809 Union St<br>San Francisco, CA 94123<br>
</address><div class="phone">
(415) 409-8001
</div>
</div>
There are altogether 40 hotels. I think there's two spaces after the . in number. I need to list all the hotels from 1 to 40. I have tried using:
re.findall("[./0-9]", string_Name)
It outputs the number. I want to get the number and all the hotel names. How can I do that?
The answer by Blender gives the rating and the restaurant list. That's fine but I want rating and the restaurant name in a different variable.

Parse the HTML:
import re
from bs4 import BeautifulSoup
html = '''
<a href="https://courses.ischool.berkeley.edu/biz/capannina-san-francisco" id="bizTitleLink0">1. Capannina
</a>
<a href="https://courses.ischool.berkeley.edu/biz/ristorante-parma-san-francisco" id="bizTitleLink4">5. Ristorante Parma
</a>
'''
soup = BeautifulSoup(html)
for link in soup.find_all('a', text=re.compile(r'^\d')):
print link.get_text()
And the output:
1. Capannina
5. Ristorante Parma

You shouldn't run regexes on html directly (preferring to use an HTML parser first), but try this regex:
(\d+)\.\s+([^<]+)
one or more digits
a dot
one or more whitespace characters
one or more non < letters
The presence of the brackets () creates a capture group. The contents of the capture group 1 will be the number. The contents of the capture group 2 will be the name.

Trouble accessing attribute after using BeautifulSoup's findAll

I'm trying to scrape sites like this one on the BBC website to grab the relevant parts of the programme listing, and I've just started using BeautifulSoup to do this.
The parts of interest start with sections like:
<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment">
<li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">
What I've done so far is opened the HTML as soup and then used soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']) to give a ResultSet of the parts I'm interested in the order in which they appear.
What I then want to do is check whether a section refers to po:MusicSegment or po:SpeechSegment in HTML that looks like:
<li about="/programmes/p01400m9#segment" class="segment track" id="segmentevent-p01400mb" typeof="po:MusicSegment"> <span class="artist-image"> <span class="depiction" rel="foaf:depiction"><img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/></span> </span> <script type="text/javascript"> window.programme_data.tracklist.push({ segment_event_pid : "p01400mb", segment_pid : "p01400m9", playlist : "http://www.bbc.co.uk/programmes/p01400m9.emp" }); </script> <h3> <span rel="mo:performer"> <span class="artist no-image" property="foaf:name" typeof="mo:MusicArtist">Mala</span> </span> <span class="title" property="dc:title">Calle F</span> </h3></li>
I want to access the typeof attribute associated with <li>, but if this chunk of HTML (as a BS4 tag) is called section and I enter section.li, it returns None.
Note that if I do section.img instead, I get something back:
<img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/>
and I could then do, e.g. section.img['height'] to get back u'63'
What I want is something analogous for the section.li part, so section.li['typeof'] to give me po:MusicSegment or po:SpeechSegment
Of course, I could simply convert each result to text and then do a simple string search, but searching by attribute seems more elegant.

I'd iterate over the list returned by findAll:
soup = BeautifulSoup('<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment"><li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">')
for elem in soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']):
print elem['typeof']
returns
po:MusicSegment
po:SpeechSegment
and then conditionally perform your other tasks:
if elem['typeof'] == 'po:MusicSegment'
do.something()
elif elem['typeof'] == 'po:SpeechSegment':
do.something_else()

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do I scrape nested data using selenium and Python - python-2.7

Related

web scraping dynamic list

What is AttributeError: object has no attribute 'w3c'?

How can I use non-ASCII characters?

How to write this in regular expression in Python?

Trouble accessing attribute after using BeautifulSoup's findAll

Categories

Resources