Selenium Python click a link to javascript in an unordered list - python-2.7

I'm trying to click and activate the javascript link with Selenium. It's for a 5 star rating widget.
five-stars is the exact item below. The other items, IE 4 star are not fully shown.
<div id="percentages_and_ratings">
<div id="percentages">
<div id="rating">
<ul id="personality-rating" class="star-rating profile_rating " onmouseout="Votes.publicStarOut(this)" onmouseover="Votes.publicStarOver(this)">
<li id="current-personality-3198779465475184989-1" class="current-rating" style="width: 0%;"></li>
<li>...
<li>...
<li>...
<li>...
<li>
<a class="five-stars" title="" href="javascript:processVoteNote('vote', 'personality', 5, '222222222222222', false, '', '', Profile.profileHeadingVote);">5</a>
</li>
<li class="cant-tell" style="display: none;">
<li class="click-away">
The selenium unit test output looks like
driver.find_element_by_xpath("(//a[contains(text(),'5')])[2]").click()
but that doesn't work. Selecting the xpath, CSS, HTML with firebug doesn't work either. Any ideas? I've been at it for a few nights now so it's time to ask :-)
I'm using Selenium web driver and python 2.7
Here is how I ended up solving it..
id = self.getID(driver)
script = "$(processVoteNote('vote', 'personality', 5, '"+id+"', false, '', '', Profile.profileHeadingVote));"
driver.execute_script(script)

Based on the sample HTML you posted,
browser.find_element_by_class_name('five-stars').click() should successfully select and click that link. If there is more than one element on the page with that class name on the page, you could use browser.find_elements_by_class_name('five-stars'), iterate through that list to identify the relevant links, and then click them.
If you want to use an XPATH search, I'd recommend using xPath Tester to try out different patterns.

Related

web scraping dynamic list

<div class="col col-1-1"><h2 class="heading">Flowers</h2><ul class="icon-list"> <li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 1<span class="icon-list__count">81</span> </li>
<li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 2 <span class="icon-list__count">52</span> </li>
<li class="col col-1-2 no-gutter">
<svg class="icon icon--medium">
<use xlink:href="https://"></use>
</svg>
measure 3<span class="icon-list__count">29</span> </li>
</ul></div>
This is one example of a list of measures for one type of flowers. How to scrape the value of the measures and store in a python dictionary? Hope the code would be flexible to allow for the possibility that on another pager there might be measure 2 and 3 only, or measure 3 and 4 (a new measure not appearing on this page), or completely new measure 4 and 5.
New to python - would appreciate any advice.
BeautifulSoup is the best when you are scraping a more static and less dynamic website.
Try using unique identifiers present in a tag to navigate in this tree like structure. This piece of code will give you a dictionary with measure n as key and value as its value.
from bs4 import BeautifulSoup
import re
html = '<div class="col col-1-1"><h2 class="heading">Flowers</h2><ul class="icon-list"><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 1<span class="icon-list__count">81</span></li><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 2 <span class="icon-list__count">52</span></li><li class="col col-1-2 no-gutter"><svg class="icon icon--medium"><use xlink:href="https://"></use></svg>measure 3<span class="icon-list__count">29</span></li></ul></div>'
soup = BeautifulSoup(html,'lxml')
li_tags = soup.find_all('li') # ['measure 181', 'measure 2 52', 'measure 329']
span_tags = soup.find_all('span',class_='icon-list__count') # ['81', '52', '29']
li_list= []
for li in li_tags:
li_list.append(li.text)
measure_dict = {}
for i in range(len(li_list)):
li_list[i] = re.sub(span_tags[i].text,'',li_list[i]) #converting 'measure 181 into 'measure 1' and likewise
measure_dict[li_list[i]] = span_tags[i].text # if you want the values as integers then use int(span_tags[i].text) in this line
print(measure_dict)
#{'measure 1': '81', 'measure 2 ': '52', 'measure 3': '29'}
The code will be flexible if the identifier I have used here class = 'icon-list__count' is present in every page you access and moreover when it also contains the data that you want to scrape. So you can hope it's the same and if not you have to traverse into the html tags to find your desired data by identify them on your own.
If in case the website uses Javascript() in the place where you want to scrape then it's better to use Selenium as it's a better scraping tool for dynamic websites.
Advice:
Reading the documentation of the module is far more helpful than watching random YT videos in the long run!
Try using re module whenever you want to play with strings, it's much better than the pre-defined methods in string

How do I scrape nested data using selenium and Python

I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]
Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()
You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors

Kentico - New fields on existing document types won't render

In Kentico 7, I added 3 new fields to the Page (menu item) document type: small_desc, long_desc and icon_class - this is in addition to the existing fields MenuItemID, MenuItemName and MenuItemTeaserImage.
On a Repeater WebPart I added the following transformation:
<li class="...">
<a class="<%# Eval("icon_class") %>" href="<%# GetDocumentUrl() %>">
<%# Eval("MenuItemName") %>
</a>
<p class="..."><%# Eval("small_desc") %></p>
</li>
A strange thing happens. While viewing the page with the Repeater in Preview mode, everything renders correctly:
<li class="...">
<a class="unique_class" href="/url.htm">
Document Title
</a>
<p class="...">A description I just added to the document.</p>
</li>
But in Live mode, I see:
<li class="...">
<a class="" href="/url.htm">
Document Title
</a>
<p class="..."></p>
</li>
So...
We've run through a plethora of troubleshooting steps...
there are absolutely no exceptions in our Event Log
everything is checked in
server cache cleared
application restarted
browser cache cleared and hard reloaded on multiple browsers and machines
My assumption was Kentico didn't like it when you add new fields to existing (Kentico default) document types. I cloned a completely new document type earlier, added all brand spanking new fields, ran a repeater on a list of new documents, and every single field showed up. I'm certain I could do that - just clone Page (menu item) and recreate all of my pages, but for (I hope) obvious reasons I'm not going to do that. Kentico Support hasn't been able to give any good direction so I turn to you smart folks!
What are the columns set in the repeater's Columns property? Isn't it possible that there are set some of them and you are missing the new ones? If it is blank, all the columns should be loaded (not good for performance though).

Trouble accessing attribute after using BeautifulSoup's findAll

I'm trying to scrape sites like this one on the BBC website to grab the relevant parts of the programme listing, and I've just started using BeautifulSoup to do this.
The parts of interest start with sections like:
<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment">
<li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">
What I've done so far is opened the HTML as soup and then used soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']) to give a ResultSet of the parts I'm interested in the order in which they appear.
What I then want to do is check whether a section refers to po:MusicSegment or po:SpeechSegment in HTML that looks like:
<li about="/programmes/p01400m9#segment" class="segment track" id="segmentevent-p01400mb" typeof="po:MusicSegment"> <span class="artist-image"> <span class="depiction" rel="foaf:depiction"><img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/></span> </span> <script type="text/javascript"> window.programme_data.tracklist.push({ segment_event_pid : "p01400mb", segment_pid : "p01400m9", playlist : "http://www.bbc.co.uk/programmes/p01400m9.emp" }); </script> <h3> <span rel="mo:performer"> <span class="artist no-image" property="foaf:name" typeof="mo:MusicArtist">Mala</span> </span> <span class="title" property="dc:title">Calle F</span> </h3></li>
I want to access the typeof attribute associated with <li>, but if this chunk of HTML (as a BS4 tag) is called section and I enter section.li, it returns None.
Note that if I do section.img instead, I get something back:
<img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/>
and I could then do, e.g. section.img['height'] to get back u'63'
What I want is something analogous for the section.li part, so section.li['typeof'] to give me po:MusicSegment or po:SpeechSegment
Of course, I could simply convert each result to text and then do a simple string search, but searching by attribute seems more elegant.
I'd iterate over the list returned by findAll:
soup = BeautifulSoup('<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment"><li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">')
for elem in soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']):
print elem['typeof']
returns
po:MusicSegment
po:SpeechSegment
and then conditionally perform your other tasks:
if elem['typeof'] == 'po:MusicSegment'
do.something()
elif elem['typeof'] == 'po:SpeechSegment':
do.something_else()

can't add a link to an entire div section

I have a problem with TinyMCE in Joomla 2.5.4. I have tried for a few days now to add a link to a div section (like <div> something< </div> ) but failed, the anchor is stripped from the HTML section because TinyMCE sees that as being wrong in HTML4. After a 3 days research I gave up and instead of a div i used a unordered list.
Now when i try to add a link to a list item (like <li> <p> something </p> </li> ) TinyMCE rearranges everything and moves the anchor inside of the list item (like <li> <a href="#"> <p> something </p> &=lt;/a> </li>).
I have tried pretty much everything from valid_elements : "[]" to text filter: No Filtering but i ran low on ideas.
Can anyone please help me?
Try playing around with TinyMCE's html5 options: http://www.tinymce.com/tryit/html5_formats.php
Hit "view source" to see how they're doing it. It's mainly this option inside tinyMCE.init:
schema: "html5",