I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:
Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.
Related
when executing build book, index.html is gaining "automatic contributions" that I have not ordered; example of the "contribution":
</div>
<div id="prerequisites" class="section level1">
<h1><span class="header-section-number"> 1</span> Prerequisites</h1>
<p>This is a <em>sample</em> book written in <strong>Markdown</strong>. You can use anything that
Pandoc’s Markdown supports, e.g., a math equation <span class="math inline">\(a^2 + b^2 = c^2\)
</span>
...
</p>class="uri">https://yihui.org/tinytex/</a>.</p>
This "gift" is reflected into the ePUB output, but not into the html output
Seems improper to have to edit index.html to exclude this "addition".
Anyone has been able to avoid this effect?
Thank you
I'm using Python 3.7, Django, and BeautifulSoup. I am currnently looking for "span" elements in my document that contain the text "Review". I do so like this
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
my_soup = BeautifulSoup(html, features="html.parser")
rev_elts = my_soup.findAll("span", text=re.compile("Review"))
for rev_elt in rev_elts:
... processing
but I'd like to add a wrinkle to where I don't want to consider those elements if they have a DIV ancestor with the class "child". So for example, I don't want to consider something like this
<div class="child">
<p>
<span class="s">Reviews</span>
...
</p>
</div>
How can I adjust my search to take this into account?
If you are using BeautifulSoup 4.7+, it has some improved CSS selector support. It handles many selectors up through CSS level 4 and a couple of custom ones like :contains(). In addition to all of that, it handles complex selectors in pseudo-classes like :not() which level 4 was supposed to handle, but they've recently pushed that support out to CSS level 5 selector support.
So in this example we will use the custom :contains selector to search for spans which contain the text Review. In addition, we will say we don't want it to match div.class span.
from bs4 import BeautifulSoup
html = """
<div>
<p><span>Review: Let's find this</span></p>
</div>
<div class="child">
<p><span>Review: Do not want this</span></p>
</div>
"""
soup = BeautifulSoup(html, features="html.parser")
spans = soup.select('span:contains(Review):not(div.child span)')
print(spans)
Output
[<span>Review: Let's find this</span>]
Depending on your case, maybe :contains isn't robust enough. In that case, you can still do something similar. Soup Sieve is the underlying library included with Beautiful Soup 4.7+, and you can import it directly to filter your regular expression returns:
from bs4 import BeautifulSoup
import soupsieve as sv
import re
html = """
<div>
<p><span>Review: Let's find this</span></p>
</div>
<div class="child">
<p><span>Review: Do not want this</span></p>
</div>
"""
soup = BeautifulSoup(html, features="html.parser")
spans = soup.find_all("span", text=re.compile("Review"))
spans = sv.filter(':not(div.child span)', spans)
print(spans)
Output
[<span>Review: Let's find this</span>]
CSS selector is the way to go in this case as #facelessuser has answered. But just in case you are wondering this can be done without using css selector as well.
You can iterate over all of an element’s parents with .parents. You could define a custom filter function which checks if any of the parents has a class of "child" and return True otherwise (in addition to all your other conditions).
from bs4 import BeautifulSoup, Tag
html="""
<div class="child">
<p><span id="1">Review</span></p>
</div>
<div>
<p><span id="2">Review</span></p>
</div>
"""
soup=BeautifulSoup(html,'html.parser')
def my_func(item):
if isinstance(item,Tag) and item.name=='span' and 'Review' in item.text:
for parent in item.parents:
if parent.has_attr('class'):
if 'child' in parent.get('class'):
return False
return True
my_spans=soup.find_all(my_func)
print(my_spans)
Outputs:
[<span id="2">Review</span>]
I basically want to scrape Litigation Paralegal under <h3 class="Sans-17px-black-85%-semibold"> and Olswang under <span class="pv-entity__secondary-title Sans-15px-black-55%">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
if tree.xpath('//*[#class="pv-entity__summary-info"]'):
experience_title = tree.xpath('//*[#class="Sans-17px-black-85%-semibold"]/h3/text()')
print(experience_title)
experience_company = tree.xpath('//*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text()')
print(experience_company)
My output:
Experience title : []
[]
Your XPath expressions are incorrect:
//*[#class="Sans-17px-black-85%-semibold"]/h3/text() means text content of h3 which is child of element with class name attribute "Sans-17px-black-85%-semibold". Instead you need
//h3[#class="Sans-17px-black-85%-semibold"]/text()
which means text content of h3 element with class name attribute "Sans-17px-black-85%-semibold"
In //*[#class="pv-position-entity__secondary-title pv-entity__secondary-title Sans-15px-black-55%"]text() you forgot a slash before text() (you need /text(), not just text()). And also target span has no class name pv-position-entity__secondary-title. You need to use
//span[#class="pv-entity__secondary-title Sans-15px-black-55%"]/text()
You can get both of these easily with CSS selectors and I find them a lot easier to read and understand than XPath.
driver.find_element_by_css_selector("div.pv-entity__summary-info > h3").text
driver.find_element_by_css_selector("div.pv-entity__summary-info span.pv-entity__secondary-title").text
. indicates class name
> indicates child (one level below only)
indicates a descendant (any levels below)
Here are some references to get you started.
CSS Selectors Reference
CSS Selectors Tips
Advanced CSS Selectors
I'm trying to scrape some stock prices, and variations, from Google Finance using python3 but I just can't figure out if there's something wrong with the page, or my regex. I'm thinking that either the svg graphic or the many script tags throughout the page are making the regex parsers fail to properly analyze the code.
I have tested this regex on many online regex builders/testers and it looks ok. As ok as a regex designed for HTML can be, anyway.
The Google Finance page I'm testing this out on is https://www.google.com/finance?q=NYSE%3AAAPL
And my python code is the following
import urllib.request
import re
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
text = page.read().decode('utf-8')
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S)
print(m.groups())
It would extract the stock price and its percent variation.
I have also tried using python2 + BeautifulSoup, like so
soup.find(id='price-panel')
but it returns empty even for this simple query. This is especially why I'm thinking that there's something weird with the html.
And here's the most important bit of html that I'm aiming for
<div id="price-panel" class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span>
</span>
<div class="id-price-change nwp goog-inline-block">
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span>
<span class="down" id="ref_22144_cp">(-1.16%)</span>
</span>
</div>
</div>
<div>
<span class="nwp">
Real-time:
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span>
</span>
<div class="mdata-dis">
<span class="dis-large"><nobr>NASDAQ
real-time data -
Disclaimer
</nobr></span>
<div>Currency in USD</div>
</div>
</div>
</div>
I'm wondering if any of you have encountered a similar problem with this page and/or can figure out if there's anything wrong with my code. Thanks in advance!
You might try a different URL that will be easier to parse, such as: http://www.google.com/finance/info?q=AAPL
The catch is that Google has said that using this API in an application for public consumption is against their Terms of Service. Maybe there is an alternative that Google will allow you to use?
I managed to get it working using BeautifulSoup, on the link posted originally.
Here's the bit of code I finaly used:
response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar
I couldn't get it working with BeautifulSoup before because I was actually trying to parse the table in this page https://www.google.com/finance?q=NYSE%3AAAPL%3BNYSE%3AGOOG, not the one I posted.
Neither method described on my question has worked on this page.
I'm trying to scrape sites like this one on the BBC website to grab the relevant parts of the programme listing, and I've just started using BeautifulSoup to do this.
The parts of interest start with sections like:
<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment">
<li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">
What I've done so far is opened the HTML as soup and then used soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']) to give a ResultSet of the parts I'm interested in the order in which they appear.
What I then want to do is check whether a section refers to po:MusicSegment or po:SpeechSegment in HTML that looks like:
<li about="/programmes/p01400m9#segment" class="segment track" id="segmentevent-p01400mb" typeof="po:MusicSegment"> <span class="artist-image"> <span class="depiction" rel="foaf:depiction"><img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/></span> </span> <script type="text/javascript"> window.programme_data.tracklist.push({ segment_event_pid : "p01400mb", segment_pid : "p01400m9", playlist : "http://www.bbc.co.uk/programmes/p01400m9.emp" }); </script> <h3> <span rel="mo:performer"> <span class="artist no-image" property="foaf:name" typeof="mo:MusicArtist">Mala</span> </span> <span class="title" property="dc:title">Calle F</span> </h3></li>
I want to access the typeof attribute associated with <li>, but if this chunk of HTML (as a BS4 tag) is called section and I enter section.li, it returns None.
Note that if I do section.img instead, I get something back:
<img alt="" height="63" src="http://static.bbci.co.uk/programmes/2.54.3/img/thumbnail/artists_default.jpg" width="112"/>
and I could then do, e.g. section.img['height'] to get back u'63'
What I want is something analogous for the section.li part, so section.li['typeof'] to give me po:MusicSegment or po:SpeechSegment
Of course, I could simply convert each result to text and then do a simple string search, but searching by attribute seems more elegant.
I'd iterate over the list returned by findAll:
soup = BeautifulSoup('<li about="/programmes/p013zzsl#segment" class="segment track" id="segmentevent-p013zzsm" typeof="po:MusicSegment"><li about="/programmes/p014003v#segment" class="segment speech alt" id="segmentevent_p014003w" typeof="po:SpeechSegment">')
for elem in soup.findAll(typeof=['po:MusicSegment', 'po:SpeechSegment']):
print elem['typeof']
returns
po:MusicSegment
po:SpeechSegment
and then conditionally perform your other tasks:
if elem['typeof'] == 'po:MusicSegment'
do.something()
elif elem['typeof'] == 'po:SpeechSegment':
do.something_else()