Web Scraping with python and scrapy

Web Scraping with python and scrapy - python-2.7

I'm trying to scrap a site with this types of div:
<div class="mindatath">Density:</div>
<div class="mindatam2">
3.98 - 4.1 g/cm
<sup>3</sup>
(Measured) 3.997 g/cm
<sup>3</sup>
(Calculated)
</div>
</div>
Ok, I need the value in mindatam2 div. But exist a lot of divs with this class. How can I relate the two divs to I know what the value to extract?
I try with Scrapy to show all divs values:
response.xpath('//div[#class="mindatam2"]/text()').extract()

If all your densities have similar format, you can use regex
For example
response.xpath('//div[#class="mindatam2"]/text()').re(r'([\d\.\-\s]+)g/cm')

Related

Generate PDF based on Particular Div id in Rails

Is there a way to generate pdf in rails based on div id.
Sample code:
<div id="pdf_download">
<h1>Hello Welcome to PDF
</div>
<div id="seconf_pdf">
<h2>Second PDF</h2>
</div>
Now i want to download only the div id "pdf_download" as pdf, Is it possible? Can anyone explain how to achieve it?

There seems to be a gem that does that - wicked_pdf. Internally it seems to depend on wkhtmltopdf, so you should be able to install this on your production environment.

Can't parse Google Finance html

I'm trying to scrape some stock prices, and variations, from Google Finance using python3 but I just can't figure out if there's something wrong with the page, or my regex. I'm thinking that either the svg graphic or the many script tags throughout the page are making the regex parsers fail to properly analyze the code.
I have tested this regex on many online regex builders/testers and it looks ok. As ok as a regex designed for HTML can be, anyway.
The Google Finance page I'm testing this out on is https://www.google.com/finance?q=NYSE%3AAAPL
And my python code is the following
import urllib.request
import re
page = urllib.request.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
text = page.read().decode('utf-8')
m = re.search("id=\"price-panel.*>(\d*\d*\d\.\d\d)</span>.*\((-*\d\.\d\d%)\)", text, re.S)
print(m.groups())
It would extract the stock price and its percent variation.
I have also tried using python2 + BeautifulSoup, like so
soup.find(id='price-panel')
but it returns empty even for this simple query. This is especially why I'm thinking that there's something weird with the html.
And here's the most important bit of html that I'm aiming for
<div id="price-panel" class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span class="unchanged" id="ref_22144_l"><span class="unchanged">96.41</span><span></span></span>
</span>
<div class="id-price-change nwp goog-inline-block">
<span class="ch bld"><span class="down" id="ref_22144_c">-1.13</span>
<span class="down" id="ref_22144_cp">(-1.16%)</span>
</span>
</div>
</div>
<div>
<span class="nwp">
Real-time:
<span class="unchanged" id="ref_22144_ltt">3:42PM EDT</span>
</span>
<div class="mdata-dis">
<span class="dis-large"><nobr>NASDAQ
real-time data -
Disclaimer
</nobr></span>
<div>Currency in USD</div>
</div>
</div>
</div>
I'm wondering if any of you have encountered a similar problem with this page and/or can figure out if there's anything wrong with my code. Thanks in advance!

You might try a different URL that will be easier to parse, such as: http://www.google.com/finance/info?q=AAPL
The catch is that Google has said that using this API in an application for public consumption is against their Terms of Service. Maybe there is an alternative that Google will allow you to use?

I managed to get it working using BeautifulSoup, on the link posted originally.
Here's the bit of code I finaly used:
response = urllib2.urlopen('https://www.google.com/finance?q=NYSE%3AAAPL')
html = response.read()
soup = BeautifulSoup(html, "lxml")
aaplPrice = soup.find(id='price-panel').div.span.span.text
aaplVar = soup.find(id='price-panel').div.div.span.find_all('span')[1].string.split('(')[1].split(')')[0]
aapl = aaplPrice + ' ' + aaplVar
I couldn't get it working with BeautifulSoup before because I was actually trying to parse the table in this page https://www.google.com/finance?q=NYSE%3AAAPL%3BNYSE%3AGOOG, not the one I posted.
Neither method described on my question has worked on this page.

How can I use non-ASCII characters?

I am using Scrapy and XPath to parse web-site in Russian language.
In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?
Here is the xpath expression:
//*[text()="Param1_name_in_russian"]/following-sibling::text()
Html snippet:
<div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Param1_name_in_russian</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>
EDITED based on comments
I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:
link to the web-site: link to real-estate web site
screenshot of what I need to parse:

Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.
I'll be using lxml instead of Scrapy below, but the logic is the same.
Code:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from lxml import html
markup = """div class="obj-params">
<div class="wrap">
<div class="obj-params-col" style="min-width:50%;">
<p>
<b>Некий текст</b>" Param1_value"</p>
<p>
<strong>Param2_name_in_russian</strong>" Param2_value</p>
<p>
<strong>Param3_name_in_russian</strong>" Param3_value"</p>
</div>
</div>
<div class="wrap">
<div class="obj-params-col">
<p>
<b>Param4_name_in_russian</b>Param4_value</p>
<div class="inline-popup popup-hor left">
<b>Param5_name</b>
<a target="_blank" href="link">Param5_value</a></div></div>"""
tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")
print pone_val
Result:
['" Param1_value"']
[Finished in 0.5s]
Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as #warwaruk's comment in your question.
Let us know if this helps.
EDIT:
Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.
import requests as rq
from lxml import html
url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)
divs = tree.xpath("//div[#class='obj-left']")
for div in divs:
name = div.xpath("./h3/span/a/text()")[0]
details = div.xpath(".//div[#class='obj-params-col'][1]")[0]
room = details.xpath("./p[1]/text()[last()]")[0]
square = details.xpath("./p[2]/text()[last()]")[0]
floor = details.xpath("./p[3]/text()[last()]")[0]
print name.encode("utf-8")
print room.encode("utf-8")
print square.encode("utf-8")
print floor.encode("utf-8")
This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.
Let us know what you think.

How to filter the html markups when render a template with jinja2?

Now I'm biulding a django project with jinja2 dealing with templates. Some page contents are submited by the client with wysiwy editor, and thing's going fine with the detail pages.
But the list pages are wrong with the slice of the contents.
My code:
<div class="summary ">
<div class="content">{{ question.content[:200]|e}}...</div>
</div>
But the output is:
<p>what i want to show here is raw text without markups</p>...
The expected result is that the html markups like <p></p> <section>.... are gone (filtered or eliminated) and only the raw text shows!
So how can I fix it? Thanks in advance!

Use striptags filter:
striptags(value)
Strip SGML/XML tags and replace adjacent whitespace
by one space.
<div class="content">{{ question.content|striptags}}...</div>
Jinja2 striptags filter test will also help you to understand how it works.
Hope that helps.

Incorporating slideshow / carousel into custom tumblr theme

Got a very particular problem here:
I've been developing a tumblr-hosted site locally, using the API to pull in posts without having to copy and paste the project into tumblr a million times. I decided I liked the API better and would just use that in production, but now that it's time to deploy I realize that I have to go back to the custom theme, {block:Posts} method.
I have the post feeding into a Cycle2 slideshow, with 3 slides containing 3 posts each for a total of 9 playlists viewable without going back to the archive. This method works perfectly with the api, but is getting messed up in the custom theme. Here's my current code:
<div class="cycle-slideshow">
{block:Posts}
{block:Text}
<div class="slide-wrapper">
<div class="post">
{block:Post1}
{block:Title}<h2>{Title}</h2>{/block:Title}
<div class="blog_item">
{Body}
</div>
{/block:Post1}
</div>
<!--two more posts before end of slide... -->
</div>
{/block:Text}
{/block:Posts}
</div> <!--end of slide wrapper - 2 more of these before end of slideshow div..
I also tried scrapping the post numbers, but still no dice. In tumblr's docs, they say that
Example: {block:Post5}I'm the fifth post!{/block:Post5} will only be rendered on the fifth post being displayed.
I'm wondering if "being displayed" refers to the html visibility of the post, and if so, if that's interfering with the cycle plugin? The results are one ill-formatted post per slide, and then after cycling through 2 blank slides, the next oldest post takes its place. I'll be pleasantly surprised if anybody has ever had a similar problem but I would kill for some advice. Here's the development site for reference (and the second carousel is working because it's still hooked up to the api). thanks!!

Generally speaking, the following code is what you'd want to have 3 slideshows with 3 posts each.
Note that in the Additional Settings on the Customize screen, you'd have to set the post count to 9 per page in order for this to work properly. I wrapped it in an Index Page block, otherwise this is going to look nasty on a Permalink Page.
{block:IndexPage}
{block:Posts}
{block:Post1}<div class="cycle-slideshow">{/block:Post1}
{block:Post4}<div class="cycle-slideshow">{/block:Post4}
{block:Post7}<div class="cycle-slideshow">{/block:Post7}
<div class="slide-wrapper">
{block:Text}
<div class="post">
{block:Title}<h2>{Title}</h2>{/block:Title}
<div class="blog_item">
{Body}
</div>
</div>
{/block:Text}
{block:Photo}
...
{/block:Photo}
...
</div>
{block:Post3}</div>{/block:Post3}
{block:Post6}</div>{/block:Post6}
{block:Post9}</div>{/block:Post9}
{/block:Posts}
{/block:IndexPage}
However, if you're wanting 3 slideshows with the post types split between the slideshows, the code would look more like the following.
Note that in this scenario, if you were to have 4 texts posts out of 9, all 4 text posts would end up in the Text slideshow. You'd have to use Javascript or CSS to remove or hide the additional posts if you're very strict about your 3.
{block:IndexPage}
<div class="cycle-slideshow">
{block:Posts}
{block:Text}
<div class="slide-wrapper">
<div class="post">
{block:Title}<h2>{Title}</h2>{/block:Title}
<div class="blog_item">
{Body}
</div>
</div>
</div>
{/block:Text}
{/block:Posts}
</div>
<div class="cycle-slideshow">
{block:Posts}
{block:Photo}
<div class="slide-wrapper">
...
</div>
{/block:Photo}
{/block:Posts}
</div>
{/block:IndexPage}
If you need me to clarify anything, let me know.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Web Scraping with python and scrapy - python-2.7

If all your densities have similar format, you can use regex For example response.xpath('//div[#class="mindatam2"]/text()').re(r'([\d\.\-\s]+)g/cm')

Related

Generate PDF based on Particular Div id in Rails

Can't parse Google Finance html

How can I use non-ASCII characters?

How to filter the html markups when render a template with jinja2?

Incorporating slideshow / carousel into custom tumblr theme

Categories

Resources