Create List from path expression with Python - python-2.7

I'm currently working on a webscraper without any frameworks and experiencing an issue where I test an xpath xpression to, say, get the table data on a wikipedia page. However when I scrape it and print it to the console it only returns an empty list. Can anyone please advise? and perhaps suggest some useful books on xpath for webscraping? (i have safaribooks of that helps)
import requests
from lxml import html
page = requests.get('https://en.wikipedia.org/wiki/L.A.P.D._(band)')
tree = html.fromstring(page.content)
# OK
bandName = tree.xpath('//*[#id="firstHeading"]/text()')
overview = tree.xpath('//*[#id="mw-content-text"]/p[1]//text()')
print(bandName)
print(overview)
#Trouble Code
yearsActive = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[6]//text()')
print(yearsActive)
members = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[11]/td[1]/ul/li/a//text()')
print(members)
UPDATE: While Conducting more testing I discovered that print(len(members)) returns zero which seems to indicate something is wrong with my xpath expression, yet when testing my members expression in chrome console it returns a list of band members.

Your XPath fails because the raw HTML tables don't have tbody. The tbody elements in this case are likely generated by browser (see related question below) :
>>> yearsActive = tree.xpath('//*[#id="mw-content-text"]/table[1]/tr[6]/td/text()')
>>> print yearsActive
[u'1989\u20131992']
>>> members = tree.xpath('//*[#id="mw-content-text"]/table[1]/tr[10]/td[1]//text()[normalize-space()]')
>>> print members
['James Shaffer', 'Reginald Arvizu', 'David Silveria', '\nRichard Morrill', '\nPete Capra', '\nCorey (surname unknown)', '\nDerek Campbell', '\nTroy Sandoval', '\nJason Torres', '\nKevin Guariglia']
In the future, it is often useful to inspect HTML that you actually receives from requests.get(), in case your XPath unexpectedly fails when run from codes but the same worked fine when run from browser tools.
Related : Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

Related

Scraping Aliexpress site with Python don't Give me Correct Result

I have problem in scraping aliexpress site.
https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html
This is one url.
What i want to get.
r = requests.get('https://www.aliexpress.com/item/Free-gift-100-Factory-Original-Unlocked-Apple-iphone-4G-8GB-16GB-32GB-Cell-phone-3-5/32691056589.html')
beautifulsoup
content = soup.find('div', {'id':'j-product-tabbed-pane'})
lxml parsing.
root = html.fromstring(r.content)
results = root.xpath('//img[#alt="aeProduct.getSubject()"]')
f = open('result.html', 'w')
f.write(lxml.html.tostring(results[0]))
f.close()
This my my code but give me false result.
Inspect on browser has that elements
But above code don't give me anything.
I think requests.get don't give me correct contents. But why and how i can solve this problem. They detect as a bot?. How can help me.
Thank you every one.
try this
1-use user agent
2-use proxy
3-disable javascript from this site and refresh it then see if the site have this element or it loads by javascript if it loads by javascript
you should find a way to render JS

How to scrape hidden text from a web page?

I am trying to scrape some text from a web page. On my webpage there is a list of words being shown. Some of them are visible some others become visible when I click on "+ More". Once clicked, the list of words is always the same (same order same words). However, some of them are in bold some are in deleted. So basically each item of the database has some features. What I would like to do: for each item tell me which features are available and which not. My problem is to overcome the "+ More" button.
My script works fine only for those words which are shown and not for those which are hidden by "+ More". What I am trying to do is to collect all the words that follow under the node "del". I initially thought that through lxml, the web page would have been loaded as it appears in chrome inspect element and I wrote my code accordingly:
from lxml import html
tree = html.fromstring(br.open(current_url).get_data())
mydata={}
if len(tree.xpath('//del[text()='some text']')) > 0:
mydata['some text'] = 'text is deleted from the web page!'
else:
mydata['some text'] = 'text is not deleted'
Every time I ran this code what I can collect is actually part of data being shown on the web page, but not the complete list of words that would have been shown after clicking "+ More".
I had tried selenium, but as far as I understand it is not meant for parsing but rather to interact with the web page. However if I ran this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.mywebpage.co.uk')
a = driver.find_element_by_xpath('//del[text()="some text"]')
I either get the element or an error. I would like to get an empty list so I could do:
mydata = {}
if len(driver.find_element_by_xpath('//del[text()="some text"]')) > 0:
mydata['some text'] = 'text is deleted from the web page!'
else:
mydata['some text'] = 'text is not deleted'
or find another way to get these "hidden" elements captured by the script.
My question is has anyone had this type of problem? How did them sorted it out?
If I understand correctly you want to find the element in a list. However Selenium throws an ElementNotFoundException if the element is not available on the page instead of returning a list.
The question I have is why do you want a list? Judging by your example you want to see if an element is present on the page or not. You can easily achieve this by using a try/except.
from selenium.common.exceptions import TimeoutException
try:
driver.find_element_by_xpath('//del[text()="some text"]')
mydata['some text'] = 'text is deleted from the web page!'
except TimeOutException:
mydata['some text'] = 'text is not deleted'
Now if you really really need this list you could search the page for multiple elements. This will return all the elements that match the locator in a list.
To do this replace:
driver.find_element_by_xpath('//del[text()="some text"]')
With (elements):
driver.find_elements_by_xpath('//del[text()="some text"]')

Having issues with Python xpath scraping

I'm back again with a question for the wonderful people here :)
Ive recently begun getting back into python (50% done at codcademy lol) and decided to make a quick script for web-scraping the spot price of gold in CAD. This will eventually be a part of a much bigger script... but Im VERY rusty and thought it would be a good project.
My issue:
I have been following the guide over at http://docs.python-guide.org/en/latest/scenarios/scrape/ to accomplish my goal, however my script always returns/prints
<Element html at 0xRANDOM>
with RANDOM being a (i assume) random hex number. This happens no matter what website I seem to use.
My Code:
#!/bin/python
#Scrape current gold spot price in CAD
from lxml import html
import requests
def scraped_price():
page = requests.get('http://goldprice.org/gold-price-canada.html')
tree = html.fromstring(page.content)
print "The full page is: ", tree #added for debug WHERE ERROR OCCURS
bid = tree.xpath("//span[#id='gpotickerLeftCAD_price']/text()")
print "Scraped content: ", bid
return bid
gold_scraper = scraped_price()
My research:
1) www.w3schools.com/xsl/xpath_syntax.asp
This is where I figured out to use '//span' to find all 'span' objects and then used the #id to narrow it down to the one I need.
2)Scraping web content using xpath won't work
This makes me think I simply have a bad tree.xpath setup. However I cannot seem to figure out where or why.
Any assistance would be greatly appreciated.
<Element html at 0xRANDOM>
What you see printed is the lxml.html's Element class string representation. If you want to see the actual HTML content, use tostring():
print(html.tostring(tree, pretty_print=True))
You are also getting Scraped content: [] printed which really means that there were no elements matching the locator. And, if you would see the previously printed out HTML, there is actually no element with id="gpotickerLeftCAD_price" in the downloaded source.
The prices on this particular site are retrieved dynamically with continuous JSONP GET requests issued periodically. You can either look into simulating these requests, or stay on a higher level automating a browser via selenium. Demo (using PhantomJS headless browser):
>>> import time
>>> from selenium import webdriver
>>>
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://goldprice.org/gold-price-canada.html")
>>> while True:
... print(driver.find_element_by_id("gpotickerLeftCAD_price").text)
... time.sleep(1)
...
1,595.28
1,595.28
1,595.28
1,595.28
1,595.28
1,595.19
...

Python lxml xpath no output

For educational purposes I am trying to scrape this page using lxml and requests in Python.
Specifically I just want to print the research areas of all the professors on the page.
This is what I have done till now
import requests
from lxml import html
response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)
for row in parsed_body.xpath('//div[#id="maincontent"]//tr[position() mod 2 = 1]'):
for column in row.xpath('//td[#class="fcardcls"]/tr[2]/td/font/text()'):
print column.strip()
But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.
1. Empty List Returned
2. Python-lxml-xpath problem
First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.
Here is the complete working code printing names and a list of research areas per name:
import requests
from lxml import html
response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)
for row in parsed_body.xpath('.//td[#class="fcardcls"]'):
name = row.findtext(".//a[#href]/b")
name = ' '.join(name.split()) # getting rid of multiple spaces
research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")
print(name, research_areas)
The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

Scraping data off flipkart using scrapy

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
#staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[#class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
One problem is that I cannot find a way to control the crawling and scrapping.
Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
#staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[#class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience
Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:
response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.
For a more detail overview of scraping with Firebug, you can check out the official documentation.
Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs:
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/