How to scrape hidden text from a web page? - python-2.7

I am trying to scrape some text from a web page. On my webpage there is a list of words being shown. Some of them are visible some others become visible when I click on "+ More". Once clicked, the list of words is always the same (same order same words). However, some of them are in bold some are in deleted. So basically each item of the database has some features. What I would like to do: for each item tell me which features are available and which not. My problem is to overcome the "+ More" button.
My script works fine only for those words which are shown and not for those which are hidden by "+ More". What I am trying to do is to collect all the words that follow under the node "del". I initially thought that through lxml, the web page would have been loaded as it appears in chrome inspect element and I wrote my code accordingly:
from lxml import html
tree = html.fromstring(br.open(current_url).get_data())
mydata={}
if len(tree.xpath('//del[text()='some text']')) > 0:
mydata['some text'] = 'text is deleted from the web page!'
else:
mydata['some text'] = 'text is not deleted'
Every time I ran this code what I can collect is actually part of data being shown on the web page, but not the complete list of words that would have been shown after clicking "+ More".
I had tried selenium, but as far as I understand it is not meant for parsing but rather to interact with the web page. However if I ran this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.mywebpage.co.uk')
a = driver.find_element_by_xpath('//del[text()="some text"]')
I either get the element or an error. I would like to get an empty list so I could do:
mydata = {}
if len(driver.find_element_by_xpath('//del[text()="some text"]')) > 0:
mydata['some text'] = 'text is deleted from the web page!'
else:
mydata['some text'] = 'text is not deleted'
or find another way to get these "hidden" elements captured by the script.
My question is has anyone had this type of problem? How did them sorted it out?

If I understand correctly you want to find the element in a list. However Selenium throws an ElementNotFoundException if the element is not available on the page instead of returning a list.
The question I have is why do you want a list? Judging by your example you want to see if an element is present on the page or not. You can easily achieve this by using a try/except.
from selenium.common.exceptions import TimeoutException
try:
driver.find_element_by_xpath('//del[text()="some text"]')
mydata['some text'] = 'text is deleted from the web page!'
except TimeOutException:
mydata['some text'] = 'text is not deleted'
Now if you really really need this list you could search the page for multiple elements. This will return all the elements that match the locator in a list.
To do this replace:
driver.find_element_by_xpath('//del[text()="some text"]')
With (elements):
driver.find_elements_by_xpath('//del[text()="some text"]')

Related

browser.click() & browser.send_keys() conflict - Selenium 3.0 Python 2.7

I am currently trying to implement a subtitle downloader with the help of the http://www.yifysubtitles.com website.
The first part of my code is to click on the accept cookies button and then send keys to search the movie of interest.
url = "http://www.yifysubtitles.com"
profile = SetProfile() # A function returning my favorite profile for Firefox
browser = webdriver.Firefox(profile)
WindowSize(400, 400)
browser.get(url)
accept_cookies = WebDriverWait(browser, 100).until(
EC.element_to_be_clickable((By.CLASS_NAME, "cc_btn.cc_btn_accept_all")))
accept_cookies_btn = browser.find_element_by_class_name("cc_btn.cc_btn_accept_all")
accept_cookies_btn.click()
search_bar = browser.find_element_by_id("qSearch")
search_bar.send_keys("Harry Potter and the Chamber of Secrets")
search_bar.send_keys(Keys.RETURN)
print "Succesfully clicked!"
But it only works once - if not randomly. If I turn on my computer and run the code, it does click, make the search and print the last statement. The second time, it doesn't click but still make the search and print the final statement.
After each try, I close the session with the browser.quit() method.
Any idea on what might be the issue here?
Specify wait for button and search bar it should solve your problem.
Thanks,D

Create List from path expression with Python

I'm currently working on a webscraper without any frameworks and experiencing an issue where I test an xpath xpression to, say, get the table data on a wikipedia page. However when I scrape it and print it to the console it only returns an empty list. Can anyone please advise? and perhaps suggest some useful books on xpath for webscraping? (i have safaribooks of that helps)
import requests
from lxml import html
page = requests.get('https://en.wikipedia.org/wiki/L.A.P.D._(band)')
tree = html.fromstring(page.content)
# OK
bandName = tree.xpath('//*[#id="firstHeading"]/text()')
overview = tree.xpath('//*[#id="mw-content-text"]/p[1]//text()')
print(bandName)
print(overview)
#Trouble Code
yearsActive = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[6]//text()')
print(yearsActive)
members = tree.xpath('//*[#id="mw-content-text"]/table[1]/tbody/tr[11]/td[1]/ul/li/a//text()')
print(members)
UPDATE: While Conducting more testing I discovered that print(len(members)) returns zero which seems to indicate something is wrong with my xpath expression, yet when testing my members expression in chrome console it returns a list of band members.
Your XPath fails because the raw HTML tables don't have tbody. The tbody elements in this case are likely generated by browser (see related question below) :
>>> yearsActive = tree.xpath('//*[#id="mw-content-text"]/table[1]/tr[6]/td/text()')
>>> print yearsActive
[u'1989\u20131992']
>>> members = tree.xpath('//*[#id="mw-content-text"]/table[1]/tr[10]/td[1]//text()[normalize-space()]')
>>> print members
['James Shaffer', 'Reginald Arvizu', 'David Silveria', '\nRichard Morrill', '\nPete Capra', '\nCorey (surname unknown)', '\nDerek Campbell', '\nTroy Sandoval', '\nJason Torres', '\nKevin Guariglia']
In the future, it is often useful to inspect HTML that you actually receives from requests.get(), in case your XPath unexpectedly fails when run from codes but the same worked fine when run from browser tools.
Related : Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

Python lxml xpath no output

For educational purposes I am trying to scrape this page using lxml and requests in Python.
Specifically I just want to print the research areas of all the professors on the page.
This is what I have done till now
import requests
from lxml import html
response=requests.get('http://cse.iitkgp.ac.in/index.php?secret=d2RkOUgybWlNZzJwQXdLc28wNzh6UT09')
parsed_body=html.fromstring(response.content)
for row in parsed_body.xpath('//div[#id="maincontent"]//tr[position() mod 2 = 1]'):
for column in row.xpath('//td[#class="fcardcls"]/tr[2]/td/font/text()'):
print column.strip()
But it is not printing anything. I was struggling quite a bit with xpaths and was intially using the copy xpath feature in chrome. I followed what was done in the following SO questions/answers and cleaned up my code quite a bit and got rid of ' tbody ' in the xpaths. Still the code returns a blank.
1. Empty List Returned
2. Python-lxml-xpath problem
First of all, the main content with the desired data inside is loaded from a different endpoint via an XHR request - simulate that in your code.
Here is the complete working code printing names and a list of research areas per name:
import requests
from lxml import html
response = requests.get('http://cse.iitkgp.ac.in/faculty4.php?_=1450503917634')
parsed_body = html.fromstring(response.content)
for row in parsed_body.xpath('.//td[#class="fcardcls"]'):
name = row.findtext(".//a[#href]/b")
name = ' '.join(name.split()) # getting rid of multiple spaces
research_areas = row.xpath('.//*[. = "Research Areas: "]/following-sibling::text()')[0].split(", ")
print(name, research_areas)
The idea here is use the fact that all "professor blocks" are located in td elements with class="fcardcls". For every block, get the name from the bold link text and research areas from the following string after Research Areas: bold text.

How to scrape through search results spanning multiple pages with lxml

I'm using lxml to scrape through a site. I want to scrape through a search result, that contains 194 items. My scraper is able to scrape only the first page of search results. How can I scrape the rest of the search results?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
After this there are scraping functions
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
By looking at the list I receive I can see that only the first page of search results is parsed. How can I parse the whole of search result.
I've tried to navigate through that website but it seems to be offline. Yet, I would like to help with the logic.
What I usually do is:
Make a request to the search URL (with parameters filled)
With lxml, extract the last page available number in a pagination div.
Loop from first page to the last one, making requests and scraping desired data:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
I hope this helps. Let me know if you have any further questions.

Scraping data off flipkart using scrapy

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
#staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[#class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
One problem is that I cannot find a way to control the crawling and scrapping.
Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
#staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[#class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience
Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:
response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.
For a more detail overview of scraping with Firebug, you can check out the official documentation.
Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs:
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/