How to scrape through search results spanning multiple pages with lxml - python-2.7

I'm using lxml to scrape through a site. I want to scrape through a search result, that contains 194 items. My scraper is able to scrape only the first page of search results. How can I scrape the rest of the search results?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
After this there are scraping functions
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
By looking at the list I receive I can see that only the first page of search results is parsed. How can I parse the whole of search result.

I've tried to navigate through that website but it seems to be offline. Yet, I would like to help with the logic.
What I usually do is:
Make a request to the search URL (with parameters filled)
With lxml, extract the last page available number in a pagination div.
Loop from first page to the last one, making requests and scraping desired data:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
I hope this helps. Let me know if you have any further questions.

Related

web Crawling and Extracting data using scrapy

I am new to python as well as scrapy.
I am trying to crawl a seed url https://www.health.com/patients/status/.This seed url contains many urls. But I want to fetch only urls that contain Faci/Details/#somenumber from the seed url .The url will be like below:
https://www.health.com/patients/status/ ->https://www.health.com/Faci/Details/2
-> https://www.health.com/Faci/Details/3
-> https://www.health.com/Faci/Details/4
https://www.health.com/Faci/Details/2 -> https://www.health.com/provi/details/64
-> https://www.health.com/provi/details/65
https://www.health.com/Faci/Details/3 -> https://www.health.com/provi/details/70
-> https://www.health.com/provi/details/71
Inside each https://www.health.com/Faci/Details/2 page there is https://www.health.com/provi/details/64
https://www.health.com/provi/details/65 ... .Finally I want to fetch some datas from
https://www.health.com/provi/details/#somenumber url.How can I achieve the same?
As of now I have tried the below code from scrapy tutorial and able to crawl only url that contains https://www.health.com/Faci/Details/#somenumber .Its not going to https://www.health.com/provi/details/#somenumber .I tried to set depth limit in settings.py file.But it doesn't worked.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from news.items import NewsItem
class MySpider(CrawlSpider):
name = 'provdetails.com'
allowed_domains = ['health.com']
start_urls = ['https://www.health.com/patients/status/']
rules = (
Rule(LinkExtractor(allow=('/Faci/Details/\d+', )), follow=True),
Rule(LinkExtractor(allow=('/provi/details/\d+', )),callback='parse_item'),
)
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = NewsItem()
item['id'] = response.xpath("//title/text()").extract()
item['name'] = response.xpath("//title/text()").extract()
item['description'] = response.css('p.introduction::text').extract()
filename='details.txt'
with open(filename, 'wb') as f:
f.write(item)
self.log('Saved file %s' % filename)
return item
Please help me to proceed further?
To be honest, the regex-based and mighty Rule/LinkExtractor gave me often a hard time. For simple project it is maybe an approach to extract all links on page and then look on the href attribute. If the href matches your needs, yield a new Response object with it. For instance:
from scrapy.http import Request
from scrapy.selector import Selector
...
# follow links
for href in sel.xpath('//div[#class="contentLeft"]//div[#class="pageNavigation nobr"]//a').extract():
linktext = Selector(text=href).xpath('//a/text()').extract_first()
if linktext and linktext[0] == "Weiter":
link = Selector(text=href).xpath('//a/#href').extract()[0]
url = response.urljoin(link)
print url
yield Request(url, callback=self.parse)
Some remarks to your code:
response.xpath(...).extract()
This will return a list, maybe you want to have a look on extract_first() which provide the first item (or None).
with open(filename, 'wb') as f:
This will overwrite the file several times. You will only gain the last item saved. Also you open the file in binary mode ('b'). From the filename I guess you want to read it as text? Use 'a' to append? See open() docs
An alternative is to use the -o flag to use scrapys facilities to store the items to JSON or CSV.
return item
It is a good style to yield items instead of return them. At least if you need to create several items from one page you need to yield them.
Another good approach is: Use one parse() function for one type/kind of page.
For instance every page in start_urls fill end up in parse(). From that you extract could extract the links and yield Requests for each /Faci/Details/N page with a callback parse_faci_details(). In parse_faci_details() you extract again the links of interest, create Requests and pass them via callback= to e.g. parse_provi_details().
In this function you create the items you need.

Retrieve all URLs to web pages on the same domain, but no media or anchors

I'm new to Python, and extremely impressed by the amount of libraries at my disposal. I have a function already that uses Beautiful Soup to extract URLs from a site, but not all of them are relevant. I only want webpages (no media) on the same website (domain or subdomain, but no other domains). I'm trying to manually program around examples I run into, but I feel like I'm reinventing the wheel - surely this is a common problem in internet applications.
Here's an example list of URLs that I might retrieve from a website, say http://example.com, with markings for whether or not I want them and why. Hopefully this illustrates the issue.
Good:
example.com/page - it links to another page on the same domain
example.com/page.html - has an filetype ending but it's an HTML page
subdomain.example.com/page.html - it's on the same site, though on a subdomain
/about/us - it's a relative link, so it doesn't have the domain it it, but it's implied
Bad:
otherexample.com/page - bad, the domain doesn't match
example.com/image.jpg - bad, not an image and not a page
/ - bad - sometimes there's just a slash in the "a" tag, but that's a reference to the page I'm already on
#anchor - this is also a relative link, but it's on the same page, so there's no need for it
I've been writing cases in if statements for each of these...but there has to be a better way!
Edit: Here's my current code, which returns nothing:
ignore_values = {"", "/"}
def desired_links(href):
# ignore if href is not set
if not href:
return False
# ignore if it is just a link to the same page
if href.startswith("#"):
return False
# skip ignored values
if href in ignore_values:
return False
def explorePage(pageURL):
#Get web page
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open(pageURL)
html = response.read()
#Parse web page for links
soup = BeautifulSoup(html, 'html.parser')
links = [a["href"] for a in soup.find_all("a", href=desired_links)]
for link in links:
print(link)
return
def main():
explorePage("http://xkcd.com")
BeautifulSoup is quite flexible in helping you to create and apply the rules to attribute values. You can create a filtering function and use it as a value for the href argument to find_all().
For example, something for you to start with:
ignore_values = {"", "/"}
def desired_links(href):
# ignore if href is not set
if not href:
return False
# ignore if it is just a link to the same page
if href.startswith("#"):
return False
# skip ignored values
if href in ignore_values:
return False
# TODO: more rules
# you would probably need "urlparse" package for a proper url analysis
return True
Usage:
links = [a["href"] for a in soup.find_all("a", href=desired_links)]
You should take a look at Scrapy and its Link Extractors.

How to scrape hidden text from a web page?

I am trying to scrape some text from a web page. On my webpage there is a list of words being shown. Some of them are visible some others become visible when I click on "+ More". Once clicked, the list of words is always the same (same order same words). However, some of them are in bold some are in deleted. So basically each item of the database has some features. What I would like to do: for each item tell me which features are available and which not. My problem is to overcome the "+ More" button.
My script works fine only for those words which are shown and not for those which are hidden by "+ More". What I am trying to do is to collect all the words that follow under the node "del". I initially thought that through lxml, the web page would have been loaded as it appears in chrome inspect element and I wrote my code accordingly:
from lxml import html
tree = html.fromstring(br.open(current_url).get_data())
mydata={}
if len(tree.xpath('//del[text()='some text']')) > 0:
mydata['some text'] = 'text is deleted from the web page!'
else:
mydata['some text'] = 'text is not deleted'
Every time I ran this code what I can collect is actually part of data being shown on the web page, but not the complete list of words that would have been shown after clicking "+ More".
I had tried selenium, but as far as I understand it is not meant for parsing but rather to interact with the web page. However if I ran this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.mywebpage.co.uk')
a = driver.find_element_by_xpath('//del[text()="some text"]')
I either get the element or an error. I would like to get an empty list so I could do:
mydata = {}
if len(driver.find_element_by_xpath('//del[text()="some text"]')) > 0:
mydata['some text'] = 'text is deleted from the web page!'
else:
mydata['some text'] = 'text is not deleted'
or find another way to get these "hidden" elements captured by the script.
My question is has anyone had this type of problem? How did them sorted it out?
If I understand correctly you want to find the element in a list. However Selenium throws an ElementNotFoundException if the element is not available on the page instead of returning a list.
The question I have is why do you want a list? Judging by your example you want to see if an element is present on the page or not. You can easily achieve this by using a try/except.
from selenium.common.exceptions import TimeoutException
try:
driver.find_element_by_xpath('//del[text()="some text"]')
mydata['some text'] = 'text is deleted from the web page!'
except TimeOutException:
mydata['some text'] = 'text is not deleted'
Now if you really really need this list you could search the page for multiple elements. This will return all the elements that match the locator in a list.
To do this replace:
driver.find_element_by_xpath('//del[text()="some text"]')
With (elements):
driver.find_elements_by_xpath('//del[text()="some text"]')

Csv parsing program & how to flatten multiple lists into single list

Ive been working on a small program that I need to do the following:
Take a csv file 'domains_prices.csv' with a column of domains and then a price for each e.g.:
http://www.example1.com,$20
http://www.example2.net,$30
and so on
and then a second file 'orders_list.csv' which is just a single column of blog post urls from the same domains listed in the 1st file e.g.:
http://www.exmaple2.net/blog-post-1
http://www.example1.com/some-article
http://www.exmaple3.net/blog-post-feb-19
and so on
I need to check the full urls from the orders_list against the domains in the 1st file and check what the price is of a blog post on that domain and then output all the blog post urls into a new file with the price for each e.g.:
http://www.example2.net/blog-post-1, $20
and then there would be a total amount at the end of the output file.
My plan was to create a dict for domains_prices with k,v as domain & price and then have all the urls from orders_list in a list and then compare the elements of that list against the prices in the dict.
This is my code, Im stuck towards the end, I have parsed_orders_list and it seems to be returning all the urls as individual lists so Im thinking I should put all those urls into a single list?
Also the final commented out code at the end is the operation I intend to do once I have the correct list of urls to compare them against the k, v of the dict, Im not sure if thats correct too though.
Please note this is also my first every full python program Ive created from scratch so if its horrendous then thats why :)
import csv
from urlparse import urlparse
#get the csv file with all domains and prices in
reader = csv.reader(open("domains_prices.csv", 'r'))
#get all the completed blog post urls
reader2 = csv.reader(open('orders_list.csv', 'r'))
domains_prices={}
orders_list = []
for row in reader2:
#put the blog post urls into a list
orders_list.append(','.join(row))
for domain, price in reader:
#strip the domains
domain = domain.replace('http://', '').replace('/','')
#insert the domains and prices into the dictionary
domains_prices[domain] = price
for i in orders_list:
#iterate over the blog post urls orders_list and
#then parse them with urlparse
data = urlparse(i)
#use netloc to get just the domain from each blog post url
parsed_orders = data.netloc
parsed_orders_list = parsed_orders.split()
print parsed_orders_list
"""
for k in parsed_orders:
if k in domains_prices:
print k, domains_prices[k]
"""
with the help of someone else Ive figured it out, made the following changes to the 'for i in orders_list' section
parsed_orders = []
for i in orders_list:
#iterate over the blog post urls orders_list and
#then parse them with urlparse
data = urlparse(i)
#use netloc to get just the domain from each blog post url then put each netloc url into a list
parsed_orders.append(data.netloc)
#print parsed_orders - to check that Im getting a list of netloc urls back
#Iterate over the list of urls and dict of domains and prices to match them up
for k in parsed_orders:
if k in domains_prices:
print k, domains_prices[k]

Problems Scraping a Page With Beautiful Soup

I am using Beautiful Soup to try and scrape a page.
I am trying to follow this tutorial.
I am trying to get the contents of the following page after submitting a Stock Ticker Symbol:
http://www.cboe.com/delayedquote/quotetable.aspx
The tutorial is for a page with a "GET" method, my page is a "POST". I wonder if that is part of the problem?
I want use the first text box – under where it says:
“Enter a Stock or Index symbol below for delayed quotes.”
Relevant code:
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
values = {'ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol' : 'IBM' }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.cboe.com/delayedquote/quotetable.aspx", data, headers)
response = urllib2.urlopen(request)
The call does not fail, I do not get a set of options and prices returned to me like when I run the page interactively. I a bunch of garbled HTML.
Thanks in advance!
Ok - I think I figured out the problem (and found another). I decided to switch to 'mechanize' from 'urllib2'. Unfortunately, I kept having problems getting the data. Finally, I realized that there are two 'submit' buttons, so I tried passing the name parameter when submitting the form. That did the trick as far as getting the correct response.
However, the next problem was that I could not get BeautifulSoup to parse the HTML and find the necessary tags. A brief Google search revealed others having similar problems. So, I gave up on BeautifulSoup and just did a basic regex on the HTML. Not as elegant as BeautifulSoup, but effective.
Ok - enough speechifying. Here's what I came up with:
import mechanize
import re
br = mechanize.Browser()
url = 'http://www.cboe.com/delayedquote/quotetable.aspx'
br.open(url)
br.select_form(name='aspnetForm')
br['ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol'] = 'IBM'
# here's the key step that was causing the trouble - pass the name parameter
# for the button when calling submit
response = br.submit(name="ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$btnSubmit")
data = response.read()
match = re.search( r'Bid</font><span> \s*([0-9]{1,4}\.[0-9]{2})', data, re.MULTILINE|re.M|re.I)
if match:
print match.group(1)
else:
print "There was a problem retrieving the quote"