Can only scrape the base price with requests html - python-requests-html

In the product price listed below, I fail to scrape the prices of the more expensive options.
Strangely, I can save a direct URL link and see the correct price when the page loads. However, when I scrape the same link with requests-html, I only get the base price of the base product.
from requests_html import HTMLSession
url = 'https://www.staegerag.ch/shop/index.php?id_product=321&controller=product&search_query=level&results=2#/273-farbe-gold_tone_light_oak/233-sprachsteuerung-nein'
session = HTMLSession()
r = session.get(url)
print(r.html.find('span[id="our_price_display"]', first=True).text)
Results in 1'399.00 CHF instead of 1'699.00 CHF

After various google searches, i came across a simple solution. since i'm still a relatively new kid on the programming block, i didn't notice that my code didn't call chromium at all. my price to be scrapped is in java. I just had to render the page via: r.html.render() in chromium.
solution:
from requests_html import HTMLSession
url = 'https://www.staegerag.ch/shop/index.php?
id_product=321&controller=product&search_query=level&results=2#/273-farbe-
gold_tone_light_oak/233-sprachsteuerung-nein'
session = HTMLSession()
r = session.get(url)
r.html.render()
print(r.html.find('span[id="our_price_display"]', first=True).text)

Related

How to scrape through search results spanning multiple pages with lxml

I'm using lxml to scrape through a site. I want to scrape through a search result, that contains 194 items. My scraper is able to scrape only the first page of search results. How can I scrape the rest of the search results?
url = 'http://www.alotofcars.com/new_car_search.php?pg=1&byshowroomprice=0.5-500&bycity=Gotham'
response_object = requests.get(url)
# Build DOM tree
dom_tree = html.fromstring(response_object.text)
After this there are scraping functions
def enter_mmv_in_database(dom_tree,engine):
# Getting make, model, variant
name_selector = CSSSelector('[class="secondary-cell"] p a')
name_results = name_selector(dom_tree)
for n in name_results:
mmv = str(`n.text_content()`).split('\\xa0')
make,model,variant = mmv[0][2:], mmv[1], mmv[2][:-2]
# Now push make, model, variant in Database
print make,model,variant
By looking at the list I receive I can see that only the first page of search results is parsed. How can I parse the whole of search result.
I've tried to navigate through that website but it seems to be offline. Yet, I would like to help with the logic.
What I usually do is:
Make a request to the search URL (with parameters filled)
With lxml, extract the last page available number in a pagination div.
Loop from first page to the last one, making requests and scraping desired data:
for page_number in range(1, last+1):
## make requests replacing 'page_number' in 'pg' GET variable
url = "http://www.alotofcars.com/new_car_search.php?pg={}&byshowroomprice=0.5-500&bycity=Gotham'".format(page_number)
response_object = requests.get(url)
dom_tree = html.fromstring(response_object.text)
...
...
I hope this helps. Let me know if you have any further questions.

Scraping data off flipkart using scrapy

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
#staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[#class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
One problem is that I cannot find a way to control the crawling and scrapping.
Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
#staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[#class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience
Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:
response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.
For a more detail overview of scraping with Firebug, you can check out the official documentation.
Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs:
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/

Connecting urls using Python's urllib2

I'm scraping some stock information using urllib2.
Some of my codes are as followings.
cap_url = "http://wisefn.stock.daum.net/company/c1010001.aspx?cmp_cd=%s" % code
cap_req = urllib2.Request(cap_url)
cap_data = urllib2.urlopen(cap_req).read()
~
~
~
depr_url = "http://wisefn.stock.daum.net/company/cF3002.aspx?cmp_cd=%s&frq=Q&rpt=ISM&finGubun=MAIN" % code
depr_req = urllib2.Request(depr_url)
depr_data = urllib2.urlopen(depr_req).read()
~
~
~
transaction_url = "http://www.shinhaninvest.com/goodicyber/mk/1206.jsp?code=%s" % code
transaction_data = urllib2.urlopen(transaction_url).read()
soup = BeautifulSoup(transaction_data, fromEncoding="utf-8")
As you know, %s is stock code. With given stock code, I'm scraping all of stock information. Total number of stock codes are over 1,600. Then I write gathered information to Excel with xlwt.
However, I can't get connecton to some url or get informationm for those I can conncet manually typing that url.
What's the problem? And how can I speed up scraping pages?
First I would check robots.txt file of the website. It most likely prohibit native python user-agent. So you may consider changing user-agent of urllib2.
Second. The website content might be generated by JavaScript, and if so, urllib2 can not evaluate it. For this purpose you can use Selenium driver, or PyQt framework or similar.

Amazon Product Advertising API: Get Average Customer Rating

When using Amazon's web service to get any product's information, is there a direct way to get the Average Customer Rating (1-5 stars)? Here are the parameters I'm using:
Service=AWSECommerceService
Version=2011-08-01
Operation=ItemSearch
SearchIndex=Books
Title=A Game of Thrones
ResponseGroup=Large
I would expect it to have a customer rating of 4.5 and total reviews of 2177. But instead I get the following in the response.
<CustomerReviews><IFrameURL>http://www.amazon.com/reviews/iframe?...</IFrameURL></CustomerReviews>
Is there a way to get the overall customer rating, besides for reading the <IFrameURL/> value, making another HTTP request for that page of reviews, and then screen scraping the HTML? That approach is fragile since Amazon could easily change the reviews page structure which would bust my application.
You can scrape from here. Just replace the asin with what you need.
http://www.amazon.com/gp/customer-reviews/widgets/average-customer-review/popover/ref=dpx_acr_pop_?contextId=dpx&asin=B000P0ZSHK
As far as i know, Amazon changed it's API so its not possible anymore to get the reviewrank information. If you check this Link the note sais:
As of November 8, 2010, only the iframe URL is returned in the request
content.
However, testing with the params you used to get the Iframe it seems that now even the Iframe dosn't work anymore. Thus, even in the latest API Reference in the chapter "Motivating Customers to Buy" the part "reviews" is compleatly missing.
However: Since i'm also very interested if its still possible somehow to get the reviewrank information - maybe even not using amazon API but a competitors API to get review rank informations - i'll set up a bounty if anybody can provide something helpful on that. Bounty will be set in this topic in two days.
You can grab the iframe review url and then use css to position it so only the star rating shows. It's not ideal since you're not getting raw data, but it's an easy way to add the rating to your page.
Sample of this in action - http://spamtech.co.uk/positioning-content-inside-an-iframe/
Here is a VBS script that would scrape the rating. Paste the code below to a text file, rename it to Test.vbs and double click to run on Windows.
sAsin = InputBox("What is your ASIN?", "Amazon Standard Identification Number (ASIN)", "B000P0ZSHK")
if sAsin <> "" Then
sHtml = SendData("http://www.amazon.com/gp/customer-reviews/widgets/average-customer-review/popover/ref=dpx_acr_pop_?contextId=dpx&asin=" & sAsin)
sRating = ExtractHtml(sHtml, "<span class=""a-size-base a-color-secondary"">(.*?)<\/span>")
sReviews = ExtractHtml(sHtml, "<a class=""a-size-small a-link-emphasis"".*?>.*?See all(.*?)<\/a>")
MsgBox sRating & vbCrLf & sReviews
End If
Function ExtractHtml(sHtml,sPattern)
Set oRegExp = New RegExp
oRegExp.Pattern = sPattern
oRegExp.IgnoreCase = True
Set oMatch = oRegExp.Execute(sHtml)
If oMatch.Count = 1 Then
ExtractHtml = Trim(oMatch.Item(0).SubMatches(0))
End If
End Function
Function SendData(sUrl)
Dim oHttp 'As XMLHTTP30
Set oHttp = CreateObject("Msxml2.XMLHTTP")
oHttp.open "GET", sUrl, False
oHttp.send
SendData = Replace(oHttp.responseText,vbLf,"")
End Function
Amazon has completely removed support for accessing rating/review information from their API. The docs mention a Response Element in the form of customer rating, but that doesn't work either.
Google shopping using Viewpoints for some reviews and other sources
This is not possible from PAPI. You either need to scrape it by yourself, or you can use other free/cheaper third-party alternatives for that.
We use the amazon-price API from RapidAPI for this, it supports price/rating/review count fetching for up to 1000 products in a single request.

Is there an Amazon.com API to retrieve product reviews? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Do any of the AWS APIs/Services provide access to the product reviews for items sold by Amazon? I'm interested in looking up reviews by (ASIN, user_id) tuple. I can see that the Product Advertising API returns a URL to a page (for embedding in an IFRAME) containing the URLs, but I am interested in a machine-readable format of the review data, if possible.
Update 2:
Please see #jpillora's comment. It's probably the most relevant regarding Update 1.
I just tried out the Product Advertising API (as of 2014-09-17), it seems that this API only returns a URL pointing to an iframe containing just the reviews. I guess you'd have to screen scrape - though I imagine that would break Amazon's TOS.
Update 1:
Maybe. I wrote the original answer below earlier. I don't have time to look into this right now because I'm no longer on a project concerned with Amazon reviews, but their webpage at Product Advertising API states "The Product Advertising API helps you advertise Amazon products using product search and look up capability, product information and features such as Customer Reviews..." as of 2011-12-08. So I hope someone looks into it and posts back here; feel free to edit this answer.
Original:
Nope.
Here is an intersting forum discussion about the fact including theories as to why: http://forums.digitalpoint.com/showthread.php?t=1932326
If I'm wrong, please post what you find. I'm interested in getting the reviews content, as well as allowing submitting reviews to Amazon, if possible.
You might want to check this link: http://reviewazon.com/. I just stumbled across it and haven't looked into it, but I'm surprised I don't see any mention on their site about the update concerning the drop of Reviews from the Amazon Products Advertising API posted at: https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html
Here's my quick take on it - you easily can retrieve the reviews themselves with a bit more work:
countries=['com','co.uk','ca','de']
books=[
'''http://www.amazon.%s/Glass-House-Climate-Millennium-ebook/dp/B005U3U69C''',
'''http://www.amazon.%s/The-Japanese-Observer-ebook/dp/B0078FMYD6''',
'''http://www.amazon.%s/Falling-Through-Water-ebook/dp/B009VJ1622''',
]
import urllib2;
for book in books:
print '-'*40
print book.split('%s/')[1]
for country in countries:
asin=book.split('/')[-1]; title=book.split('/')[3]
url='''http://www.amazon.%s/product-reviews/%s'''%(country,asin)
try: f = urllib2.urlopen(url)
except: page=""
page=f.read().lower(); print '%s=%s'%(country, page.count('member-review'))
print '-'*40
According to Amazon Product Advertizing API License Agreement (https://affiliate-program.amazon.com/gp/advertising/api/detail/agreement.html) and specifically it's point 4.b.iii:
You will use Product Advertising Content only ... to send end users to and drive sales on the Amazon Site.
which means it's prohibited for you to show Amazon products reviews taken trough their API to sale products at your site. It's only allowed to redirect your site visitors to Amazon and get the affiliate commissions.
I would use something like the answer of #mfs above. Unfortunately, his/her answer would only work for up to 10 reviews, since this is the maximum that can be displayed on one page.
You may consider the following code:
import requests
nreviews_re = {'com': re.compile('\d[\d,]+(?= customer review)'),
'co.uk':re.compile('\d[\d,]+(?= customer review)'),
'de': re.compile('\d[\d\.]+(?= Kundenrezens\w\w)')}
no_reviews_re = {'com': re.compile('no customer reviews'),
'co.uk':re.compile('no customer reviews'),
'de': re.compile('Noch keine Kundenrezensionen')}
def get_number_of_reviews(asin, country='com'):
url = 'http://www.amazon.{country}/product-reviews/{asin}'.format(country=country, asin=asin)
html = requests.get(url).text
try:
return int(re.compile('\D').sub('',nreviews_re[country].search(html).group(0)))
except:
if no_reviews_re[country].search(html):
return 0
else:
return None # to distinguish from 0, and handle more cases if necessary
Running this with 1433524767 (which has significantly different number of reviews for the three countries of interest) I get:
>> print get_number_of_reviews('1433524767', 'com')
3185
>> print get_number_of_reviews('1433524767', 'co.uk')
378
>> print get_number_of_reviews('1433524767', 'de')
16
Hope it helps
As said by others above, amazon has discontinued providing reviews in its api. Howevever, i found this nice tutorial to do the same with python. Here is the code he gives, works for me! He uses python 2.7
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Written as part of https://www.scrapehero.com/how-to-scrape-amazon-product-reviews-using-python/
from lxml import html
import json
import requests
import json,re
from dateutil import parser as dateparser
from time import sleep
def ParseReviews(asin):
#This script has only been tested with Amazon.com
amazon_url = 'http://www.amazon.com/dp/'+asin
# Add some recent user agent to prevent amazon from blocking the request
# Find some chrome user agent strings here https://udger.com/resources/ua-list/browser-detail?browser=Chrome
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(amazon_url,headers = headers).text
parser = html.fromstring(page)
XPATH_AGGREGATE = '//span[#id="acrCustomerReviewText"]'
XPATH_REVIEW_SECTION = '//div[#id="revMHRL"]/div'
XPATH_AGGREGATE_RATING = '//table[#id="histogramTable"]//tr'
XPATH_PRODUCT_NAME = '//h1//span[#id="productTitle"]//text()'
XPATH_PRODUCT_PRICE = '//span[#id="priceblock_ourprice"]/text()'
raw_product_price = parser.xpath(XPATH_PRODUCT_PRICE)
product_price = ''.join(raw_product_price).replace(',','')
raw_product_name = parser.xpath(XPATH_PRODUCT_NAME)
product_name = ''.join(raw_product_name).strip()
total_ratings = parser.xpath(XPATH_AGGREGATE_RATING)
reviews = parser.xpath(XPATH_REVIEW_SECTION)
ratings_dict = {}
reviews_list = []
#grabing the rating section in product page
for ratings in total_ratings:
extracted_rating = ratings.xpath('./td//a//text()')
if extracted_rating:
rating_key = extracted_rating[0]
raw_raing_value = extracted_rating[1]
rating_value = raw_raing_value
if rating_key:
ratings_dict.update({rating_key:rating_value})
#Parsing individual reviews
for review in reviews:
XPATH_RATING ='./div//div//i//text()'
XPATH_REVIEW_HEADER = './div//div//span[contains(#class,"text-bold")]//text()'
XPATH_REVIEW_POSTED_DATE = './/a[contains(#href,"/profile/")]/parent::span/following-sibling::span/text()'
XPATH_REVIEW_TEXT_1 = './/div//span[#class="MHRHead"]//text()'
XPATH_REVIEW_TEXT_2 = './/div//span[#data-action="columnbalancing-showfullreview"]/#data-columnbalancing-showfullreview'
XPATH_REVIEW_COMMENTS = './/a[contains(#class,"commentStripe")]/text()'
XPATH_AUTHOR = './/a[contains(#href,"/profile/")]/parent::span//text()'
XPATH_REVIEW_TEXT_3 = './/div[contains(#id,"dpReviews")]/div/text()'
raw_review_author = review.xpath(XPATH_AUTHOR)
raw_review_rating = review.xpath(XPATH_RATING)
raw_review_header = review.xpath(XPATH_REVIEW_HEADER)
raw_review_posted_date = review.xpath(XPATH_REVIEW_POSTED_DATE)
raw_review_text1 = review.xpath(XPATH_REVIEW_TEXT_1)
raw_review_text2 = review.xpath(XPATH_REVIEW_TEXT_2)
raw_review_text3 = review.xpath(XPATH_REVIEW_TEXT_3)
author = ' '.join(' '.join(raw_review_author).split()).strip('By')
#cleaning data
review_rating = ''.join(raw_review_rating).replace('out of 5 stars','')
review_header = ' '.join(' '.join(raw_review_header).split())
review_posted_date = dateparser.parse(''.join(raw_review_posted_date)).strftime('%d %b %Y')
review_text = ' '.join(' '.join(raw_review_text1).split())
#grabbing hidden comments if present
if raw_review_text2:
json_loaded_review_data = json.loads(raw_review_text2[0])
json_loaded_review_data_text = json_loaded_review_data['rest']
cleaned_json_loaded_review_data_text = re.sub('<.*?>','',json_loaded_review_data_text)
full_review_text = review_text+cleaned_json_loaded_review_data_text
else:
full_review_text = review_text
if not raw_review_text1:
full_review_text = ' '.join(' '.join(raw_review_text3).split())
raw_review_comments = review.xpath(XPATH_REVIEW_COMMENTS)
review_comments = ''.join(raw_review_comments)
review_comments = re.sub('[A-Za-z]','',review_comments).strip()
review_dict = {
'review_comment_count':review_comments,
'review_text':full_review_text,
'review_posted_date':review_posted_date,
'review_header':review_header,
'review_rating':review_rating,
'review_author':author
}
reviews_list.append(review_dict)
data = {
'ratings':ratings_dict,
'reviews':reviews_list,
'url':amazon_url,
'price':product_price,
'name':product_name
}
return data
def ReadAsin():
#Add your own ASINs here
AsinList = ['B01ETPUQ6E','B017HW9DEW']
extracted_data = []
for asin in AsinList:
print "Downloading and processing page http://www.amazon.com/dp/"+asin
extracted_data.append(ParseReviews(asin))
sleep(5)
f=open('data.json','w')
json.dump(extracted_data,f,indent=4)
if __name__ == '__main__':
ReadAsin()
Here, is the link to his website reviews scraping with python 2.7
Unfortunately you can only get an iframe URL with the reviews, the content itself is not accessible.
Source: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/DG/CHAP_MotivatingCustomerstoBuy.html#GettingCustomerReviews
Checkout RapidAPI: https://rapidapi.com/blog/amazon-product-reviews-api/
By using this API we can get Amazon product reviews.
You can use Amazon Product Advertising API. It has a Response Group 'Reviews' which you can use with operation 'ItemLookup'. You need to know ASIN i.e. unique item id of the product.
Once you set all the parameters and execute the signed URL, you will receive an XML which contains a link to customer reviews under "IFrameURL" tag.
Use this URL and use pattern searching in html returned from this url to extract the reviews. For each review in the html, there will be a unique review id and under that you can get all the data for that particular review.