Scrapy crawl category links till product page

Scrapy crawl category links till product page - python-2.7

Hello i am trying to scrape a ecommerce website which loads data with scroll and load more button I followed this How to scrape website with infinte scrolling? link but when I tried the code it's closing the spider without an products may be the structure has changed and I would like some help to get me starting I am quite new to webscraping.
edit question
ok i am scraping this website [link]http://www.jabong.com/women/ which has subcategories , i am trying to scrape all the subcategories products i tried above code but that didnt work for so after doing some research i created a code which works but doesnt satisfy my goal .so far i have tried this
` import scrapy
#from scrapy.exceptions import CloseSpider
from scrapy.spiders import Spider
#from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from koovs.items import product
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
class propubSpider(scrapy.Spider):
name = 'koovs'
allowed_domains = ['jabong.com']
max_pages = 40
def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request('http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=%d' % i, callback=self.parse)
def parse(self, response):
for sel in response.xpath('//*[#id="catalog-product"]/section[2]'):
item = product()
item['price'] = sel.xpath('//*[#class="price"]/span[2]/text()').extract()
item['image'] = sel.xpath('//*[#class="primary-image thumb loaded"]/img').extract()
item['title'] = sel.xpath('//*[#data-original-href]/#href').extract()
`
the above code works for one category and also if i specify the number of pages, the above website has a lot of products for given category and i dont know in how many pages they reside , so i decided to use crawl spider to go through all the categories and product pages and fetch data but i am very new to scrapy any help would be highly appreciated

First thing you need to understand is the DOM structure of websites often change. So a scraper written in the past will may or may not work for you now.
So the best approach while scraping a website is to find a hidden api or a hidden url which can only be seen when you anlayze the network traffic of a website. This not just provide you a reliable solution for scraping but also save bandwidth which is very important while doing Broad crawling as most of the time you don't need to download the whole page.
Let's take the example of the website you are crawling to get more clarity. When you visit this page you can see the button which says Show More Product. Go to the developer tools of your browser and select the network analyzer. When you click on the button you will see the browser sending a GET request to this link. Check the response and you will see list of all the products in the first page. Now when you will analyze this link, you can see it has a parameter page=1. Change it to page=2 and you will see the list of all products in the second page.
Now go ahead and write the spider. It will be something like:
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request
from jabong.items import product
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=1&limit=52&sortField=popularity&sortBy=desc",
]
page = 1
def parse(self, response):
products = response.xpath('//*[#id="catalog-product"]/section[2]/div')
if not products:
raise CloseSpider("No more products!")
for p in products:
item = product()
item['price'] = p.xpath('a/div/div[2]/span[#class="standard-price"]/text()').extract()
item['title'] = p.xpath('a/div/div[1]/text()').extract()
if item['title']:
yield item
self.page += 1
yield Request(url="http://www.jabong.com/women/clothing/tops-tees-shirts/tops/?page=%d&limit=52&sortField=popularity&sortBy=desc" % self.page,
callback=self.parse,
dont_filter=True)
N.B.- This example is just for educational purpose. Please refer to the website's Terms and Conditions/Privacy Policy/Robots.txt before crawling/scraping/storing any data from the website.

Related

Require second login for users on Django based site

I have a Django 2.2 based project that uses a custom user model, the built-in Auth app and Django-All-Auth for user management. Almost every page on the site is behind a login and we use varying levels of permissions to determine what can be accessed a user.
So far, so good, but now I'm being asked to designate everything behind a specific part of the site as "sensitive", requiring a second login prompt using the same login credentials. What this means is that the client wants to see a login appear when they try to access anything under /top-secret/ the first time in a set time, say 30 mins, regardless of whether they're already logged in or not.
I've dug around on the internet for ideas on how to do this, but so far I've been unable to find a good solution. Has anyone here had any experience with something similar, and if so, could they point me in the right direction?

Figured out how to make this work in my situation. I debated deleting this question, but on the off chance that there's someone else out there that wants to do something similar, here's how I did it.
Create a new middleware with the following:
from allauth.account.adapter import get_adapter
from datetime import datetime
from dateutil.relativedelta import relativedelta
from django.conf import settings
from django.shortcuts import redirect
from django.urls import reverse
from django.utils import timezone
class ExtraLoginPromptMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
if request.user.is_authenticated:
for url in settings.EXTRA_LOGIN_URLS:
if request.path.startswith(url):
last_seen = request.session.get(settings.EXTRA_LOGIN_SESSION_VARIABLE)
if not last_seen:
last_seen = request.user.last_login
else:
last_seen = datetime.fromisoformat(last_seen)
if last_seen + relativedelta(seconds=settings.EXTRA_LOGIN_EXPIRY) < timezone.now():
# NB, this uses allauth, if you want to use django.auth, just
# call django.contrib.auth's logout instead
get_adapter(request).logout(request)
return redirect('{}?next={}'.format(reverse('account_login'), url))
else:
request.session[settings.EXTRA_LOGIN_SESSION_VARIABLE] = timezone.now().isoformat()
return self.get_response(request)
To make use of this, you'll want to add this middleware after SessionMiddleware and AuthenticationMiddleware
You'll also want to define a few variables in your settings:
EXTRA_LOGIN_URLS = ['/top-secret/', 'another-secret']
EXTRA_LOGIN_EXPIRY = 120 # How long before the "extra security" session expires
EXTRA_LOGIN_SESSION_VARIABLE = 'extra-login-last-seen'

Python Scrapy Click on html button

I am new to scrapy and using scrapy with python 2.7 for web automation. I want to click on a html button on a website which opens a login form. My problem is that I just want to click on a button and trasfer control to new page. I have read all similar questions but none found satisfactory because they all contain direct login or using selenium.
Below is HTML Code for button and I want to visit http://example.com/login where there is login page.
<div class="pull-left">
Employers
I have written code for to extract link. But how to visit that link and carry out next process. Below is My code.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'pro'
url = "http://login-page.com/"
def start_requests(self):
yield scrapy.Request(self.url, self.parse_login)
def parse_login(self, response):
employers = response.css("div.pull-left a::attr(href)").extract_first()
print employers
Do I need to use "yield" Everytime and callback to new fuction for just visiting a link or there is other way to do it.

What you need is to yield a new request or easier make a response.follow like in the docs:
def parse_login(self, response):
next_page = response.css("div.pull-left a::attr(href)").extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.next_page_parse)
About the callback, it depends basically on how easily can the page gets parsed, for example, check the general spiders section on the docs

Access google analytics data with Django

I'm trying to build a super simple dashboard to show to users their Google Analytics data well formatted.
I'm using oAuth2Client and Django 1.10.4 and Python 3.5.
I've followed the example within the documentation and now I have a very simple app, the landing page will ask you to login, you click on a link to authorise, the Google page loads and asks you if you want to share your GA data and if you accept you are redirect to a page you can see only if you are logged in. All good so far.
However I can't manage to actually get users data, what's the best way to get for example the list of properties in a user's account or even better the number of page views a property had in the last week?
This is my code so far:
/pools/models.py
from django import http
from oauth2client.contrib.django_util import decorators
from django.views.generic import ListView
# from polls.models import GoogleAnalytic
from django.http import HttpResponse
# Google Analytics
from apiclient.discovery import build
# def index(request):
# return http.HttpResponse("Hello and Welcome! </br> </br> Click <a href='/profile_enabled'> here</a> to login")
#decorators.oauth_required
def get_profile_required(request):
resp, content = request.oauth.http.request(
'https://www.googleapis.com/plus/v1/people/me')
return http.HttpResponse(content)
#decorators.oauth_enabled
def get_profile_optional(request):
if request.oauth.has_credentials():
# this could be passed into a view
# request.oauth.http is also initialized
return http.HttpResponse('You are signed in.</br> </br>'+'User email: {}'.format(
request.oauth.credentials.id_token['email']) + "</br></br>Click <a href='/ga'> here </a> to view your metrics")
else:
return http.HttpResponse(
'Hello And Welcome!</br></br>'
'You need to sign in to view your data. </br></br>' +
'Here is an OAuth Authorize link:Authorize'
.format(request.oauth.get_authorize_redirect()))
########## MY CODE! ###############
#decorators.oauth_required
def google_analytics(object):
return HttpResponse('These are your results for last week:')
urls.py
from django.conf import urls
from polls import views
import oauth2client.contrib.django_util.site as django_util_site
urlpatterns = [
urls.url(r'^$', views.get_profile_optional),
urls.url(r'^profile_required$', views.get_profile_required),
# urls.url(r'^profile_enabled$', views.get_profile_optional),
urls.url(r'^oauth2/', urls.include(django_util_site.urls)),
urls.url(r'^ga/$', views.google_analytics)
]
settings.py
[...]
GOOGLE_OAUTH2_CLIENT_ID = 'XXX.apps.googleusercontent.com'
GOOGLE_OAUTH2_CLIENT_SECRET = 'XXX'
GOOGLE_OAUTH2_SCOPES = ('email','https://www.googleapis.com/auth/analytics')
So my problem is I don't really understand where Django saves the token to access the data of that particular user, I know it works because it prints out the email address correctly etc, but I can't figure out what I should add to def google_analytics(object): to actually get specific Google API methods.
If anyone has experience on these kind of things I would really appreciate some help! Thanks!

If you want to fetch Google Analytics configuration details e.g. Accounts, Web properties, Profiles, Filters, Goals, etc you can do that using Google Analytics Management API V3
If you want to fetch data of certain dimension and metrics from a Google Analytics view (aka profile), you can do that using either Core Reporting API V3 or Analytics Reporting API V4.
I think you will find the python api examples in their respective Guides.

Refresh scrapy response after selenium browser Click

I m trying to scrape a website that uses Ajax to load the different pages.
Although my selenium browser is navigating through all the pages, but scrapy response is still the same and it ends up scraping same response(no of pages times).
Proposed Solution :
I read in some answers that by using
hxs = HtmlXPathSelector(self.driver.page_source)
You can change the page source and then scrape. But it is not working ,also after adding this the browser also stopped navigating.
code
def parse(self, response):
self.driver.get(response.url)
pages = (int)(response.xpath('//p[#class="pageingP"]/a/text()')[-2].extract())
for i in range(pages):
next = self.driver.find_element_by_xpath('//a[text()="Next"]')
print response.xpath('//div[#id="searchResultDiv"]/h3/text()').extract()[0]
try:
next.click()
time.sleep(3)
#hxs = HtmlXPathSelector(self.driver.page_source)
for sel in response.xpath("//tr/td/a"):
item = WarnerbrosItem()
item['url'] = response.urljoin(sel.xpath('#href').extract()[0])
request = scrapy.Request(item['url'],callback=self.parse_job_contents,meta={'item': item}, dont_filter=True)
yield request
except:
break
self.driver.close()
Please Help.

When using selenium and scrapy together, after having selenium perform the click I've read the page back for scrapy using
resp = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
That would go where your HtmlXPathSelector selector line went. All the scrapy code from that point to the end of the routine would then need to refer to resp (page rendered after the click) rather than response (page rendered before the click).
The time.sleep(3) may give you issues as it doesn't guarantee the page has actually loaded, it's just an unconditional wait. It might be better to use something like
WebDriverWait(self.driver, 30).until(test page has changed)
which waits until the page you are waiting for passes a specific test, such as finding the expected page number or manufacturer's part number.
I'm not sure what the impact of closing the driver at the end of every pass through parse() is. I've used the following snippet in my spider to close the driver when the spider is closed.
def __init__(self, filename=None):
# wire us up to selenium
self.driver = webdriver.Firefox()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()

Selenium isn't in any way connected with scrapy, nor their response object, and in your code I don't see you changing the response object.
You'll have to work with them independently.

Django sitemaps with paginated content

I don't think this is actually possible, but is there any clean and tidy way to get paginated content working with Django sitemaps?
For example, my landing page has news on it, and there are no permalinks to news posts, the only way to use them is to paginate 5 at a time through them all.
Another part gets lists of items in various genres and other criteria.
If it isn't possible, what is the best way to handle it? To not provide urls to the sitemap for any of these? To get just the first page for paginated pages?
My best idea is that I should just give the landing page as an url, and not bother with the listing pages at all since they aren't really important search engine-wise. But if this is the best course of action, how can I just provide a link to the landing page from within the sitemaps framework?
Any suggestions are welcome.

I've included urls for my paginated list views in the XML sitemap using the following code:
from django.conf import settings
from django.contrib.sitemaps import Sitemap
from django.core.paginator import Paginator
from django.core.urlresolvers import reverse
class ArticleListSitemap(Sitemap):
def items(self):
objects = Article.objects.all()
paginator = Paginator(objects, settings.ARTICLE_PAGINATE_BY)
return paginator.page_range
def location(self, page):
return reverse('article_list', kwargs={'page': page})

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scrapy crawl category links till product page - python-2.7

Related

Require second login for users on Django based site

Python Scrapy Click on html button

Access google analytics data with Django

Refresh scrapy response after selenium browser Click

Django sitemaps with paginated content

Categories

Resources