Scraping data off flipkart using scrapy - python-2.7

I am trying to scrape some information from flipkart.com for this purpose I am using Scrapy. The information I need is for every product on flipkart.
I have used the following code for my spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
class WebCrawler(CrawlSpider):
name = "flipkart"
allowed_domains = ['flipkart.com']
start_urls = ['http://www.flipkart.com/store-directory']
rules = [
Rule(LinkExtractor(allow=['/(.*?)/p/(.*?)']), 'parse_flipkart', cb_kwargs=None, follow=True),
Rule(LinkExtractor(allow=['/(.*?)/pr?(.*?)']), follow=True)
]
#staticmethod
def parse_flipkart(response):
hxs = HtmlXPathSelector(response)
item = FlipkartItem()
item['featureKey'] = hxs.select('//td[#class="specsKey"]/text()').extract()
yield item
What my intent is to crawl through every product category page(specified by the second rule) and follow the product page(first rule) within the category page to scrape data from the products page.
One problem is that I cannot find a way to control the crawling and scrapping.
Second flipkart uses ajax on its category page and displays more products when a user scrolls to the bottom.
I have read other answers and assessed that selenium might help solve the issue. But I cannot find a proper way to implement it into this structure.
Suggestions are welcome..:)
ADDITIONAL DETAILS
I had earlier used a similar approach
the second rule I used was
Rule(LinkExtractor(allow=['/(.?)/pr?(.?)']),'parse_category', follow=True)
#staticmethod
def parse_category(response):
hxs = HtmlXPathSelector(response)
count = hxs.select('//td[#class="no_of_items"]/text()').extract()
for page num in range(1,count,15):
ajax_url = response.url+"&start="+num+"&ajax=true"
return Request(ajax_url,callback="parse_category")
Now i was confused on what to use for callback "parse_category" or "parse_flipkart"
Thank you for your patience

Not sure what you mean when you say that you can't find a way to control the crawling and scraping. Creating a spider for this purpose is already taking it under control, isn't it? If you create proper rules and parse the responses properly, that is all you need. In case you are referring to the actual order in which the pages are scraped, you most likely don't need to do this. You can just parse all the items in whichever order, but gather their location in the category hierarchy by parsing the breadcrumb information above the item title. You can use something like this to get the breadcrumb in a list:
response.css(".clp-breadcrumb").xpath('./ul/li//text()').extract()
You don't actually need Selenium, and I believe it would be an overkill for this simple issue. Using your browser (I'm using Chrome currently), press F12 to open the developer tools. Go to one of the category pages, and open the Network tab in the developer window. If there is anything here, click the Clear button to clear things up a bit. Now scroll down until you see that additional items are being loaded, and you will see additional requests listed in the Network panel. Filter them by Documents (1) and click on the request in the left pane (2). You can see the URL for the request (3) and the query parameters that you need to send (4). Note the start parameter which will be the most important since you will have to call this request multiple times while increasing this value to get new items. You can check the response in the Preview pane (5), and you will see that the request from the server is exactly what you need, more items. The rule you use for the items should pick up those links too.
For a more detail overview of scraping with Firebug, you can check out the official documentation.
Since there is no need to use Selenium for your purpose, I shall not cover this point more than adding a few links that show how to use Selenium with Scrapy, if the need ever occurs:
https://gist.github.com/cheekybastard/4944914
https://gist.github.com/irfani/1045108
http://snipplr.com/view/66998/

Related

How do I check if a user has entered the URL from another website in Django?

I want an effect to be applied when a user is entering my website. So therefore I want to check for when a user is coming from outside my website so the effect isnt getting applied when the user is surfing through different urls inside the website, but only when the user is coming from outside my website
You can't really check for where a user has come from specifically. You can check if the user has just arrived on your site by setting a session variable when they load one of your pages. You can check for it before you set it, and if they don't have it, then they have just arrived and you can apply your effect. There's some good examples of how sessions work here: https://developer.mozilla.org/en-US/docs/Learn/Server-side/Django/Sessions
There's a couple of ways to handle this. If you are using function based views, you can just create a separate util function and include it at the top of every page, eg,
utils.py
def first_visit(request):
"""returns the answer to the question 'first visit for session?'
make sure SESSION_EXPIRE_AT_BROWSER_CLOSE set to False in settings for persistance"""
if request.session['first_visit']:
#this is not the first session because the session variable is used.
return False
else:
#This is the first visit
...#do something
#set the session variable so you only do the above once
request.session[first_visit'] = True
return True
views.py
from utils.py import first_visit
def show_page(request):
first_visit = first_visit(request)
This approach gives you some control. For example, you may not want to run it on pages that require login, because you will already have run it on the login page.
Otherwise, the best approach depends on what will happen on the first visit. If you want just to update a template (eg, perhaps to show a message or run a script on th epage) you can use a context processor which gives you extra context for your templates. If you want to interrupt the request, perhaps to redirect it to a separate page, you can create a simple piece of middleware.
docs for middleware
docs for context processors
You may also be able to handle this entirely by javascript. This uses localStorage to store whether or not this is the user's first visit to the site and displays the loading area for 5 seconds if there is nothing in localStorage. You can include this in your base template so it runs on every page.
function showMain() {
document.getElementByID("loading").style.display = "none";
document.getElementByID("main").style.display = "block";
}
const secondVisit = localStorage.getItem("secondVisit");
if (!secondVisit) {
//show loading screen
document.getElementByID("loading").style.display = "block";
document.getElementByID("main").style.display = "none";
setTimeout(5000, showMain)
localStorage.setItem("secondVisit", "true" );
} else {
showMain()
}

Django Forbid HttpResponse

I am working on a tiny movie manager by using the out-of-the-box admin module in Django.
I add a "Play" link on the movie admin page to play the movie, by passing the id of this movie. So the backend is something like this:
import subprocess
def play(request, movie_id):
try:
m = Movie.objects.get(pk=movie_id)
subprocess.Popen([PLAYER_PATH, m.path + '/' + m.name])
return HttpResponseRedirect("/admin/core/movie")
except Movie.DoesNotExist:
return HttpResponse(u"The movie is not exist!")
As the code above reveals, every time I click the "play" link, the page will be refreshed to /admin/core/movie, which is the movie admin page, I just do not want the backend to do this kind of things, because I may use the "Search" functions provided by the admin module, so the URL before clicking on "Play" may be something like: "/admin/core/movie/?q=gun", if that response takes effect, then the query criteria will be removed.
So, my thought is whether I can forbid the HttpResponse, in order to let me stay on the current page.
Any suggestions on this issue ?
Thanks in advance.
I used the custom action in admin to implement this function.
So finally I felt that actions are something like procedures, which have no return values, and requests are something like methods(views) with return values...
Thanks !

Turn down webpage depending on time of day (DJANGO)

How can i use django templates to remove the contents of a page depending on the time of day?
I'm creating an online food delivery site. The delivery orders are sectioned off into times.
For example for a 7pm delivery drop, i would want the page to show normally until 6pm that day. Then at 6:01pm i would want the page to say something like "This delivery time is not available"
To be honest you shouldn't rely on template logic if you want to prevent unwanted behaviour.
You can create two variables, e.g.
from django.conf import settings
TIME_OPEN = getattr(settings, 'TIME_OPEN', datetime.now().replace(hour=10, minute=30, second=0, microsecond=0).time())
TIME_CLOSED = getattr(settings, 'TIME_CLOSED', datetime.now().replace(hour=21, minute=30, second=0, microsecond=0).time())
In your url patterns you could add something like:
if TIME_OPEN < datetime.now().time() < TIME_CLOSED:
urlpatterns += patterns('shop.customers.views',
(r'^checkout/$', 'checkout'),
)
Based on your new variables you could add a context_processor that supplies a context variable to each template for UI logic, e.g. {'shop_open': True}.
Mind you these examples rely on server time, so you would have to check, because it can differ from your local machine. Another approach could be to create a decorator which can be wrapped around views that require certain times.
So, just to make sure; Don't rely on template logic, protect your views

Sitecore Web Forms for Marketers and DMS - not recording campaigns, goals and dropout info

In WFFM there is an option so that, when someone abandons the form, any data that was entered in the form itself is recorded and should be accessible via the Dropout Report.
I have a WFFM for which I have turned on Analytics and turned on the dropout feature. Unfortunately I don't see any data being recorded in the DB and the Dropout Report is visible, but empty.
I see from the javascript code included in the WFFM folder that a series of AJAX calls are supposed to save the fields on blur events -- with calls to /sitecore modules/web/Web Forms for Marketers/Tracking.aspx
I tried debugging the Javascript code, but the method supposed to post the info to /sitecore modules/web/Web Forms for Marketers/Tracking.aspx is never being called. Can you think of any reasons for this code not to work? Also, does anyone know which table this information is supposed to be recorded? Is it the fields table in the WFFM DB?
Finally, even though I have turned on analytics on this particular WFFM form and I have associated a campaign and a goal to the submission of the form, none of these is being recorded. I see that the data entered in the form is stored successfully and is displaying in the Data Report, but no info about the Campaign nor the Goal are recorded in the DB.
I even checked manually directly in the DMS DB running:
select top 10
p.DateTime, p.UrlText, cp.CampaignName
,i.Url, vi.VisitId
from pages p
inner join ItemUrls i on p.ItemId = i.ItemId
inner join Visits vi on vi.VisitId = p.VisitId
inner join GeoIps g on vi.Ip = g.Ip
left join Campaigns cp on cp.CampaignId = vi.CampaignId
order by p.DateTime desc
This one shows that the page where the form is rendered is being hit, but no campaign is associated to the visit.
Then I tried the following:
select pe.datetime, ped.Name, pg.UrlText from PageEvents pe
inner join PageEventDefinitions ped on ped.PageEventDefinitionId = pe.PageEventDefinitionId
inner join Pages pg on pg.PageId = pe.PageId
order by pe.DateTime desc
But I don't see any entry for this particular campaign nor for the goal (while I see entries for other campaigns and goals associated to non-WFFM Sitecore items)
Any advice would be greatly appreciated!
Thanks,
Francesco
EDIT
The sc.webform.js file contains this method:
_create: function () {
var self = this,
options = this.options;
if (options.tracking) {
this.element.find("input[type!='submit'], select, textarea")
.bind('focus', function (e) { self.onFocusField(e, this) })
.bind('blur change', function (e) { self.onBlurField(e, this) });
this.element.find("select")
.change(function () { $scw.webform.controls.updateAnalyticsListValue(this) });
this.element.find("input[type='checkbox'], input[type='radio']")
.click(function () { $scw.webform.controls.updateAnalyticsListValue(this) });
}
this.element.find(".scfDatePickerTextBox").each(function () { $scw.webform.controls.datePicker(this) });
},
This is supposed to be called by the form on sc.webform widget initialization. It should bind the focus and blur change events for all input fields, drop downs and text areas. Unfortunately, when I tried to put a break point inside this method, it never gets called.
SECOND EDIT
Interesting. I figured out that the whole thing should start from this line of Javascript code embedded in the page that contains the WFFM form:
<script type="text/javascript">
$scwhead.ready(function() {
$scw('#form_A8BF483419174F97A2830E12CBCF7E4F').webform({formId: "{A8BF4834-1917-4F97-A283-0E12CBCF7E4F}",pageId: "{21C24144-B964-4FBA-8388-D9B90EBBC17C}",eventCountId: "pagecolumns_0_columncontent_0_bottomrow_0_form_A8BF483419174F97A2830E12CBCF7E4F_form_A8BF483419174F97A2830E12CBCF7E4F_eventcount",tracking: true})
});
</script>
Once I put a break point here, I was finally able to trace into the _create method of the jQuery.UI widget defined in sc.webform.js. The code that calls _create is actually inside the jQuery.UI library. Kinda makes sense, right?
Finally, the code inside _create is executed, the blur events are bound to the TrackEvents method, also defined within the widget:
_trackEvents: function(events) {
$scw.ajax({
type: 'POST',
url: "/sitecore modules/web/Web Forms for Marketers/Tracking.aspx" + location.search,
data: {track: JSON.stringify(events)},
dataType: 'json'
});
What doesn't make sense is that now, even though I can finally see trackEvents being called whenever I tab from field to field in the WFFM form (why wasn't working before it's a mistery to me), I don't see any data recorded in the WFFM DB. I even tried a quick query in the DB:
select f.Timestamp, f.StorageName, fi.Value, fi.FieldName
from Form f
inner join Field fi on f.Id = fi.FormId
order by f.Timestamp desc, FieldName
Does anybody know where is Tracking.aspx supposed to save the captured field informations?
This may be silly to ask, but did you configure the data source correctly for your WFFM? I mean, obviously, you're using WFFM..but is it set to use SQL or is it using the "file" that WFFM uses by default as it's database.
like this to use SQL:
<!-- MSSQL-->
<formsDataProvider type="Sitecore.Forms.Data.DataProviders.WFMDataProvider,Sitecore.Forms.Core">
<param desc="connection string">Database=Sitecore_WebForms;Data Source=xxx;user id=xxx;password=xxx;Connect Timeout=30</param>
</formsDataProvider>
<!-- SQLite -->
<!--<formsDataProvider type="Sitecore.Forms.Data.DataProviders.SQLite.SQLiteWFMDataProvider,Sitecore.Forms.Core">
<param desc="connection string">Data Source=/data/sitecore_webforms.db;version=3;BinaryGUID=true</param>
</formsDataProvider>-->
If you don't configure that correctly, I'm wondering if somehow data is being recorded in one place but not another? Also, another question I have is to ask if this is a dev environment, are you running webforms in live mode? It just seems to me like this is a configuration issue.
We are experiencing the exact same problem on 6.5 update 6 and WFFM 2.3.3 rev. 111209. We can see the asynchronous calls to the server including the probably well formed json object containing the correct event.
Example:
track:[{"fieldId":"{E0A0BCDD-85E1-4D8D-9E76-5ABD240423C9}","type":"Field Completed","value":"test","formId":"{0F3B57C1-1B6A-43B9-A5A6-2E958C168B31}","pageId":"{025AFF68-62B9-42CE-B49F-0C36311E1976}","ticks":16}]
We don't see any of the dropouts arrive in the database, though...
Have you made sure your campaigns and goals have been deployed? If you have switched databases they may not be. To redeploy do this:
For each Goal in System -> Marketing Center -> Goals
Change the workflow state to draft
Save
Then in the review ribbon click deploy.
This will create an entry in the pageeventdefinition table and allow
you to query.
Don't forget to do the same for campaigns.

django admin actions on all the filtered objects

Admin actions can act on the selected objects in the list page.
Is it possible to act on all the filtered objects?
For example if the admin search for Product names that start with "T-shirt" which results with 400 products and want to increase the price of all of them by 10%.
If the admin can only modify a single page of result at a time it will take a lot of effort.
Thanks
The custom actions are supposed to be used on a group of selected objects, so I don't think there is a standard way of doing what you want.
But I think I have a hack that might work for you... (meaning: use at your own risk and it is untested)
In your action function the request.GET will contain the q parameter used in the admin search. So if you type "T-Shirt" in the search, you should see request.GET look something like:
<QueryDict: {u'q': [u'T-Shirt']}>
You could completely disregard the querystring parameter that your custom action function receives and build your own queryset based on that request.GET's q parameter. Something like:
def increase_price_10_percent(modeladmin, request, queryset):
if request.GET['q'] is None:
# Add some error handling
queryset=Product.objects.filter(name__contains=request.GET['q'])
# Your code to increase price in 10%
increase_price_10_percent.short_description = "Increases price 10% for all products in the search result"
I would make sure to forbid any requests where q is empty. And where you read name__contains you should be mimicking whatever filter you created for the admin of your product object (so, if the search is only looking at the name field, name__contains might suffice; if it looks at the name and description, you would have a more complex filter here in the action function too).
I would also, maybe, add an intermediate page stating what models will be affected and have the user click on "I really know what I'm doing" confirmation button. Look at the code for django.contrib.admin.actions for an example of how to list what objects are being deleted. It should point you in the right direction.
NOTE: the users would still have to select something in the admin page, otherwise the action function would never get called.
This is a more generic solution, is not fully tested(and its pretty naive), so it might break with strange filters. For me works with date filters, foreign key filters, boolean filters.
def publish(modeladmin,request,queryset):
kwargs = {}
for filter,arg in request.GET.items():
kwargs.update({filter:arg})
queryset = queryset.filter(**kwargs)
queryset.update(published=True)