Content scrapers and Django - django

I have an application built with Django. Part of it relies on data that I aggregate from other websites. Wondering how I should approach building the scraper/aggregator.
The advantages I see of building it as a Django app is
the ability to use Django's models & database API
the ability to use Django's other methods
On the other hand I think the disadvantage would be scalability in the long run.
Should I build the scraper/aggregator as an app in my Django project or as a separate script that runs on its own?
Would love to hear your thoughts.

Neither of your points require it to run within Django. And since it will not be dependent on the web/HTTP interface, having it be a separate module is the only option that makes sense.

I just have published a Django app django-dynamic-scraper on GitHub, which is build on top of the scraping framework Scrapy and where you can build Scrapy scrapers in the Django admin and use Django model classes to store your scraped data, maybe this is of some use for people with similar problems.

If it's a django app, it will only run when someone loads the page. That could slow the loading.
Making another script could be a nicer idea but could produce inaccurate data.
I think it actually depends on the context.

Related

using backbone/ember makes django being a simple REST API?

I have read a couple of articles about using new JS frameworks like Backbone.js or Ember.js
I have come up to this statement:
If I use a JS framework like Backbone.js/Ember.js, I then move the logic from the back-end (Django) to the front-end.
Therefore, will Django actually be used only for its Models?
Does that mean that Django Views and Django Template are not needed anymore and the Django back-end is kinda turn into a "basic" REST API that will be consumed by the front end.
Do you agree? Is it then the purpose of Django in this case?
Is turning the django backend into a REST API one of the most suitable use case when using a framework like Backbone.js/Ember.js for the front-end?
Thanks.
Django is perfectly fine to be used this way, you still get the admin, the models, the orm and all the third-party plugins. However, it isn't blazingly fast, so if you're doing simple document level, non-relational REST mapping, you might wan't to look into node.js and mongoDB for instance.
If you're sticking with django (like we are, we like the structure it gives us), you can use one of the REST plugins:
Django Rest Framework A perfect match since DRF 2.0, under very active development!
Django Tastypie (checkout backbone-tastypie.js for integration)
Django Piston (might be a bit stale or has development picked up lately?)
If you only want to work with frontend development, checkout the Backend-as-a-Service places like cloudmine.me or firebase.com that handle all backend stuff for you, for a price of course.
Django may seem unnecessary once you start thinking about single page solutions and Javascript applications, but if you want your site to be 'fail proof' it wouldn't be impossible to develop both a client side Javascript version of the site as well as a backend django side incase the user, or your site's javascript, fails at some level. Of course this requires creating your site twice, and probably isn't needed in the age of modern browsers, but such would be one of the few instances where you would mix the two for a complete solution.
Yes, that's about it. You can use it to manage authentication to resources and such and maybe use a main view for your application but you won't need to use the server side templating since these frameworks are made to work with json/xml response.
That's why a lot of people are moving to lighter backend/backbone or ember combo instead of a complete solution like django. You can also use your django for caching json response which makes your application appears faster.
We are doing that and use django-piston to make it easier on you.
Normally you make your entire website under Django and only one page will be a "single app page" using backbone.js, usually that page is a very interactive page, with lots of small updates that occur frequently and need to be shown very fast to the user. This page, because of the large number of changes and user interactions is constructed on the client-side so that you are using his PC resources and not the server's, the rest of the pages can use django because it offers you a very stable and secure framework for the server-side

How to migrate Django project to Google App Engine

I am looking for a guide to migrate Django project to Google App Engine and use Google's datastore. The most of the guides I found were linked to Django-Appengine using Django-nonrel (but I want to use GAE's native support).
Going through GAE getting started guide, it says:
Google App Engine supports any framework written in pure Python that speaks CGI (and any WSGI-compliant framework using a CGI adaptor), including Django, CherryPy, Pylons, web.py, and web2py. You can bundle a framework of your choosing with your application code by copying its code into your application directory.
I understand that I won't be able to use some features of Django in that case (majorly the admin feature) and would also need to restructure the models.
From other reading, I also found that latest SDK of GAE now includes Django 1.3 on Python 2.5.
I tried to put all files from my Django application to a GAE project, but couldn't get it all to work together.
Please provide some basic guide using which I may migrate my Django project to Google App Engine's code.
Thanks.
For an existing Django app, using django-nonrel is the simplest approach; it is very popular so you should be able to find help with specific errors you get quickly.
Another approach is written up in this article: http://code.google.com/appengine/articles/pure_django.html -- it goes the other way, taking an App Engine app that uses Django for dispatch, templates, and forms, but not for models, and describes how to make it run in a native Django environment. Maybe you can glean some useful hints for your situation from it.
I've used django-nonrel, which behaves pretty much like django, except that operations with JOINs will return errors. I've basically worked around this by avoiding ManyToMany fields, and essentially building that functionality manually with an intermediate table.
So far I've ran into two problems with Django-nonrel:
1. No access to ancestor queries, which can be run in a transaction. There's a pending pull request for this feature though.
2. You can't specify fields that are not indexed. This could significantly increase your write costs. I have an idea to fix this, but I haven't done so yet.
(Edit: You CAN specify fields that are not indexed, and I've verified this works well).
2 (new). Google is pushing a new database backend called ndb that does automatic caching and batching, which will not be available with django-nonrel.
If you decide not to use django-nonrel, the main differences are that Django models do not run under App Engine. You'll have to rewrite your models to inherit from App Engine's db.Model. Your forms that use Django's ModelForm will need to inherit from google.appengine.ext.db.djangoforms instead. Once you're on App Engine, you'd have to port back Django if you ever take your app somewher else.
If you already have a Django application you might want to check this out. You won't work with App Engine's datastore but Google Cloud SQL might fit your needs.

Do we still need django-nonrel now that GAE (allegedly) supports Django out of the box?

According to this question:
Django on Google App Engine
The easiest way to get started with GAE/Django is with the Django non-rel bundle. However now that the latest Python/GAE SDK includes a build of Django, do we still need this?
What's the best-practice for getting started wth Django on GAE right now?
Thanks
Update: It seems that Web app2 is the easiest choice for new projects.
This guest article suggests that
"App Engine does come with some Django support, but this is mainly
only the templating and views."
non-rel is still seemingly your best bet. Although I'd caution you that further development and/or maintenance may not happen according to their blog.
Normal Django's models doesn't have a backend supporting GAE's datastore. Hence you can't use Django models, and hence, Django's model forms. What you'd have to do use use models derived from GAE's python db.Model(). Instead of using Django's ModelForm class for forms, you would use google.appengine.ext.db.djangoforms. Note, that's specifically for ModelForms, other forms work fine since they're not tied to the database.
I can think of two good reasons to use Django-nonrel:
1a) you have a existing project on Django. Using Django-nonrel would be the laziest way to go. Rewriting models to GAE's models isn't too hard, but it could be a small pain, especially if
1b) you use a lot of existing Django components, and you'd have to go through all of them to update the models and forms.
2) You want to hedge your bets against GAE. Using Django-nonrel will allow you to switch over to MongoDB with very little effort, since Django-nonrel has a functioning MongoDB backend. The current Django-nonrel maintainers seem to be more interested in MongoDB.
Having worked with Django-nonrel, I've so far run into some reasons why it may be a bad choice:
1) No support for ancestor queries. There's an outstanding pull request for this though. It won't be compatible with any other DB backend though.
2) ndb is coming out, and seems like it'll have a few more benefits, that likely won't see support on Django-nonrel.
If you do use GAE's native db API, the main benefit from Django would be the form validation. Otherwise, webapp2+jinja2+gae db.Models() would provide similar functionality to Django.

Django Admin site and Forms on AppEngine

i'm developing a web site hosted on AppEngine and wanted to use Django for some tasks. I've read these two answers:
Django on Google App Engine
Django and App Engine
But those are pretty old, and my question is a little more specific. I've taken a look at django-nonrel and seems good, but i've not used it and cant affirm anything.
So, the question is. Can I use the Admin site and the forms from Django with this package? If not, do you know any other patch that allow me to use them?
Thank you very much!
If you use django-nonrel, then you can use the Django admin site but it will be limited to the types of queries you can do on app engine. I personally found it easier to code my own simple admin interfaces that to type to make things work in Django Admin.
Regarding forms, regular Django Forms and ModelForms work quite well.
Yes, you can (both Admin and forms).
(definitely) :)
I installed djangoappengine 3 months ago and work on it daily under Eclipse (Windows).
If you have some expericence with Django it should be easy, I faced much more problems with Eclipse integration, but nothing unfeseable (even for a newbie - as I'm still).
You just have to start from here:
http://www.allbuttonspressed.com/projects/djangoappengine#installation
Be careful anyway: there are some limitations due to the Datastore capabilities.
A lot of work has been done to circumvent them (dbindexer, specifics decorators...) and if you're planning to develop an app from scratch you will find your way (keeping " noSQL " in mind) but if you plan to migrate a plain vanilla SQL app, it may cause you some pain...
Last point: instances handling Django and all its libraries may be long to start with App Engine ; an issue to consider:
http://code.google.com/p/googleappengine/issues/detail?id=1695
Hope it helps.
Florent

Django and App Engine

I wanted to check the status of running Django on the Google App Engine currently and what the benefits of running django on GAE over simply using Webapp.
Django main killer feature, IMHO, is the reuseable apps and middleware. Unfortunately, most current Django apps use models or model forms (django-tags, django-reviews, django-profiles, Pinax apps).
So what are the remaining features or benefits that django has that can still run in Google App Engine (other than what's disabled: the popular django apps, session and authentication middleware, users and admin, models, etc).
Also, is there a list of the Django apps that work in App Engine as well?
app-engine-patch currently has the most of django functional, including sessions, contrib.auth, sites, and some other standard django apps. However, its main drawback (my opinion) is that it uses a zip file of a modified version of django to achieve this functionality and the current maintainers don't seem to have kept pace with current django releases. Currently it seems to be the consensus of the past and present maintainers that this approach is too cumbersome to maintain and therefore no one is currently maintaining it.
google-app-engine-django, uses a monkey patch approach of the latest django version included in the production GAE runtime, so as long as google continues to track django releases you'll be kept up to date regarding django. However, it currently has not fully ported contrib.auth, so you can only authenticate with google accounts - which can be a big drawback depending on whether you want contrib.auth User models to work as you know them on sql backends. There is also no django admin support in the helper as there is in app-engine-patch. A fork of django-app-engine-django exists which adds in some of the contrib apps, such as flatpages, sites, and sitemaps. Also note, it only works on django versions up to 1.1, until issue #3230 Django 1.2 is added to use_library, unless you upload django as a zip file.
On the horizon, the original developer of app-engine-patch has been working on the django-nonrel branch, but this may be pretty far away from being included in a django release. This django developers thread has a lot of information about these efforts.
Separately, there is a google summer of code project working on integrating some aspects of nonrel db's.
app-engine-patch gets most of those things working inside AppEngine - so you can (mostly) use straight Modelforms, use the Django users and admin, etc.
I've only used it for fairly simple projects (being quite new to django), but they claim that most Django apps will work with (at most) minor modifications on appengine. For instance, app-engine-patch uses the AppEngine Model classes rather than the Django classes; and there are some of the basic views that are too inefficient to run on Appengine.
added: google-app-engine-django is similar; but provides a BaseModel that appears identical to Django's BaseModel. My understand is that google-app-engine-django was released by Google, then forked to create app-engine-patch. The maintainers of app-engine-patch seem to have some different goals from the creators of google-app-engine-django, so you may find that one of the two suits your needs better than the other.
Google have provided some articles on running Django apps on appengine; the most recent is actually a guest post from the authors of app-engine-patch.
I've had the best success by simply picking and choosing the Django features that I need and patching them into webapp myself. In my latest project I actually just cut out the webapp stuff entirely. I still import and call several webapp utility functions, but it is mostly a hand rolled application built from the good parts of GAE and Django.
You might be interested to check out web2py, another Python framework that supposedly has less friction between GAE and a "normal" web server.
It is now quite easy to use full Django on GAE:
https://developers.google.com/appengine/articles/django-nonrel#ps
The Django version provided with App Engine has been updated to 1.2.5 with the latest SDK release (1.4.2, changelog). This version is available through the use_library() declaration, so you no longer need to mess around with monkey patching to the same extent.
The GoogleAppEngine (GAE) Python 2.7 runtime provides several third-party libraries that your application can use, in addition to the Python standard library, GAE tools, and GAE Python runtime environment. One of them is Django. The below is copied from the GAE docs page on third-party libraries:
To use Django in Python 2.7, specify the WSGI application and Django library in app.yaml:
...
handlers:
- url: /.*
script: main.app # a WSGI application in the main module's global scope
libraries:
- name: django
version: "1.2"