The best way for integration Django and Scrapy - django

I know some ways like scrapy-djangoitem but as it has mentioned:
DjangoItem is a rather convenient way to integrate Scrapy projects with Django models, but bear in mind that Django ORM may not scale well if you scrape a lot of items (ie. millions) with Scrapy. This is because a relational backend is often not a good choice for a write intensive applications (such as a web crawler), specially if the database is highly normalized and with many indices.
So what is the best way to use scraped items in db and django models?

It is not about Django ORM but rather about the database you choose as backend. What it says is that if you are expecting to write millions of items to your tables, relational database systems might not be your best choice here (MySQL, Postgres ...) and it can be even worse in terms of performance if you add many indicies since your application is write-heavy (Database must update B-Trees or other structures for keeping index on every write).
I would suggest sticking with Postgres or MySQL for now and look for another solution if you start to have performance issues on database level.

Related

Collecting Relational Data and Adding to a Database Periodically with Python

I have a project that :
fetches data from active directory
fetches data from different services based on active directory data
aggregates data
about 50000 row have to be added to database in every 15 min
I'm using Postgresql as database and django as ORM tool. But I'm not sure that django is the right tools for such projects. I have to drop and add 50000 rows data and I'm worry about performance.
Is there another way to do such process?
50k rows/15m is nothing to worry about.
But I'd make sure to use bulk_create to avoid 50k of round trips to the database, which might be a problem depending on your database networking setup.
For sure there are other ways, if that's what you're asking. But Django ORM is quite flexible overall, and if you write your queries carefully there will be no significant overhead. 50000 rows in 15 minutes is not really big enough. I am using Django ORM with PostgreSQL to process millions of records a day.
You can write a custom Django management command for this purpose, Then call it like
python manage.py collectdata
Here is the documentation link

Django supported alternative to noSQL

We need a reasonable insert and query speed over huge tables so I considered using some noSQL adapter with Django. Unfortunately:
Django does not provide official support for noSQL databases.
In our original schema some Big Data are relational to other Big Data making the data duplication unacceptable.
Project deadlines are enemies of hot stuff like this.
So, as far I can see, PostgreSQL should be the way to go for this scenario, right?!
Please let me know any other detail that may be relevant to this question!
Bonus to anyone that can point out some useful database techniques like database sharding...
Well, there is a fork of django project that uses MongoDb as the backend.You can read about it here . The Code on GitHub is here.You give some heads up, MongoDB is a NOSQL db that does support sharding and replication. So i think this might something that you are looking for.

Mongodb vs PostgreSQL in django

I am not that experienced with django yet, but we are creating a project soon and we were wondering which database to use for our backend (Mongodb or PostgreSQL).
I've read a lot of post saying the differences between each, but I still can't take the decision of which to go for. Taking in consideration I never worked with Mongodb before.
So what shall I go for ??
Thanks a lot in advance
MongoDB is non-relational, and as such you cannot do things like joins, etc.
For this reason, many of the django.contrib apps, and other 3rd-part apps are likely to not work with mongodb.
But mongodb might be very useful if you need to store schemaless complex objects that won't go straight into postgresql (of course you could json-serialize and put in a text field, but using mongodb instead is just way better, allows you doing searches, ..).
So, the best suggestion is to use two databases:
PostgreSQL for the standard applications, such as django core, authentication, ...
MongoDB only for your application, when you have to store non-relational, complex objects
You also might want to use the raw_* methods that skip lots of (mostly unnecessary) validation by the django orm.
Just remember that databases, especially sql vs no-sql, are not drop-in replacements of each other, but instead they have their own features, pros and cons, so you have to find out which one suits best your needs in each case, not just pick one and use it for everything.
UPDATE
I forgot to say: remember that you have to use the django-nonrel fork in order to make django support non-relational databases. It is currently a fork of django 1.3, but a 1.4-based version is work-in-progress.

Data Warehouse and Django

This is more of an architectural question than a technological one per se.
I am currently building a business website/social network that needs to store large volumes of data and use that data to draw analytics (consumer behavior).
I am using Django and a PostgreSQL database.
Now my question is: I want to expand this architecture to include a data warehouse. The ideal would be: the operational DB would be the current Django PostgreSQL database, and the data warehouse would be something additional, preferably in a multidimensional model.
We are still in a very early phase, we are going to test with 50 users, so something primitive such as a one-column table for starters would be enough.
I would like to know if somebody has experience in this situation, and that could recommend me a framework to create a data warehouse, all while mantaining the operational DB with the Django models for ease of use (if possible).
Thank you in advance!
Here are some cool Open Source tools I used recently:
Kettle - great ETL tool, you can use this to extract the data from your operational database into your warehouse. Supports any database with a JDBC driver and makes it very easy to build e.g. a star schema.
Saiku - nice Web 2.0 frontend built on Pentaho Mondrian (MDX implementation). This allows your users to easily build complex aggregation queries (think Pivot table in Excel), and the Mondrian layer provides caching etc. to make things go fast. Try the demo here.
My answer does not necessarily apply to data warehousing. In your case I see the possibility to implement a NoSQL database solution alongside an OLTP relational storage, which in this case is PostgreSQL.
Why consider NoSQL? In addition to the obvious scalability benefits, NoSQL offer a number of advantages that probably will apply to your scenario. For instance, the flexibility of having records with different sets of fields, and key-based access.
Since you're still in "trial" stage you might find it easier to decide for a NoSQL database solution depending on your hosting provider. For instance AWS have SimpleDB, Google App Engine provide their own DataStore, etc. However there are plenty of other NoSQL solutions you can go for that have nice Python bindings.

Django norel access to different nosql at the same time?

i'm new to the nosql world, and from forums and articles that i've read: most of users try to "mix" nosql tools, for example, they use Cassandra and MongoDB together to make a "powerful system", because am beginning with MongoDB, i've downloaded the DjanMon project (am a django fan ^_^ ), of course i've downloaded the special version of django that accepts the NoSql use: Django NonRel, and i've noticed that the Setting file dont "oblige" you to use one specific NoSql solution like in Django with RDBMS where you must specify MySql or PostegreSql or other solution, so, is it possible to mix lot of (or two of course) NoSql solution using Django (for example MongoDB+Cassandra)?
There's nothing to stop you using multiple storage solutions, whether SQL or NoSQL - but the NoSQL solutions all have different architectures, data models and APIs (For example, MongoDB is a document-oriented database, whereas Cassandra is Column-oriented), so you can't usually swap one for another without some effort.
Can you clarify what you are actually trying to achieve? I.e. why are you interested in mixing these two specific solutions?