Django model: change db_table dynamically - django

I have a large number of sets of data. Each set of data comprises of several database tables. The schema for the sets of database tables is identical. Each set of tables can have over a million rows. Each set of data belongs to one job, there are no relations between jobs. One or more jobs belong to a different user. Sets of tables get imported and eventually deleted as a set of tables. From a performance point of view it is better to keep them as separate sets of tables.
So I would like to have several generic Django models one for each of the several tables.
I have achieved it in my views.py file by using code similar to this:
from foobar.models import Foo, Bar
def my_view(request):
prefix = request.GET.get('prefix')
Foo._meta.db_table = prefix + '_foo'
Bar._meta.db_table = prefix + '_bar'
....
foobar_list = Foo.objects.filter(bar_id=myval)
...
My questions are: Is it safe to use this code with concurrent multiple users of a Django based web application? Are the models objects shared across users? What would happen if there were two requests simultaneously?
EDIT NO 2: I have considered Lie Ryan's answer and the comments and come up with this code:
from django.http import HttpResponse, HttpResponseNotFound
from django.db import models
from django.template import RequestContext, loader
def getModels(prefix):
table_map = {}
table_map["foo"] = type(str(prefix + '_foo'), (models.Model,), {
'__module__': 'foobar.models',
'id' : models.IntegerField(primary_key=True),
'foo' : models.TextField(blank=True),
})
table_map["foo"]._meta.db_table = prefix + '_foo'
table_map["bar"] = type(str(prefix + '_bar'), (models.Model,), {
'__module__': 'foobar.models',
'id' : models.IntegerField(primary_key=True),
'foo' : models.ForeignKey(prefix + '_foo', null=True, blank=True),
})
table_map["bar"]._meta.db_table = prefix + '_bar'
return table_map
def foobar_view(request):
prefix = request.GET.get('prefix')
if prefix != None and prefix.isdigit():
table_map = getModels(prefix)
foobar_list = table_map["bar"].objects.filter.order_by('foo__foo')
template = loader.get_template('foobar/foobar.html')
context = RequestContext(request, {
'foobar_list': foobar_list,
})
return HttpResponse(template.render(context))
else:
return HttpResponseNotFound('<h1>Page not found</h1>')
Now my question is, is this second draft of the edited code safe with concurrent multiple users?

This technique is called sharding. No, it is not safe to do this if you serve concurrent requests with threads.
What you can do is to dynamically construct multiple classes pointing to different db_tables, and use a factory to select the right class.
tables = ["foo", "bar"]
table_map = {}
for tbl in tables:
class T(models.Model):
db_table = tbl
... table definition ...
table_map[tbl] = T
And then create a function that selects the right table_map based on how you shard your data.
Also be careful of injection if you accept table name from user input.
Alternatively, some database systems like PostgrSQL allows multiple schemas per database, which might be a better way to separate your data in certain circumstances.

Related

Is it possible to write a QuerySet method that modifies the dataset but delays evaluation (similar to prefetch_related)?

I'm working on a QuerySet class that does something similar to prefetch_related but allows the query to link data that's in an unconnected database (basically, linking records from django apps's database to records in a legacy system, using a shared unique key, something along the links of:
class UserFoo(models.Model):
''' Uses the django database & can link to User model '''
user = models.OneToOneField(User, related_name='userfoo')
foo_record = models.CharField(
max_length=32,
db_column="foo",
unique=True
) # uuid pointing to legacy db table
#property
def foo(self):
if not hasattr(self, '_foo'):
self._foo = Foo.objects.get(uuid=self.foo_record)
return self._foo
#foo.setter
def foo(self, foo_obj):
self._foo = foo_obj
and then
class Foo(models.Model):
'''Uses legacy database'''
id = models.AutoField(primary_key=True)
uuid = models.CharField(max_length=32) # uuid for Foo legacy db table
…
#property
def user(self):
if not hasattr(self, '_user'):
self._user = User.objects.get(userfoo__foo_record=self.uuid)
return self._user
#user.setter
def user(self, user_obj):
self._user = user_obj
Run normally, a query that matches 100 foos (each with, say, 1 user record) will end up requiring 101 queries: one to get the foos, and a hundred for each user record (by doing a look up for the user record by calling the user property on each food).
To get around this, I am making something similar to prefetch_related which pulls all of the matching records for a query by the key, which means I just need one additional query to get the remaining records.
My code looks something like this:
class FooWithUserQuerySet(models.query.QuerySet):
def with_foo(self):
qs = self._clone()
foo_idx = {}
for record in self.all():
foo_idx.setdefault(record.uuid, []).append(record)
users = User.objects.filter(
userfoo__foo_record__in=foo_idx.keys()
).select_related('django','relations','here')
user_idx = {}
for user in users:
user_idx[user.userfoo.foo_record] = user
for fid, frecords in foo_idx.items():
user = user_idx.get(fid)
for frecord in frecords:
if user:
setattr(frecord, 'user', user)
return qs
This works, but any extra data saved to a foo is lost if the query is later modified — that is, if the queryset is re-ordered or filtered in any way.
I would like a way to create a method that does exactly what I am doing now, but waits until the moment that adjusts whenever the query is evaluated, so that foo records always have a User record.
Some notes:
the example has been highly simplified. There are actually a lot of tables that link up to the legacy data, and so for example although there is a one-to-on relationship between Foo and User, there will be some cases where a queryset will have multiple Foo records with the same key.
the legacy database is on a different server and server platform, so I can't link the two tables using a database server itself
ideally I'd like the User data to be cached, so that even if the records are sorted or sliced I don't have to re-run the foo query a second time.
Basically, I don't know enough about the internals of how the lazy evaluation of querysets works in order to do the necessary coding. I have jumped back and forth on the source code for django.db.models.query but it really is a fairly dense read and I'm hoping someone out there who's worked with this already can offer some pointers.

Django aggregate multiple columns after arithmetic operation

I have a really strange problem with Django 1.4.4.
I have this model :
class LogQuarter(models.Model):
timestamp = models.DateTimeField()
domain = models.CharField(max_length=253)
attempts = models.IntegerField()
success = models.IntegerField()
queue = models.IntegerField()
...
I need to gather the first 20 domains with the higher sent property. The sent property is attempts - queue.
This is my request:
obj = LogQuarter.objects\
.aggregate(Sum(F('attempts')-F('queue')))\
.values('domain')\
.filter(**kwargs)\
.order_by('-sent')[:20]
I tried with extra too and it isn't working.
It's really basic SQL, I am surprised that Django can't do this.
Did someone has a solution ?
You can actually do this via subclassing some of the aggregation functionality. This requires digging in to the code to really understand, but here's what I coded up to do something similar for MAX and MIN. (Note: this code is based of Django 1.4 / MySQL).
Start by subclassing the underlying aggregation class and overriding the as_sql method. This method writes the actual SQL to the database query. We have to make sure to quote the field that gets passed in correctly and associate it with the proper table name.
from django.db.models.sql import aggregates
class SqlCalculatedSum(aggregates.Aggregate):
sql_function = 'SUM'
sql_template = '%(function)s(%(field)s - %(other_field)s)'
def as_sql(self, qn, connection):
# self.col is currently a tuple, where the first item is the table name and
# the second item is the primary column name. Assuming our calculation is
# on two fields in the same table, we can use that to our advantage. qn is
# underlying DB quoting object and quotes things appropriately. The column
# entry in the self.extra var is the actual database column name for the
# secondary column.
self.extra['other_field'] = '.'.join(
[qn(c) for c in (self.col[0], self.extra['column'])])
return super(SqlCalculatedSum, self).as_sql(qn, connection)
Next, subclass the general model aggregation class and override the add_to_query method. This method is what determines how the aggregate gets added to the underlying query object. We want to be able to pass in the field name (e.g. queue) but get the corresponding DB column name (in case it is something different).
from django.db import models
class CalculatedSum(models.Aggregate):
name = SqlCalculatedSum
def add_to_query(self, query, alias, col, source, is_summary):
# Utilize the fact that self.extra is set to all of the extra kwargs passed
# in on initialization. We want to get the corresponding database column
# name for whatever field we pass in to the "variable" kwarg.
self.extra['column'] = query.model._meta.get_field(
self.extra['variable']).db_column
query.aggregates[alias] = self.name(
col, source=source, is_summary=is_summary, **self.extra)
You can then use your new class in an annotation like this:
queryset.annotate(calc_attempts=CalculatedSum('attempts', variable='queue'))
Assuming your attempts and queue fields have those same db column names, this should generate SQL similar to the following:
SELECT SUM(`LogQuarter`.`attempts` - `LogQuarter`.`queue`) AS calc_attempts
And there you go.
I am not sure if you can do this Sum(F('attempts')-F('queue')). It should throw an error in the first place. I guess, easier approach would be to use extra.
result = LogQuarter.objects.extra(select={'sent':'(attempts-queue)'}, order_by=['-sent'])[:20]

django join querysets from multiple tables

If I have queries on multiple tables like:
d = Relations.objects.filter(follow = request.user).filter(date_follow__lt = last_checked)
r = Reply.objects.filter(reply_to = request.user).filter(date_reply__lt = last_checked)
article = New.objects.filter(created_by = request.user)
vote = Vote.objects.filter(voted = article).filter(date__lt = last_checked)
and I want to display the results from all of them ordered by date (I mean not listing all the replies, then all the votes, etc ).
Somehow, I want to 'join all these results', in a single queryset.
Is there possible?
It seems like you need different objects to have common operations ...
1) In this case it might be better to abstract these properties in a super class... I mean that you could have an Event class that defines a user field, and all your other event classes would subclass this.
class Event(model.Model):
user = models.ForeignKey(User)
date = ...
class Reply(Event):
#additional fields
class Vote(Event):
#additional fields
Then you would be able to do the following
Event.objects.order_by("date") #returns both Reply, Vote and Event
Check-out http://docs.djangoproject.com/en/1.2/topics/db/models/#id5 for info on model inheritance.
2) You could also have an Event model with a generic relation to another object. This sounds cleaner to me as a Vote is conceptually not an "event". Check-out : http://docs.djangoproject.com/en/dev/ref/contrib/contenttypes/#id1
Anyway, I think your problem is a matter of design
In addition to to Sebastien's proposal number 2: Django actually has some built-in functionality that you could "abuse" for this; for the admin it has already a model that logs the user's actions and references the objects through a generic foreign key relation, I think you could just sub-class this model and use it for your purposes:
from django.contrib.admin.models import LogEntry, ADDITION
from django.utils.encoding import force_unicode
from django.contrib.contenttypes.models import ContentType
class MyLog(LogEntry):
class Meta(LogEntry.Meta):
db_table_name = 'my_log_table' #use another name here
def log_addition(request, object):
LogEntry.objects.log_action(
user_id = request.user.pk,
content_type_id = ContentType.objects.get_for_model(object).pk,
object_id = object.pk,
object_repr = force_unicode(object),
action_flag = ADDITION
)
You can now log all your notifications etc. where they happen with with log_addition(request, object) and filter the Log table than for your purposes! If you want to log also changes / deletions etc. you can make yourself some helper functions for that!

How do I perform a batch insert in Django?

In mysql, you can insert multiple rows to a table in one query for n > 0:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9), ..., (n-2, n-1, n);
Is there a way to achieve the above with Django queryset methods? Here's an example:
values = [(1, 2, 3), (4, 5, 6), ...]
for value in values:
SomeModel.objects.create(first=value[0], second=value[1], third=value[2])
I believe the above is calling an insert query for each iteration of the for loop. I'm looking for a single query, is that possible in Django?
These answers are outdated. bulk_create has been brought in Django 1.4:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create
I recently looked for such a thing myself (inspired by QuerySet.update(), as I imagine you are too). To my knowledge, no bulk create exists in the current production framework (1.1.1 as of today). We ended up creating a custom manager for the model that needed bulk-create, and created a function on that manager to build an appropriate SQL statement with the sequence of VALUES parameters.
Something like (apologies if this does not work... hopefully I've adapted this runnably from our code):
from django.db import models, connection
class MyManager(models.Manager):
def create_in_bulk(self, values):
base_sql = "INSERT INTO tbl_name (a,b,c) VALUES "
values_sql = []
values_data = []
for value_list in values:
placeholders = ['%s' for i in range(len(value_list))]
values_sql.append("(%s)" % ','.join(placeholders))
values_data.extend(value_list)
sql = '%s%s' % (base_sql, ', '.join(values_sql))
curs = connection.cursor()
curs.execute(sql, values_data)
class MyObject(models.Model):
# model definition as usual... assume:
foo = models.CharField(max_length=128)
# custom manager
objects = MyManager()
MyObject.objects.create_in_bulk( [('hello',), ('bye',), ('c', )] )
This approach does run the risk of being very specific to a particular database. In our case, we wanted the function to return the IDs just created, so we had a postgres-specific query in the function to generate the requisite number of IDs from the primary key sequence for the table that represents the object. That said, it does perform significantly better in tests versus iterating over the data and issuing separate QuerySet.create() statements.
Here is way to do batch inserts that still goes through Django's ORM (and thus retains the many benefits the ORM provides). This approach involves subclassing the InsertQuery class as well as creating a custom manager that prepares model instances for insertion into the database in much the same way that Django's save() method uses. Most of the code for the BatchInsertQuery class below is straight from the InsertQuery class, with just a few key lines added or modified. To use the batch_insert method, pass in a set of model instances that you want to insert into the database. This approach frees up the code in your views from having to worry about translating model instances into valid SQL values; the manager class in conjunction with the BatchInsertQuery class handles that.
from django.db import models, connection
from django.db.models.sql import InsertQuery
class BatchInsertQuery( InsertQuery ):
####################################################################
def as_sql(self):
"""
Constructs a SQL statement for inserting all of the model instances
into the database.
Differences from base class method:
- The VALUES clause is constructed differently to account for the
grouping of the values (actually, placeholders) into
parenthetically-enclosed groups. I.e., VALUES (a,b,c),(d,e,f)
"""
qn = self.connection.ops.quote_name
opts = self.model._meta
result = ['INSERT INTO %s' % qn(opts.db_table)]
result.append('(%s)' % ', '.join([qn(c) for c in self.columns]))
result.append( 'VALUES %s' % ', '.join( '(%s)' % ', '.join(
values_group ) for values_group in self.values ) ) # This line is different
params = self.params
if self.return_id and self.connection.features.can_return_id_from_insert:
col = "%s.%s" % (qn(opts.db_table), qn(opts.pk.column))
r_fmt, r_params = self.connection.ops.return_insert_id()
result.append(r_fmt % col)
params = params + r_params
return ' '.join(result), params
####################################################################
def insert_values( self, insert_values ):
"""
Adds the insert values to the instance. Can be called multiple times
for multiple instances of the same model class.
Differences from base class method:
-Clears self.columns so that self.columns won't be duplicated for each
set of inserted_values.
-appends the insert_values to self.values instead of extends so that
the values (actually the placeholders) remain grouped separately for
the VALUES clause of the SQL statement. I.e., VALUES (a,b,c),(d,e,f)
-Removes inapplicable code
"""
self.columns = [] # This line is new
placeholders, values = [], []
for field, val in insert_values:
placeholders.append('%s')
self.columns.append(field.column)
values.append(val)
self.params += tuple(values)
self.values.append( placeholders ) # This line is different
########################################################################
class ManagerEx( models.Manager ):
"""
Extended model manager class.
"""
def batch_insert( self, *instances ):
"""
Issues a batch INSERT using the specified model instances.
"""
cls = instances[0].__class__
query = BatchInsertQuery( cls, connection )
for instance in instances:
values = [ (f, f.get_db_prep_save( f.pre_save( instance, True ) ) ) \
for f in cls._meta.local_fields ]
query.insert_values( values )
return query.execute_sql()
########################################################################
class MyModel( models.Model ):
myfield = models.CharField(max_length=255)
objects = ManagerEx()
########################################################################
# USAGE:
object1 = MyModel(myfield="foo")
object2 = MyModel(myfield="bar")
object3 = MyModel(myfield="bam")
MyModels.objects.batch_insert(object1,object2,object3)
You might get the performance you need by doing manual transactions. What this will allow you to do is to create all the inserts in one transaction, then commit the transaction all at once. Hopefully this will help you: http://docs.djangoproject.com/en/dev/topics/db/transactions/
No it is not possible because django models are objects rather than a table. so table actions are not applicable to django models. and django creates an object then inserts data in to the table therefore you can't create multiple object in one time.

Fetching ManyToMany objects from multiple objects through intermediate tables

Is there an easy way to fetch the ManyToMany objects from a query that returns more than one object? The way I am doing it now doesn't feel as sexy as I would like it to. Here is how I am doing it now in my view:
contacts = Contact.objects.all()
# Use Custom Manager Method to Fetch Each Contacts Phone Numbers
contacts = PhoneNumber.objects.inject(contacts)
My Models:
class PhoneNumber(models.Model):
number = models.CharField()
type = models.CharField()
# My Custom Manager
objects = PhoneNumberManager()
class Contact(models.Model):
name = models.CharField()
numbers = models.ManyToManyField(PhoneNumber, through='ContactPhoneNumbers')
class ContactPhoneNumbers(models.Model):
number = models.ForeignKey(PhoneNumber)
contact = models.ForeignKey(Contact)
ext = models.CharField()
My Custom Manager:
class PhoneNumberManager(models.Manager):
def inject(self, contacts):
contact_ids = ','.join([str(item.id) for item in contacts])
cursor = connection.cursor()
cursor.execute("""
SELECT l.contact_id, l.ext, p.number, p.type
FROM svcontact_contactphonenumbers l, svcontact_phonenumber p
WHERE p.id = l.number_id AND l.contact_id IN(%s)
""" % contact_ids)
result = {}
for row in cursor.fetchall():
id = str(row[0])
if not id in result:
result[id] = []
result[id].append({
'ext': row[1],
'number': row[2],
'type': row[3]
})
for contact in contacts:
id = str(contact.id)
if id in result:
contact.phonenumbers = result[id]
return contacts
There are a couple things you can do to find sexiness here :-)
Django does not have any OOTB way to inject the properties of the through table into your Contact instance. A M2M table with extra data is a SQL concept, so Django wouldn't try to fight the relations, nor guess what should happen in the event of namespace collision, etc... . In fact, I'd go so far as to say that you probably do not want to inject arbitrary model properties onto your Contact object... if you find yourself needing to do that, then it's probably a sign you should revise your model definition.
Instead, Django provides convenient ways to access the relation seamlessly, both in queries and for data retrieval, all the while preserving the integrity of the entities. In this case, you'll find that your Contact object offers a contactphonenumbers_set property that you can use to access the through data:
>>> c = Contact.objects.get(id=1)
>>> c.contactphonenumbers_set.all()
# Would produce a list of ContactPhoneNumbers objects for that contact
This means, in your case, to iterate of all contact phone numbers (for example) you would:
for contact in Contact.objects.all():
for phone in contact.contactphonenumbers_set.all():
print phone.number.number, phone.number.type, phone.ext
If you really, really, really want to do the injection for some reason, you'll see you can do that using the 3-line code sample immediately above: just change the print statements into assignment statements.
On a separate note, just for future reference, you could have written your inject function without SQL statements. In Django, the through table is itself a model, so you can query it directly:
def inject(self, contacts):
contact_phone_numbers = ContactPhoneNumbers.objects.\
filter(contact__in=contacts)
# And then do the result construction...
# - use contact_phone_number.number.phone to get the phone and ext
# - use contact_phone_number.contact to get the contact instance