How do I perform a batch insert in Django? - django

In mysql, you can insert multiple rows to a table in one query for n > 0:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9), ..., (n-2, n-1, n);
Is there a way to achieve the above with Django queryset methods? Here's an example:
values = [(1, 2, 3), (4, 5, 6), ...]
for value in values:
SomeModel.objects.create(first=value[0], second=value[1], third=value[2])
I believe the above is calling an insert query for each iteration of the for loop. I'm looking for a single query, is that possible in Django?

These answers are outdated. bulk_create has been brought in Django 1.4:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create

I recently looked for such a thing myself (inspired by QuerySet.update(), as I imagine you are too). To my knowledge, no bulk create exists in the current production framework (1.1.1 as of today). We ended up creating a custom manager for the model that needed bulk-create, and created a function on that manager to build an appropriate SQL statement with the sequence of VALUES parameters.
Something like (apologies if this does not work... hopefully I've adapted this runnably from our code):
from django.db import models, connection
class MyManager(models.Manager):
def create_in_bulk(self, values):
base_sql = "INSERT INTO tbl_name (a,b,c) VALUES "
values_sql = []
values_data = []
for value_list in values:
placeholders = ['%s' for i in range(len(value_list))]
values_sql.append("(%s)" % ','.join(placeholders))
values_data.extend(value_list)
sql = '%s%s' % (base_sql, ', '.join(values_sql))
curs = connection.cursor()
curs.execute(sql, values_data)
class MyObject(models.Model):
# model definition as usual... assume:
foo = models.CharField(max_length=128)
# custom manager
objects = MyManager()
MyObject.objects.create_in_bulk( [('hello',), ('bye',), ('c', )] )
This approach does run the risk of being very specific to a particular database. In our case, we wanted the function to return the IDs just created, so we had a postgres-specific query in the function to generate the requisite number of IDs from the primary key sequence for the table that represents the object. That said, it does perform significantly better in tests versus iterating over the data and issuing separate QuerySet.create() statements.

Here is way to do batch inserts that still goes through Django's ORM (and thus retains the many benefits the ORM provides). This approach involves subclassing the InsertQuery class as well as creating a custom manager that prepares model instances for insertion into the database in much the same way that Django's save() method uses. Most of the code for the BatchInsertQuery class below is straight from the InsertQuery class, with just a few key lines added or modified. To use the batch_insert method, pass in a set of model instances that you want to insert into the database. This approach frees up the code in your views from having to worry about translating model instances into valid SQL values; the manager class in conjunction with the BatchInsertQuery class handles that.
from django.db import models, connection
from django.db.models.sql import InsertQuery
class BatchInsertQuery( InsertQuery ):
####################################################################
def as_sql(self):
"""
Constructs a SQL statement for inserting all of the model instances
into the database.
Differences from base class method:
- The VALUES clause is constructed differently to account for the
grouping of the values (actually, placeholders) into
parenthetically-enclosed groups. I.e., VALUES (a,b,c),(d,e,f)
"""
qn = self.connection.ops.quote_name
opts = self.model._meta
result = ['INSERT INTO %s' % qn(opts.db_table)]
result.append('(%s)' % ', '.join([qn(c) for c in self.columns]))
result.append( 'VALUES %s' % ', '.join( '(%s)' % ', '.join(
values_group ) for values_group in self.values ) ) # This line is different
params = self.params
if self.return_id and self.connection.features.can_return_id_from_insert:
col = "%s.%s" % (qn(opts.db_table), qn(opts.pk.column))
r_fmt, r_params = self.connection.ops.return_insert_id()
result.append(r_fmt % col)
params = params + r_params
return ' '.join(result), params
####################################################################
def insert_values( self, insert_values ):
"""
Adds the insert values to the instance. Can be called multiple times
for multiple instances of the same model class.
Differences from base class method:
-Clears self.columns so that self.columns won't be duplicated for each
set of inserted_values.
-appends the insert_values to self.values instead of extends so that
the values (actually the placeholders) remain grouped separately for
the VALUES clause of the SQL statement. I.e., VALUES (a,b,c),(d,e,f)
-Removes inapplicable code
"""
self.columns = [] # This line is new
placeholders, values = [], []
for field, val in insert_values:
placeholders.append('%s')
self.columns.append(field.column)
values.append(val)
self.params += tuple(values)
self.values.append( placeholders ) # This line is different
########################################################################
class ManagerEx( models.Manager ):
"""
Extended model manager class.
"""
def batch_insert( self, *instances ):
"""
Issues a batch INSERT using the specified model instances.
"""
cls = instances[0].__class__
query = BatchInsertQuery( cls, connection )
for instance in instances:
values = [ (f, f.get_db_prep_save( f.pre_save( instance, True ) ) ) \
for f in cls._meta.local_fields ]
query.insert_values( values )
return query.execute_sql()
########################################################################
class MyModel( models.Model ):
myfield = models.CharField(max_length=255)
objects = ManagerEx()
########################################################################
# USAGE:
object1 = MyModel(myfield="foo")
object2 = MyModel(myfield="bar")
object3 = MyModel(myfield="bam")
MyModels.objects.batch_insert(object1,object2,object3)

You might get the performance you need by doing manual transactions. What this will allow you to do is to create all the inserts in one transaction, then commit the transaction all at once. Hopefully this will help you: http://docs.djangoproject.com/en/dev/topics/db/transactions/

No it is not possible because django models are objects rather than a table. so table actions are not applicable to django models. and django creates an object then inserts data in to the table therefore you can't create multiple object in one time.

Related

How to create a customized filter search function in Django?

I am trying to create a filter search bar that I can customize. For example, if I type a value into a search bar, then it will query a model and retrieve a list of instances that match the value. For example, here is a view:
class StudentListView(FilterView):
template_name = "leads/student_list.html"
context_object_name = "leads"
filterset_class = StudentFilter
def get_queryset(self):
return Lead.objects.all()
and here is my filters.py:
class
StudentFilter(django_filters.FilterSet):
class Meta:
model = Lead
fields = {
'first_name': ['icontains'],
'email': ['exact'],
}
Until now, I can only create a filter search bar that can provide a list of instances that match first_name or email(which are fields in the Lead model). However, this does now allow me to do more complicated tasks. Lets say I added time to the filter fields, and I would like to not only filter the Lead model with the time value I submitted, but also other Lead instances that have a time value that is near the one I submitted. Basically, I want something like the def form_valid() used in the views where I can query, calculate, and even alter the values submitted.
Moreover, if possible, I would like to create a filter field that is not necessarily an actual field in a model. Then, I would like to use the submitted value to do some calculations as I filter for the list of instances. If you have any questions, please ask me in the comments. Thank you.
You can do just about anything by defining a method on the filterset to map the user's input onto a queryset. Here's one I did earlier. Code much cut down ...
The filter coat_info_contains is defined as a CharFilter, but it is further parsed by the method which splits it into a set of substrings separated by commas. These substrings are then used to generate Q elements (OR logic) to match a model if the substring is contained in any of three model fields coating_1, coating_2 and coating_3
This filter is not implicitly connected to any particular model field. The connection is through the method= specification of the filter to the filterset's method, which can return absolutely any queryset on the model that can be programmed.
Hope I haven't cut out anything vital.
import django_filters as FD
class MemFilter( FD.FilterSet):
class Meta:
model = MyModel
# fields = [fieldname, ... ] # default filters created for these. Not required if all declarative.
# fields = { fieldname: [lookup_expr_1, ...], ...} # for specifying possibly multiple lookup expressions
fields = {
'ft':['gte','lte','exact'], 'mt':['gte','lte','exact'],
...
}
# declarative filters. Lots and lots of
...
coat_info_contains = FD.CharFilter( field_name='coating_1',
label='Coatings contain',
method='filter_coatings_contains'
)
...
def filter_coatings_contains( self, qs, name, value):
values = value.split(',')
qlist = []
for v in values:
qlist.append(
Q(coating_1__icontains = v) |
Q(coating_2__icontains = v) |
Q(coating_3__icontains = v) )
return qs.filter( *qlist )

Django - annotate against a prefetch QuerySet?

Is it possible to annotate/count against a prefetched query?
My initial query below, is based on circuits, then I realised that if a site does not have any circuits I won't have a 'None' Category which would show a site as Down.
conn_data = Circuits.objects.all() \
.values('circuit_type__circuit_type') \
.exclude(active_link=False) \
.annotate(total=Count('circuit_type__circuit_type')) \
.order_by('circuit_type__monitor_priority')
So I changed to querying sites and using prefetch, which now has an empty circuits_set for any site that does not have an active link. Is there a Django way of creating the new totals against that circuits_set within conn_data? I was going to loop through all the sites manually and add the totals that way but wanted to know if there was a way to do this within the QuerySet instead?
my end result should have a something like:
[
{'circuit_type__circuit_type': 'Fibre', 'total': 63},
{'circuit_type__circuit_type': 'DSL', 'total': 29},
{'circuit_type__circuit_type': 'None', 'total': 2}
]
prefetch query:
conn_data = SiteData.objects.prefetch_related(
Prefetch(
'circuits_set',
queryset=Circuits.objects.exclude(
active_link=False).select_related('circuit_type'),
)
)
I don't think this will work. Its debatable whether it should work. Let's refer to what prefetch_related does.
Returns a QuerySet that will automatically retrieve, in a single batch, related objects for each of the specified lookups.
So what happens here is that two queries are dispatched and two lists are realized. These lists are then partitioned in memory and grouped to the correct parent records.
Count() and annotate() are directives to the DBMS that resolve to SQL
Select Count(id) from conn_data
Because of the way annotate and prefetch_related work I think its unlikely they will play nice together. prefetch_related is just a convenience though. From a practical perspective running two separate ORM queries and assigning them to SiteData records yourself is effectively the same thing. So something like ...
#Gets all Circuits counted and grouped by SiteData
Circuits.objects.values('sitedata_id)'.exclude(active_link=False).select_related('circuit_type').annotate(Count('site_data_id'));
Then you just loop over your SiteData records and assign the counts.
Ok I got what I wanted with this, probably a better way of doing it but it works never the less:
from collections import Counter
import operator
class ConnData(object):
def __init__(self, priority='', c_type='', count=0 ):
self.priority = priority
self.c_type = c_type
self.count = count
def __repr__(self):
return '{} {}'.format(self.__class__.__name__, self.c_type)
# get all the site data
conn_data = SiteData.objects.exclude(Q(site_type__site_type='Data Centre') | Q(site_type__site_type='Factory')) \
.prefetch_related(
Prefetch(
'circuits_set',
queryset=Circuits.objects.exclude(active_link=False).select_related('circuit_type'),
)
)
# create a list for the conns
conns = []
# add items to list of dictionaries with all required fields
for conn in conn_data:
try:
conn_type = conn.circuits_set.all()[0].circuit_type.circuit_type
prioritiy = conn.circuits_set.all()[0].circuit_type.monitor_priority
conns.append({'circuit_type' : conn_type, 'priority' : prioritiy})
except:
# create category for down sites
conns.append({'circuit_type' : 'Down', 'priority' : 10})
# crate new list for class data
conn_counts = []
# create counter data
conn_count_data = Counter(((d['circuit_type'], d['priority']) for d in conns))
# loop through counter data and add classes to list
for val, count in conn_count_data.items():
cc = ConnData()
cc.priority = val[1]
cc.c_type = val[0]
cc.count = count
conn_counts.append(cc)
# sort the classes by priority
conn_counts = sorted(conn_counts, key=operator.attrgetter('priority'))

Django, general version of prefetch_related()?

Of course, I don't mean to do what prefetch_related does already.
I'd like to mimic what it does.
What I'd like to do is the following.
I have a list of MyModel instances.
A user can either follows or doesn't follow each instance.
my_models = MyModel.objects.filter(**kwargs)
for my_model in my_models:
my_model.is_following = Follow.objects.filter(user=user, target_id=my_model.id, target_content_type=MY_MODEL_CTYPE)
Here I have n+1 query problem, and I think I can borrow what prefetch_related does here. Description of prefetch_related says, it performs the query for all objects and when the related attribute is required, it gets from the pre-performed queryset.
That's exactly what I'm after, perform query for is_following for all objects that I'm interested in. and use the query instead of N individual query.
One additional aspect is that, I'd like to attach queryset rather than attach the actual value, so that I can defer evaluation until pagination.
If that's too ambiguous statement, I'd like to give the my_models queryset that has is_following information attached, to another function (DRF serializer for instance).
How does prefetch_related accomplish something like above?
A solution where you can get only the is_following bit is possible with a subquery via .extra.
class MyModelQuerySet(models.QuerySet):
def annotate_is_follwing(self, user):
return self.extra(
select = {'is_following': 'EXISTS( \
SELECT `id` FROM `follow` \
WHERE `follow`.`target_id` = `mymodel`.id \
AND `follow`.`user_id` = %s)' % user.id
}
)
class MyModel(models.Model):
objects = MyModelQuerySet.as_manager()
usage:
my_models = MyModel.objects.filter(**kwargs).annotate_is_follwing(request.user)
Now another solution where you can get a whole list of following objects.
Because you have a GFK in the Follow class you need to manually create a reverse relation via GenericRelation. Something like:
class MyModelQuerySet(models.QuerySet):
def with_user_following(self, user):
return self.prefetch_related(
Prefetch(
'following',
queryset=Follow.objects.filter(user=user) \
.select_related('user'),
to_attr='following_user'
)
)
class MyModel(models.Model):
following = GenericRelation(Follow,
content_type_field='target_content_type',
object_id_field='target_id'
related_query_name='mymodels'
)
objects = MyModelQuerySet.as_manager()
def get_first_following_object(self):
if hasattr(self, 'following_user') and len(self.following_user) > 0:
return self.following_user[0]
return None
usage:
my_models = MyModel.objects.filter(**kwargs).with_user_following(request.user)
Now you have access to following_user attribute - a list with all follow objects per mymodel, or you can use a method like get_first_following_object.
Not sure if this is the best approach, and I doubt this is what prefetch_related does because I'm joining here.
I found there's way to select extra columns in your query.
extra_select = """
EXISTS(SELECT * FROM follow_follow
WHERE follow_follow.target_object_id = myapp_mymodel.id AND
follow_follow.target_content_type_id = %s AND
follow_follow.user_id = %s)
"""
qs = self.extra(
select={'is_following': extra_select},
select_params=[CONTENT_TYPE_ID, user.id]
)
So you can do this with join.
prefetch_related way of doing it would be separate queryset and look it up in queryset for the attribute.

Django model: change db_table dynamically

I have a large number of sets of data. Each set of data comprises of several database tables. The schema for the sets of database tables is identical. Each set of tables can have over a million rows. Each set of data belongs to one job, there are no relations between jobs. One or more jobs belong to a different user. Sets of tables get imported and eventually deleted as a set of tables. From a performance point of view it is better to keep them as separate sets of tables.
So I would like to have several generic Django models one for each of the several tables.
I have achieved it in my views.py file by using code similar to this:
from foobar.models import Foo, Bar
def my_view(request):
prefix = request.GET.get('prefix')
Foo._meta.db_table = prefix + '_foo'
Bar._meta.db_table = prefix + '_bar'
....
foobar_list = Foo.objects.filter(bar_id=myval)
...
My questions are: Is it safe to use this code with concurrent multiple users of a Django based web application? Are the models objects shared across users? What would happen if there were two requests simultaneously?
EDIT NO 2: I have considered Lie Ryan's answer and the comments and come up with this code:
from django.http import HttpResponse, HttpResponseNotFound
from django.db import models
from django.template import RequestContext, loader
def getModels(prefix):
table_map = {}
table_map["foo"] = type(str(prefix + '_foo'), (models.Model,), {
'__module__': 'foobar.models',
'id' : models.IntegerField(primary_key=True),
'foo' : models.TextField(blank=True),
})
table_map["foo"]._meta.db_table = prefix + '_foo'
table_map["bar"] = type(str(prefix + '_bar'), (models.Model,), {
'__module__': 'foobar.models',
'id' : models.IntegerField(primary_key=True),
'foo' : models.ForeignKey(prefix + '_foo', null=True, blank=True),
})
table_map["bar"]._meta.db_table = prefix + '_bar'
return table_map
def foobar_view(request):
prefix = request.GET.get('prefix')
if prefix != None and prefix.isdigit():
table_map = getModels(prefix)
foobar_list = table_map["bar"].objects.filter.order_by('foo__foo')
template = loader.get_template('foobar/foobar.html')
context = RequestContext(request, {
'foobar_list': foobar_list,
})
return HttpResponse(template.render(context))
else:
return HttpResponseNotFound('<h1>Page not found</h1>')
Now my question is, is this second draft of the edited code safe with concurrent multiple users?
This technique is called sharding. No, it is not safe to do this if you serve concurrent requests with threads.
What you can do is to dynamically construct multiple classes pointing to different db_tables, and use a factory to select the right class.
tables = ["foo", "bar"]
table_map = {}
for tbl in tables:
class T(models.Model):
db_table = tbl
... table definition ...
table_map[tbl] = T
And then create a function that selects the right table_map based on how you shard your data.
Also be careful of injection if you accept table name from user input.
Alternatively, some database systems like PostgrSQL allows multiple schemas per database, which might be a better way to separate your data in certain circumstances.

Django aggregate multiple columns after arithmetic operation

I have a really strange problem with Django 1.4.4.
I have this model :
class LogQuarter(models.Model):
timestamp = models.DateTimeField()
domain = models.CharField(max_length=253)
attempts = models.IntegerField()
success = models.IntegerField()
queue = models.IntegerField()
...
I need to gather the first 20 domains with the higher sent property. The sent property is attempts - queue.
This is my request:
obj = LogQuarter.objects\
.aggregate(Sum(F('attempts')-F('queue')))\
.values('domain')\
.filter(**kwargs)\
.order_by('-sent')[:20]
I tried with extra too and it isn't working.
It's really basic SQL, I am surprised that Django can't do this.
Did someone has a solution ?
You can actually do this via subclassing some of the aggregation functionality. This requires digging in to the code to really understand, but here's what I coded up to do something similar for MAX and MIN. (Note: this code is based of Django 1.4 / MySQL).
Start by subclassing the underlying aggregation class and overriding the as_sql method. This method writes the actual SQL to the database query. We have to make sure to quote the field that gets passed in correctly and associate it with the proper table name.
from django.db.models.sql import aggregates
class SqlCalculatedSum(aggregates.Aggregate):
sql_function = 'SUM'
sql_template = '%(function)s(%(field)s - %(other_field)s)'
def as_sql(self, qn, connection):
# self.col is currently a tuple, where the first item is the table name and
# the second item is the primary column name. Assuming our calculation is
# on two fields in the same table, we can use that to our advantage. qn is
# underlying DB quoting object and quotes things appropriately. The column
# entry in the self.extra var is the actual database column name for the
# secondary column.
self.extra['other_field'] = '.'.join(
[qn(c) for c in (self.col[0], self.extra['column'])])
return super(SqlCalculatedSum, self).as_sql(qn, connection)
Next, subclass the general model aggregation class and override the add_to_query method. This method is what determines how the aggregate gets added to the underlying query object. We want to be able to pass in the field name (e.g. queue) but get the corresponding DB column name (in case it is something different).
from django.db import models
class CalculatedSum(models.Aggregate):
name = SqlCalculatedSum
def add_to_query(self, query, alias, col, source, is_summary):
# Utilize the fact that self.extra is set to all of the extra kwargs passed
# in on initialization. We want to get the corresponding database column
# name for whatever field we pass in to the "variable" kwarg.
self.extra['column'] = query.model._meta.get_field(
self.extra['variable']).db_column
query.aggregates[alias] = self.name(
col, source=source, is_summary=is_summary, **self.extra)
You can then use your new class in an annotation like this:
queryset.annotate(calc_attempts=CalculatedSum('attempts', variable='queue'))
Assuming your attempts and queue fields have those same db column names, this should generate SQL similar to the following:
SELECT SUM(`LogQuarter`.`attempts` - `LogQuarter`.`queue`) AS calc_attempts
And there you go.
I am not sure if you can do this Sum(F('attempts')-F('queue')). It should throw an error in the first place. I guess, easier approach would be to use extra.
result = LogQuarter.objects.extra(select={'sent':'(attempts-queue)'}, order_by=['-sent'])[:20]