Django ORM Performance Issues - django

I am looping over 13,000 in-memory city names and generating queries to filter for something. I've encountered something I cannot explain...
When the loop has a single line:
cities = City.objects.filter(name__iexact=city)
performance is almost 800 items/second
When the loop measures the length of the returned set...
cities = City.objects.filter(name__iexact=city)
num_citles = len(cities)
performance drops to 8 items/second
I can't explain where the performance degradation occurrs. Obviously, I'm missing something... Why would counting the number of items on an in-memory array that is always between 0 and 3 items reducing performance by a factor of x100?

Django querysets are lazy so QuerySet.filter does not actually evaluate the queryset i.e. run the queries in the database. When you run len function on it, it is evaluated and it will get all the items from database after running the filter only to get the count. Hence, the count is very slower.
You'll get a far better performance if you run COUNT on the database level:
num_cities = City.objects.filter(name__iexact=city).count()

Related

Django query using large amount of memory

I have a query that is causing memory spikes in my application. The below code is designed to show a single record, but occasionally show 5 to 10 records. The problem is there are edge cases where 100,000 results are passed to MultipleObjectsReturned. I believe this causes the high memory usage. The code is:
try:
record = record_class.objects.get(**filter_params)
context["record"] = record
except record_class.MultipleObjectsReturned:
records = record_class.objects.filter(**filter_params)
template_path = "record/%(type)s/%(type)s_multiple.html" % {"type": record_type}
return render(request, template_path, {"records": records}, current_app=record_type)
I thought about adding a slice at the end of the filter query, so it looks like this:
records = record_class.objects.filter(**filter_params)[:20]
But the code still seems slow. Is there a way to limit the results to 20 in a way that does not load the entire query or cause high memory usage?
As this_django documentation says:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
For example, this returns the first 5 objects (LIMIT 5):
Entry.objects.all()[:5]
So it seems that "limiting the results to 20 in a way that does not load the entire query" is being fulfiled .
So your code is slow for some other reason. or maybe you are checking the time complexity in wrong way.

Quantity of database query's vs application memory performance using Django

If I need a total of all objects in a query set as well as a slice of filed values from those objects, which option would be better considering speed and application memory use (I am using a PostgreSQL backend):
Option a:
def get_data():
queryset = MyObject.objects.all()
total_objects = queryset.count()
thumbs = queryset[:5].values_list('thumbnail', flat=True)
return {total_objects:total_objects, thumbs:thumbs}
Option b:
def get_data():
objects = list(MyObject.objects.all())
total_objects = len(objects)
thumbs = [o.thumbnail for o in objects[:5]]
return {total_objects:total_objects, thumbs:thumbs}
If I understand things correctly, and certainly correct me if I am wrong:
Option a: It will hit the database two times and will result in only total_objects = integer and thumbs = list of strings in memory.
Option b: It will hit the database one time and will result in a list of all objects and all their filed data + option a items in memory.
Considering these options and that there are potentially millions of instances of MyObject: Is the speed of one data base hit (options a) preferable to the memory consumption of a single data base hit (option b)?
My priority is for overall speed in returning the data, but I am concerned about the larger memory consumption slowing things down even more than the extra database hit.
Using SQL is the fastest method and will always beat the Python equivalent, even if it hits the database more. The difference is negligible in comparison. Remember, that's what SQL is meant to do - be fast and efficient.
Anyway, running a thousand loops using timeit, these are the results:
In [8]: %timeit get_data1() # Using ORM
1000 loops, best of 3: 628 µs per loop
In [9]: %timeit get_data2() # Using python
1000 loops, best of 3: 1.54 ms per loop
As you can see, the first method takes 628 microseconds per loop, while the second one takes 1.54 milliseconds. That's almost 2.5 times as much! A clear winner.
I used an SQLite database with only 100 objects in it (I used autofixture to spam the models). I'm guessing PostgreSQL will return different results, but I am still in favor of the first one.

SQL Server CE Delete Performance

I am using SQL Server CE 4.0 and am getting poor DELETE query performance.
My table has 300,000 rows in it.
My query is:
DELETE from tableX
where columnName1 = '<some text>' AND columnName2 = '<some other text>'
I am using a non-clustered index on the 2 fields columnName1 and columnName2.
I noticed that when the number of rows to delete is small (say < 2000), the index can help performance by 2-3X. However, when the number of rows to delete is larger (say > 15000), the index does not help at all.
My theory for this behavior is that when the number of rows is large, the index maintenance is killing the gains achieved by using the index (index seek instead of table scan). Is this correct?
Unfortunately, I can't get rid of the index because it significantly helps non-mutating query performance.
Also, what else can I do to improve the delete performance for the > 15,000 row case?
I am using SQL Server CE 4.0 on Windows 7 (32-bit).
My application is written in C++ and uses the OLE DB interface to manipulate the database.
There is something known as "the tipping point" where the cost of locating individual rows using a seek is not worth it, and it is easier to just perform a single scan instead of thousands of seeks.
A couple of things you may consider for performance:
have a filtered index, if those are supported in CE (I honestly have no idea)
instead of deleting 15,000 rows at once, batch the deletes into chunks.
consider a "soft delete" - where you simply update an active column to 0. Then you can actually delete the rows in smaller batches in the background. I mean, is a user really sitting around and waiting actively for you to delete 15,000+ rows? Why?

concurrent query performance in amazon redshift

On Amazon Redshift, do concurrent queries affect each others performance?
For example, lets say there are two queries: one on a relatively small table (~5m rows) retrieving all rows, and another on a large table (~500m) rows. Both tables have the same fields, both have no compression. Both queries retrieve all data in their respective tables to compute their results. There are no joins or filters. Both queries retrieve about 2-4 fields for their computations.
Running by itself, the small query returns in about 700ms. However, while the large query is running (which by itself takes a few minutes), the small query returns in 4-6 seconds.
This is the observed behavior on a cluster with a single XL node.
Is this the expected behavior? Is there a configuration setting that will promise performance consistency of the small query, even if the large query is running?
Copy-pasted from: https://forums.aws.amazon.com/thread.jspa?threadID=137540#
I've performed some concurrent query benchmarking.
I created a straightforward query which by itself took about a minute
to run. I then ran one of those queries at once, then two, them three,
etc, and timed each query.
Each query basically halved database performance - e.g. what you'd
expect; double the load, halve the performance.
Actually, it's a bit better than halving - you get about an extra 10%
performance.
This performance behaviour held true up to 5 concurrent queries, which
is the max number of concurrent queries configured on the database I
was working with. If I ran six queries, the final query could not
execute until one of the first queries had finished and freed up a
slot.
Finally, vacuum acts much like a normal query - it halves performance.
It's not special.
Actually, vacuum is something more than halving - it's equivelent to a
pretty heavy query.
There are no guarantees because all of this is running on a fixed number of CPUs. With a fixed capacity of work when you increase the work it lowers the throughput. The short answer is get a bigger machine (ie more nodes).
Here is the specifics of your answer:
https://forums.aws.amazon.com/message.jspa?messageID=437015#
http://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html

Querying geonames for rows 1000-1200

I have been querying Geonames for parks per state. Mostly there are under 1000 parks per state, but I just queried Conneticut, and there are just under 1200 parks there.
I already got the 1-1000 with this query:
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000
But increasing the maxRows to 1200 gives an error that I am querying for too many at once. Is there a way to query for rows 1000-1200 ?
I don't really see how to do it with their API.
Thanks!
You should be using the startRow parameter in the query to page results. The documentation notes that it takes an integer value (0 based indexing) and should be
Used for paging results. If you want to get results 30 to 40, use startRow=30 and maxRows=10. Default is 0.
So to get the next 1000 data points (1000-1999), you should change your query to
http://api.geonames.org/search?featureCode=PRK&username=demo&country=US&style=full&adminCode1=CT&maxRows=1000&startRow=1000
I'd suggest reducing the maxRows to something manageable as well - something that will put less of a load on their servers and make for quicker responses to your queries.