Unable to iterate over all objects in table using objects.all() - django

I am writing a migration script that will iterate over all objects of a cassandra model (Cats). There are more than 30000 objects of Cat but using Cats.objects.all(), I am only able to iterate over 10000 objects.
qs = Cats.objects.all()
print(qs.count()) # returns 30000
print(len(qs)) # returns 10000
Model:
from django_cassandra_engine.models import DjangoCassandraModel
class Cats(DjangoCassandraModel):
...
Cassnadra backend used: django-Cassandra-engine version 1.6.2

The default fetch size (aka page size) is 10K so you'll only get the first 10K rows returned. If you really want to get all the records in the table, you'll need to override the session defaults:
'cassandra': {
...
'OPTIONS': {
...
'session': {
...
'default_fetch_size': 10000
}
}
}
But be careful around setting it to a very high value because it can overload the coordinator node for the request and affect the performance of your cluster.
You should instead iterate through the results in a page, then request the next page until you've reached the end. Cheers!

Related

Graphene-Django - how to pass an argument from Query to class DjangoObjectType

First of all, thanks ! it has been 1 year without asking question as I always found an answer. You're a tremendous help.
Today, I do have a question I cannot sort out myself.
Please, I hope you would be kind enough to help me on the matter.
Context: I work on a project with Django framework, and I have some dynamic pages made with react.js. The API I'm using in between is graphQL based. Apollo for the client, graphene-django for the back end.
I want to do a dynamic pages made from a GraphQL query having a set (a declared field in the class DjangoObjectType made from a Django query), and I want to be able to filter dynamically the parent with a argument A, and the set with argument B. My problem is how to find a way to pass the argument B to the set to filter it.
The graphQL I would achieved based on graphQL documentation
query DistributionHisto
(
$id:ID,
$limit:Int
)
{
distributionHisto(id:$id)
{
id,
historical(limit:$limit)
{
id,
date,
histo
}
}
}
But I don't understand how to pass (limit:$limit) to my set in the back end.
Here my schema.py
import graphene
from graphene_django.types import DjangoObjectType
class DistributionType(DjangoObjectType):
class Meta:
model = DistributionTuple
historical = graphene.List(HistoricalTimeSeriesType)
def resolve_historical(self, info):
return HistoricalTimeSeries.objects.filter(
distribution_tuple_id=self.id
).order_by('date')[:2]
class Query(object):
distribution_histo = graphene.List(
graphene.NonNull(DistributionType),
id=graphene.ID(),
limit=graphene.Int()
)
def resolve_distribution_histo(
self, info, id=None, limit=None):
filter_q1 = {'id': id} if id else {}
return DistributionTuple.objects.filter(**filter_q1)
I have tried few things, but I didn't find a way to make it to work so far.
At the moment, as you see, the arg "limit" reaches a dead end in def resolve*, where ideally, it would be pass up to the class DistributionSetHistoType where it would replace the slice [:2] by [:limit] in resolve_distribution_slice_set()
I hope I have been clear, please let me know if it's not the case.
Thanks for your support.
This topic called pagination.
front end seletion
const { loading, error, data, fetchMore } = useQuery(GET_ITEMS, {
variables: {
offset: 0,
limit: 10
},
});
backend selction
the number 10 in .count(10) represent the first 10 elements in the array
DistributionTuple.objects.filter(**filter_q1).count(10)

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

Haystack and Elasticsearch: Limit number of results

I have 2 servers with Haystack:
Server1: This has elasticsearch installed
Server2: This doesn't have elasticsearch, the queries are made to Server1
My issue is about pagination when I make queries from Server2 to Server1:
Server2 makes query to Server1
Server1 send all the results back to Server2
Server2 makes the pagination
But this is not optimal, if the query return 10.000 objects, the query will be slow.
I know that you can send to elasticsearch some values in the query (size, from and to) but I don't know if this is possible using Haystack, I've checked documentation and googled it and found nothing.
How could I configure the query in Haystack to receive the results 10 by 10 ?
Edit
Is possible that if I make SearchQuerySet()[10000:10010] it will only ask for this 10 items ?
Or it will ask for all the items and then filter them ?
Edit2
I found this on Haystack Docs:
SearchQuery API - set_limits
it seems a function to do whatt I'm trying to do:
Restricts the query by altering either the start, end or both offsets.
And then I tried to do:
from haystack.query import SearchQuerySet
sqs = SearchQuerySet()
sqs.query.set_limits(low=0, high=4)
sqs.filter(content='anything')
The result is the full list, like I never add the set_limit line
Why is not working ?
Haystack works kinda different from Django ORM. After limiting the queryset, you should call get_results() in order to get limited results. This is actually smart, because it avoids multiple requests from Elastic.
Example:
# Assume you have 800 records.
sqs = SearchQuerySet()
sqs.query.set_limits(low=0, high=4)
len(sqs) # Will return 800 records
len(sqs.get_results()) # Will return first 4 records.
Hope that it helps.
Adding on to the Yigit answer, if you want to have these offsets on filtered records just add filter condition when you form the SearchQuerySet.
Also remember once the limits are set you can't change them by setting them again. You would need to form the SearchQuerySet() again, or there is a method to clear the limits.
results = SearchQuerySet().filter(content="keyword")
#we have a filtered resultSet now let's find specific records
results.query.set_limits(0,4)
return results.query.get_results()

Optimizing database queries in Django

I have a bit of code that is causing my page to load pretty slow (49 queries in 128 ms). This is the landing page for my site -- so it needs to load snappily.
The following is my views.py that creates a feed of latest updates on the site and is causing the slowest load times from what I can see in the Debug toolbar:
def product_feed(request):
""" Return all site activity from friends, etc. """
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')
latestdesigns = Design.objects.all().order_by('-added')
latest = list(latestparts) + list(latestdesigns)
latestupdates = sorted (latest, key = lambda x: x.added, reverse = True)
latestupdates = latestupdates [0:8]
# only get the unique avatars that we need to put on the page so we're not pinging for images for each update
uniqueusers = User.objects.filter(id__in = Part.objects.values_list('adder', flat=True))
return render_to_response("homepage.html", {
"uniqueusers": uniqueusers,
"latestupdates": latestupdates
}, context_instance=RequestContext(request))
The query that causes the most time seem to be:
latest = list(latestparts) + list(latestdesigns) (25ms)
There is another one that's 17ms (sitewide annoucements) and 25ms (adding tagged items on each product feed item) respectively that I am also investigating.
Does anyone see any ways in which I can optimize the loading of my activity feed?
You never need more than 8 items, so limit your queries. And don't forget to make sure that added in both models is indexed.
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')[:8]
latestdesigns = Design.objects.all().order_by('-added')[:8]
For bonus marks, eliminate the magic number.
After making those queries a bit faster, you might want to check out memcache to store the most common query results.
Moreover, I believe adder is ForeignKey to User model.
Part.objects.distinct().values_list('adder', flat=True)
Above line is QuerySet with unique addre values. I believe you ment exactly that.
It saves you performing a subuery.

Django caching a large list

My django application deals with 25MB binary files. Each of them has about 100,000 "records" of 256 bytes each.
It takes me about 7 seconds to read the binary file from disk and decode it using python's struct module. I turn the data into a list of about 100,000 items, where each item is a dictionary with values of various types (float, string, etc.).
My django views need to search through this list. Clearly 7 seconds is too long.
I've tried using django's low-level caching API to cache the whole list, but that won't work because there's a maximum size limit of 1MB for any single cached item. I've tried caching the 100,000 list items individually, but that takes a lot more than 7 seconds - most of the time is spent unpickling the items.
Is there a convenient way to store a large list in memory between requests? Can you think of another way to cache the object for use by my django app?
edit the item size limit to be 10m (larger than 1m), add
-I 10m
to /etc/memcached.conf and restart memcached
also edit this class in memcached.py located in /usr/lib/python2.7/dist-packages/django/core/cache/backends to look like this:
class MemcachedCache(BaseMemcachedCache):
"An implementation of a cache binding using python-memcached"
def __init__(self, server, params):
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = 1024*1024*10 #added limit to accept 10mb
super(MemcachedCache, self).__init__(server, params,
library=memcache,
value_not_found_exception=ValueError)
I'm not able to add comments yet, but I wanted to share my quick fix around this problem, since I had the same problem with python-memcached behaving strangely when you change the SERVER_MAX_VALUE_LENGTH at import time.
Well, besides the __init__ edit that FizxMike suggests you can also edit the _cache property in the same class. Doing so you can instantiate the python-memcached Client passing the server_max_value_length explicitly, like this:
from django.core.cache.backends.memcached import BaseMemcachedCache
DEFAULT_MAX_VALUE_LENGTH = 1024 * 1024
class MemcachedCache(BaseMemcachedCache):
def __init__(self, server, params):
#options from the settings['CACHE'][connection]
self._options = params.get("OPTIONS", {})
import memcache
memcache.SERVER_MAX_VALUE_LENGTH = self._options.get('SERVER_MAX_VALUE_LENGTH', DEFAULT_MAX_VALUE_LENGTH)
super(MemcachedCache, self).__init__(server, params,
library=memcache,
value_not_found_exception=ValueError)
#property
def _cache(self):
if getattr(self, '_client', None) is None:
server_max_value_length = self._options.get("SERVER_MAX_VALUE_LENGTH", DEFAULT_MAX_VALUE_LENGTH)
#one could optionally send more parameters here through the options settings,
#I simplified here for brevity
self._client = self._lib.Client(self._servers,
server_max_value_length=server_max_value_length)
return self._client
I also prefer to create another backend that inherits from BaseMemcachedCache and use it instead of editing django code.
here's the django memcached backend module for reference:
https://github.com/django/django/blob/master/django/core/cache/backends/memcached.py
Thanks for all the help on this thread!