Haystack and Elasticsearch: Limit number of results - django

I have 2 servers with Haystack:
Server1: This has elasticsearch installed
Server2: This doesn't have elasticsearch, the queries are made to Server1
My issue is about pagination when I make queries from Server2 to Server1:
Server2 makes query to Server1
Server1 send all the results back to Server2
Server2 makes the pagination
But this is not optimal, if the query return 10.000 objects, the query will be slow.
I know that you can send to elasticsearch some values in the query (size, from and to) but I don't know if this is possible using Haystack, I've checked documentation and googled it and found nothing.
How could I configure the query in Haystack to receive the results 10 by 10 ?
Edit
Is possible that if I make SearchQuerySet()[10000:10010] it will only ask for this 10 items ?
Or it will ask for all the items and then filter them ?
Edit2
I found this on Haystack Docs:
SearchQuery API - set_limits
it seems a function to do whatt I'm trying to do:
Restricts the query by altering either the start, end or both offsets.
And then I tried to do:
from haystack.query import SearchQuerySet
sqs = SearchQuerySet()
sqs.query.set_limits(low=0, high=4)
sqs.filter(content='anything')
The result is the full list, like I never add the set_limit line
Why is not working ?

Haystack works kinda different from Django ORM. After limiting the queryset, you should call get_results() in order to get limited results. This is actually smart, because it avoids multiple requests from Elastic.
Example:
# Assume you have 800 records.
sqs = SearchQuerySet()
sqs.query.set_limits(low=0, high=4)
len(sqs) # Will return 800 records
len(sqs.get_results()) # Will return first 4 records.
Hope that it helps.

Adding on to the Yigit answer, if you want to have these offsets on filtered records just add filter condition when you form the SearchQuerySet.
Also remember once the limits are set you can't change them by setting them again. You would need to form the SearchQuerySet() again, or there is a method to clear the limits.
results = SearchQuerySet().filter(content="keyword")
#we have a filtered resultSet now let's find specific records
results.query.set_limits(0,4)
return results.query.get_results()

Related

How to fetch a certain number of records from paginated dynamodb table?

I am trying to work with the first 50 records or 1st scan page returned from the get_paginator method.
This is how i scan through the table and get paginated results over which i loop and do some post processing.
dynamo_client = boto3.client('dynamodb')
paginator = dynamo_client.get_paginator("scan")
for page in paginator.paginate(TableName=table_name):
yield from page["Items"]
Is it possible to only work on say the 1st scanned page and explicitly mention 2nd page onwards? Summing it up, i am trying to query the first page results in one lambda function and the 2nd page specifically using another lambda function. How can i achieve this?
You need to pass the NextToken to your other Lambda, somehow.
On the paginator response, there is a NextToken property. You can then pass that in the config of the paginator.paginate() call.
Somewhat contrived example:
dynamo_client = boto3.client('dynamodb')
paginator = dynamo_client.get_paginator("scan")
token = ""
# Grab the first page
for page in paginator.paginate(TableName=table_name):
# do some work
dowork(page["Items"])
# grab the token
token = page["NextToken"]
# stop iterating after the first page for some reason
break
# This will continue to iterator where the last iterator left off
for page in paginator.paginate(TableName=table_name, PaginationConfig= 'StartingToken': token }):
# do some work
dowork(page["Items"])
Let's say you were trying to use a Lambda to iterate over all your DynamoDB items in a table. You could have the iterator run until a time limit, break, then queue up next Lambda function, passing along the NextToken for it to resume with.
You can learn more via the API doc which details what this does or see some further examples on GitHub.

python code for directory api to batch retrieve all users from domain

Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.

Why is the database still being hit even after memcached stores the result?

I am making a few basic queries with active record and attempting to store the result in memcached using the dalli gem.
I have the following code:
#page = Rails.cache.fetch 'pages/index' do
Page.find_by_name('index')
end
#grid_items = Rails.cache.fetch 'pages/index/grid_items' do
#page.grid_items
end
Storing and retrieving works fine for the 'pages/index', but the 'pages/index/grid_items' still results in a database query. The following output from my localhost shows this:
Cache read: pages/index
Cache fetch_hit: pages/index
Cache read: pages/index/grid_items
Cache fetch_hit: pages/index/grid_items
GridItem Load (0.7ms) SELECT "grid_items".* FROM "grid_items" WHERE (status = 't') AND (page_id = 1) ORDER BY "grid_items"."position" ASC
I have tried using "includes" instead of "joins", without luck. I have also checked the result of Rails.cache.fetch in the console and it returns the correct data.
The answer was that I needed to make sure that ActiveRecord actually hit the database. I was caching a query object, and not the results.
To trigger a database call I ended up using this, to cast the query to an array:
#grid_items = Rails.cache.fetch 'pages/index/grid_items' do
#page.grid_items.live.to_a
end

Optimizing database queries in Django

I have a bit of code that is causing my page to load pretty slow (49 queries in 128 ms). This is the landing page for my site -- so it needs to load snappily.
The following is my views.py that creates a feed of latest updates on the site and is causing the slowest load times from what I can see in the Debug toolbar:
def product_feed(request):
""" Return all site activity from friends, etc. """
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')
latestdesigns = Design.objects.all().order_by('-added')
latest = list(latestparts) + list(latestdesigns)
latestupdates = sorted (latest, key = lambda x: x.added, reverse = True)
latestupdates = latestupdates [0:8]
# only get the unique avatars that we need to put on the page so we're not pinging for images for each update
uniqueusers = User.objects.filter(id__in = Part.objects.values_list('adder', flat=True))
return render_to_response("homepage.html", {
"uniqueusers": uniqueusers,
"latestupdates": latestupdates
}, context_instance=RequestContext(request))
The query that causes the most time seem to be:
latest = list(latestparts) + list(latestdesigns) (25ms)
There is another one that's 17ms (sitewide annoucements) and 25ms (adding tagged items on each product feed item) respectively that I am also investigating.
Does anyone see any ways in which I can optimize the loading of my activity feed?
You never need more than 8 items, so limit your queries. And don't forget to make sure that added in both models is indexed.
latestparts = Part.objects.all().prefetch_related('uniparts').order_by('-added')[:8]
latestdesigns = Design.objects.all().order_by('-added')[:8]
For bonus marks, eliminate the magic number.
After making those queries a bit faster, you might want to check out memcache to store the most common query results.
Moreover, I believe adder is ForeignKey to User model.
Part.objects.distinct().values_list('adder', flat=True)
Above line is QuerySet with unique addre values. I believe you ment exactly that.
It saves you performing a subuery.

how to get amazon prices using Boto?

it seems that Boto is the official Amazon API module for Python, and this one is for Tornado, so here is my questions:
does it offer pagination (requests only 10 products, since amazon offers 10 products per page, then i want only to get the first page...), then how (sample code?)
how then to parse the product parse, i've used python-amazon-simple-product-api but sadly it dont offer pagination, so all the offers keep iterating.
generally, pagination is performed by the client requesting the api. To do this in boto, you'll need to cut your systems up. So for instance, say you make a call to AWS via boto, using the get_all_instances def; you'll need to store those somehow and then keep track of which servers have been displayed, and which not. To my knowledge, boto does not have the LIMIT functionality most dev's are used to from MySQL. Personally, I scan all my instances and stash them in mongo like so:
for r in conn.get_all_instances(): # loop through all reservations
groups = [g.name for g in r.groups] # get a list of groups for this reservation
for x in r.instances: # loop through all instances with-in reservation
groups = ','.join(groups) # join the groups into a comma separated list
name = x.tags.get('Name',''); # get instance name from the 'Name' tag
new_record = { "tagname":name, "ip_address":x.private_ip_address,
"external_ip_nat":x.ip_address, "type":x.instance_type,
"state":x.state, "base_image":x.image_id, "placement":x.placement,
"public_ec2_dns":x.public_dns_name,
"launch_time":x.launch_time, "parent": ObjectId(account['_id'])}
new_record['groups'] = groups
systems_coll.update({'_id':x.id},{"$set":new_record},upsert=True)
error = db.error()
if error != None:
print "err:%s:" % str(error)
You could also wrap these in try/catch blocks. Up to you. Once you get them out of boto, should be trivial to do the cut up work.
-- Jess