Get all HITs with a certain status - amazon-web-services

The SearchHITs function seems almost useless for doing any actual searching. It merely pulls a listing of your HITs and doesn't allow for any filters. Is the only way to search to iterate through all the results? For example:
my_reviewable_hits = []
for page in range(5,50):
res = m.conn.search_hits(sort_direction='Descending', page_size=100, page_number=page)
for hit in res:
if hit.HITStatus == 'Reviewable':
my_reviewable_hits.append(hit)

Yes. You have to iterate through all of them.

Related

Efficiently updating a large number of records based on a field in that record using Django

I have about a million Comment records that I want to update based on that comment's body field. I'm trying to figure out how to do this efficiently. Right now, my approach looks like this:
update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
model_obj = Comments.objects.get(id=comment.id)
model_obj.word_count = len(model_obj.body.split())
update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])
However, this hangs and seems to time out in my migration process. Does anybody have suggestions on how I can accomplish this?
It's not easy to determine the memory footprint of a Django object, but an absolute minimum is the amount of space needed to store all of its data. My guess is that you may be running out of memory and page-thrashing.
You probably want to work in batches of, say, 1000 objects at a time. Use Queryset slicing, which returns another queryset. Try something like
BATCH_SIZE = 1000
start = 0
base_qs = Comments.objects.filter(word_count=0)
while True:
batch_qs = base_qs[ start: start+BATCH_SIZE ]
start += BATCH_SIZE
if not batch_qs.exists():
break
update_list = []
for comment in batch_qs:
model_obj = Comments.objects.get(id=comment.id)
model_obj.word_count = len(model_obj.body.split())
update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])
print( f'Processed batch starting at {start}' )
Each trip around the loop will free the space occupied by the previous trip when it replaces batch_qs and update_list. The print statement will allow you to watch it progress at a hopefully acceptable, regular rate!
Warning - I have never tried this. I'm also wondering whether slicing and filtering will play nice with each other or whether one should use
base_qs = Comments.objects.all()
...
while True:
batch_qs = base_qs[ start: start+BATCH_SIZE ]
....
for comment in batch_qs.filter(word_count=0) :
so you are slicing your way though rows in the entire DB table and retrieving a subset of each slice that needs updating. This feels "safer". Anybody know for sure?

SuiteScript 2.0: Are there any search result limitations when executing a saved search via "getInputData" stage of map/reduce script?

I am currently building a map/reduce script in NetSuite which passes the results of a saved search from the getInputData stage to the map stage. This is being done by first running a WHILE loop in the getInputData stage to obtain the internal ids of each entry, inserting into an array, then passing over to the map stage. Like so:
// run saved search - unlimited rows from saved search.
do {
var subresults = invoiceSearch.run().getRange({ start: start, end: start + pageSize });
results = results.concat(subresults);
count = subresults.length;
start += pageSize + 1;
} while (count == pageSize);
var invSearchArray = [];
if(invoiceSearch){
//NOTE: .run().each has a limit of 4,000 results, hence the do-while loop above.
for (var i = 0; i < results.length; i++){
var invObj = new Object();
invObj['invID'] = results[i].getValue({name: 'internalid'});
invSearchArray.push(invObj);
}
}
return invSearchArray;
I implemented it this way because I feared there would be result restrictions, just as the ".run().each" function has (limited to 4000 results).
I made the assumption that passing the search object directly from getInputData to Map would have restricted results of 4000 as well. Can someone offer clarity on whether there are such restrictions? Am I right to fear the script holting prematurely because search results cannot be processed beyond 4000 in the getInputData stage of a map/reduce script?
Any example to aid me in understanding how a search object is processed in a map/reduce script would be most appreciated.
Thanks
If you simply return the Search instance, all results will be passed along to map, beyond the 1000 or 4000 limits of the getRange and each methods.
If the Search has 8500 results, all 8500 will get passed to map.
function getInputData() {
return search.load(...); // alternatively search.create(...)
}

How to make an MPMediaQuery to return results based on releaseDate

I'm trying to create an MPMediaQuery that will return results in chronological order, preferably ascending or descending based on the query itself.
Right now my query returns items in ascending chrono order (oldest at top) but I'd like to be able to reverse the order.
I need my results to be in an MPMediaQuery so that I can use it with the MPMediaPlayer.
var qryPodcasts = MPMediaQuery()
var titleFilter = MPMediaPropertyPredicate()
titleFilter = MPMediaPropertyPredicate(value: "This American Life", forProperty: MPMediaItemPropertyPodcastTitle, comparisonType: .equalTo)
qryPodcasts.addFilterPredicate(titleFilter)
EDIT::
I've been able to get a step closer to my goal however I still have a problem that causes a crash.
I've added this code which will sort the resulting query based on "releaseDate" however not all Podcasts conform so this property can be nil causing a crash:
let myItems = qryPodcasts.items?.sorted{($0.releaseDate)! > ($1.releaseDate)!}
let podCollection = MPMediaItemCollection(items: myItems!)
myMP.setQueue(with: podCollection!)
How can I avoid this error and how do I handle items without a releaseDate?
Just for this part:
How can I avoid this error and how do I handle items without a releaseDate?
Avoid using forced-unwrapping (!) and provide default values for nil:
let myItems = qryPodcasts.items?.sorted{($0.releaseDate ?? Date.distantFuture) > ($1.releaseDate ?? Date.distantFuture)}
You'd better check all other forced-unwrappings in your code, are you really 100% sure those never return nil?

How to make this django attribute name search better?

lcount = Open_Layers.objects.all()
form = SearchForm()
if request.method == 'POST':
form = SearchForm(request.POST)
if form.is_valid():
data = form.cleaned_data
val=form.cleaned_data['LayerName']
a=Open_Layers()
data = []
for e in lcount:
if e.Layer_name == val:
data = val
return render_to_response('searchresult.html', {'data':data})
else:
form = SearchForm()
else:
return render_to_response('mapsearch.html', {'form':form})
This just returns back if a particular "name" matches . How do to change it so that it returns when I give a search for "Park" , it should return Park1 , Park2 , Parking , Parkin i.e all the occurences of the park .
You can improve your searching logic by using a list to accumulate the results and the re module to match a larger set of words.
However, this is still pretty limited, error prone and hard to maintain or even harder to make evolve. Plus you'll never get as nice results as if you were using a search engine.
So instead of trying to manually reinvent the wheel, the car and the highway, you should spend some time setting up haystack. This is now the de facto standard to do search in Django.
Use woosh as a backend at first, it's going to be easier. If your search get slow, replace it with solr.
EDIT:
Simple clean alternative:
Open_Layers.objects.filter(name__icontains=val)
This will perform a SQL LIKE, adding %` for you.
This going to kill your database if used too often, but I guess this is probably not going to be an issue with your current project.
BTW, you probably want to rename Open_Layers to OpenLayers as this is the Python PEP8 naming convention.
Instead of
if e.Layer_name == val:
data = val
use
if val in e.Layer_name:
data.append(e.Layer_name)
(and you don't need the line data = form.cleaned_data)
I realise this is an old post, but anyway:
There's a fuzzy logic string comparison already in the python standard library.
import difflib
Mainly have a look at:
difflib.SequenceMatcher(None, a='string1', b='string2', autojunk=True).ratio()
more info here:
http://docs.python.org/library/difflib.html#sequencematcher-objects
What it does it returns a ratio of how close the two strings are, between zero and 1. So instead of testing if they're equal, you chose your similarity ratio.
Things to watch out for, you may want to convert both strings to lower case.
string1.lower()
Also note you may want to impliment your favourite method of splitting the string i.e. .split() or something using re so that a search for 'David' against 'David Brent' ranks higher.

Why does django-sphinx only output 20 results? How can I get the rest?

Doing a search using django-sphinx gives me results._sphinx that says there were 68 results, but when I iterate over them, I can only get at the first 20 of them.
I'm SURE there's a way around this, and that this is by design, but it's officially stumping the heck out of me. Does anybody know how to get the complete queryset?
I figured this out finally.
Apparently, the querysets only return 20 hits until you access the queryset. Or something like that.
So, if you explicitly want the iterate over the whole thing, you have to do:
for result in results[0:results.count()]:
print result
Or something to that effect, which will query the entire thing explicitly. Ugh. This should be clearly documented...but it not.
After hacking through source, I set the _limit variable explicitly.. Does the job, and issues an actual limit:
qs = MyEntity.search.query(query_string)
qs._limit = limit
for result in qs:
print result
work for me:
in sphinx config file:
max_matches = 5000
in django code:
desc_obj = Dictionary.search.query( search_desc )
desc_obj._maxmatches = 5000
or in settings:
SPHINX_MAX_MATCHES = 5000