Ordering by sum of difference - django

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?

best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)

The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.

Related

Slow insertion using Neptune and Gremlin

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

Elasticsearch scoring on multiple indexes: dfs_query_then_fetch returns the same scores as query_then_fetch

I have multiple indices in Elasticsearch (and the corresponding documents in Django created using django-elasticsearch-dsl). All of the indices have these settings:
settings = {'number_of_shards': 1,
'number_of_replicas': 0}
Now, I am trying to perform a search across all the 10 indices. In order to retrieve consistent scoring between the results from different indices, I am using dfs_query_then_fetch:
search = Search(index=['mov*'])
search = search.params(search_type='dfs_query_then_fetch')
objects = search.query("multi_match", query='Tom & Jerry', fields=['title', 'actors'])
I get bad results due to inconsistent scoring. A book called 'A story of Jerry and his friend Tom' from one index can be ranked higher than the cartoon 'Tom & Jerry' from another index. The reason is that dfs_query_then_fetch is not working. When I remove it or substitute with the simple query_then_fetch, I get absolutely the same results with the identical scoring.
I have tested it on URI requests as well, and I always get the same scores for both search types.
What can be the reason for it?
UPDATE: The results are actually not the same, but they are only really slightly different, e.g. a score of 50.1 with dfs and 50.0 without dfs, while the same model within one index has a score of 80.0.
If the number of shards is 1, then dfs_query_then_fetch and query_then_fetch will return the same result. DFS query will do a query to all shards and then show you results based on the scores computed, but in this case there is only one shard.
Regarding the scoring, you might wanna have a look at your actors field too. Also, do let us know what are the analyzer and tokenizer if you have used custom ones?

Boost::accumulator's percentile giving wrong values

I am using boost::accumulators::tag::extended_p_square_quantile for calculating percentile. In this, I also need to feed probabilities to the accumulator so I did this m_acc = AccumulatorType(boost::accumulators::extended_p_square_probabilities = probs); where probs is a vector containing the probabilities.
Values in the prob vector are {0.5,0.3,0.9,0.7}
I provided some sample values to accumulator.
But when I try to get the percentile using boost::accumulators::quantile(m_acc, boost::accumulators::quantile_probability = probs[0]); it returns incorrect values and even nan sometimes.
What is wrong here?
I ran into this problem and wasted lot of time to figured out the problem and therefore want to answer this.
Problem is with the vector. Vector should be shorted in increasing order of its values.
Change the vector values to this {0.3,0.5,0.7,0.9} and it will work as expected.
So if someone is using tag::extended_p_square_quantile for percentile(which supports multiple probabilities) then (s)he needs to give probabilities(vector/array/list) in sorted order.
This isn't the case with tag::p_square_quantile because we can give only one value(probability) in it.

How to optimize use of querysets with lists

I have a model that has a couple million objects. Each object represents a call made/received by a company.
To simplify things, let's say this model, Call, has these fields:
calldate, context, channel.
My goal is to know the average # of calls made and received during each hour of the day of the month (load by hour). The catch is: I need to find this for port1 and port2 separately.
As of now, my code works fine, except that it takes around 1 whole minute to give me the result for a range of 4 months and I it seems extremely inefficient.
I've done some simple profiling and discovered that the extend is taking around 99% of the processing time:
queryset = Call.objects.filter(calldate__gte='SOME_DATE')
port1, port2 = [],[]
port1.extend(queryset.filter(context__icontains="e1-1"))
port2.extend(queryset.filter(context__icontains="e1-2"))
channels_in_port1 = ["Port/%d-2" % x for x in range(1,32)]
channels_in_port2 = ["Port/%d-2" % x for x in range(32,63)]
for i in channels_in_port1:
port1.extend(queryset.filter(channel__icontains=i))
for i in channels_in_port2:
port2.extend(queryset.filter(channel__icontains=i))
port1 and port2 have around 150k objects combined now.
As soon as I have all calls for port1 and port2, I'm good to go. The rest of the code is basically some for loops for port1 and port2 that sums up and takes the average of calls according to the hour/day/month. Trivial stuff.
I tried to avoid using any "extend" by using itertools.chain and chaining the querysets instead. However, that made the processing time shift to the part where I do the trivial for loops to calculate the load by hour.
Any alternatives? Better ways to filter the queryset?
Thanks very much!!
Have you considered using django's aggregate functions? http://docs.djangoproject.com/en/dev/topics/db/aggregation/
I presume your problem is with the second set of extends, ie those within the for loops, rather than the first. (The first is completely unnecessary, in any case: rather than defining an empty list up front and extending it, you can just do port1 = list(queryset.filter(context__icontains="e1-1")).)
Anyway, to summarize what I think you are trying to do: you want to get all Call objects for a certain date, in two blocks depending on the value for channel: one where it contains values from 0 to 31, and one with values between 32 and 62.
It seems like you could do this with just two queries, without any extending at all:
port1 = queryset.filter(channel__range=["Port/1-2", "Port/31-2"])
port2 = queryset.filter(channel__range=["Port/1-32", "Port/31-62"])
Does that not do what you want?
Edit in response to comment but that's then just two queries which you can extend, or concatenate. The problem with your code as posted is that you are doing 31 queries and extend operations for each port, which is bound to be expensive. If you just do one each, plus one extend/concat, that will be much cheaper.

recursively find subsets

Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}
From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.
I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.
It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.
I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().