Splitting a list based on another list values in Mathematica - list

In Mathematica I have a list of point coordinates
size = 50;
points = Table[{RandomInteger[{0, size}], RandomInteger[{0, size}]}, {i, 1, n}];
and a list of cluster indices these points belong to
clusterIndices = {1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1};
what is the easiest way to split the points into two separate lists based on the clusterIndices values?
EDIT:
The solution I came up with:
pointIndices =
Map[#[[2]] &,
GatherBy[MapIndexed[{#1, #2[[1]]} &, clusterIndices], First],
{2}];
pointsByCluster = Map[Part[points, #] &, pointIndices];
It there a better way to do this?

As #High Performance Mark and #Nicholas Wilson said, I'd start with combining the two lists together via Transpose or Thread. In this case,
In[1]:= Transpose[{clusterIndices, points}]==Thread[{clusterIndices, points}]
Out[1]:= True
At one point, I looked at which was faster, and I think Thread is marginally faster. But, it only really matters when you are using very long lists.
#High Performance Mark makes a good point in suggesting Select. But, it would only allow you to pull a single cluster out at a time. The code for selecting cluster 1 is as follows:
Select[Transpose[{clusterIndices, points}], #[[1]]==1& ][[All, All, 2]]
Since you seem to want to generate all clusters, I'd suggest doing the following:
GatherBy[Transpose[{clusterIndices, points}], #[[1]]& ][[All, All, 2]]
which has the advantage of being a one liner and the only tricky part was in selecting the correct Part of the resulting list. The trick in determining how many All terms are necessary is to note that
Transpose[{clusterIndices, points}][[All,2]]
is required to get the points back out of the transposed list. But, the "clustered" list has one additional level, hence the second All.
It should be noted that the second parameter in GatherBy is a function that accepts one parameter, and it can be interchanged with any function you wish to use. As such, it is very useful. However, if you'd like to transform your data as your gathering it, I'd look at Reap and Sow.
Edit: Reap and Sow are somewhat under used, and fairly powerful. They're somewhat confusing to use, but I suspect GatherBy is implemented using them internally. For instance,
Reap[ Sow[#[[2]], #[[1]] ]& /# Transpose[{clusterIndices, points}], _, #2& ]
does the same thing as my previous code without the hassle of stripping off the indices from the points. Essentially, Sow tags each point with its index, then Reap gathers all of the tags (_ for the 2nd parameter) and outputs only the points. Personally, I use this instead of GatherBy, and I've encoded it into a function which I load, as follows:
SelectEquivalents[x_List,f_:Identity, g_:Identity, h_:(#2&)]:=
Reap[Sow[g[#],{f[#]}]&/#x, _, h][[2]];
Note: this code is a modified form of what was in the help files in 5.x. But, the 6.0 and 7.0 help files removed a lot of the useful examples, and this was one of them.

Here's a succinct way to do this using the new SplitBy function in version 7.0 that should be pretty fast:
SplitBy[Transpose[{points, clusterIndices}], Last][[All, All, 1]]
If you aren't using 7.0, you can implement this as:
Split[Transpose[{points, clusterIndices}], Last[#]==Last[#2]& ][[All, All, 1]]
Update
Sorry, I didn't see that you only wanted two groups, which I think of as clustering, not splitting. Here's some code for that:
FindClusters[Thread[Rule[clusterIndices, points]]]

How about this?
points[[
Flatten[Position[clusterIndices, #]]
]] & /#
Union[clusterIndices]

I don't know about 'better', but the more usual way in functional languages would not be to add indices to label each element (your MapIndexed) but instead to just run along each list:
Map[#1[[2]] &,
Sort[GatherBy[
Thread[ {#1, #2} &[clusterIndices, points]],
#1[[1]] &], #1[[1]][[1]] < #2[[1]][[1]] &], {2}]
Most people brought up in Lisp/ML/etc will write the Thread function out instantly is the way to implement the zip ideas from those languages.
I added in the Sort because it looks like your implementation will run into trouble if clusterIndices = {2[...,2],1,...}. On the other hand, I would still need to add in a line to fix the problem that if clusterIndices has a 3 but no 2, the output indices will be wrong. It is not clear from your fragment how you are intending to retrieve things though.
I reckon you will find list processing much easier if you refresh yourself with a hobby project like building a simple CAS in a language like Haskell where the syntax is so much more suited to functional list processing than Mathematica.

If I think of something simpler I will add to the post.
Map[#[[1]] &, GatherBy[Thread[{points, clusterIndices}], #[[2]] &], {2}]

My first step would be to execute
Transpose[{clusterIndices, points}]
and my next step would depend on what you want to do with that; Select comes to mind.

Related

How to understand "pre" in Preintegration of IMU?

I see preintegration in IMU-fused SLAM literatures, which mention that preintegration is useful in avoiding to recompute the IMU integration between two consecutive keyframes.
However, when I see some open-sourced SLAM code, e.g. OrbSLAM3, I haven't found anything special in preintegration because it just looks like an ordinary integration (with same time-intervals, same interpolation, same full caculation over multiple intervals), and I don't see an expected pre-calculated value that is reused time and time again.
So my question, is preintegration just an alias of the integration of IMU, or else how to correctly understand "pre"?
Finally I've made clear of Preintegration.
It is a misunderstanding to think of Preintegration as a pre-calculated value that is reusable in each process for calculating any new integration.
In fact, Preintegration actually means that people temporarily calculate integrations in a rough way for incoming any time instant (say, instant 0, 1, 2, ..., k-1) and, at a key instant k, a bundle adjustment can be performed on the pre-integrations over time instants [0, k] without recomputing the integration on time instant 0, 1, 2, ..., k-1.

Slow insertion using Neptune and Gremlin

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

A thousand nested if/and/or

I have a huge "sheet application", in one row, and one of the fields, called "price" literally has 433 "IF" statements in the formula bar. Basically, there's a lot of options, and based on them, I have to change the price in the field (if user picks option 1, set price 1, and so on). I was just wondering if there was a more sane way of writing that, because
it's seriously tiring and incomprehensible
it usually doesn't even work because I have to throw a IF(AND()) and IF(OR()) from time to time which breaks my workflow, and makes debugging impossible.
Any help greatly appreciated.
instead of AND you can use multiplication * and instead of OR use sum + - that way you can easily use ARRAYFORMULA if needed. also have a look here: https://webapps.stackexchange.com/q/123729/186471 for alternatives
update
try this instead of your pstebin formula:
=ARRAYFORMULA(IFERROR(IF((D24<>"")*(E24=""),
VLOOKUP(D24, {data!A:B; data!F1:G23}, 2, 0),
VLOOKUP(D24&E24, {{data!F24:F37&data!F50; data!F28&data!F49},
{data!I23; data!I23; data!I23; data!I23:I29;
data!I29; data!I29; data!I29; data!I30; data!G28}}, 2, 0))))
and data sheet:

What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2?

I am using Hyperparameter using HParams Dashboard in Tensorflow 2.0-beta0 as suggested here https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I am confused in step 1, I could not find any better explanation. My questions are related to following lines:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
My question:
I want to try more dropout values instead of just two (0.1 and 0.2). If I write more values in it then it throws an error- 'maximum 2 arguments can be given'. I tried to look for documentation but could not find anything like from where these hp.Discrete and hp.RealInterval functions came.
Any help would be appreciated. Thank you!
Good question. They notebook tutorial lacks in many aspects. At any rate, here is how you do it at a certain resolution res
for dropout_rate in tf.linspace(
HP_DROPOUT.domain.min_value,
HP_DROPOUT.domain.max_value,
res,):
By looking at the implementation to me it really doesn't seem to be GridSearch but MonteCarlo/Random search (note: this is not 100% correct, please see my edit below)
So on every iteration a random float of that real interval is chosen
If you want GridSearch behavior just use "Discrete". That way you can even mix and match GridSearch with Random search, pretty cool!
Edit: 27th of July '22: (based on the comment of #dpoiesz)
Just to make it a little more clear, as it is sampled from the intervals, concrete values are returned. Therefore, those are added to the grid dimension and grid search is performed using those
RealInterval is a min, max tuple in which the hparam will pick a number up.
Here a link to the implementation for better understanding.
The thing is that as it is currently implemented it does not seems to have any difference in between the two except if you call the sample_uniform method.
Note that tf.linspace breaks the mentioned sample code when saving current value.
See https://github.com/tensorflow/tensorboard/issues/2348
In particular OscarVanL's comment about his quick&dirty workaround.

Ordering by sum of difference

I have a model that has one attribute with a list of floats:
values = ArrayField(models.FloatField(default=0), default=list, size=64, verbose_name=_('Values'))
Currently, I'm getting my entries and order them according to the sum of all diffs with another list:
def diff(l1, l2):
return sum([abs(v1-v2) for v1, v2 in zip(l1, l2)])
list2 = [0.3, 0, 1, 0.5]
entries = Model.objects.all()
entries.sort(key=lambda t: diff(t.values, list2))
This works fast if my numer of entries is very slow small. But I'm afraid with a large number of entries, the comparison and sorting of all the entries will get slow since they have to be loaded from the database. Is there a way to make this more efficient?
best way is to write it yourself, right now you are iterating over a list over 4 times!
although this approach looks pretty but it's not good.
one thing that you can do is:
have a variable called last_diff and set it to 0
iterate through all entries.
iterate though each entry.values
from i = 0 to the end of list, calculate abs(entry.values[i]-list2[i])
sum over these values in a variable called new_diff
if new_diff > last_diff break from inner loop and push the entry into its right place (it's called Insertion Sort, check it out!)
in this way, in average scenario, time complexity is much lower than what you are doing now!
and maybe you must be creative too. I'm gonna share some ideas, check them for yourself to make sure that they are fine.
assuming that:
values list elements are always positive floats.
list2 is always the same for all entries.
then you may be able to say, the bigger the sum over the elements in values, the bigger the diff value is gonna be, no matter what are the elements in list2.
then you might be able to just forget about whole diff function. (test this!)
The only way to makes this really go faster, is to move as much work as possible to the database, i.e. the calculations and the sorting. It wasn't easy, but with the help of this answer I managed to actually write a query for that in almost pure Django:
class Unnest(models.Func):
function = 'UNNEST'
class Abs(models.Func):
function = 'ABS'
class SubquerySum(models.Subquery):
template = '(SELECT sum(%(field)s) FROM (%(subquery)s) _sum)'
x = [0.3, 0, 1, 0.5]
pairdiffs = Model.objects.filter(pk=models.OuterRef('pk')).annotate(
pairdiff=Abs(Unnest('values')-Unnest(models.Value(x, ArrayField(models.FloatField())))),
).values('pairdiff')
entries = Model.objects.all().annotate(
diff=SubquerySum(pairdiffs, field='pairdiff')
).order_by('diff')
The unnest function turns each element of the values into a row. In this case it happens twice, but the two resulting columns are instantly subtracted and made positive. Still, there are as many rows per pk as there are values. These need to be summed, but that's not as easy as it sounds. The column can't be simply be aggregated. This was by far the most tricky part—even after fiddling with it for so long, I still don't quite understand why Postgres needs this indirection. Of the few options there are to make it work, I believe a subquery is the single one expressible in Django (and only as of 1.11).
Note that the above behaves exactly the same as with zip, i.e. the when one array is longer than the other, the remainder is ignored.
Further improvements
While it will be a lot faster already when you don't have to retrieve all rows anymore and loop over them in Python, it doesn't change yet that it results in a full table scan. All rows will have to be processed, every single time. You can do better, though. Have a look into the cube extension. Use it to calculate the L1 distance—at least, that seems what you're calculating—directly with the <#> operator. That will require the use of RawSQL or a custom Expression. Then add a GiST index on the SQL expression cube("values"), or directly on the field if you're able to change the type from float[] to cube. In case of the latter, you might have to implement your own CubeField too—I haven't found any package yet that provides it. In any case, with all that in place, top-N queries on the lowest distance will be fully indexed hence blazing fast.