How can I do conditional sort on Gremlin/Neptune - amazon-web-services

I am trying to build a query that will return a Vertex both edges, and sort them conditionally by value of a different property according to whether the original Vertex is the out Vertex or in Vertex in the given edge, I have two properties that will comply with this sort ("origin_pinned" & "target_pinned"), if an edge in Vertex is the original vertex I want to use "origin_pinned" value if its the out vertex then I want to use "target_pinned" value
This must be one query
I tried to run the following query but does not seem to have effect:
g.V('id123').bothE().order().by(values(choose(inV().is(V('id123')),
constant('origin_pinned'), constant('target_pinned'))), desc)

The values step will not work the way you are trying to use it. You did not include any sample data in the question, but using the air-routes data set, the query can most likely be simplified as shown below:
gremlin> g.V(44).as('a').
......1> bothE().
......2> order().
......3> by(coalesce(
......4> where(inV().as('a')).constant('origin_pinned'),
......5> constant('target_pinned')))
==>e[57948][3742-contains->44]
==>e[54446][3728-contains->44]
==>e[4198][13-route->44]
==>e[4776][31-route->44]
==>e[4427][20-route->44]
==>e[4015][8-route->44]
==>e[5061][44-route->8]
==>e[5062][44-route->13]
==>e[5063][44-route->20]
==>e[5064][44-route->31]
and to prove the reverse also works
gremlin> g.V(44).as('a').
......1> bothE().
......2> order().
......3> by(coalesce(
......4> where(outV().as('a')).constant('origin_pinned'),
......5> constant('target_pinned')))
==>e[5061][44-route->8]
==>e[5062][44-route->13]
==>e[5063][44-route->20]
==>e[5064][44-route->31]
==>e[57948][3742-contains->44]
==>e[54446][3728-contains->44]
==>e[4198][13-route->44]
==>e[4776][31-route->44]
==>e[4427][20-route->44]
==>e[4015][8-route->44]
To check things are working as expected, we can do:
gremlin> g.V(44).as('a').
......1> bothE().as('e').
......2> coalesce(
......3> where(inV().as('a')).constant('origin_pinned'),
......4> constant('target_pinned')).as('p').
......5> order().
......6> select('e','p')
==>[e:e[57948][3742-contains->44],p:origin_pinned]
==>[e:e[54446][3728-contains->44],p:origin_pinned]
==>[e:e[4198][13-route->44],p:origin_pinned]
==>[e:e[4776][31-route->44],p:origin_pinned]
==>[e:e[4427][20-route->44],p:origin_pinned]
==>[e:e[4015][8-route->44],p:origin_pinned]
==>[e:e[5061][44-route->8],p:target_pinned]
==>[e:e[5062][44-route->13],p:target_pinned]
==>[e:e[5063][44-route->20],p:target_pinned]
==>[e:e[5064][44-route->31],p:target_pinned]

managed to figure it out with the help of Kelvin Lawrence's answer:
g.V('id123').as('a').bothE().order().by(
coalesce(
where(
outV().as('a')).values('origin_pinned'),
values('target_pinned'),
constant(0)),
desc)

Related

Gremlin: how do I find vertices whose properties *contain* a certain values

Imagine I have vertices with properties whose values are lists, e.g:
g.addV('v').property('prop',['a','b','c'])
How do I find cases where prop contains a certain value?
This seemed the obvious thing to try:
g.V().has(P.within('prop'),'a')
But it doesn't work:
gremlin_python.driver.protocol.GremlinServerError: 599: Could not locate method: DefaultGraphTraversal.has([within([a]), test])
If you use the VertexProperty list cardinality (see multi-properties in the docs), you can accomplish it like this:
>>> g.addV('v').property(list_, 'prop', 'a').property(list_, 'prop', 'b').property(list_, 'prop', 'c').next()
v[0]
>>> g.V().has('prop', within('a')).toList()
[v[0]]
Note that list_ is coming from an enum via from gremlin_python.process.traversal import Cardinality.
If it's a real list (not a multi-valued property), then you'll have to unfold the value:
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('v').property('prop',['a','b','c'])
==>v[0]
gremlin> g.V().filter(values('prop').unfold().is('a'))
==>v[0]
// or
gremlin> g.V().has('prop', unfold().is('a'))
==>v[0]
Note that this filter requires a full scan over all vertices as individual list entries cannot be indexed. Hence you should take a look at Jason's answer as multi-properties are usually a much better choice.

Filtering multidimensional views in xtensor

I am trying to filter a 2D xtensor view with a simple condition. I found the xt::filter function, but when i use it, it only return the first column of the filtered view. I need the 2D filtered view. What is the best way to do it?
I could check the condition line by line, and get all the indexes myself, and the use xt::view to only show the needed lines, but i am hopig in a more sophisticated method using the xtensor toolset.
My current filter, which returns only one direction looks like this:
auto unfiltered = xt::view(...);
auto filtered = xt::filter(unfiltered, xt::view(unfiltered, xt::all(), 0) > tresh);
EDIT:
It is possible i was not completly clear. I need a 2D view where i kept only those lines, where the first element of the line is greater than the treshold.
xt::view(unfiltered, xt::all(), 0)
is creating a view that only contains the first column of unfiltered. The following should do what you expect:
auto unfiltered = xt::view(...);
auto filtered = xt::filter(unfiltered, unfiltered > tresh);
EDIT: sorry for the misunderstanding, here is an update following OP remark:
The condition is not broadcast to the shape of the expression to filter, a workaround for now is:
auto unfiltered = xt::view(...);
auto filtered = xt::filter(unfiltered,
xt::broadcast(xt::view(unfiltered, xt::all(), 0, xt::newaxis()),
unfiltered.shape()) > tresh);
I'll open an issue for this.
Also notice that filter returns a 1D expression (because the elements satisfying a condition may be scattered in the original expression), you need to reshape it to get a 2D expression.

Randomly set one-third of na's in a column to one value and the rest to another value

I'm trying to impute missing values in a dataframe df. I have a column A with 300 NaN's. I want to randomly set 2/3rd of it to value1 and the rest to value2.
Please help.
EDIT: I'm actually trying to this on dask, which does not support item assignment. This is what I have currently. Initially, I thought I'll try to convert all NA's to value1
da.where(df.A.isnull() == True, 'value1', df.A)
I got the following error:
ValueError: need more than 0 values to unpack
As the comment suggests, you can solve this with Series.where.
The following will work, but I cannot promise how efficient this is. (I suspect it may be better to produce a whole column of replacements at once with numpy.choice.)
df['A'] = d['A'].where(~d['A'].isnull(),
lambda df: df.map(
lambda x: random.choice(['value1', 'value1', x])))
explanation: if the value is not null (NaN), certainly keep the original. Where it is null, replace with the corresonding values of the dataframe produced by the first lambda. This maps values of the dataframe (chunks) to randomly choose the original value for 1/3 and 'value1' for others.
Note that, depending on your data, this likely has changed the data type of the column.

Getting significance level, alpha, from KS test results?

I am trying to find the significance level/alpha level (to eventually get the confidence level) of my Kolmogorov-Smirnov test results and I feel like I'm going crazy because this doesn't seem explained well enough anywhere (in a way that I understand.)
I have sample data that I want to see if it comes from one of four probability distribution functions: Cauchy, Gaussian, Students t, and Laplace. (I am not doing a two-sample test.)
Here is sample code for Cauchy:
### Cauchy Distribution Function
data = [-1.058, 1.326, -4.045, 1.466, -3.069, 0.1747, 0.6305, 5.194, 0.1024, 1.376, -5.989, 1.024, 2.252, -1.451, -5.041, 1.542, -3.224, 1.389, -2.339, 4.073, -1.336, 1.081, -2.573, 3.788, 2.26, -0.6905, 0.9064, -0.7214, -0.3471, -1.152, 1.904, 2.082, -2.471, 0.6434, -1.709, -1.125, -1.607, -1.059, -1.238, 6.042, 0.08664, 2.69, 1.013, -0.7654, 2.552, 0.7851, 0.5365, 4.351, 0.9444, -2.056, 0.9638, -2.64, 1.165, -1.103, -1.624, -1.082, 3.615, 1.709, 2.945, -5.029, -3.57, 0.6126, -2.88, 0.4868, 0.4222, -0.2062, -1.337, -0.326, -2.784, 6.724, -0.1316, 4.681, 6.839, -1.987, -5.372, 1.522, -2.347, 0.4531, -1.154, -3.631, 0.426, -4.271, 1.687, -1.612, -1.438, 0.8777, 0.06759, 0.6114, -1.296, 0.07865, -1.104, -1.454, -1.62, -1.755, 0.7868, -3.312, 1.054, -2.183, -7.066, -0.04661, 1.612, 1.441, -1.768, -0.2443, -0.7033, -1.16, 0.2529, 0.2441, -1.962, 0.568, 1.568, 8.385, 0.7192, -1.084, 0.9035, 3.376, -0.7172, -0.1221, 3.267, 0.4064, -0.4894, -2.001, 1.63, -2.891, 0.6244, 2.381, -1.037, -1.705, -0.5223, -0.2912, 1.77, -3.792, 0.1716, 4.121, -0.9119, -0.1166, 5.694, -5.904, 0.5485, -2.788, 2.582, -1.553, 1.95, 3.886, 1.066, -0.475, 0.5701, -0.9367, -2.728, 4.588, -5.544, 1.373, 1.807, 2.919, 0.8946, 0.6329, -1.34, -0.6154, 4.005, 0.204, -1.201, -4.912, -4.766, 0.0554, 3.484, -2.819, -5.131, 2.108, -1.037, 1.603, 2.027, 0.3066, -0.3446, -1.833, -2.54, 2.828, 4.763, 0.9926, 2.504, -1.258, 0.4298, 2.536, -1.214, -3.932, 1.536, 0.03379, -3.839, 4.788, 0.04021, -0.2701, -2.139, 0.1339, 1.795, -2.12, 5.558, 0.8838, 1.895, 0.1073, 2.011, -1.267, -1.08, -1.12, -1.916, 1.524, -1.883, 5.348, 0.115, -1.059, -0.4772, 1.02, -0.4057, 1.822, 4.011, -3.246, -7.868, 2.445, 2.271, 0.5377, 0.2612, 0.7397, -1.059, 1.177, 2.706, -4.805, -0.7552, -4.43, -0.4607, 1.536, -4.653, -0.5952, 0.8115, -0.4434, 1.042, 1.179, -0.1524, 0.2753, -1.986, -2.377, -1.21, 2.543, -2.632, -2.037, 4.011, 1.98, -2.589, -4.9, 1.671, -0.2153, -6.109, 2.497]
def C(data):
stuff = []
# vary gamma
for scale in xrange(1, 101, 1):
ks_statistic, pvalue = ss.kstest(data, "cauchy", args=(scale,))
stuff.append((ks_statistic, pvalue, scale))
bestks = min(c[0] for c in stuff)
bestrow = [row for row in stuff if row[0] == bestks]
return bestrow
I am trying to fit this function to my data, and to return the scale parameter (gamma) that corresponds to the highest probability of being fit with a Cauchy Distribution. The corresponding ks-statistic and p-value also get returned. I thought that this would be done by finding the minimum ks-statistic, which would be the curve that yields the smallest distance between any given data point and distribution-curve point. I realize that I need, though, to find "alpha" so that I can find my probability that the sample data is from a Cauchy Distribution, with the specified scale/gamma value I found.
I have referenced many sources trying to explain how to find "alpha", but I have no clue how to do this in my code.
Thank you for any help and insight!
I think this question is actually outside the range of SO because it involves statistics. You would probably be better to answer on, say, Cross Validation. However, let me offer one or two remarks.
The K-S is used for testing whether a given set of data has arisen from a given, fully specified distribution function. (Even for this purpose it might not be optimal.) It's not intended, as far as I know, as a measure of fit amongst alternatives.
To make inferences about probabilities one must have a viable probability model for the data in the first place. In this case, what is the space of alternatives and how are probabilities assigned to them under the null and alternative hypotheses?
Now, to get that unhelpful comment that I offered. Thanks for being so tactful about it! This is what I was trying to express.
You try scales from 1 to 100 in unit steps. I wanted to point out that scales less than one produce curious results. Now I see some close fits, which is especially true when p-values are considered; there's nothing to tell them apart from that for scale=2. Here's a plot.
Each triple gives (scale, K-S, p).
The main thing might be, what do you want from your data?

compare two dictionary and display the image based on the key in python

How can i compare two dictionary and based on the matching keys I have to display the images. I mean if the key matched with the first dictionary and its in the second too, then i have to take the image based on the key. I have given a try, and the code is:
for key in res_lst_srt:
if key in resizedlist:
b,g,r = cv2.split(images[i])
img = cv2.merge((r,g,b))
plt.subplot(2,3,i+1),plt.imshow(img)
plt.xticks([]),plt.yticks([])
plt.show()
I have taken the query image seperately, and i have got the distance between the query image,and all the database image. Distance have key and value, database image have key and value. I want to retrieve the image which matches the best with minimum distance based on key.
Thanks in advance!
It seems to me that you are not properly into the dict concept, you should study it a little bit to understand how it works with simple elements (number, strings) and only when you got it try with the heavy datas as opencv images.
Try this piece of code:
dict1 = {'a':1, 'b':2, 'c':3}
dict2 = {'e':1, 'd':2, 'c':4}
print dict1
print dict2
# note that this code is not optimized!!
# there are plenty of ways you can do better
# but prob. is the easiest way == better way to understand it
for k1 in dict1.keys():
for k2 in dict2.keys():
if k1==k2:
print 'keys matches'
mergedvalues = dict1[k1] + dict2[k2]
print 'merged value is:', merged values
for better ways to compare two dicts going deep in python way of handling dict and other data structures (as list, set, etc) and operations on that, this answer is nice. but I think you should understand how dict works before.