I'm new to using Gremlin (up until now I was accessing Neptune using Opencypher and given up due to how slow it was) and I'm getting really confused over some stuff here.
Basically what I'm trying to do is -
Let us say we have some graph A-->B-->C. There are multiple such graphs in the database, so I'm looking for the specific A,B,C nodes that have the property 'idx' equals '1'. I want to add a node D{'idx' = '1'} and an edge so I will end up having
A-->B-->C-->D
It is safe to assume A,B,C already exist and are connected together.
Also, we wish to add D only if it doesn't already exist.
So what I currently have is this:
g.V().
hasLabel('A').has('idx', '1').
out().hasLabel('B').has('idx', '1').
out().hasLabel('C').has('idx', '1').as('c').
V().hasLabel('D').has('idx', '1').fold().
coalesce(
unfold(),
addV('D').property('idx','1')).as('d').
addE('TEST_EDGE').from('c').to('d')
now the problem is that well, this doesn't work and I don't understand Gremlin enough to understand why. This returns from Neptune as "An unexpected error has occurred in Neptune" with the code "InternalFailureException"
another thing to mention is that if the node D does exist, I don't get an error at all, and in fact th node is properly connected to the graph as it should.
furthermore, I've seen in a different post that using ".as('c')" shouldn't work since there is a 'fold' action afterwards which makes it unusable (for a reason I still don't understand, probably cause I'm not sure how this entire .as,.store,.aggregate work)
And suggests using ".aggregate('c')" instead, but doing so will change the returned error to "addE(TEST_EDGE) could not find a Vertex for from() - encountered: BulkSet". This, adding to the fact that the code I wrote actually works and connects node D to the graph if it already exists, makes me even more confused.
So I'm lost
Any help or clarification or explanation or simplification would be much appreciated
Thank you! :)
A few comments before getting to the query:
If the intent is to have multiple subgraphs of (A->B->C), then you may not want to use this labeling scheme. Labels are meant to be of lower variation - think of labels as groups of vertices of the same "type".
A lookup of a vertex by an ID is the fastest way to find a vertex in a TinkerPop-based graph database. Just be aware of that as you build your access patterns. Instead of doing something like `hasLabel('x').has('idx','y'), if both of those items combined make a unique vertex, you may also want to think of creating a composite ID of something like 'x-y' for that vertex for faster access/lookup.
On the query...
The first part of the query looks good. I think you have a good understanding of the imperative nature of Gremlin just up until you get to the second V() in the query. That V() is going to tell Neptune to start evaluating against all vertices in the graph again. But we want to continue evaluating beyond the 'C' vertex.
Unless you need to return an output in either case of existence or non-existence, you could get away with just doing the following without a coalesce() step:
g.V().
hasLabel('A').has('idx', '1').
out().hasLabel('B').has('idx', '1').
out().hasLabel('C').has('idx', '1').
where(not(out().hasLabel('D').has('idx','1'))).
addE('TEST_EDGE).to(
addV('D').property('idx','1'))
)
The where clause allows us to do the check for the non-existence of a downstream edge and vertex without losing our place in the traversal. It will only continue the traversal if the condition specified is not() found in this case. If it is not found, the traversal continues with where we left off (the 'C' vertex). So we can feed that 'C' vertex directly into an addE() step to create our new edge and new 'D' vertex.
Related
I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.
I'm using the getclosest command to find a vertex.
ForceVertex1 = hatInstance.vertices.getClosest(coordinates=((x,y,z,))
This is a dictionary object with Key 0 and two values (hatInstance.vertices[1] and the coordinates of the vertex) The specific output:
{0: (mdb.models['EXP-100'].rootAssembly.instances['hatInstance-100'].vertices[1], (62.5242172081597, 101.192447407436, 325.0))}
Whenever I try to create a set, the vertex isn't accepted
mainAssembly.Set(vertices=ForceVertex1[0][0],name='LoadSet1')
I also tried a different way:
tolerance = 1.0e-3
vertex = []
for vertex in hatInstance.vertices:
x = vertex.pointOn[0][0]
print x
y = vertex.pointOn[0][1]
print y
z = vertex.pointOn[0][2]
print z
break
if (abs(x-xTarget)) < tolerance and abs(y-yTarget) < tolerance and abs(z-zTarget) < tolerance):
vertex.append(hatInstance.vertices[vertex.index:vertex.index+1])
xTarget etc being my coordinates, despite this I still don't get a vertex object
For those struggeling with this, I solved it.
Don't use the getClosest command as it returns a dictionary object despite the manual recommending this. I couldn't convert this dictionary object, specifically a key and a value within to a standalone object (vertex)
Instead use Instance.vertices.getByBoundingSphere(center=,radius=)
The center is basically a tuple of the coordinates and the radius is the tolerance. This returns an array of vertices
If you want the geometrical object you just have to access the dictionary.
One way to do it is:
ForceVertex1 = hatInstance.vertices.getClosest(coordinates=((x,y,z,))[0][0]
This will return the vertices object only, which you can assign to a set or whatever.
Edit: Found a solution to actually address the original question:
part=mdb.models[modelName].parts[partName]
v=part.vertices.getClosest(coordinates=(((x,y,z)),))
Note the formatting requirement for coordinates ((( )),), three sets of parenthesis with a comma. This will find the vertex closest to the specified point. In order to use this to create a set, I found you need to massage the Abaqus Python interface to return the vertex in a format that uses their "getSequenceFromMask" method. In order to create a set, the edges, faces, and/or vertices need to be of type "Sequence", which is internal to Abaqus. To do this, I then use the following code:
v2=part.verticies.findAt((((v[0][1])),))
part.Set(name='setName', vertices=v2)
Note, v[0][1] will give you the point at which the vertex lies on. Note again the format of the specified point using the findAt method (((point)),) with three sets of parenthesis and a comma. This will return a vertex that uses the getSequenceFromMask method in Abaqus (you can check by typing v2 then enter in the python box at the bottom of CAE, works with Abaqus 2020). This is type "Sequence" (you can check by typing type(V2)) and this can be used to create a set. If you do not format the point in findAt correctly (e.g., findAt(v[0][1]), without the parenthesis and comma), it will return an identical vertex as you get by accessing the dictionary returned using getClosest (e.g., v[0][0]). This is type 'Vertex' and cannot be used to create a set, even though it asks for a vertex. If you know the exact point where the vertex is, then you do not need the first step. You can simply use the findAt method with the correct formatting. However, the tolerance for findAt is very small (1e-6) and will return an empty sequence if nothing is found within the tolerance. If you only have a ballpark idea of where the vertex is located, then you need to use the getClosest method first. This indeed gets the closest vertex to the specified point, which may or may not be the one you are interested in.
Original post:
None of these answers work for a similar problem I am having while trying to create a set of faces within some range near a point. If I use getClosest as follows
f=mdb.models['Model-1'].parts['Part-1'].faces.getClosest(coordinates=((0,0,0),), searchTolerance=1)
mdb.models['Model-1'].parts['Part-1'].Set(faces=f, name='faceSet')
I get an error "TypeError: Keyword error on faces".
If I access the dictionary via face=f[0], I get error "Feature Creation Failed". If I access the tuple within the dictionary via f[0][0], I get the error "TypeError: keyword error on faces" again.
The option to use .getByBoundingSphere doesn't work either, because the faces in my model are massive, and the faces have to be completely contained within the sphere for Abaqus to "get" them, basically requiring me to create a sphere that encompasses the entire model.
My solution was to create my own script as follows:
import numpy as np
model=mdb.models['Model-1']
part=model.parts['Part-1']
faceSave=[]
faceSave2=[]
x=np.arange(-1,1,0.1)
y=np.arange(-1,1,0.1)
z=np.arange(-1,1,0.1)
for x1 in x:
for y1 in y:
for z1 in z:
f=part.faces.findAt(((x1,y1,z1),))
if len(f)>0:
if f[0] in faceSave2:
None
else:
faceSave.append(f)
faceSave2.append(f[0])
part.Set(faces=faceSave,name='faceSet')
This works, but it's extraordinarily slow, in part because "findAt" will throw a warning to the console whenever it doesn't find a face, and it usually doesn't find a face with this approach. The code above basically looks within a small cube for any faces, and puts them in the list "faceSave". faceSave2 is setup to ensure that duplicate faces aren't added to the list. Accessing the tuple (e.g, f[0] in the code above) contains the unique information about the face, whereas 'f' is just a pointer to the 'findAt' command. Strangely, you can use the pointer 'f' to create a Set, but you cannot use the actual face object 'f[0]' to create a set. The problem with this approach for general use is, the tolerance for "findAt" is super small, so, you either have to be confident where things are located in your model, or have the step size be 1e-6 in np.arange() to ensure you don't miss a face that's in the cube. With a tiny step size, expect the code to take forever.
At any rate, I can use a tuple (or a list of tuples) obtained via "findAt" to create a Set in Abaqus. However, I cannot use the tuple obtained via "getClosest" to make a set, even though I see no difference between the two objects. It's unfortunate, because getClosest gives me the exact info I need effectively immediately without my jumbled mess of for-loops.
#anarchoNobody:
Thank you so much for your edited answer!
This workaround works great, also with faces. I spent a lot of hours trying to figure out why .getClosest does not provide a working result for creating a set, but with the workaround and the number of brackets it works.
If applied with several faces, the code has to be slightly modified:
faces=((mdb.models['Model-1'].rootAssembly.instances['TT-1'].faces.getClosest(
coordinates=(((10.0, 10.0, 10.0)),), searchTolerance=2)),
(mdb.models['Model-1'].rootAssembly.instances['TT-1'].faces.getClosest(
coordinates=((-10.0, 10.0, 10.0),), searchTolerance=2)),)
faces1=(mdb.models['Model-1'].rootAssembly.instances['Tube-1'].faces.findAt((((
faces[0][0][1])),)),
mdb.models['Model-1'].rootAssembly.instances['Tube-1'].faces.findAt((((
faces[1][0][1])),)),)
mdb.models['Model-1'].rootAssembly.Surface(name='TT-inner-surf', side1Faces=faces1)
```
I wrote a little code in Fortran. But the code doesn't behave as I thought, and I can figure out where is the problem.
I will not put the code here because it has 1200 lines but here its philosophy:
I create a 3D grid represented by a four dimensional table (I stock a vector of 2 elements on each point of the grid, corresponding at the nature of the site and who is occupying the site). This grid represents what we call a crystal (where atoms can be found periodically)
When this grid is constructed, the code scans each point of this grid and it looks to the neighboring sites to count the different type of atoms or the vacancies.
For this last point, I use a triple imbricated loop which permit to explore the different sites and I check the different neighboring site using either the if or the select case instructions. As I want my grid to be periodic, I have the function mod in the argument of the if or the select case.
The problem is sometimes, It found a different element in a neighboring site that the actual element in this specific neighboring site. As an example:
In the two ouput files where all the coordinates are written with the
element type I have grid(0,0,1)=-1 (which correspond to a empty site).
But while the code is looking to the neighboring sites of grdi(0,0,1) It tells that there is actually an element indexed 2 in grid(0,0,1).
I look carefully to the block in the triple implemented loop, but it seems fine.
I would like to know if anyone has already meet this kind of problem, or know if there is some problems using mod in a if or select case argument ?
If some of you want to look closer, I can send you the code, with some explanations.
Arrays are usually dimensioned as:
REAL(KIND=8),DIMENSION(0:N) ::A
or
REAL(KIND=8),DIMENSION(N) :: A
In the later example, they are assumed to start at 1.
You could also go (-N:N) or (10:191)
If you use the compiler switch '-check bounds' or ;-check all' you will see if you are going outside the array/etc. This is not an uncommon thing to get hosed up, but the compiler will abort quickly when the dimension is outside.
Once it works then removed the -check bounds and/or -check all.
Thanks for your consideration francescalus and haraldkl.
It was not related to the dimension of arrays Holmz, but thank you to try to help
It seems I finally succeed to fix it. I will post an over answer If I fully understand why it was not working properly.
Apparently, it was related to the combination of a different argument order in a call procedure and the subroutine header + a declaration in the subroutine with intent(inout).
It was like the intent(inout) was masking the problem. But It a bit strange for me.
Some explanations about the code :
As I said, the code create a 3D grid where each intersection of the 3D grid correspond to a crystallographic site. I attribute a value at each site -1 for an empty site, 1 for a crystal atom (0 if there is a vacancy instead of a crystal atom), 2,3,4,5 for different impurities. Actually, the empty sites and the sites which received crystal atoms are not of the same type, that's why an empty site and a vacancy are distinguished. The impurities can only occupied the empty site and are forbidden to occupied a crystal site.
The aim of the code is to explore the configurational space of the system, in other words all the possible distribution we can obtained with the different elements. To do so I start from a initial configuration and I choose randomly to site (respecting the rules of occupation) and I virtually switch them. I calculate the energy of the old an new configurations, if the new has a lower energy I keep it, if not, i keep the old one. The calculus of the energy is based on the knowledge of the environment of each vacancies and impurities, so we need to know their neighbors. And I repeat the all procedure again and again to converge to the most stable (so the most probable) configuration.
The next step is to include the temperature effect, and to add the second type of empty sites.
Have a nice day,
M.
I am working user behavior project. Based on user interaction I have got some data. There is nice sequence which smoothly increases and decreases over the time. But there are little discrepancies, which are very bad. Please refer to graph below:
You can also find data here:
2.0789 2.09604 2.11472 2.13414 2.15609 2.17776 2.2021 2.22722 2.25019 2.27304 2.29724 2.31991 2.34285 2.36569 2.38682 2.40634 2.42068 2.43947 2.45099 2.46564 2.48385 2.49747 2.49031 2.51458 2.5149 2.52632 2.54689 2.56077 2.57821 2.57877 2.59104 2.57625 2.55987 2.5694 2.56244 2.56599 2.54696 2.52479 2.50345 2.48306 2.50934 2.4512 2.43586 2.40664 2.38721 2.3816 2.36415 2.33408 2.31225 2.28801 2.26583 2.24054 2.2135 2.19678 2.16366 2.13945 2.11102 2.08389 2.05533 2.02899 2.00373 1.9752 1.94862 1.91982 1.89125 1.86307 1.83539 1.80641 1.77946 1.75333 1.72765 1.70417 1.68106 1.65971 1.64032 1.62386 1.6034 1.5829 1.56022 1.54167 1.53141 1.52329 1.51128 1.52125 1.51127 1.50753 1.51494 1.51777 1.55563 1.56948 1.57866 1.60095 1.61939 1.64399 1.67643 1.70784 1.74259 1.7815 1.81939 1.84942 1.87731
1.89895 1.91676 1.92987
I would want to smooth out this sequence. The technique should be able to eliminate numbers with characteristic of X and Y, i.e. error in mono-increasing or mono-decreasing.
If not eliminate, technique should be able to shift them so that series is not affected by errors.
What I have tried and failed:
I tried to test difference between values. In some special cases it works, but for sequence as presented in this the distance between numbers is not such that I can cut out errors
I tried applying a counter, which is some X, then only change is accepted otherwise point is mapped to previous point only. Here I have great trouble deciding on value of X, because this is based on user-interaction, I am not really controller of it. If user interaction is such that its plot would be a zigzag pattern, I am ending up with 'no user movement data detected at all' situation.
Please share the techniques that you are aware of.
PS: Data made available in this example is a particular case. There is no typical pattern in which numbers are going to occure, but we expect some range to be continuous with all the examples. Solution I am seeking is generic.
I do not know how much effort you want to involve in this problem but if you want theoretical guaranties,
topological persistence seems well adapted to your problem imho.
Basically with that method, you can filtrate local maximum/minimum by fixing a scale
and there are theoritical proofs that says that if you sampling is
close from your function, then you extracts correct number of maximums with persistence.
You can see these slides (mainly pages 7-9 to get the idea) to get an idea of the method.
Basically, if you take your points as a landscape and imagine a watershed starting from maximum height and decreasing, you have some picks.
Every pick has a time where it is born which is the time where it becomes emerged and a time where it dies which is when it merges with an higher pick. Now a persistence diagram pictures a point for every pick where its x/y coordinates are its time of birth/death (by assumption the first pick does not die and is not shown).
If a pick is a global maximal, then it will be further from the diagonal in the persistence diagram than a local maximum pick. To remove local maximums you have to remove picks close to the diagonal. There are fours local maximums in your example as you can see with the persistence diagram of your data (thanks for providing the data btw) and two global ones (the first pick is not pictured in a persistence diagram):
If you noise your data like that :
You will still get a very decent persistence diagram that will allow you to filter local maximum as you want :
Please ask if you want more details or references.
Since you can not decide on a cut off frequency, and not even on the filter you want to use, I would implement several, and let the user set the parameters.
The first thing that I thought of is running average, and you can see that there are so many things to set, to get different outputs.
I am new in rapid miner 5, just want to know how to find noise in my data and show them in chart and how to delete them?
A complex problem because it depends what you mean by noise.
If you mean finding individual attributes whose values are plain wrong then you could plot a histogram view and work out some sort of limits on what constitutes a valid value. You could then impose that rule by using Filter Examples to remove them.
If you mean finding attributes that have some sort of random jitter applied to them it would be difficult to detect these. Only by knowing beforehand what the expected shape of the distribution is could you compare with observation and do something about it. However, the action to take is by no means obvious.
If you mean finding examples within an example set that are obviously different from other examples then you could consider using the various outlier functions. The simplest one to get started is Detect Outlier (Distances). This finds a set number of outliers (default 10) based on a distance calculation that uses all the attributes for examples. It creates a new attribute called outlier that is set to true or false. You could then use the Filter Examples operator to remove those that are set to true.
Hope that helps at least as a start.