Slow insertion using Neptune and Gremlin - amazon-web-services

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.

If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

Related

Adding a node and edge to a graph using Gremlin behaving strange

I'm new to using Gremlin (up until now I was accessing Neptune using Opencypher and given up due to how slow it was) and I'm getting really confused over some stuff here.
Basically what I'm trying to do is -
Let us say we have some graph A-->B-->C. There are multiple such graphs in the database, so I'm looking for the specific A,B,C nodes that have the property 'idx' equals '1'. I want to add a node D{'idx' = '1'} and an edge so I will end up having
A-->B-->C-->D
It is safe to assume A,B,C already exist and are connected together.
Also, we wish to add D only if it doesn't already exist.
So what I currently have is this:
g.V().
hasLabel('A').has('idx', '1').
out().hasLabel('B').has('idx', '1').
out().hasLabel('C').has('idx', '1').as('c').
V().hasLabel('D').has('idx', '1').fold().
coalesce(
unfold(),
addV('D').property('idx','1')).as('d').
addE('TEST_EDGE').from('c').to('d')
now the problem is that well, this doesn't work and I don't understand Gremlin enough to understand why. This returns from Neptune as "An unexpected error has occurred in Neptune" with the code "InternalFailureException"
another thing to mention is that if the node D does exist, I don't get an error at all, and in fact th node is properly connected to the graph as it should.
furthermore, I've seen in a different post that using ".as('c')" shouldn't work since there is a 'fold' action afterwards which makes it unusable (for a reason I still don't understand, probably cause I'm not sure how this entire .as,.store,.aggregate work)
And suggests using ".aggregate('c')" instead, but doing so will change the returned error to "addE(TEST_EDGE) could not find a Vertex for from() - encountered: BulkSet". This, adding to the fact that the code I wrote actually works and connects node D to the graph if it already exists, makes me even more confused.
So I'm lost
Any help or clarification or explanation or simplification would be much appreciated
Thank you! :)
A few comments before getting to the query:
If the intent is to have multiple subgraphs of (A->B->C), then you may not want to use this labeling scheme. Labels are meant to be of lower variation - think of labels as groups of vertices of the same "type".
A lookup of a vertex by an ID is the fastest way to find a vertex in a TinkerPop-based graph database. Just be aware of that as you build your access patterns. Instead of doing something like `hasLabel('x').has('idx','y'), if both of those items combined make a unique vertex, you may also want to think of creating a composite ID of something like 'x-y' for that vertex for faster access/lookup.
On the query...
The first part of the query looks good. I think you have a good understanding of the imperative nature of Gremlin just up until you get to the second V() in the query. That V() is going to tell Neptune to start evaluating against all vertices in the graph again. But we want to continue evaluating beyond the 'C' vertex.
Unless you need to return an output in either case of existence or non-existence, you could get away with just doing the following without a coalesce() step:
g.V().
hasLabel('A').has('idx', '1').
out().hasLabel('B').has('idx', '1').
out().hasLabel('C').has('idx', '1').
where(not(out().hasLabel('D').has('idx','1'))).
addE('TEST_EDGE).to(
addV('D').property('idx','1'))
)
The where clause allows us to do the check for the non-existence of a downstream edge and vertex without losing our place in the traversal. It will only continue the traversal if the condition specified is not() found in this case. If it is not found, the traversal continues with where we left off (the 'C' vertex). So we can feed that 'C' vertex directly into an addE() step to create our new edge and new 'D' vertex.

Elasticsearch scoring on multiple indexes: dfs_query_then_fetch returns the same scores as query_then_fetch

I have multiple indices in Elasticsearch (and the corresponding documents in Django created using django-elasticsearch-dsl). All of the indices have these settings:
settings = {'number_of_shards': 1,
'number_of_replicas': 0}
Now, I am trying to perform a search across all the 10 indices. In order to retrieve consistent scoring between the results from different indices, I am using dfs_query_then_fetch:
search = Search(index=['mov*'])
search = search.params(search_type='dfs_query_then_fetch')
objects = search.query("multi_match", query='Tom & Jerry', fields=['title', 'actors'])
I get bad results due to inconsistent scoring. A book called 'A story of Jerry and his friend Tom' from one index can be ranked higher than the cartoon 'Tom & Jerry' from another index. The reason is that dfs_query_then_fetch is not working. When I remove it or substitute with the simple query_then_fetch, I get absolutely the same results with the identical scoring.
I have tested it on URI requests as well, and I always get the same scores for both search types.
What can be the reason for it?
UPDATE: The results are actually not the same, but they are only really slightly different, e.g. a score of 50.1 with dfs and 50.0 without dfs, while the same model within one index has a score of 80.0.
If the number of shards is 1, then dfs_query_then_fetch and query_then_fetch will return the same result. DFS query will do a query to all shards and then show you results based on the scores computed, but in this case there is only one shard.
Regarding the scoring, you might wanna have a look at your actors field too. Also, do let us know what are the analyzer and tokenizer if you have used custom ones?

Storm and stop words

I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html),
I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.
If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.
If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:
receive a batch of tuples to check (say 10000 at a time)
build a single query from their content and look up corresponding stop words in persistence
attach a boolean to each tuple specifying what to with each one
+ add a Filter right after that to filter based on that boolean.
And if you feel adventurous:
Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).
Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.

Check a fingerprint in the database

I am saving the fingerprints in a field "blob", then wonder if the only way to compare these impressions is retrieving all prints saved in the database and then create a vector to check, using the function "identify_finger"? You can check directly from the database using a SELECT?
I'm working with libfprint. In this code the verification is done in a vector:
def test_identify():
cur = DB.cursor()
cur.execute('select id, fp from print')
id = []
gallary = []
for row in cur.fetchall():
data = pyfprint.pyf.fp_print_data_from_data(str(row['fp']))
gallary.append(pyfprint.Fprint(data_ptr = data))
id.append(row['id'])
n, fp, img = FingerDevice.identify_finger(gallary)
There are two fundamentally different ways to use a fingerprint database. One is to verify the identity of a person who is known through other means, and one is to search for a person whose identity is unknown.
A simple library such as libfprint is suitable for the first case only. Since you're using it to verify someone you can use their identity to look up a single row from the database. Perhaps you've scanned more than one finger, or perhaps you've stored multiple scans per finger, but it will still be a small number of database blobs returned.
A fingerprint search algorithm must be designed from the ground up to narrow the search space, to compare quickly, and to rank the results and deal with false positives. Just as a Google search may come up with pages totally unrelated to what you're looking for, so too will a fingerprint search. There are companies that devote their entire existence to solving this problem.
Another way would be to have a mysql plugin that knows how to work with fingerprint images and select based on what you are looking for.
I really doubt that there is such a thing.
You could also try to parallelize the fingerprint comparation, ie - calling:
FingerDevice.identify_finger(gallary)
in parallel, on different cores/machines
You can't check directly from the database using a SELECT because each scan is different and will produce different blobs. libfprint does the hard work of comparing different scans and judging if they are from the same person or not
What zinking and Tudor are saying, I think, is that if you understand how does that judgement process works (which is by the way, by minutiae comparison) you can develop a method of storing the relevant data for the process (the *minutiae, maybe?) in the database and then a method for fetching the relevant values -- maybe a kind of index or some type of extension to the database.
In other words, you would have to reimplement the libfprint algorithms in a more complex (and beautiful) way, instead of just accepting the libfprint method of comparing the scan with all stored fingerprint in a loop.
other solutions for speeding your program
use C:
I only know sufficient C to write kind of hello-world programs, but it was not hard to write code in pure C to use the fp_identify_finger_img function of libfprint and I can tell you it is much faster than pyfprint.identify_finger.
You can continue doing the enrollment part of the stuff in python. I do it.
use a time / location based SELECT:
If you know your users will scan their fingerprints with more probability at some time than other time, or at some place than other place (maybe arriving at work at some time and scanning their fingers, or leaving, or entering the building by one gate, or by other), you can collect data (at each scan) for measuring the probabilities and creating parallel tables to sort the users for their probability of arriving at each time and location.
We know that identify_finger tries to identify fingers in a loop with the fingerprint objects you provided in a list, so we can use that and give it the objects sorted in a way in which the more likely user for that time and that location will be the first in the list and so on.

JPA 2.0: Batching queries with IN clause

I am looking for a strategy to batch all my queries (with IN clause) to overcome the restrictions by databases on IN clause (See here).
I usually get list of size 100000 to 305000. So, this has become very important to tackle.
I have tried two strategies so far.
Strategy 1:
Create an entity and hence a table with one column to hold such values (can we create temp tables on the fly with JPA 2.0 vendor-independent?) and use the data from the temp table as a subquery to the original query before eventually cleaning up the temp table.
Advantage: Very performant queries. Really quick, I must admit for the numbers I have mentioned, it was mostly under a minute.
Possible drawback: Use of temp table which is actually a permanent one in my case thus far.
Strategy 2:
Calculate the batch size for the given input list and for each batch execute the query and accumulate the result.
Advantage: No temp tables. Easy for any threads within the same transaction.
Disadvantage: A big disadvantage is amount of time it takes to execute all the batches. For the mentioned numbers, this is at an unacceptable level at the moment. Takes anything between 5 to 15 mins!
I would appreciate any feedback, suggestions or improvements from all you JPA gurus.
Thanks.
I only tested up to 50,000 integers but I have some pretty good performance data around splitting large lists using various methods, with CLR and a numbers table leading the pack at the higher end:
Splitting a list of integers : another roundup
Not sure if you are using integers or strings but the results should be roughly equivalent.
As an aside, I'll confess I have no idea what JPA 2.0 is, but I assume you can control the format of the lists that it sends to SQL Server.