TimeoutError and TimeLimitExceededException when running AWS Neptune query - amazon-web-services

I have a database with about 2 million nodes and transaction connections stored in a Neptune database.
I am trying two different queries with similar issues but I don't know how to solve any of them.
The first query is trying to generate a 2 hop graph starting from one user. The query is g.V(source).outE('friend').otherV().outE('friend').toList(). For a 1 hop graph, the query works fine, but for 2 hops or more I have the following error:
gremlin_python.driver.protocol.GremlinServerError: 598: {"detailedMessage":"A timeout occurred within the script or was otherwise cancelled directly during evaluation of [1e582e78-bab5-462c-9f24-5597d53ef02f]","code":"TimeLimitExceededException","requestId":"1e582e78-bab5-462c-9f24-5597d53ef02f"}
The second query I am making is finding a path (does not need to be the shortest but just a path) from a source node to a target node. The query to do this is the following: g.V().hasId(str(source)).repeat(__.out().simplePath()).until(__.hasId(str(target))).path().limit(1).toList()
The query works for pairs of nodes that are relatively close (at most 4 hops of distance) but for further apart pairs of nodes I am getting the following error:
*** tornado.ioloop.TimeoutError: Operation timed out after 30 seconds
I was wondering if anyone might have suggestions on how to solve these Time limit errors. I would really appreciate any help with this, thanks!

This is a known bug in TinkerPop Python 3.4.9 client. Please see the thread on Gremlin mailing list for details of the issue and the workaround:
https://groups.google.com/g/gremlin-users/c/K0EVG3T-UrM
You can change the 30sec timeout using the following code snippet.
from gremlin_python.driver.tornado.transport import TornadoTransport
graph=Graph()
connection = DriverRemoteConnection(endpoint,'g',
transport_factory=lambda: TornadoTransport(read_timeout=None, write_timeout=None))
g = graph.traversal().withRemote(connection)

Related

Powerquery & Zabbix API - DataSource.Error (500) Internal Server Error when asking for much data

I am using Powerquery to fetch the data from Zabbix using their API. It works fine when I fetch the data for some days, but as I increase the period and the amount of data surpasses the millions of rows I just get the error below after some time waiting, and the query doesn't return anything.
I am using Web.contents to get the data as follows:
I have added that timeout as you can see above but the error just happens much before 5 minutes have passed. How should I solve this? Are there ways to fetch large amounts of data in power query in parts, without being all at once? Or does this error happen because of connection parameters inherent to zabbix configs.?
My team changed all the possible parameters regarding server memory and nothing seemed to have worked. One thing to notice is that, although power query seems to face the same error (500) internal server error if I get data for a period of 3 days or 30 days, for the first case it shows the error much faster while in the last case it takes much more time and eventually gets to the same error.
Thanks!
It's a PHP memory limit hit, you should modify the maximum memory.
For example, in an Apache standard setup you should edit /etc/httpd/conf.d/zabbix.conf and modify the php_value memory_limit to a greater value (restart apache!).
The default is 128M, the "right" setting depends on the memory available on the system and the maximum data size you want to get.

How to know when elasticsearch is ready for query after adding new data?

I am trying to do some unit tests using elasticsearch. I first start by using the index API about 100 times to add new data to my index. Then I use the search API with aggs. The problem is if I don't pause for 1 second after adding data 100 times, I get random results. If I wait 1 second I always get the same result.
I'd rather not have to wait x amount of time in my tests, that seems like bad practice. Is there a way to know when the data is ready?
I am waiting until I get a success response from elasticsearch /index api already, but that is not enough it seems.
First I'd suggest you to index your documents with a single bulk query : it would save some time because of less http/tcp overhead.
To answer your question, you should consider using the refresh=true parameter (or wait_for) while indexing your 100 documents.
As stated in documentation, it would :
Refresh the relevant primary and replica shards (not the whole index)
immediately after the operation occurs, so that the updated document
appears in search results immediately
More about it here :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html

BigQueryIO - only first day table can be created, despite having CreateDisposition.CREATE_IF_NEEDED

I have a dataflow job processing data from pub/sub defined like this:
read from pub/sub -> process (my function) -> group into day windows -> write to BQ
I'm using Write.Method.FILE_LOADS because of bounded input.
My job works fine, processing lots of GBs of data but it fails and tries to retry forever when it gets to create another table. The job is meant to run continuously and create day tables on its own, it does fine on the first few ones but then gives me indefinitely:
Processing stuck in step write-bq/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at least 05h30m00s without outputting or completing in state finish
Before this happens it also throws:
Load job <job_id> failed, will retry: {"errorResult":{"message":"Not found: Table <name_of_table> was not found in location US","reason":"notFound"}
It is indeed a right error because this table doesn't exists. Problem is that the job should create it on its own because of defined option CreateDisposition.CREATE_IF_NEEDED.
The number of day tables that it creates correctly without a problem depens on number of workers. It seems that when some worker creates one table its CreateDisposition changes to CREATE_NEVER causing the problem, but it's only my guess.
The similar problem was reported here but without any definite answer:
https://issues.apache.org/jira/browse/BEAM-3772?focusedCommentId=16387609&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16387609
ProcessElement definition here seems to give some clues but I cannot really say how it works with multiple workers: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L138
I use 2.15.0 Apache SDK.
I encountered the same issue, which is still not fixed in BEAM 2.27.0 of january 2021. Therefore I had to develop a workaround: a custom PTransform which checks if the target table exist before the the BigQueryIO stage. It uses the bigquery java client for this and a Guava cache, as well as a windowing strategy (fixed, check every 15s) to sustain a heavy traffic of about 5000 elements per second. Here is the code: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1
There was a bug in the past that caused this error, but that particular one was fixed in commit https://github.com/apache/beam/commit/d6b4dcec5f297f5c1bd08f345f0e1e5c756775c2#diff-3f40fd931c8b8b972772724369cea310 Can you check to see if the version of Beam you are running includes this commit?

This instance has too many database splits to complete the operation

Until now, even if 20 per database were created, no error was issued, but when I made suddenly more than 16 suddenly I got an error.
It is totally different from what is described in https://cloud.google.com/spanner/quotas. I do not understand the reason at all.
It is not clear if you're talking about 20 and 16 secondary indexes and what was the type of the operation that you were trying to execute.
Just to be clear, splits are not indexes. Splits are not exposed to users in Cloud Spanner, see more details on the topic here:
https://cloud.google.com/spanner/docs/schema-and-data-model#database-splits
"Too many database splits" error indicates that you need more nodes to manage your dataset. A node may manage up to 2TB of data:
https://cloud.google.com/spanner/quotas#database_limits

SimpleDB Incremental Index

I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.
Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.