I'm trying to perform a bulk update of edges within my property graph using the following gremlin command:
g.E().hasLabel('foo').property('expiry', 1607808123)
The count of edges in my graph with this label is ~3 million. I expect this operation to be long running, but what I'm encountering is a TimeLimitExceededException exception from Neptune after approximately 3 minutes. Is there a supported way to perform long running bulk updates in Amazon Neptune?
Have you tried increasing the neptune_query_timeout value in the database cluster parameter group associated with your cluster? The default I think is 120 seconds but you can set a larger value.
https://docs.aws.amazon.com/en_us/neptune/latest/userguide/parameters.html
Related
I read the docs for AWS Glue GetPartitionsAPI and noticed that it has "Segment" parameter:
"Segment": {
"SegmentNumber": number,
"TotalSegments": number
},
I checked the explanations for Segment here and it says:
Defines a non-overlapping region of a table's partitions, allowing multiple requests to be run in parallel.
SegmentNumber - The zero-based index number of the segment. For example, if the total number of segments is 4, SegmentNumber values range from 0 through 3.
TotalSegments - The total number of segments. Minimum value of 1. Maximum value of 10.
It looks strikingly similar to Segments in DynamoDB, which is used by DynamoDB's parallel scan feature (As discussed here: Scan vs Parallel Scan in AWS DynamoDB?). However, the difference I noticed was how the segments in Glue is limited to only 10, while it seems unlimited on DynamoDB. Therefore:
1. I would like to know what Glue Segments are, if it's really the same thing and use cases for parallel scans in DynamoDB apply here too (e.g. over 20GB table size, read throughput not fully utilized)
Also, Presto/Trino has a Glue-segment-related parameter on config.properties as seen on Trino AWS Glue Configuration Properties:
hive.metastore.glue.partitions-segments - Number of segments for partitioned Glue tables, defaults to 5.
which indicates that segments can be utilized by Presto/Trino. However, when I tried setting the parameter to different numbers, queries keeps using the same number of splits and the speed remains the same (I used same number of nodes every time). Therefore:
2. How Glue Segments fit into Presto/Trino's conceptual model of nodes, tasks and splits?
And in this article about How to use AWS Glue Data Catalog as Metastore for Hive on AWS EMR There is this notice about throttling:
In EMR 5.20.0 or later, parallel partition pruning is enabled automatically for Spark and Hive when is used as the metastore. This change significantly reduces query planning time by executing multiple requests in parallel to retrieve partitions. The total number of segments that can be executed concurrently range between 1 and 10. The default value is 5, which is a recommended setting. You can change it by specifying the property aws.glue.partition.num.segments in hive-site configuration classification. If throttling occurs, you can turn off the feature by changing the value to 1.
3. How Glue Segments may cause throttling and in what kind of scenario? (I think this one might become clear as my understanding about Glue Segment begins to form)
We're running a large bulk load into AWS neptune and can no longer query the graph to get node counts without the query timing out. What options do we have to ensure we can audit the total counts in the graph?
Fails on curl and sagemaker notebook.
There are a few of things you could consider.
The easiest is to just increase the timeout specified in the cluster and/or instance parameter group, so that the query can (hopefully) complete.
If your Neptune engine version is 1.0.5.x then you can use the DFE engine to improve Gremlin count performance. You just need to enable the DFE engine using DFEQueryEngine=viaQueryHint in the cluster parameter group.
If you get the status of the load it will show you a value for the number of records processed so far. In this context a record is not a row from a CSV file or RDF format file. Instead it is the count of triples loaded in the RDF case and the count of property values and labels in the property graph case. As a simple example, imagine a CSV file with 100 rows and each row has 6 columns. Not including the ID column that is a label and 4 properties. The total number of records to load will be 100*5 i.e 500. If you have sparse rows then the calculation will be approximate unless you add up every non ID column with an actual value.
If you have the Neptune streams feature enabled you can inspect the stream and find the last vertex or edge created. Note that just enabling streams for this purpose may not be the ideal choice as it will impact the speed of the load as adding to the stream adds some overhead.
Recently we faced an outage due to 403 rateLimitExceeded error. We are trying to setup an alert using gcp metric for this error. However the metric for bigquery.googleapis.com/query/count or bigquery.googleapis.com/job/num_in_flight is not showing the number of queries running correctly. We believe we crossed the threshold of 100 concurrent queries several times over the past few days but the metric explorer shows only a maximum of 5 only on few occasions. Do these metrics need any other configs to show the right number or we should use some other way to create an alert that shows that we have crossed 80% of concurrent query no.
Background
I have a website deployed in multiple machines. I want to create a Google Custom Metric that specifies the throughput of it - how many calls were served.
The idea was to create a custom metric that collects information about served requests and 1 time per minute to update the information into a custom metric. So, for each machine, this code can happen a maximum of 1-time per minute. But this process is happening on each machine on my cluster.
Running the code locally is working perfectly.
The problem
I'm getting this error: Grpc.Core.RpcException:
Status(StatusCode=InvalidArgument, Detail="One or more TimeSeries
could not be written: One or more points were written more frequently
than the maximum sampling period configured for the metric. {Metric:
custom.googleapis.com/web/2xx, Timestamps: {Youngest Existing:
'2019/09/28-23:58:59.000', New: '2019/09/28-23:59:02.000'}}:
timeSeries[0]; One or more points were written more frequently than
the maximum sampling period configured for the metric. {Metric:
custom.googleapis.com/web/4xx, Timestamps: {Youngest Existing:
'2019/09/28-23:58:59.000', New: '2019/09/28-23:59:02.000'}}:
timeSeries1")
Then, I was reading in the custom metric limitations that:
Rate at which data can be written to a single time series = one point per minute
I was thinking that Google Cloud Custom Metric will handle the concurrencies issues for me.
According to their limitations, the only option for me to implement realtime monitoring is to put another application that will collect information from all machines and will update it into a custom metric. It sounds to me like too much work for a real use case.
What I'm missing?
Now that you add the machine name on the metric and you get the machines metrics.
To SUM these metrics go to Stackdriver > Metric Explorer, and group your metrics by project-id or label for example, and then SUM the metrics.
https://cloud.google.com/monitoring/charts/metrics-selector#alignment
You can save the chart in a custom dashboard.
We have been using Couchbase for about two years but we finally decided to switch to Amazon DynamoDB service for many reasons.
Now I started the migration of data to dynamodb. First everything was alright and going as expected, but after some time the response time from dynamo is getting higher and higher and the migration process is getting slower by time.
I tried to change my strategies but with no luck.
What can I do to increase the response time?
Basically I am scanning an SQL table getting 100 items per query then asking Couchbase to retrieve the data I want about these 100 items. At first I was getting high response times (as shown in the image bellow).
The following info might help:
I am running the migration code on an ec2 micro server running Ubuntu 14.04 with node v 4.4.1.
After looking at the graphs I started measuring the time for each dynamodb request (so I don't know what is the average was at first), the average response time is 800 ms for about 150,000 requests (get & put only, no batch commands or queries)
I am storing the items in two tables one with integer hash key and the other is with integer hash and sort keys
The second table is a huge one (having about 4.4 millions of items and counting)
Thanks to #Vor for mentioning the "Hot partition keys" I have made further readings and in reference to the Guidelines for Working with Tables I have followed what they are recommending and also I distributed my requests into batch requests made the difference.
Thank you all for your help.