To delete the vertex by looping the dataframe in glue timeout - amazon-web-services

I want to delete the vertex to loop on one dataframe.
Suppose I will delete the vertex based on some cols of dataframe
my function is written in this way: and it is timeout
def delete_vertices_for_label(rows):
conn = self.remote_connection()
g = self.traversal_source(conn)
for row in rows:
entries = row.asDict()
create_traversal = __.hasLabel(str(entries["~label"]))
for key, value in entries.iteritems():
if key=='~id':
pass
elif key == '~label':
pass
else:
create_traversal.has(key), value)
g.V().coalesce(create_traversal).drop().iterate()
I have succeed in using this function locally on tinkerGraph, however ,when I try to run above function in glue which manipulate data in aws Neptune ; it failed.
I also create one lambda function in below: still meet the issue like timeout.
def run_sample_gremlin_basedon_property():
remoteConn = DriverRemoteConnection('ws://' + CLUSTER_ENDPOINT + ":" +
CLUSTER_PORT + '/gremlin', 'g')
graph = Graph()
g = graph.traversal().withRemote(remoteConn)
create_traversal = __.hasLabel("Media")
create_traversal.has("Media_ID", "99999")
create_traversal.has("src_name", "NET")
print ("create_traversal:",create_traversal)
g.V().coalesce(create_traversal).drop().iterate()

Dropping a vertex involves dropping associated properties and edges as well, and hence depending on the data, it could take a large amount of time. Drop step was optimized in one of the engine releases [1], so ensure that you are on a version newer than that. If you still get timeouts, set an appropriate timeout value on the cluster using the cluster parameter for timeouts.
Note: This answer is based off EmmaYang's communication with AWS Support. Looks like the Gluejob was configured in a manner that needs a high timeout. I'm not familiar enough with Glue to comment more on that (Emma - Can you please elaborate that?)
[1] https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.1.0.200296.0.html

Related

Lambda random long execution while running QLDB query

I have a lambda triggered by a SQS FIFO queue when there are messages on this queue. Basically this lambda is getting the message from the queue and connecting to QLDB through a VPC endpoint in order to run a simple SELECT query and a subsequent INSERT query. The table selected by the query has a index for the field used in the where condition.
Flow (all the services are running "inside" a VPC):
SQS -> Lambda -> VPC interface endpoint -> QLDB
Query SELECT:
SELECT FIELD1, FIELD2 FROM TABLE1 WHERE FIELD3 = "ABCDE"
Query INSERT:
INSERT INTO TABLE1 .....
This lambda is using a shared connection/session on QLDB and this is how I'm connecting to it.
import { QldbDriver, RetryConfig } from 'amazon-qldb-driver-nodejs'
let driverQldb: QldbDriver
const ledgerName = 'MyLedger'
export function connectQLDB(): QldbDriver {
if ( !driverQldb ) {
const retryLimit = 4
const retryConfig = new RetryConfig(retryLimit)
const maxConcurrentTransactions = 1500
driverQldb = new QldbDriver(ledgerName, {}, maxConcurrentTransactions, retryConfig)
}
return driverQldb
}
When I run a load test that simulates around 200 requests/messages per second to that lambda in a time interval of 15 minutes, I'm starting facing a random long execution for that lambda while running the queries on QLDB (mainly the SELECT query). Sometimes the same query retrieves data around 100ms and sometimes it takes more than 40 seconds which results in lambda timeouts. I have changed lambda timeout to 1 minute but this is not the best approch and sometimes it is not enough too.
The VPC endpoint metrics are showing around 250 active connections and 1000 new connections during this load test execution. Is there any QLDB metric that could help to identify the root cause of this behavior?
Could it be related to some QLDB limitation (like the 1500 active sessions described here: https://docs.aws.amazon.com/qldb/latest/developerguide/limits.html#limits.default) or something related to concurrency read/write iops?
scodeler, I've read through the NodeJS QLDB driver, and I think theres an order of operations error. If you provide your own backoff function in the RetryConfig where RetryConfig(4, newBackoffFunction), you should see significant performance improvement in your lambda's completing.
The driver's default backoff
const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, Math.pow(SLEEP_BASE_MS * 2, retryAttempt));
summarized...it returns
return Math.random() * exponentialBackoff;
does not match the default best jitter function practices
const newBackoffFunction: BackoffFunction = (retryAttempt: number, error: Error, transactionId: string) => {
const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, SLEEP_BASE_MS * Math.pow(2, retryAttempt));
const jitterRand: number = Math.random();
const delayTime: number = jitterRand * exponentialBackoff;
return delayTime;
}
The difference is that the SLEEP_BASE_MS should be multiplied by 2 ^ retryAttempt, and not (SLEEP_BASE_MS x 2) ^ retryAttempt.
Hope this helps!

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Power Query | Loop with delay

I'm new to PQ and trying to do following:
Get updates from server
Transform it.
Post data back.
While code works just fine i'd like it to be performed each N minutes until application closure.
Also LastMessageId variable should be revaluated after each call of GetUpdates() and i need to somehow call GetUpdates() again with it.
I've tried Function.InvokeAfter but didn't get how to run it more than once.
Recursion blow stack out ofc.
The only solution i see is to use List.Generate but struggle to understand how it can be used with delay.
let
//Get list of records
GetUpdates = (optional offset as number) as list => 1,
Updates = GetUpdates(),
// Store last update_id
LastMessageId = List.Last(Updates)[update_id],
// Prepare and response
Process = (item as record) as record =>
// Map Process function to each item in the list of records
Map = List.Transform(Updates, each Process(_))
in
Map
PowerBI does not support continuous automatic re-loading of data in the desktop.
Online, you can enforce a refresh as fast as 15 minutes using direct query1
Alternative methods:
You could do this in Excel and use VBA to re-execute the query on a schedule
Streaming data in PowerBI2
Streaming data with Flow and PowerBI3
1: Supported DirectQuery Sources
2: Realtime Streaming in PowerBI
3: Streaming data with Flow
4: Don't forget to enable historic logging!

redis.py: How to flush all the queries in a pipeline

I have a redis pipeline say:
r = redis.Redis(...).pipline()
Suppose I need to remove any residual query, if present in the pipeline without executing. Is there anything like r.clear()?
I have search docs and source code and I am unable to find anything.
The command list is simply a python list object. You can inspect it like such:
from redis import StrictRedis
r = StrictRedis()
pipe = r.pipeline()
pipe.set('KEY1', 1)
pipe.set('KEY2', 2)
pipe.set('KEY3', 3)
pipe.command_stack
[(('SET', 'KEY1', 1), {}), (('SET', 'KEY2', 2), {}), (('SET', 'KEY3', 3), {})]
This has not yet been sent to the server so you can just pop() or remove the commands you don't want. You can also just assign an empty list, pipe.command_stack = [].
If there is a lot you could simply just re-assign a new Pipeline object to pipe.
Hope this is what you meant.
Cheers
Joe
Use:
pipe.reset()
Other than the obvious advantage of ignoring implementation details (such as the command_stack mentioned before), this method will take care of interrupting the current ongoing transaction (if any) and returning the connection to the pool.

Bing Maps Rest Services Multiple Locations for Geocoding

Currently, I have an ASP application which retrieves a set of locations from a datasource and then uses Bing map REST services to geocode the addresses and then display them on a table and a map in pages of 10 results at a time.
Currently, the application processes the locations sequentially ...
var geocodeRequest = "http://ecn.dev.virtualearth.net/REST/v1/Locations/" + fullAddress.replace('&', ' ').replace(',', ' ') + "?output=json&jsonp=GeocodeCallback&key=" + getCredentials;
CallRestService(geocodeRequest);
......
function GeocodeCallback(result) {
if (result &&
result.resourceSets &&
result.resourceSets.length > 0 &&
result.resourceSets[0].resources &&
result.resourceSets[0].resources.length > 0) {
// Set the map view using the returned bounding box
var bbox = result.resourceSets[0].resources[0].bbox;
var viewBoundaries = MM.LocationRect.fromLocations(new MM.Location(bbox[0], bbox[1]), new MM.Location(bbox[2], bbox[3]));
map.setView({ bounds: viewBoundaries });
// Add a pushpin at the found location
MM.Location.prototype.locID = null;
var location = new MM.Location(result.resourceSets[0].resources[0].point.coordinates[0], result.resourceSets[0].resources[0].point.coordinates[1]);
location.locID = tableRowIndex;
locs.push(location);
.....
Is there any way to speed this up by passing 10 locations in one call and then processing result.resourceSets[0], result.resourceSets[1] etc?
How would multiple addresses be passed into the rest services call? (comma deliminated?)
Thanks
Bing has two REST-accessible geocoding APIs. One of them is the one you're using, which only supports one address at a time, and the other is the Dataflow API which is designed for high-volume batch processing. Neither really seem like they're right for you, as your system is currently designed.
Depending on where you're getting your street addresses from (all you mention is 'a datasource'), you might be able to do a big-batch geocode for all the locations in your datasource - move the geocoding from request time to a batch process, and just use the request-time geocoding for the ones the batch process hasn't gotten to yet.
There is no way of doing this as it looks right now. It has been proposed to support native in javascript (i think), but I do not think that it has been implemented yet. It you want some concurrency, you could look at webworkers:
http://en.wikipedia.org/wiki/Web_Workers
https://developer.mozilla.org/En/Using_web_workers
But this is not supported in IE yet. Maybe you could try to check out html5 async. I do not know if it could be used in the creation of the script element that is created when you call the REST services.