Lambda random long execution while running QLDB query - amazon-web-services

I have a lambda triggered by a SQS FIFO queue when there are messages on this queue. Basically this lambda is getting the message from the queue and connecting to QLDB through a VPC endpoint in order to run a simple SELECT query and a subsequent INSERT query. The table selected by the query has a index for the field used in the where condition.
Flow (all the services are running "inside" a VPC):
SQS -> Lambda -> VPC interface endpoint -> QLDB
Query SELECT:
SELECT FIELD1, FIELD2 FROM TABLE1 WHERE FIELD3 = "ABCDE"
Query INSERT:
INSERT INTO TABLE1 .....
This lambda is using a shared connection/session on QLDB and this is how I'm connecting to it.
import { QldbDriver, RetryConfig } from 'amazon-qldb-driver-nodejs'
let driverQldb: QldbDriver
const ledgerName = 'MyLedger'
export function connectQLDB(): QldbDriver {
if ( !driverQldb ) {
const retryLimit = 4
const retryConfig = new RetryConfig(retryLimit)
const maxConcurrentTransactions = 1500
driverQldb = new QldbDriver(ledgerName, {}, maxConcurrentTransactions, retryConfig)
}
return driverQldb
}
When I run a load test that simulates around 200 requests/messages per second to that lambda in a time interval of 15 minutes, I'm starting facing a random long execution for that lambda while running the queries on QLDB (mainly the SELECT query). Sometimes the same query retrieves data around 100ms and sometimes it takes more than 40 seconds which results in lambda timeouts. I have changed lambda timeout to 1 minute but this is not the best approch and sometimes it is not enough too.
The VPC endpoint metrics are showing around 250 active connections and 1000 new connections during this load test execution. Is there any QLDB metric that could help to identify the root cause of this behavior?
Could it be related to some QLDB limitation (like the 1500 active sessions described here: https://docs.aws.amazon.com/qldb/latest/developerguide/limits.html#limits.default) or something related to concurrency read/write iops?

scodeler, I've read through the NodeJS QLDB driver, and I think theres an order of operations error. If you provide your own backoff function in the RetryConfig where RetryConfig(4, newBackoffFunction), you should see significant performance improvement in your lambda's completing.
The driver's default backoff
const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, Math.pow(SLEEP_BASE_MS * 2, retryAttempt));
summarized...it returns
return Math.random() * exponentialBackoff;
does not match the default best jitter function practices
const newBackoffFunction: BackoffFunction = (retryAttempt: number, error: Error, transactionId: string) => {
const exponentialBackoff: number = Math.min(SLEEP_CAP_MS, SLEEP_BASE_MS * Math.pow(2, retryAttempt));
const jitterRand: number = Math.random();
const delayTime: number = jitterRand * exponentialBackoff;
return delayTime;
}
The difference is that the SLEEP_BASE_MS should be multiplied by 2 ^ retryAttempt, and not (SLEEP_BASE_MS x 2) ^ retryAttempt.
Hope this helps!

Related

Process 10req/s and save to cloud storage - recommended method?

I have 10 requests per second of data I want to save that looks like the entry below. I need to save this data after a CloudRun function completes. (My infrastructure is on google-cloud-platform). The data will be used as a data set for machine learning.
{
"text": "1k characters",
"text2": "1k characters",
"metadata1": "enum (100 vals)",
"metadata2": "number value"
}
I planned to save this as a non-awaited function to google-cloud-storage either in one folder or in folders based on the metadata1 enum. Is either better than the other?
Is this the appropriate route to take?
I think pubsub is overkill as suggested in this SO answer.
I can propose you 2 patterns, but in both case you need to store the messages:
Either use PubSub to stack the messages. Then, use Dataflow to read pubsub and to sink to Cloud Storage. Or use a on demand service (Cloud Run for exemple) to pull your PubSub subscription and write a file with all the message read (You can trigger your Cloud Run with Cloud Scheduler, every hour for example)
Or store the message in BigQuery, and then perform query export to GCS regularly (again with a Cloud Scheduler + Cloud Functions/Run). It's my preferred solution, because, maybe a day, you will have to process differently your message, and to get metrics/perform analytics on them.
#guillaume's answer is definitely the best one, but for ease of implement-ability, I decided to just save them directly to GCS.
const saveData = async ({ text, text2, enum, number }) => {
try {
const timestamp = new Date().getTime()
const folder = enum
const fileName = `${folder}/${enum}-${timestamp}.json`
const file = bucket.file(fileName)
const contents = JSON.stringify({ text, text2, enum, number })
return file.save(contents)
}
} catch (e) {
console.log(`Failed to save file, ${e.message}`)
}
}
It added some latency, but overall I estimated the cost to be about $10 in server costs a month as compared to pubsub method which when trying to determine the cost, put it around $50-100 bucks a month (or more, was hard to determine. But it did assume that each message is 1MB if it's under 1MB).
The big query method Guillaume provided appeared to have no cost since 1TB of transferred data is free each month. I could be wrong on this. I may switch to this later on.

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.
If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server
Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.
So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

Power Query | Loop with delay

I'm new to PQ and trying to do following:
Get updates from server
Transform it.
Post data back.
While code works just fine i'd like it to be performed each N minutes until application closure.
Also LastMessageId variable should be revaluated after each call of GetUpdates() and i need to somehow call GetUpdates() again with it.
I've tried Function.InvokeAfter but didn't get how to run it more than once.
Recursion blow stack out ofc.
The only solution i see is to use List.Generate but struggle to understand how it can be used with delay.
let
//Get list of records
GetUpdates = (optional offset as number) as list => 1,
Updates = GetUpdates(),
// Store last update_id
LastMessageId = List.Last(Updates)[update_id],
// Prepare and response
Process = (item as record) as record =>
// Map Process function to each item in the list of records
Map = List.Transform(Updates, each Process(_))
in
Map
PowerBI does not support continuous automatic re-loading of data in the desktop.
Online, you can enforce a refresh as fast as 15 minutes using direct query1
Alternative methods:
You could do this in Excel and use VBA to re-execute the query on a schedule
Streaming data in PowerBI2
Streaming data with Flow and PowerBI3
1: Supported DirectQuery Sources
2: Realtime Streaming in PowerBI
3: Streaming data with Flow
4: Don't forget to enable historic logging!

To delete the vertex by looping the dataframe in glue timeout

I want to delete the vertex to loop on one dataframe.
Suppose I will delete the vertex based on some cols of dataframe
my function is written in this way: and it is timeout
def delete_vertices_for_label(rows):
conn = self.remote_connection()
g = self.traversal_source(conn)
for row in rows:
entries = row.asDict()
create_traversal = __.hasLabel(str(entries["~label"]))
for key, value in entries.iteritems():
if key=='~id':
pass
elif key == '~label':
pass
else:
create_traversal.has(key), value)
g.V().coalesce(create_traversal).drop().iterate()
I have succeed in using this function locally on tinkerGraph, however ,when I try to run above function in glue which manipulate data in aws Neptune ; it failed.
I also create one lambda function in below: still meet the issue like timeout.
def run_sample_gremlin_basedon_property():
remoteConn = DriverRemoteConnection('ws://' + CLUSTER_ENDPOINT + ":" +
CLUSTER_PORT + '/gremlin', 'g')
graph = Graph()
g = graph.traversal().withRemote(remoteConn)
create_traversal = __.hasLabel("Media")
create_traversal.has("Media_ID", "99999")
create_traversal.has("src_name", "NET")
print ("create_traversal:",create_traversal)
g.V().coalesce(create_traversal).drop().iterate()
Dropping a vertex involves dropping associated properties and edges as well, and hence depending on the data, it could take a large amount of time. Drop step was optimized in one of the engine releases [1], so ensure that you are on a version newer than that. If you still get timeouts, set an appropriate timeout value on the cluster using the cluster parameter for timeouts.
Note: This answer is based off EmmaYang's communication with AWS Support. Looks like the Gluejob was configured in a manner that needs a high timeout. I'm not familiar enough with Glue to comment more on that (Emma - Can you please elaborate that?)
[1] https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.1.0.200296.0.html