How to query big data in DynamoDB in best practice - amazon-web-services

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student
I'm using the serverless framework to deal with that scenario with this Lambda:
const queryStudent = async (_school_id, _year) => {
var params = {
TableName: `schoolTable`,
KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
};
try {
let _students = [];
let items;
do {
items = await dynamoClient.query(params).promise();
_students = items.Items;
params.ExclusiveStartKey = items.LastEvaluatedKey;
} while (typeof items.LastEvaluatedKey != 'undefined');
return _students;
} catch (e) {
console.log('Error: ', e);
}
};
const mainHandler = async (event, context) => {
…
let students = await queryStudent(body.school_id, body.year);
await printCerificate(students)
…
}
So far, it’s working well with about 5k students (just sample data)
My concern: is it a scalable solution to query large data in DynamoDB?
As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?
Any best practice approach for this scenario is very appreciated and welcome.

If you think about scaling, there are multiple potential bottlenecks here, which you could address:
Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.
If you want this to scale to a million students per school, I'd probably change the architecture to something like this:
You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.
This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.
Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Related

Process 10req/s and save to cloud storage - recommended method?

I have 10 requests per second of data I want to save that looks like the entry below. I need to save this data after a CloudRun function completes. (My infrastructure is on google-cloud-platform). The data will be used as a data set for machine learning.
{
"text": "1k characters",
"text2": "1k characters",
"metadata1": "enum (100 vals)",
"metadata2": "number value"
}
I planned to save this as a non-awaited function to google-cloud-storage either in one folder or in folders based on the metadata1 enum. Is either better than the other?
Is this the appropriate route to take?
I think pubsub is overkill as suggested in this SO answer.
I can propose you 2 patterns, but in both case you need to store the messages:
Either use PubSub to stack the messages. Then, use Dataflow to read pubsub and to sink to Cloud Storage. Or use a on demand service (Cloud Run for exemple) to pull your PubSub subscription and write a file with all the message read (You can trigger your Cloud Run with Cloud Scheduler, every hour for example)
Or store the message in BigQuery, and then perform query export to GCS regularly (again with a Cloud Scheduler + Cloud Functions/Run). It's my preferred solution, because, maybe a day, you will have to process differently your message, and to get metrics/perform analytics on them.
#guillaume's answer is definitely the best one, but for ease of implement-ability, I decided to just save them directly to GCS.
const saveData = async ({ text, text2, enum, number }) => {
try {
const timestamp = new Date().getTime()
const folder = enum
const fileName = `${folder}/${enum}-${timestamp}.json`
const file = bucket.file(fileName)
const contents = JSON.stringify({ text, text2, enum, number })
return file.save(contents)
}
} catch (e) {
console.log(`Failed to save file, ${e.message}`)
}
}
It added some latency, but overall I estimated the cost to be about $10 in server costs a month as compared to pubsub method which when trying to determine the cost, put it around $50-100 bucks a month (or more, was hard to determine. But it did assume that each message is 1MB if it's under 1MB).
The big query method Guillaume provided appeared to have no cost since 1TB of transferred data is free each month. I could be wrong on this. I may switch to this later on.

Lambda SQL Server RDS Connection Leak

Problem
I'm using mssql v6.2.0 in a Lambda that is invoked frequently (consistently ~25 concurrent invocations under standard load).
I seem to be having trouble with connection pooling or something because I keep having tons of open DB connections which overwhelm my database (SQL Server on RDS) causing the Lambdas to just time out waiting for query results.
I have read the docs, various similar questions, Github issues, etc. but nothing has worked for this particular issue.
Things I've Learned Already
I did learn that pooling is possible across invocations due to the fact that variables outside the handler function are shared across invocations in the same container. This makes me think I should see just a few connections for each container running my Lambda, but I don't know how many that is so it's hard to verify. Bottom line is that pooling should keep me from having tons and tons of open connections, so something isn't working right.
There are several different ways to use mssql and I have tried several of them. Notably I've tried specifying max pool size with both large and small values but got the same results.
AWS recommends that you check to see if there's already a pool before trying to create a new one. I tried that to no avail. It was something like pool = pool || await createPool()
I know that RDS Proxy exists to help with situations like this, but it appears it isn't offered (at this time) for SQL Server instances.
I do have the ability to slow down my data a bit, but this has a slight impact on the performance of the product as a whole, so I don't want to do that just to avoid solving a DB connections issue.
Left unchecked, I saw as many as 700 connections to the DB at once, leading me to think there's a leak of some kind and it's maybe not just a reasonable result of high usage.
I didn't find a way to shorten the TTL for connections on the SQL Server side as recommended by this re:Invent slide. Perhaps that is part of the answer?
Code
'use strict';
/* Dependencies */
const sql = require('mssql');
const fs = require('fs').promises;
const path = require('path');
const AWS = require('aws-sdk');
const GeoJSON = require('geojson');
AWS.config.update({ region: 'us-east-1' });
var iotdata = new AWS.IotData({ endpoint: process.env['IotEndpoint'] });
/* Export */
exports.handler = async function (event) {
let myVal= event.Records[0].Sns.Message;
// Gather prerequisites in parallel
let [
query1,
query2,
pool
] = await Promise.all([
fs.readFile(path.join(__dirname, 'query1.sql'), 'utf8'),
fs.readFile(path.join(__dirname, 'query2.sql'), 'utf8'),
sql.connect(process.env['connectionString'])
]);
// Query DB for updated data
let results = await pool.request()
.input('MyCol', sql.TYPES.VarChar, myVal)
.query(query1);
// Prepare IoT Core message
let params = {
topic: `${process.env['MyTopic']}/${results.recordset[0].TopicName}`,
payload: convertToGeoJsonString(results.recordset),
qos: 0
};
// Publish results to MQTT topic
try {
await iotdata.publish(params).promise();
console.log(`Successfully published update for ${myVal}`);
//Query 2
await pool.request()
.input('MyCol1', sql.TYPES.Float, results.recordset[0]['Foo'])
.input('MyCol2', sql.TYPES.Float, results.recordset[0]['Bar'])
.input('MyCol3', sql.TYPES.VarChar, results.recordset[0]['Baz'])
.query(query2);
} catch (err) {
console.log(err);
}
};
/**
* Convert query results to GeoJSON for API response
* #param {Array|Object} data - The query results
*/
function convertToGeoJsonString(data) {
let result = GeoJSON.parse(data, { Point: ['Latitude', 'Longitude']});
return JSON.stringify(result);
}
Question
Please help me understand why I'm getting runaway connections and how to fix it. For bonus points: what's the ideal strategy for handling high DB concurrency on Lambda?
Ultimately this service needs to handle several times the current load -- I realize this becomes a quite intense load. I'm open to options like read replicas or other read-performance-boosting measures as long as they're compatible with SQL Server, and they're not just a cop out for writing proper DB access code.
Please let me know if I can improve the question. I know there are similar ones out there but I have read/tried a lot of them and didn't find them to help. Thanks in advance!
Related Material
https://forums.aws.amazon.com/thread.jspa?messageID=678029 (old, but similar)
https://www.slideshare.net/AmazonWebServices/best-practices-for-using-aws-lambda-with-rdsrdbms-solutions-srv320 re:Invent slide deck
https://www.jeremydaly.com/reuse-database-connections-aws-lambda/ Relevant info but for MySQL instead of SQL Server
Answer
I finally found the answer after 4 days of effort. All I needed to do was scale up the DB. The code is actually fine as-is.
I went from db.t2.micro to db.t3.small (or 1 vCPU, 1GB RAM to 2 vCPU and 2GB RAM) at a net cost of roughly $15/mo.
Theory
In my case, the DB probably couldn't handle the processing (which involves several geographic calculations) for all my invocations at once. I did see CPU go up, but I assumed that was a result of the high open connections. When the queries slowed down, the concurrent invocations pile up as Lambdas start to wait for results, finally causing them to time out and not close their connections properly.
Comparisions:
db.t2.micro:
200+ DB connections (goes up continuously if you leave it running)
50+ concurrent invocations
5000+ ms Lambda duration when things slow down, ~300ms under no load
db.t3.small:
25-35 DB connections (constantly)
~5 concurrent invocations
~33 ms Lambda duration <-- ten times faster!
CloudWatch Dashboard
Summary
I think this issue was confusing to me because it didn't smell like a capacity issue. Almost every time I've dealt with high DB connections in the past, it has been a code error. Having tried options there, I thought it was "some magical gotcha of serverless" that I needed to understand. In the end it was as simple as changing DB tiers. My takeaway is that DB capacity issues can manifest themselves in ways other than high CPU and memory usage, and that high connections may be a result of something besides a code bug.
Update (4 months in)
This continues to work very well. I'm impressed that doubling the DB resources seems to have given > 2x performance. Now, when due to load (or a temporary bug during development), the db connections get really high (even over 1k) the DB handles it. I'm not seeing any issues at all with db connections timing out or the database getting bogged down due to load. Since the original time of writing I've added several CPU-intensive queries to support reporting workloads, and it continues to handle all these loads simultaneously.
We've also deployed this setup to production for one customer since the time of writing and it handles that workload without issue.
So a connection pool is no good on Lambda at all what you can do is reuse connections.
Trouble is every Lambda execution opens a pool it'll just flood the DB like you're getting, you want 1 connection per lambda container, you can use a db class like so (this is rough but lemmy know if you've got questions)
export default class MySQL {
constructor() {
this.connection = null
}
async getConnection() {
if (this.connection === null || this.connection.state === 'disconnected') {
return this.createConnection()
}
return this.connection
}
async createConnection() {
this.connection = await mysql.createConnection({
host: process.env.dbHost,
user: process.env.dbUser,
password: process.env.dbPassword,
database: process.env.database,
})
return this.connection
}
async query(sql, params) {
await this.getConnection()
let err
let rows
[err, rows] = await to(this.connection.query(sql, params))
if (err) {
console.log(err)
return false
}
return rows
}
}
function to(promise) {
return promise.then((data) => {
return [null, data]
}).catch(err => [err])
}
What you need to understand is A lambda execution is a little virtual machine that does a task and then stops, it does sit there for a while and if anyone else needs it then it gets reused along with the container and connection for a single task there's never multiple connections to a single lambda.
Hope this helps let me know if ya need any more detail! Oh and welcome to stackoverflow, that's a well-constructed question.

Distribute work stored in table to multiple processes

I have a database table where each row represents a work to be done. This table is filled up/receive work through a rest API. Apart from a rest-service taking up the work, I have another service which uses actors to process this work.
I need suggestions in distributing this work evenly across these workers. This work is not one time, it is kind of done at an interval until user deletes that.
Therefore I need a mechanism where
The work as it comes is distributed evenly.
If the second service(work consumer) fails it can again boot up with all the records in table and distribute the work again.
Each actor represents one row of the work table.
class WorkActor(workId: String)(implicit system: ActorSystem, materializer: ActorMaterializer) extends Actor {
// read the record from table or whereever you want to read
override def preStart(): Unit = {
logger.info("WorkActor start ===> " + self)
}
override def receive: Receive = {
case _ => {}
}
}
Create an Akka cluster sharding region to dispatch the request from rest api to corresponding actor. Calling startShardingRegion function to return an actorRef. Then you could send the message to this sharding actorRef by rest API, and then corresponding will help you handle the message.
final case class CommandEnvelope(id: String, payload: Any)
def startShardingRegion(role: String)(implicit system: ActorSystem) = {
ClusterSharding(system).start(
typeName = role,
entityProps = Props(classOf[WorkActor]),
settings = ClusterShardingSettings(system),
extractEntityId = ClusterConfig.extractEntityId,
extractShardId = ClusterConfig.extractShardId
)
}
// sharding key
object ClusterConfig {
private val numberOfShards = 100
val extractEntityId: ShardRegion.ExtractEntityId = {
case CommandEnvelope(id, payload) => (id, payload)
}
val extractShardId: ShardRegion.ExtractShardId = {
case CommandEnvelope(id, _) => (id.hashCode % numberOfShards).toString
case ShardRegion.StartEntity(id) => (id.hashCode % numberOfShards).toString
}
}
Read or recover the data from preStart function in the actor. There are many choice. You may read the uncompleted work from MQ (Kafka), Akka persistence (RDS, Cassandra) etc.
SBR has open source solution. That is an advanced topic if your business logic works.
https://github.com/TanUkkii007/akka-cluster-custom-downing
The general outline of a solution is to use Akka Cluster, Cluster Sharding, and Akk Cluster Singleton. When the cluster is considered formed (generally when some minimum number of members have joined the cluster), you start the Cluster Sharding system (sharding work items by the DB's primary key) and then a Cluster Singleton will read the DB table and send work items to Cluster Sharding for distribution among the nodes of the cluster. Akka Streams and particularly Alpakka's Slick JDBC integration may prove useful within the singleton. Another cluster singleton to periodically check on jobs may also be useful to recover from cluster node failures (but see below for something to consider there).
Two notes:
If using Cluster Sharding and Cluster Singleton, you probably want to consider what happens in a split-brain situation: this is a distributed system and the probability of a split-brain eventually happening can be presumed to be 100%. In the split-brain scenario, you will very likely have the same jobs being performed simultaneously by different sides of the split, so you need to ask if that's acceptable in your use-case.
If not, then you will need a component which monitors the communications between nodes in your cluster to detect a split-brain and takes steps to resolve the condition: Lightbend's Split Brain Resolver is a good choice if you aren't interested in implementing this yourself.
In a related vein, if the jobs consist of many steps which must be performed, a question to ask is, if a cluster or node fails after completing, say, eight of ten steps, is it acceptable to redo steps 1-8 vs. starting with step 9? If the answer to this is "no", then you'll need to persist the intermediate state of the job. Akka Persistence is a great choice here, though you may want to read up on event sourcing. If using Persistence with Cluster Sharding and Cluster Singleton, it should be noted, you will almost certainly need to handle split-brains (see previous item).

GCP Dataflow droppedDueToClosedWindow & Commit request for stage S8 and key 8 is larger than 2GB

We run into problems with our Dataflow on Google Cloud. Our pipeline consists of various input steps, which get data pushed in with GCP PubSub. We then aggregate the data and sort it. These 1 steps are clearly too heavy for Dataflow and the window we configured. We get an exception [2] on the step. Also we see these metrics:
droppedDueToClosedWindow 3,838,662 Bids/AggregateExchangeOrders
droppedDueToClosedWindow 21,060,627 Asks/AggregateExchangeOrders
Now I am seeking advice how to attack this issue. Should I break down the steps, so for example iterations and sorting can be done with parallel steps?
Is there a way to get more information about what exactly happens?
Should we increase the number of workers? (Currently 1).
We are rather new with Dataflow. .. Good advice is most welcome.
Edit: I am adding a bit of details on the steps.
This is how the steps below are 'chained' together:
#Override
public PCollection<KV<KV<String, String>, List<ExchangeOrder>>> expand(PCollection<KV<String, KV<String, String>>> input) {
return input.apply("PairWithType", new ByPairWithType(type))
.apply("UnfoldExchangeOrders", new ByAggregatedExchangeOrders())
.apply("AggregateExchangeOrders", GroupByKey.<KV<String, String>, KV<String, KV<BigDecimal, BigDecimal>>>create())
.apply("ReorderExchangeOrders", ParDo.of(new ReorderExchangeOrders()));
}
AggregateExchangeOrders:
So here, clearly we iterate through a collection of orders, and parse the type (twice), so it'a big decimal.
Which makes me think, we could skip one parse step as described here:
Convert string to BigDecimal in java
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, KV<String, String>> key = c.element().getKey();
List<KV<String, String>> value = c.element().getValue();
value.forEach(
exchangeOrder -> {
try {
BigDecimal unitPrice = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getKey()));
BigDecimal quantity = BigDecimal.valueOf(Double.valueOf(exchangeOrder.getValue()));
if (quantity.compareTo(BigDecimal.ZERO) != 0) {
// Exclude exchange orders with no quantity.
c.output(KV.of(key.getValue(), KV.of(key.getKey(), KV.of(unitPrice, quantity))));
}
} catch (NumberFormatException e) {
// Exclude exchange orders with invalid element.
}
});
}
...next we group and sort. (And optionally reverse it), it seems this step is not taking a huge load.
ReorderExchangeOrders:
#ProcessElement
public void processElement(ProcessContext c) {
KV<String, String> pairAndType = c.element().getKey();
Iterable<KV<String, KV<BigDecimal, BigDecimal>>> exchangeOrderBook = c.element().getValue();
List<ExchangeOrder> list = new ArrayList<>();
exchangeOrderBook.forEach(exchangeOrder -> list.add(
new ExchangeOrder(exchangeOrder.getKey(), exchangeOrder.getValue().getKey(), exchangeOrder.getValue().getValue())));
// Asks are sorted in ASC order
Collections.sort(list);
// Bids are sorted in DSC order
if (pairAndType.getValue().equals(EXCHANGE_ORDER_TYPE.BIDS.toString())) {
Collections.reverse(list);
}
c.output(KV.of(pairAndType, list));
}
[ 1 ] Dataflow screenshot:
[ 2 ] Exception: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed.
java.lang.IllegalStateException: Commit request for stage S8 and key 8 is larger than 2GB and cannot be processed. This may be caused by grouping a very large amount of data in a single window without using Combine, or by producing a large amount of data from a single input element.
com.google.cloud.dataflow.worker.StreamingDataflowWorker$Commit.getSize(StreamingDataflowWorker.java:327)
com.google.cloud.dataflow.worker.StreamingDataflowWorker.lambda$new$0(StreamingDataflowWorker.java:342)
The error message is kind of straightforward.
The root cause of the problem, as many of the comments point out, is that the structure that contains all the results for one of the DoFn's is larger than 2GB, and your best option would be to partition your data in some way to make your work units smaller.
In the code I see that some of the structures returned by DoFn's are nested structures in the form KV>. This arrangement forces Dataflow to send the whole response back in one monolithic bundle, and prevents it from chunking it into smaller pieces.
One possible solution would be to use composite keys instead of nested structures for as long as possible in the pipeline, and only combine them when strictly necessary.
For example,
instead of KV>, the DoFn could return
KV<(concat(Key1, Key2)), Value>
This would split the work units into much smaller sets that can then be dispatched in parallel to multiple workers.
To answer the other questions, increasing the number of workers will have no effect as the huge collection generated by DoFn looks like is not splittable. Adding logging to see how the collection arrives at 2GB might provide useful tips to prevent this.

How to retrieve the current state of a running Step Functions in AWS

I'm giving the AWS' Step Functions a try and I'm interested in them for implementing long-running procedures. One functionality I would like to provide to my users is the possibility of showing execution's progress. Using describeExecution I can verify if some execution is still running or done. But progress is a logical measure and Step Functions itself has no way to tell me how much of the process is left.
For that, I need to provide the logic myself. I can measure the progress in the tasks of the state machine knowing the total number of steps needed to take and counting the number of steps already taken. I can store this information in the state of the machine which is passed among steps while the machine is running. But how can I extract this state using API? Of course, I can store this information is an external storage like DynamoDb but that's not very elegant!
The solution I have found my self (so far this is the only), is using getExecutionHistory API. This API returned a list of events that are generated for the Step Functions and it can include input or output (or neither) based on whether the event is for a starting a lambda function or is it for the time a lambda function has exited. You can call the API like this:
var params = {
executionArn: 'STRING_VALUE', /* required */
maxResults: 10,
reverseOrder: true
};
stepfunctions.getExecutionHistory(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
By reversing the order of the list of events, we can get the latest ones first. Then we can look for the latest output in the list. The first one you'll find will be the latest version of the output which is the current state of the Step Functions.