Here is my problem:
I have to compare a candidate object with some criteria with millions of other candidates in db. Since lambda allows only 5 minutes of execution so it causes timeout.
My Solution:
I planned to do this comparison with 10,000 chunks of candidates so I have to call 10 lambda functions (through SNS) to process 100,000 candidates and then save results of each lambda in some DynamoDB table. But how to get a callback when all lambda functions are done processing so that I can collect those results for individual lambdas and then calculate final results. How to achieve this or is there any better way to acheive my goal. Any help is most appreciated.
I'm not sure if AWS Lambda is truly a good fit for your use case. However just focusing on the main part of your question, you could use DynamoDB Atomic Counters to determine when all processing is complete. You would do the following:
Initially insert a record in DynamodB with a field like numberOfLambdaCalls attribute set to the number of concurrent executions you are kicking off, and a completedLambdaCalls attribute set to 0.
As each function completes, as part of updating the DynamoDB record they would increment the completedLambdaCalls attribute atomically.
Each function could check the returned result of the update to see if they were the one to complete the processing like if numberOfLambdaCalls == completedLambdaCalls and if they are, perform whatever action is necessary to trigger your response.
Related
I would like to get some advice to see whether step function is suitable for my use case.
I have a bunch of user records generated at random time. I need to do some pre-processing and validation before putting them into a pool. I have a stage which runs periodically (1-5min) to collect records from the pool and combine them, then publish them.
I need realtime traceability/monitor of each record and I need to notify the user once the record is published.
Here is a diagram to illustrate the flow.
Is a step function suitable for my use case? if not, is there any alternative which help me to simplify the solution? Thanks
Yes, Step Functions is an option. Step Function "State Machines" add the greatest value vs other AWS serverless workflow patterns such as event-driven or pub/sub when the scenario involves complex branching/retry logic and observability requirements. SM logic is explicit and visual, which makes it simple to reason about the workflow. For each State Machine (SM) execution, you can easily trace the exact path the execution took and where it failed. This added functionality is reflected in its higher cost.
In any case, you need to gather records until its time to collect them. This batching requirement means that your achitecture will need more elements than just a State Machine. Here are some ideas:
(1) A SM preprocesses Records one-by-one as they arrive
One option is to use State Machines to orchestrate the preprocessing and validation only. Each arriving event record kicks off a SM execution. Pre-processed records go into a queue, from which they are periodically polled and sent to be combined.
[Records EventBrige event] -> [preprocessing SM] -> [Record queue] -> [polling lambda] -> [Combining Service]
(2) Preprocess and process bached records in a end-to-end State Machine
Gather records in a queue as they arrive. A lambda periodically polls the queue and begins the SM execution on a batch of records. A SM Map Task pre-processes and validates the records in parallel then calls the combining service, all within a single execution. This setup gives you the greatest visibility, but is more complex because you have to handle cases where a single record causes the batched execution to fail.
[Records arrive] -> [Record source queue] -> [polling lambda gets batch] -> [SM for preprocessing, collecting and combining]
Other
There are plenty of other combinations, including chaining SM's together if necessary. Or avoiding SM's altogether. Which option is best for you will depend on which pain points matter most to you: observability, error handling, simplicity, cost.
We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).
Hi I was just wondering if someone could give me clarification on the benefits of batchwrite
If we have let's say 36 items we want to write to dynamodb, i'm using a AWS lambda function and the way I see it I have two options (Pseudo code)
Option One
for item in items:
putItem(item)
Option two
for item in items:
if cnt < 25
batch.push(item)
if cnt == 25
batchWrite(batch)
cnt = 0
cnt++
I feel like option one is quick and dirty but if my items would rarely go over 100 is it that bad (would I time out my lambda etc ..) ?
Anyway best practice clarification on this would be great.
For both of your options, you should implement some kind of parallel send for better controlling and performance, like Promise.allSettled in JS and asyncio parallel in python.
Like mentioned in the comments aboce, using batchWrite does in fact reduce the number of requests called, but one critical limitation is it only accepts up to 25 requests and 16 MB total size in one batch requests , if you expect the requests will exceed this limitation, you need to, and you should implement a way to divide the requests into multiple batches, so that each batch size is under 25 requests and 16 MB.
Using putItem is more simple than using batchWrite, as it doesn't have the limitation mentioned above. But again, you will initiate a lot of API requests.
Both the method does not affect the cost, as AWS does not charge you by the number of API called to them, according to their pricing description, but the actual data write to and read from the DynamoDB table, which are known are WCU and RCU respectively. Btw, data transfer in is not charged.
In conclusion, for both putItem and batchWrite, what you have to concern is (1) how to implement the requests and handle retry incase of error. (2) The quantity of the record to be inserted.
Just wondering whats the best way to handle the fact that dynamodb can only write batch sizes of max 25.
I have 3 Lambdas (there are more but I am simplifying down so we don't get side tracked)
GetNItemsFromExternalSourceLambda
SaveAllToDynamoDBLambda
AnalyzeDynamoDBLambda
Here is what happens:
GetNItemsFromExternalSourceLambda can potentially fetch 250 items in 1 rest call it makes to an external api.
It then invokes SaveAllToDynamoDBLambda and passes a) all these items and b) paging info e.g. {pageNum:1, pageSize : 250, numPages:5 } in the payload
SaveAllToDynamoDBLambda needs to save all items to a dynamodb table and then , based on the paging info will either a) re-invoke GetNItemsFromExternalSourceLambda (to fetch next page of data) or b) invoke AnalyzeDynamoDBLambda
these steps can loop many times obviously until we have got all the data from the external source before finally proceeding to last step
the final AnalyzeDynamoDBLambda then is some lambda that processes all the data that was fetched and saved to the db
So my problems lies in fact that SaveAllToDynamoDBLambda can only write 25 items in a batch, which means I would have to tell my GetNItemsFromExternalSourceLambda to only fetch 25 items at a time from the external source which is not ideal. (being able to fetch 250 at a time would be a lot better)
One could extend the timeout period of the SaveAllToDynamoDBLambda so that it could do multiple batch writes inside one invocation but i dont like that approach.
I could also zip up the 250 items and save to s3 in one upload which could trigger a stream event but I would have same issue on the other side of that solution.
just wondering whats a better approach while still being able to invoke AnalyzeDynamoDBLambda only after all info from all rest calls has been saved to dynamodb.
Basically the problem is you need a way of subdividing the large batch (250 items in this case) down to batches of 25 of less.
A very simple solution would be to use a Kinesis stream in the middle. Kinesis can take up to 500 records per PutRecords call. You can then use GetRecords with a Limit of 25 and put the records into Dynamo with a single BatchWriteItem call.
Make sure you look at the size limits as well before deciding if this solution will work for you.
If I scan or query in DynamoDB it is possible to set the Limit property. The DynamoDB documentation says the following:
The maximum number of items to evaluate (not necessarily the number of
matching items).
So the problem with this is if you set filters and such it won't return all the items.
My goal that I'm trying to figure out how to achieve is to have a filter in a scan or query, but have it return x number of items. No matter what. I'm ok with having to use LastEvaluatedKey and make multiple requests, but I would like to try to make it as seamless and easy as possible (so not doing that would be best.
The only way I have thought to do this is to set the Limit property to say 1 or something. Then just keep scanning or querying using the LastEvaluatedKey until I reach that x number of items I'm looking for. Problem is, this seems VERY wasteful and inefficient. I mean if you have a table of millions of records you might have to make thousands and thousands of requests. It doesn't seem like it scales very well. Of course I'm sure it's no different than what DynamoDB would be doing behind the scenes.
But is there a way to do this more efficiently where I can reduce the number of requests I have to make? Or is that the only way to achieve this?
How would you achieve this goal?
A single Query operation will read up to the maximum number of items set (if using the Limit parameter) or a maximum of 1 MB of data and then apply any filtering to the results using FilterExpression.
You're 100% right that Limit is applied before FilterExpression. Meaning Dynamo might return some number or documents less than the Limit while other documents that satisfy the FilterExpression still exist in the table but aren't returned.
Its sounds like it would be unacceptable for your api to behave in the same manner. That is going to mean that in some cases, a single request to your service will result in multiple requests to Dynamo. Also, keep in mind that there is no way to predict what the LastEvaluatedKey will be which would be required to parallelize these requests. So in the case that your service makes multiple requests to Dynamo, they will be serial. To me, this is a rather heavy tradeoff but, if it is a requirement that you satisfy the Limit whenever possible, you have options.
First, Dynamo will automatically page at 1 MB. That means you could simply send your query to Dynamo without a Limit and implement the Limit on your end. You may still need to make multiple requests to ensure that your've satisfied the Limit but this approach will result in the fewest number of requests to Dynamo. The trade off here is the total data being read and transferred. Chances are your Limit will not happen to line up perfectly with the 1 MB limit which means the excess data being read, filtered, and transferred is wasted.
You already mentioned the other extreme of sending a Limit of 1 and pointed out that will result in the maximum number of requests to Dynamo
Another approach along these lines is to create some sort of probabilistic function that takes the Limit given to your service by the client and computes a new Limit for Dynamo. For example, your FilterExpression filters out about half of the documents in the table. That means you can multiply the client Limit by 2 and that would be a reasonable Limit to send to Dynamo. Of the approaches we've talked about so far, this one has the highest potential for efficiency however, it also has the highest potential for complexity. For example, you might find that using a simple linear function is not good enough and instead you need to use machine learning to find a multi-variate non-linear function to calculate the new Limit. This approach also heavily depends on the uniformity of your data in Dynamo as well as your access patterns. Again, you might need machine learning to optimize for those variables.
In any of the cases where you are implementing the Limit on your end, if you plan on sending back the LastEvaluatedKey to the client for subsequent calls to your service, you will also need to take care to keep track of the LastEvaluatedKey that you evaluated. You will no longer be able to rely on the LastEvaluatedKey returned from Dynamo.
The final approach would be to reorganize/regroup your data either with a GSI, a separate table that you keep in sync using Dynamo Streams or a different schema altogether with the goal of not requiring a FilterExpression.