My desire is to retrieve x number of records from a database based on some custom select statement, the output will be an array of json data. I then want to pass each element in the array into another lambda function in parallel.
So if 1000 records are returned, 1000 lambda functions need to be executed in parallel (I increase my account limit to what I need). If 30 out of 1000 fail, the main task that was retrieving the records needs to know about it.
I'm struggling to put together this simple flow.
I currently use javascript and AWS Aurora. I'm not looking for node.js/javascript code that retrieves the data, just the AWS Step Functions configuration and how to build an array within each function.
Thank you.
if 1000 records are returned, 1000 lambda functions need to be
executed in parallel
What you are trying to achieve is not supported by Step Functions. A State Machine task cannot be modified based on the input it received. So for instance, a Parallel task cannot be configured to add/remove functions based on the number of items it received in an array input.
You should probably consider using SQS Lambda trigger. Number of records retrieved from DB can be added to SQS queue which will then trigger a Lambda function for each item received.
If 30 out of 1000 fail, the main task that was retrieving the records
needs to know about it.
There are various ways to achieve this. SQS won't delete an item from the queue if Lambda returns an error. You can configure DLQ and RedrivePolicy based on your requirements. Or you may want to come up with a custom solution to keep the count on failing Lambdas to invoke the service that fetch records from the DB.
Related
I have a use case where i want a scheduled lambda to read from a dynamodb table until there are no records left to process from its dynamodb query. I don't want to run lots of instances of the lamdba as it will hit a REST endpoint each time and don't want to overload this external service.
The reason I am thinking i can't use dynamo streams (please correct me if I am wrong here) is
this DDB is where messages will be sent when a legacy service is down, the scheduled error handler lambda that will read them would not want to try and process them as soon as they are inserted as it is likely the legacy service is still down. (is it possible with streams to update one row in the DB say legacy_service = alive and then trigger a lambda ONLY for the rows where processed_status = false)
I also don't want to have multiple instances of the lambda running at one time as i don't want to throttle the legacy service.
I would like a scheduled lambda that queries dynamodb table for all records that have processed_status = false, the query has a limit to only retrieve a small batch (1 or 2 messages) and process them ( I have this part implemented already) when this lambda is finished i would like it to trigger again and again until there is no records in the DDB with processed_status = false.
This can be done with recursive functions good tutorial here https://labs.ebury.rocks/2016/11/22/recursive-amazon-lambda-functions/
Introduction
We are building an application to process a monthly file, and there are many aws components involved in this project:
Lambda reads the file from S3, parse it and push it to dynamoDB with flag (PENDING) for each record.
Another Lambda will processing these records after the first Lambda is done, and to flag a record as (PROCESSED) after it's done with it.
Problem:
We want to send a result to SQS after all records are processed.
Our approach
Is to use DynamoDB streaming to trigger a lambda each time a record gets updated, and Lambda to query dynamoDB to check f all records are processed, and to send the notification when that's true.
Questions
Are there any other approach that can achieve this goal without triggering Lambda each time a record gets updated?
Are there a better approach that doesn't include DynamoDB streaming?
I would recommend Dynamodb Stream as they are reliable enough, triggering lambda for and update is pretty cheap, execution will be 1-100 ms usually. Even if you have millions of executions it is a robust solution. There is a way to a have shared counter of processed messages using elastic cache, once you receive update and counter is 0 you are complete.
Are there any other approach that can achieve this goal without
triggering Lambda each time a record gets updated?
Other option is having a scheduled lambda execution to check status of all processed from the db (query for PROCESSED) and move it to SQS. Depending on the load you could define how often to be run. (Trigger using cloudwatch scheduled event)
What about having a table monthly_file_process with row for every month having extra counter.
Once the the s3 files is read count the records and persist the total as counter. With every PROCESSED one decrease the counter , if the counter is 0 after the update send the SQS notification. This entire thing with sending to SQS could be done from 2 lambda which processes the record, just extra step checking the counter.
I have written a cloud storage trigger based cloud function. I have 10-15 files landing at 5 secs interval in cloud bucket which loads data into a bigquery table(truncate and load).
While there are 10 files in the bucket I want cloud function to process them in sequential manner i.e 1 file at a time as all the files accesses the same table for operation.
Currently cloud function is getting triggered for multiple files at a time and it fails in BIgquery operation as multiple files trying to access the same table.
Is there any way to configure this in cloud function??
Thanks in Advance!
You can achieve this by using pubsub, and the max instance param on Cloud Function.
Firstly, use the notification capability of Google Cloud Storage and sink the event into a PubSub topic.
Now you will receive a message every time that a event occur on the bucket. If you want to filter on file creation only (object finalize) you can apply a filter on the subscription. I wrote an article on this
Then, create an HTTP functions (http function is required if you want to apply a filter) with the max instance set to 1. Like this, only 1 function can be executed in the same time. So, no concurrency!
Finally, create a PubSub subscription on the topic, with a filter or not, to call your function in HTTP.
EDIT
Thanks to your code, I understood what happens. In fact, BigQuery is a declarative system. When you perform a request or a load job, a job is created and it works in background.
In python, you can explicitly wait the end on the job, but, with pandas, I didn't find how!!
I just found a Google Cloud page to explain how to migrate from pandas to BigQuery client library. As you can see, there is a line at the end
# Wait for the load job to complete.
job.result()
than wait the end of the job.
You did it well in the _insert_into_bigquery_dwh function but it's not the case in the staging _insert_into_bigquery_staging one. This can lead to 2 issues:
The dwh function work on the old data because the staging isn't yet finish when you trigger this job
If the staging take, let's say, 10 seconds and run in "background" (you don't wait the end explicitly in your code) and the dwh take 1 seconds, the next file is processed at the end of the dwh function, even if the staging one continue to run in background. And that leads to your issue.
The architecture you describe isn't the same as the one from the documentation you linked. Note that in the flow diagram and the code samples the storage events triggers the cloud function which will stream the data directly to the destination table. Since BigQuery allow for multiple streaming insert jobs several functions could be executed at the same time without problems. In your use case the intermediate table used to load with write-truncate for data cleaning makes a big difference because each execution needs the previous one to finish thus requiring a sequential processing approach.
I would like to point out that PubSub doesn't allow to configure the rate at which messages are sent, if 10 messages arrive to the topic they all will be sent to the subscriber, even if processed one at a time. Limiting the function to one instance may lead to overhead for the above reason and could increase latency as well. That said, since the expected workload is 15-30 files a day the above maybe isn't a big concern.
If you'd like to have parallel executions you may try creating a new table for each message and set a short expiration deadline for it using table.expires(exp_datetime) setter method so that multiple executions don't conflict with each other. Here is the related library reference. Otherwise the great answer from Guillaume would completely get the job done.
I have one AWS lambda that kicks off (SNS events) multiple lambdas which in turn kick off (SNS events) multiple lambdas. All of these lambdas are writing files to S3 and I need to know when all files have been written. There will be another lambda which will send a final SNS message containing all references to the files produced. The amount of fan-out in the second set of lambdas is unknown as depends on the first fan-out.
If this was a single fan-out I would know how many files to be looking for but as it is a 2 step fan-out I am unsure as to how to monitor for all files. Has anybody dealt with this before? Thanks.
I would create a DynamoDB table for tracking this process. Create a single record in the table when the initial Lambda function kicks off, with a unique ID like a UUID or something if you don't already have a unique ID for this process. Also add that unique ID to the SNS messages, this will be the key used for all updates performed by the other processes. Also add a splitters_invoked to the record when it is created by the first process with the number of second level splitter functions it is invoking, and a splitters_complete property set to 0.
Inside the second level splitter functions you can use the DynamoDB feature Conditional Updates to update the DynamoDB record with the list of files created with their S3 locations. The second level splitter functions will also use the DynamoDB Atomic Counters feature to update the splitters_complete count just before they exit.
At the "process" level, each of those invocations will perform another Conditional Update to the DynamoDB record flagging the individual file they just processed as complete.
Finally, configure DynamoDB streams to trigger another Lambda function. This lambda function will check two conditions: splitters_complete is equal to splitters_invoked, and all files in the file list are marked as "completed". Then it will know that it can perform the final step in your process.
Alternatively, if you don't want to keep the list of S3 file locations in the DynamoDB table, simply use atomic counters for that as well, one counter for the total number of files created by the second level splitters, and another counter for the file processing functions.
There seems to be no way to tell lambdas to pull records in a scheduled manner.
This means that my lambda function never gets invoked unless the size of records meets the batch specification.
I'd like to my lambda function to get invoked eagerly so that it can pull records after a specified time elapses as well.
Imagine that you are building a real time analytics service that do not fill the specified batch size for a long time during off-peaks.
Is there any workaround to pull records periodically?
This means that my lambda function never gets invoked unless the size of records meets the batch specification.
That is not correct to my knowledge - can you provide the documentation that says so?
To my knowledge
AWS uses a daemon for polling the stream and check for new records. The daemon is what triggers the Lambda and it happens in one of the two cases:
Batch size crossed the specified limit (the one configured in Lambda).
Certain time had passed (don't know how much exactly) and current batch is not empty.
I had done a massive use of Kinesis and Lambda, I have configured the batch limit to 500 records (per invocation).
I have had invocations with less than 500 records, sometimes even ~20 records - this is a fact.