Orchestrate AWS Lambda functions for data preprocessing pipeline - amazon-web-services

I have the following ETL job to do using AWS services.
Input is a list of files that need to be processed.
For each file, several Lambdas are being invoked in a synchronous pipeline (e.g. each file needs to be downloaded from somewhere, then transformed etc.)
The whole pipeline needs to wait for all files to be processed in order to invoke next Lambda, which is used for analyzing all newly generated information.
That Lambda, will dictate the way all files will continue their processing.
The diagram is something like this:
I've seen that there are tools like AWS Step functions, AWS Glue, and this is to me very similar to a producer consumer problem so maybe SQS comes to play.
Anyway, what would be the best approach to build a pipeline like this?

Related

How to deploy AWS Lambda using Codepipeline without rebuilding it if there are no function code changes?

so we have a pipeline that zips code functions for our Lambda, uploads it to S3 and builds
the every lambda we have again with new version of zipped codes.
Now, the problem is, every single Lambda is being Built every pipeline run. even if there are no changes to other lambda code function. ex. (only 1 of 10 lambda has code change)
What would be the best approach or checking that we need to add in our pipeline in order to build the only Lambda that has code change? open for any suggestions even creating new pipeline and breaking this lambdas into pieces
I think the best way is to add a stage before the zipfile to see with files changes in the code in the last merge.
Simply take those names and check which lambda was affected.
Then it will pass the list of lambdas need to redeploy.
What we do at our company is have a single pipeline per lambda / repo. We have a few mono repos that when they deploy they deploy all the lambdas in that repo at once but still through a single pipeline. If you concerned about cost in the pipeline sticking around you could always delete them and then have another job to recreate them when you need to feploy a new change.
We've got everything done through cloud formation scripts so it's all simple scripts running here and there to create pipelines.
Curious what is the reason to have one pipeline deploy all 10 lambdas?

Is there a way to delete messages in SQS with Dataflow once they arrive in PubSub?

I have the following infrastructure in place: Dataflow is used to send messages from AWS SQS to Google Cloud's Pub/Sub.
Messages are read with java and Apache Beam (SqsIO).
Is there a way with Dataflow to delete the messages in AWS SQS once they arrive / are read in PubSub and how would that look like? Can this be done in java with Apache Beam?
Thank you for any answers in advance!
There's no in-built support for message deletion, but you can add code to delete messages that are read from AWS SQS using a Beam ParDo. But you must perform such a deletion with care.
A Beam runner performs reading using one or more workers. A given work item could fail at any time and a runner usually re-runs a failed work item. Additionally, most runners fuse multiple steps. For example, if you have a Read transform followed by a delete ParDo, a runner may fuse these transforms an execute them together. Now if a work item fails after partially deleting data, a re-run of such a work item may fail or may produce incorrect data.
The usual solution is to add a fusion break between the two steps. You can achieve this with Beam's Reshuffle.viaRandomKey() transform (or just by adding any transform that uses GroupByKey). For example, the flow of your program can be as follows.
pipeline
.apply(SqsIO.read())
.apply(Reshuffle.viaRandomKey())
.apply(ParDo.of(new DeleteSQSDoFn()))
.apply(BigQuery.Write(...))

Code pipeline to build a branch on pull request

I am trying to make a code pipeline which will build my branch when I make a pull request to the master branch in AWS. I have many developers working in my organisation and all the developers work on their own branch. I am not very familiar with ccreating lambda function. Hoping for a solution
You can dynamically create pipelines everytime a new pull-request has been created. Look for the CodeCommit Triggers (in the old CodePipeline UI), you need lambda for this.
Basically it works like this: Copy existing pipeline and update the the source branch.
It is not the best, but afaik the only way to do what you want.
I was there and would not recommend it for the following reasons:
I hit this limit of 20 in my region: "Maximum number of pipelines with change detection set to periodically checking for source changes" - but, you definitely want this feature ( https://docs.aws.amazon.com/codepipeline/latest/userguide/limits.html )
The branch-deleted trigger does not work correctly, so you can not delete the created pipeline, when the branch has been merged into master.
I would recommend you to use Github.com if you need a workflow as you described. Sorry for this.
I have recently implemented an approach that uses CodeBuild GitHub webhook support to run initial unit tests and build, and then publish the source repository and built artefacts as a zipped archive to S3.
You can then use the S3 archive as a source in CodePipeline, where you can then transition your PR artefacts and code through Integration testing, Staging deployments etc...
This is quite a powerful pattern, although one trap here is that if you have a lot of pull requests being created at a single time, you can get CodePipeline executions being superseded given only one execution can proceed through a given stage at a time (this is actually a really important property, especially if your integration tests run against shared resources and you don't want multiple instances of your application running data setup/teardown tasks at the same time). To overcome this, I publish an S3 notification to an SQS FIFO queue when CodeBuild publishes the S3 artifact, and then poll the queue, copying each artifact to a different S3 location that triggers CodePipeline, but only if there are are currently no executions waiting to execute after the first CodePipeline source stage.
We can very well have dynamic branching support with the following approach.
One of the limitations in AWS code-pipeline is that we have to specify branch names while creating the pipeline. We can however overcome this issue using the architecture shown below.
flow diagram
Create a Lambda function which takes the GitHub web-hook data as input, using boto3 integrate it with AWS pipeline(pull the pipeline and update), have an API gateway to make the call to the Lambda function as a rest call and at last create a web-hook to the GitHub repository.
External links:
https://aws.amazon.com/quickstart/architecture/git-to-s3-using-webhooks/
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/codepipeline.html
Related thread: Dynamically change branches on AWS CodePipeline

Can an AWS Lambda modify a json file on itself?

I have an AWS Lambda function. which have an array on a .json file. now the thing is that I want to modify that .json but after the run, the json remains exactly the same than before the run.
The logs I place there make me think that is actually being modified, but, I wonder if a lambda goes back to its definition before the run.
tbh the information that I need to hold in that json is going to be always just a small amount of settings but those are going to be easy to modify without making a deploy and im trying to avoid using a db or an s3 bucket.
Regards,
Daniel
You're not going to be able to do this. Lambda stores the deployment package (i.e. the .zip or .jar file you used to deploy) and uses that package for the next Lambda it spins up. This new Lambda may or may not be the one that just ran.
The easiest way will be to store this in an S3 bucket. Be aware though that just like in multi-threaded programming you may have many processes (Lambda instances) running at the same time so resource contention is something to be aware of.
I want you to consider the following behaviour of Lambda function:
Let's say you spin one lambda up ,
and then you send a second message to lambda .
If you first lambda finished before you send the second message
The same lambda will run the message .
So this is why you see it changed the file , it's on the same instance with same files .
I would suggest loading json into memory ,
and not change the file directly .
That will solve your problem.
AWS Lambda images are immutable. You need to deploy new state file (json with array) or use some kind storage for it.

Best way to automate a process to be run from command line (via AWS)

I am working on a web application to provide a software as a web-based service using AWS, but I'm stuck on the implementation.
I will be using a Content Management System (probably Joomla) to manage user logins and front-end tasks such as receiving file uploads as the input. The program that provides the service needs to be run from the command line. However, I am not sure what the best way to automate this process (starting the program once the input file has been received) would be. It is an intensive program that will take at least an hour on each program, and should be run sequentially if there is more than one input at any one time, so there needs to be a queue where each element in the queue records the file path of the input file, the file path of the output folder, and ideally the email to send a notification to when the job is done.
I have looked into Amazon Data Pipeline and AWS Simple Workflow Service, and Simple Queue Services and Simple Notification Services, but I'm still not sure how exactly these could be used to trigger the start of the process, starting from the input file being uploaded.
Any help would be greatly appreciated!
There are a number of ways to architect this type of process; here is one approach what would work:
On the upload, upload the file to an S3 bucket, so that it can be accessed by any instance later.
Within the upload process, send a message to an SQS queue, which includes the bucket/key of the file uploaded, and the email of the user that uploaded it.
Either with Lambda, or with a cron process on a purpose built instance(s), check the SQS queue, and process each request.
Into the processing phase, add the email notification to the user when the process is complete.
You can absolutely use data pipeline to automate this process.
Take a look at managed preconditions and the following samples.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-preconditions.html
https://github.com/awslabs/data-pipeline-samples/tree/master/samples