How to automate the Updating/Editing of Amazon Data Pipeline - amazon-web-services

I want to use AWS Data Pipeline service and have created some using the manual JSON based mechanism which uses the AWS CLI to create, put and activate the pipeline.
My question is that how can I automate the editing or updating of the pipeline if something changes in the pipeline definition? Things that I can imagine changing could be schedule time, addition or removal of Activities or Preconditions, references to DataNodes, resources definition etc.
Once the pipeline is created, we cannot edit quite a few things as mentioned here in the official doc: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-manage-pipeline-modify-console.html#dp-edit-pipeline-limits
This makes me believe that if I want to automate the updating of pipeline then I would have to delete and re-create/activate a new pipeline? If yes, then the next question is that how can I create a automated process which identifies the previous version's ID, deletes it and creates a new one? Essentially trying to build a release management flow for this where the configuration JSON file is released and deployed automatically.
Most commands like activate, delete, list-runs, put-pipeline-definition etc. take the pipeline-id which is not known until a new pipeline created. I am unable to find anything which remains constant across updates or recreation (the unique-id and name parameters of the createpipeline command are consistent but then I can't use them for the above mentioned tasks (I need pipeline-id for that.
Of course I can try writing shell scripts which grep and search the output and try to create a script but is there any other better way? Some other info that I am missing?
Thanks a lot.

You cannot edit schedules completely or change references so creating/deleting pipelines seems to be the best way for your scenario.
You'll need the pipeline-id to delete a pipeline. Is it not possible to keep a record of that somewhere? You can have a file with the last used id stored locally or in S3 for instance.
Some other ways I can think of are:
If you have only 1 pipeline in the account you can list-pipelines and
use the only result
If you have the pipeline name you can list-pipelines and find the id

Related

Can I configure Google DataFlow to keep nodes up when I drain a pipeline

I am deploying a pipeline to Google Cloud DataFlow using Apache Beam. When I want to deploy a change to the pipeline, I drain the running pipeline and redeploy it. I would like to make this faster. It appears from the logs that on each deploy DataFlow builds up new worker nodes from scratch: I see Linux boot messages going by.
Is it possible to drain the pipeline without tearing down the worker nodes so the next deployment can reuse them?
rewriting Inigo's answer here:
Answering the original question, no, there's no way to do that. Updating should be the way to go. I was not aware it was marked as experimental (probably we should change that), but the update approach has not changed in the last 3 i have been using DF. About the special cases of update not working, supposing your feature existed, the workers would still need the new code, so no really much to save, and update should work in most of the other cases.

It is possible to re-run a job in Google Cloud Dataflow after succeded

Maybe the question sounds stupid but I was wondering if once the job is successfully finished and having ID, is it possible to start the same job again?
Or is it necessary to create another one?
Because otherwise I would have the job with the same name throughout the list.
I just want to know if there is a way to restart it without recreating it again.
It's not possible to run the exact same job again, but you can create a new job with the same name that runs the same code. It will just have a different job ID and show up as a separate entry in the job list.
If you want to make running repeated jobs easier, you can create a template. This will let you create jobs from that template via a gcloud command instead of having to run your pipeline code.
Cloud Dataflow does have a re-start function. See SDK here. One suggested pattern (to help with deployment) is to create a template for the graph you want to repeatedly run AND execute the template.

What's the point of full configurations in .tfstate?

So I've gone through some basic reading:
https://blog.gruntwork.io/an-introduction-to-terraform-f17df9c6d180
and
https://blog.gruntwork.io/how-to-manage-terraform-state-28f5697e68fa
So I understand that .tfstate tells the terraform CLI which resources it's actually responsible for managing. But couldn't that be done more minimally with a list of ID's?
Why does .tfstate need to contain full configurations of all resources, if terraform refetch is run implicitly before terraform apply?
Would the fetch not get complete information from the infrastructure? And then that could be used to do the diff etc...
I suppose if you get complete information every time, you might as well record it. But I'm wondering if it's a necessary step. Thanks!

Code pipeline to build a branch on pull request

I am trying to make a code pipeline which will build my branch when I make a pull request to the master branch in AWS. I have many developers working in my organisation and all the developers work on their own branch. I am not very familiar with ccreating lambda function. Hoping for a solution
You can dynamically create pipelines everytime a new pull-request has been created. Look for the CodeCommit Triggers (in the old CodePipeline UI), you need lambda for this.
Basically it works like this: Copy existing pipeline and update the the source branch.
It is not the best, but afaik the only way to do what you want.
I was there and would not recommend it for the following reasons:
I hit this limit of 20 in my region: "Maximum number of pipelines with change detection set to periodically checking for source changes" - but, you definitely want this feature ( https://docs.aws.amazon.com/codepipeline/latest/userguide/limits.html )
The branch-deleted trigger does not work correctly, so you can not delete the created pipeline, when the branch has been merged into master.
I would recommend you to use Github.com if you need a workflow as you described. Sorry for this.
I have recently implemented an approach that uses CodeBuild GitHub webhook support to run initial unit tests and build, and then publish the source repository and built artefacts as a zipped archive to S3.
You can then use the S3 archive as a source in CodePipeline, where you can then transition your PR artefacts and code through Integration testing, Staging deployments etc...
This is quite a powerful pattern, although one trap here is that if you have a lot of pull requests being created at a single time, you can get CodePipeline executions being superseded given only one execution can proceed through a given stage at a time (this is actually a really important property, especially if your integration tests run against shared resources and you don't want multiple instances of your application running data setup/teardown tasks at the same time). To overcome this, I publish an S3 notification to an SQS FIFO queue when CodeBuild publishes the S3 artifact, and then poll the queue, copying each artifact to a different S3 location that triggers CodePipeline, but only if there are are currently no executions waiting to execute after the first CodePipeline source stage.
We can very well have dynamic branching support with the following approach.
One of the limitations in AWS code-pipeline is that we have to specify branch names while creating the pipeline. We can however overcome this issue using the architecture shown below.
flow diagram
Create a Lambda function which takes the GitHub web-hook data as input, using boto3 integrate it with AWS pipeline(pull the pipeline and update), have an API gateway to make the call to the Lambda function as a rest call and at last create a web-hook to the GitHub repository.
External links:
https://aws.amazon.com/quickstart/architecture/git-to-s3-using-webhooks/
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/codepipeline.html
Related thread: Dynamically change branches on AWS CodePipeline

What is the difference between deploy and redeploy in SAS DI Studio?

Am guessing that it might just retain the metadata ID (redeploy) as opposed to generating a new one (deploy), is that the only difference though?
It is only difference but it is very important. You should always redeploy jobs that any flow is dependent on. If you deploy job that was already added to a flow, the flow will be damaged.
I disagree with the previous comment by #barjey that the flow is damaged. It is not damaged , its just that a New metadata ID is created in the form of a new .sas file for the same job, like JOB_NAME_00000.sas and deployed
This adds to lot of confusion and a lot of versions of the same jobs float around, which is incorrect. That's the reason a job is always re-deployed so that the previous version of the code is over-written and new changes reflected in the flow.
Yo redeploy a job to incorporate the environmental changes by automatically identifying the environment to which you have deployed the job .These changes are then reflected at back end where the job is actually saved (job which is scheduled in a flow).