Amazon web service Data pipeline - amazon-web-services

Can we use existing ec2 instance details while configuring data pipeline? If it is possible then what are the ec2 details that we need to provide while creating a pipe line?

Yes, it is possible. According to AWS support.
"You can install Task Runner on computational resources that you manage, such as an Amazon EC2 instance, or a physical server or workstation. Task Runner can be installed anywhere, on any compatible hardware or operating system, provided that it can communicate with the AWS Data Pipeline web service.
This approach can be useful when, for example, you want to use AWS Data Pipeline to process data that is stored inside your organization’s firewall. By installing Task Runner on a server in the local network, you can access the local database securely and then poll AWS Data Pipeline for the next task to run. When AWS Data Pipeline ends processing or deletes the pipeline, the Task Runner instance remains running on your computational resource until you manually shut it down. The Task Runner logs persist after pipeline execution is complete."
I did this myself as it takes a while to get the pipeline to start up, this start up time could be 10-15 minutes depending on unknown factors.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

Related

AWS Batch Trigger a spring boot microservices

I have below 2 requirements can you please help share any suggestion
AWS Batch trigger a ECR Fargate ( On demand )
AWS Batch trigger a Spring App deployed in ECR ( running permanently)
SO here Option 1, I need to start a spring boot app which should start in ECR Fargate. this I understood from AWS batch we can specify the Cluster of the Fargate so that when the AWS batch run the app will get started.
For Option 2, I have a spring boot App deployed in ECR Fargate and it will be running, and inside spring batch is there, Now AWS batch need to trigger the spring batch. it is possible if so can you please share the implementation sample.
Also from my client App or program I need to update the AWS batch, saying the job is success or failure. can you share any sample for those as well.
AWS Batch only executes ECS tasks not ECS services. For option 1 - to launch a container (your app that does the work you want) within ECS Fargate, you would need to specify an AWS Batch compute environment as Fargate, a job queue that references the compute environment, and a job definition of the task (what container to run, what command to send, what CPU and memory resources are required). See the Learn AWS Batch workshop or the AWS Batch Getting Started documentation for more information on this.
For option 2 - AWS Batch and Spring Batch are orthogonal solutions. You should just call the Spring Batch API endpoint directly OR rely on AWS Batch. Using both is not recommended unless this is something you don't have control over.
But to answer your question - calling an non-AWS API endpoint is handled in your container and application code. AWS Batch does not prevent this but you would need to make sure that the container has secure access to the proper credentials to call the Spring boot app. Once your Batch job calls the API you have two choices:
Immediately exit and track the status of the spring batch operations elsewhere (i.e. The Batch job is to call the API and SUCCESS = "able to send the API request successfully, FAIL = "not able to call the API" )
Call the API, then enter a loop where you poll the status of the Spring batch job until it completes successfully or not, exiting the AWS Batch job with the same state as the Spring batch job did.

How to run resource intensive tasks with Airflow

We have a long running (3h) model training task which runs every 3 days and smaller prediction pipelines that run daily.
For both cases we use Jenkins + EC2 plugin to spin up large instances(workers) and run pipelines on them. This serves 2 purposes:
Keep pipelines isolated. So every pipeline has all resources of one instance.
We save costs. Large instance run only for several hours and not 24/7
With Jenkins + EC2 plugin I am not responsible for copying code to worker and reporting the result of the execution back. Jenkins does it under the hood.
Are there anyways to achieve the same behaviour with Airflow?
Airflow 1.10 released a host of new AWS integrations that gives you a few options for doing something like this on AWS.
https://airflow.apache.org/integration.html#aws-amazon-web-services
If you are running your task in a containerized setting, it sounds like the ECSOperator or the KubernetesPodOperator could be what you need (if you're using Kubernetes).

How would I configure an EC2 instance To Download From MYSQL and Process The Data Based On A Script?

Coming from the serverless framework.
I can currently launch an EC2 instance from lambda.
I do not know how to then tell that EC2 instance to download all data for a user from MYSQL into memory, perform some row level operations and send the intermediate results to dynamodb or a persistent storage service.
Sub questions that I have:
Where would the script be stored, if it is a python script, then how would I let the EC2 get access to it upon launch?
Can I include in the script, an auto shutdown. So that once the task has completed then shut it down? If not then I can use dynamodb to keep track.

How do I setup AWS Data Pipeline to copy on-premise Hive data to S3?

I read through the documentation, which talks about MySQL and RDS. But could not find anything on moving on premise Hive/Hadoop data to S3. I appreciate any links or articles.
You can use S3DistCp to copy HDFS data from your on-premise to S3 and vise versa.
Normally Data Pipeline instantiates an Ec2Resource instance in the AWS cloud and runs the TaskRunner on this instance. The corresponding activity in the pipeline that is marked as 'runsOn' for the Ec2Resource is then run on this instance. For details refer to the documentation here.
But any S3DistCp running on an EC2 instance will not have access to your on-premise HDFS. To have access to on-premise resources the corresponding activities will have to be executed by a TaskRunner running on an on-premise box. For details on how to set this up refer to the documentation here.
The TaskRunner is a java standalone application provided by AWS, that can be manually run on any self managed box. It connects to the data pipeline service over AWS API, to get metadata about tasks pending execution and then executes them on the same box where it is running.
In case of automated Ec2Resource provisioning, Data Pipeline instantiates the ec2 instance and runs this same TaskRunner on it, and all of it is transparent to us.

Using AWS SQS for Job Queuing but minimizing "workers" uptime

I am designing my first Amazon AWS project and I could use some help with the queue processing.
This service accepts processing jobs, either via an ASP.net Web API service or a GUI web site (which just calls the API). Each job has one or more files associated with it and some rules about the type of job. I want to queue each job as it comes in, presumably using AWS SQS. The jobs will then be processed by a "worker" which is a python script with a .Net wrapper. The python script is an existing batch processor that cannot be altered/customized for AWS, hence the wrapper in .Net that manages the AWS portions and passing in the correct params to python.
The issue is that we will not have a huge number of jobs, but each job is somewhat compute intensive. One of the reasons to go to AWS was to minimize infrastructure costs. I plan on having the frontend web site (Web API + ASP.net MVC4 site) run on elastic beanstalk. But I would prefer not to have a dedicated worker machine always online polling for jobs, since these workers need to be a bit "beefier" instance (for processing) and it would cost us a lot to mostly sit doing nothing.
Is there a way to only run the web portion on beanstalk and then have the worker process only spin up if there are items in the queue? I realize I could have a micro "controller" instance always online polling and then have it control the compute spinup, but even that seems like it shouldn't be needed. Can EC2 instances be started based on a non-zero SQS queue size? So basically web api adds job to queue, something watches the queue and sees it's non-zero, this triggers the EC2 worker to start, it spins up and polls the queue on startup. It processes until the queue until empty, then something triggers it to shutdown.
You can use Autoscaling in conjunction with SQS to dynamically start and stop EC2 instances. There is a AWS blog post that describes the architecture you are thinking of.