I need to run java code that talk to MySql db in AWS and does some ETL on a nightly frequency. Which AWS service can I used for this?
I would recommend looking at the following:
AWS Glue ETL
AWS Batch
AWS ECS / Fargate scheduled tasks
Related
I have an on premises schedule that's too dense and there are a lot of errors. I'd like to explore options for migrating some of the workload to the cloud. Can I connect to on-premises resources using AWS Batch? I'd to connect to on prem database/warehouse and run some of these jobs in the cloud using Spot instances and drop the output into S3 but wasn't sure if I used AWS Batch or AWS Glue or a combination of the two. Is there a different option?
I have two questions to ask:
So my company has 2 instances of airflow running, one on a GCP
provisioned cluster and another on a AWS provisioned cluster. Since
GCP has Composer, which helps you to manage airflow, is there a way
to sort of integrate the airflow DAGs on the AWS cluster to be
managed by GCP as well?
For Batch ETL/Streaming jobs(in python), GCP has Dataflow (Apache
Beam) for that. What's the AWS equivalent of that?
Thanks!
No, you can't do it, till now you have to use AWS, provision it and manage by yourself. There are some options you can choose: EC2, ECS + Fargate, EKS
Dataflow is equivalent to Amazon Elastic MapReduce (EMR) or AWS Batch Dataflow. Moreover if you want to run current Apache Beam jobs, you can provision Apache Beam in EMR and everything should be the same
I use AWS EMR cluster to run HIVE query. For query optimization purpose, sometime I need to kill a long-running step but keep the EMR cluster live so I can keep using it. Is there a way to do it either in HIVE CLI or AWS console?
Please refer here for the detail. To cancel steps using the AWS CLI:
aws emr cancel-steps --cluster-id j-2QUAJ7T3OTEI8 --step-ids s-3M8DKCZYYN1QE
I want to automate a redshift insert query to be run every day.
We actually use Aws environment. I was told using lambda is not the right approach. Which is the best ETL process to automate a query in Redshift.
For automating SQL on Redshift you have 3 options (at least)
Simple - cron
Use a EC2 instance and set up a cron job on that to run your SQL code.
psql -U youruser -p 5439 -h hostname_of_redshift -f your_sql_file
Feature rich - Airflow (Recommended)
If you have a complex schedule to run then it is worth investing time learning and using apache airflow. This also needs to run on a server(ec2) but offers a lot of functionality.
https://airflow.apache.org/
AWS serverless - AWS data pipeline (NOT Recommended)
https://aws.amazon.com/datapipeline/
Cloudwatch->Lambda->EC2 method described below by John Rotenstein
This is a good method when you want to be AWS centric, it will be cheaper than having a dedicated EC2 instance.
One option:
Use Amazon CloudWatch Events on a schedule to trigger an AWS Lambda function
The Lambda function launches an EC2 instance with a User Data script. Configure Shutdown Behavior as Terminate.
The EC2 instance executes the User Data script
When the script is complete, it should call sudo shutdown now -h to shutdown and terminate the instance
The EC2 instance will only be billed per-second.
Redshift now supports scheduled queries natively: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-schedule-query.html
You can use boto3 and psycopg2 to run the queries by creating a python script
and scheduling it in cron to be executed daily.
You can also try to convert your queries into Spark jobs and schedule those jobs to run in AWS Glue daily. If you find it difficult, you can also look into Spark SQL and give it a shot. If you are going with Spark SQL, keep in mind the memory usage as Spark SQL is pretty memory intensive.
I am running a spark cluster on AWS EMR. How do I get all all the details of the jobs and executors that are running on AWS EMR without using the spark UI. I am going to use it for monitoring and optimization.
You can checkout nagios or ganglia for cluster health but you cant see the jobs running on spark with these tools.
If you are using AWS EMR you can do that using lynx server. something like below.
Login to the master node of the cluster.
try the below command
lynx http://localhost:4040
Note : before you type the command make sure you are running a job