I have a notebook created in databricks, and I would like to run this job on-demand from AWS Lambda. That is, when a file arrives in my S3 bucket, i would like to run databricks notebook job for my ETL purpose.
The databricks cluster has autotermination_minutes parameter set for 60 minutes. Sometimes my job is not running, since the cluster auto-terminates when its idle for 60 minutes. Is there any way I can restart the cluster from AWS Lambda before running the job?
Thanks
Yes you can, using Databricks Cluster APIs.
You just need to pass the cluster-id.
Here is the link
https://docs.databricks.com/api/latest/clusters.html#start
Related
I need to run java code that talk to MySql db in AWS and does some ETL on a nightly frequency. Which AWS service can I used for this?
I would recommend looking at the following:
AWS Glue ETL
AWS Batch
AWS ECS / Fargate scheduled tasks
How can I turn on/off EMR clusters? There is only one possibility to terminate permanently. What if I do not need the cluster at nights and I do not want to create a new cluster every morning?
You can't do this. Stopping an EMR cluster is not supported. You simply terminate it when you don't need it.
To protect your data, you should be using EMRFS which allows EMR cluster to read data from S3. This way, there is no need to copy any data from S3 to HDFS.
You can enable scale up\scale down policies available in EMR UI and resize your cluster based on multiple metrics, i.e. ram\cpu utilization. You can also create external job that will send to EMR scale up\scale down command via awscli and you can schedule such jobs to run in the morning and in the evening.
From my experience resizing works well on task nodes while resizing core nodes demands HDFS sync that works only if you don't run any tasks on your EMR.
How may I make my PySpark code to run with AWS EMR from AWS Lambda? Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
You need transient cluster for this case which will auto terminate once your job is completed or the timeout is reached whichever occurs first.
You can access this link on how to initialise the same.
What are the processes available to create a EMR cluster:
Using boto3
/ AWS
CLI
/ Java
SDK
Using cloudformation
Using Data Pipeline
Do I have to use AWS Lambda to create an auto-terminating EMR cluster to run my S3-stored code once?
No. It isn’t mandatory to use lambda to create an auto-terminating cluster.
You just need to specify a flag --auto-terminate while creating a cluster using boto3 / CLi / Java-SDK. But this case you need to submit the job along with cluster config. Ref
Note:
Its not possible to create an auto-terminating cluster using cloudformation. By design, CloudFormation assumes that the
resources that are being created will be permanent to some extent.
If you REALLY had to do it this way, you could make an AWS api call to
delete the CF stack upon finishing your EMR tasks.
How may I make my PySpark code to run with AWS EMR from AWS Lambda?
You can design your lambda to submit spark
job.
You can find an example
here
In my use case I have one parameterised lambda which invoke CF to create cluster, submit job and terminate cluster.
I want to automate a redshift insert query to be run every day.
We actually use Aws environment. I was told using lambda is not the right approach. Which is the best ETL process to automate a query in Redshift.
For automating SQL on Redshift you have 3 options (at least)
Simple - cron
Use a EC2 instance and set up a cron job on that to run your SQL code.
psql -U youruser -p 5439 -h hostname_of_redshift -f your_sql_file
Feature rich - Airflow (Recommended)
If you have a complex schedule to run then it is worth investing time learning and using apache airflow. This also needs to run on a server(ec2) but offers a lot of functionality.
https://airflow.apache.org/
AWS serverless - AWS data pipeline (NOT Recommended)
https://aws.amazon.com/datapipeline/
Cloudwatch->Lambda->EC2 method described below by John Rotenstein
This is a good method when you want to be AWS centric, it will be cheaper than having a dedicated EC2 instance.
One option:
Use Amazon CloudWatch Events on a schedule to trigger an AWS Lambda function
The Lambda function launches an EC2 instance with a User Data script. Configure Shutdown Behavior as Terminate.
The EC2 instance executes the User Data script
When the script is complete, it should call sudo shutdown now -h to shutdown and terminate the instance
The EC2 instance will only be billed per-second.
Redshift now supports scheduled queries natively: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-schedule-query.html
You can use boto3 and psycopg2 to run the queries by creating a python script
and scheduling it in cron to be executed daily.
You can also try to convert your queries into Spark jobs and schedule those jobs to run in AWS Glue daily. If you find it difficult, you can also look into Spark SQL and give it a shot. If you are going with Spark SQL, keep in mind the memory usage as Spark SQL is pretty memory intensive.
I'm running a instance in amazon AWS and it runs non-stop everyday. I'm using ubuntu ec2 instance which is running Apache, Mirthconnect tool and LAMP server. I want to run this instance only on particular time duration of a day. I prefer not use any additional AWS services such as cloud-watch . Is there a way we could acheive this?.
The major purpose is for using Mirthconnect fetching data from mysql database
There are 3 solutions.
AWS Data Pipeline - You can schedule the instance start/stop just like cron. It will cost you one hour of t1.micro instance for every start/stop
AWS Lambda - Define a lambda function that gets triggered at a pre defined time. Your lambda function can start/stop instances. Your cost will be very minimal or $0
Write a shell script and run it as a cron job or run it on demand. The script will have AWS CLI command to start and stop the instance.
I used Data Pipeline for a long time before moving to Lambda. Data Pipeline is very trivial. Just paste the AWS CLI commands to stop and start instances. Lambda is more involved.
I guess for that you'll need another machine which is on 24x7. On which you can write cron job in python using boto or any other language like bash.
I don't see how you start a instance in stopped state without using any other machine.
Or you can have a simple raspberry pi on at your home which does the ON-OFF work for you using AWS CLI or simple Python. How about that? ;)