How to scheduler HDFS Balancer in Cloudera Manager? - hdfs

Anyone have the idea on how to schedule hdfs-balancer in Cloudera Manager, and to have it run on a scheduled basis - probably every Saturday.

You can use the hdfs balance command, and make it in the cron on one of your nodes:
sudo -u hdfs hdfs balancer -threshold 5

I have scheduled this in my hdfs user's cron and it works fine
rebalancing HDFS weekly job
00 10 * * SAT hdfs balancer -threshold 5 >> /app/hadoop_users/sf/rebal_log.txt

Related

what is the best Airflow architecture for AWS EMR clusters?

I have an AWS EMR cluster with 1 master node, 30 core nodes and some auto-scaled task nodes.
now, hundreds of Hive and mysql jobs are running by Oozie on the cluster.
I'm going to change some jobs from Oozie to Airflow.
I googled to apply Airflow to my cluster.
I found out that all dag should be located on every node and Airflow Worker must be installed on all nodes.
But, My dag will be updated frequently and new dags will be added frequently, but the number of nodes is about 100 and even auto-scaled nodes are used.
And, As you know, only master node has hive/mysql application on the cluster.
So I am very confused.
Who can tell me Airflow architecture to apply to my EMR cluster?
Airflow worker nodes are not the same as EMR nodes.
In a typical setup, a celery worker ("Airflow worker node") reads from a queue of jobs and executes them using the appropriate operator (In this case probably a SparkSubmitOperator or possibly an SSHOperator).
Celery workers would not run on your EMR nodes as those are dedicated to running Hadoop jobs.
Celery workers would likely run on EC2s outside of your EMR cluster.
One common solution to having the same DAGs on every celery worker, is to put the dags on network storage (like EFS) and mount the network drive to the celery worker EC2s.

AWS cronjob for every minute on ec2 free instance

Q. I have 1 AWS ec2 free instance where I have deployed my website. and now I want to start 1 cronjob (crontab) that will execute 1 PHP file once per minute. Do I have to purchase anything on AWS server or I can run cron every minutes on free instance too ?
Cronjobs have nothing to do with AWS Services. So think of it as a part of your website functionality, what you need to do is to log into your server through ssh and start writing the cronjob you need.

AWS: Automating queries in redshift

I want to automate a redshift insert query to be run every day.
We actually use Aws environment. I was told using lambda is not the right approach. Which is the best ETL process to automate a query in Redshift.
For automating SQL on Redshift you have 3 options (at least)
Simple - cron
Use a EC2 instance and set up a cron job on that to run your SQL code.
psql -U youruser -p 5439 -h hostname_of_redshift -f your_sql_file
Feature rich - Airflow (Recommended)
If you have a complex schedule to run then it is worth investing time learning and using apache airflow. This also needs to run on a server(ec2) but offers a lot of functionality.
https://airflow.apache.org/
AWS serverless - AWS data pipeline (NOT Recommended)
https://aws.amazon.com/datapipeline/
Cloudwatch->Lambda->EC2 method described below by John Rotenstein
This is a good method when you want to be AWS centric, it will be cheaper than having a dedicated EC2 instance.
One option:
Use Amazon CloudWatch Events on a schedule to trigger an AWS Lambda function
The Lambda function launches an EC2 instance with a User Data script. Configure Shutdown Behavior as Terminate.
The EC2 instance executes the User Data script
When the script is complete, it should call sudo shutdown now -h to shutdown and terminate the instance
The EC2 instance will only be billed per-second.
Redshift now supports scheduled queries natively: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-schedule-query.html
You can use boto3 and psycopg2 to run the queries by creating a python script
and scheduling it in cron to be executed daily.
You can also try to convert your queries into Spark jobs and schedule those jobs to run in AWS Glue daily. If you find it difficult, you can also look into Spark SQL and give it a shot. If you are going with Spark SQL, keep in mind the memory usage as Spark SQL is pretty memory intensive.

Databricks auto_termination is set to 60 minutes, and my job fails

I have a notebook created in databricks, and I would like to run this job on-demand from AWS Lambda. That is, when a file arrives in my S3 bucket, i would like to run databricks notebook job for my ETL purpose.
The databricks cluster has autotermination_minutes parameter set for 60 minutes. Sometimes my job is not running, since the cluster auto-terminates when its idle for 60 minutes. Is there any way I can restart the cluster from AWS Lambda before running the job?
Thanks
Yes you can, using Databricks Cluster APIs.
You just need to pass the cluster-id.
Here is the link
https://docs.databricks.com/api/latest/clusters.html#start

Is it possible to have cron job running on Amazon EC2 instance?

I'm currently working on scrapping tool for information analysis and I put a cron job on AWS EC2 instance (Ubuntu 14 TLS Server).
The cron job runs by executing Laravel artisan command.
Following is what I have entered in crontab -e.
0 20 * * * php /var/www/html/artisan data:get
But this doesn't run everyday and I found cron service has stopped for no reason.
Is it possible to have cron job on AWS Ec2 instance?
If not, what's the solution?
Yes, cron should just run on Amazon EC2 instances.
Did you look at the answers to this question ?
Cannot get cron to work on Amazon EC2?