I am new to airflow. I was trying to schedule a job which uses bashoperator to run a spark submit command.It is working fine but it is keeping airflow busy till it completes the spark job.
cmd = "ssh hadoop#<ipaddress> spark-submit \
--master yarn \
--deploy-mode cluster \
--executor-memory 2g \
--executor-cores 2 \
t = BashOperator(task_id='task1',bash_command=cmd,dag=dag)
How can i make airflow just submit the bash command and move on to another task?
I am currently running airflow on standalone EC2 machine.
Also how can we make airflow run multiple tasks at sametime.
Is there a GCP command for creating OR replacing a cloud run job? I'm using github-actions to create cloud run and scheduler jobs, and need to keep switching the commands between:
gcloud alpha run jobs create
gcloud alpha run jobs update
Is there a way to create the job and overwrite it if it already exists?
gcloud beta run jobs deploy was recently added to gcloud which does what you're looking for. Documentation is here
To create a Cloud Run new job :
gcloud beta run jobs create JOB_NAME --image IMAGE_URL OPTIONS
To update existing job :
gcloud beta run jobs update JOB_NAME
If you want a single command to handle the creation and update at the same time, you can develop your own Shell script, example :
#!/usr/bin/env bash
set -e
set -o pipefail
set -u
export JOB_NAME=my_job
res=$(gcloud beta run jobs describe $JOB_NAME --region=europe-west1 || echo "NOT_EXIST")
echo "#######Result : $res"
if [ "$res" = "NOT_EXIST" ]; then
echo "Creating your job..."
gcloud beta run jobs create $JOB_NAME
echo "Updating your job..."
gcloud beta run jobs update $JOB_NAME
I intend to create an auto terminating EMR cluster that executes a spark cluster and shuts down.
If I submit the application as a step to an existing cluster that does not auto terminate using the following command, it works and application completes in 3 minutes.
aws emr add-steps --cluster-id xxx \
--steps Name=imdbetlapp,Jar=command-runner.jar,Args=\
s3://bucketname/run_etl.py],ActionOnFailure=CONTINUE --region us-east-1
However, when i use the following command to create an auto terminating cluster with a step function, the application keeps running for more that 30 minutes.
aws emr create-cluster --applications Name=Hadoop Name=Spark --use-default-roles \
--bootstrap-actions Path=s3://bucketname/emr_bootstrap.sh,Name=installPython \
--log-uri s3://logbucketname/elasticmapreduce/ \
--configurations https://s3.amazonaws.com/bucketname/emr_configurations.json \
--steps Name=imdbetlapp,Jar=command-runner.jar,Args=[spark-submit,--deploy-mode,cluster,\
--files,s3://bucketname/etl_module/aws_config.cfg,s3://bucketname/run_etl.py] \
--release-label emr-5.29.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \
InstanceGroupType=CORE,InstanceCount=1,InstanceType=m4.large \
--auto-terminate --region us-east-1
What am I missing out?
I have zipped my etl python module and uploaded that along with the actual folder and configuration file aws_config.cfg. It works perfectly if submitted as a step function to existing cluster as I can see output being written to another S3 bucket. However, if I issue a CLI command to create a cluster and execute the step the step keeps executing forever.
I created a cluster on Dataproc and it works great. However, after the cluster is idle for a while (~90 min), the master node will automatically stops. This happens to every cluster I created. I see there is a similar question here: Keep running Dataproc Master node
It looks like it's the initialization action problem. However the post does not give me enough info to fix the issue. Below are the commands I used to create the cluster:
gcloud dataproc clusters create $CLUSTER_NAME \
--project $PROJECT \
--bucket $BUCKET \
--region $REGION \
--zone $ZONE \
--master-machine-type $MASTER_MACHINE_TYPE \
--master-boot-disk-size $MASTER_DISK_SIZE \
--worker-boot-disk-size $WORKER_DISK_SIZE \
--num-workers=$NUM_WORKERS \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh,gs://dataproc-initialization-actions/datalab/datalab.sh \
--metadata gcs-connector-version=$GCS_CONNECTOR_VERSION \
--metadata bigquery-connector-version=$BQ_CONNECTOR_VERSION \
--scopes cloud-platform \
--metadata JUPYTER_CONDA_PACKAGES=numpy:scipy:pandas:scikit-learn \
--optional-components=ANACONDA,JUPYTER \
I need the BigQuery connector, GCS connector, Jupyter and DataLab for my cluster.
How can I keep my master node running? Thank you.
As summarized in the comment thread, this is indeed caused by Datalab's auto-shutdown feature. There are a couple ways to change this behavior:
Upon first creating the Datalab-enabled Dataproc cluster, log in to Datalab and click on the "Idle timeout in about ..." text to disable it: https://cloud.google.com/datalab/docs/concepts/auto-shutdown#disabling_the_auto_shutdown_timer - The text will change to "Idle timeout is disabled"
Edit the initialization action to set the environment variable as suggested by yelsayed:
function run_datalab(){
if docker run -d --restart always --net=host -e "DATALAB_DISABLE_IDLE_TIMEOUT_PROCESS=true" \
-v "${DATALAB_DIR}:/content/datalab" ${VOLUME_FLAGS} datalab-pyspark; then
echo 'Cloud Datalab Jupyter server successfully deployed.'
err 'Failed to run Cloud Datalab'
And use your custom initialization action instead of the stock gs://dataproc-initialization-actions one. It could be worth filing a tracking issue in the github repo for dataproc initialization actions too, suggesting to disable the timeout by default or provide an easy metadata-based option. It's probably true that the auto-shutdown behavior isn't as expected in default usage on a Dataproc cluster since the master is also performing roles other than running the Datalab service.
How can i run periodic job in background on EMR cluster?
I have script.sh with cron job and application.py in s3 and want to run cluster with this command:
aws emr create-cluster
--name "Test cluster"
–-release-label emr-5.12.0
--applications Name=Hive Name=Pig Name=Ganglia Name=Spark
--ec2-attributes KeyName=myKey
--instance-type m3.xlarge
--instance-count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,
Finally, i want that cron job from script.sh execute application.py
Now i don't understand how to install cron on master node, python file need some libraries, they should be installed to.
You need to SSH into the master node, and then perform the crontab setup from there, not on your local machine:
Connect to the Master Node Using SSH
Secure Shell (SSH) is a network protocol you can use to create a
secure connection to a remote computer. After you make a connection,
the terminal on your local computer behaves as if it is running on the
remote computer. Commands you issue locally run on the remote
computer, and the command output from the remote computer appears in
your terminal window.
When you use SSH with AWS, you are connecting to an EC2 instance,
which is a virtual server running in the cloud. When working with
Amazon EMR, the most common use of SSH is to connect to the EC2
instance that is acting as the master node of the cluster.
Using SSH to connect to the master node gives you the ability to
monitor and interact with the cluster. You can issue Linux commands on
the master node, run applications such as Hive and Pig interactively,
browse directories, read log files, and so on. You can also create a
tunnel in your SSH connection to view the web interfaces hosted on the
master node. For more information, see View Web Interfaces Hosted on
Amazon EMR Clusters.
By default crontab is installed in linux system, you don't need to install manually.
To add spark job scheduling in cron, please follow below steps
Login into master node (SSH into master).
run command
crontab -e
Add the below line in crontab and save it ( :w)
*/15 0 * * * /script-path/script.sh
Now cron will schedule the jobs for every 15 minutes.
Please refer this link to know about cron.
Hope this helps.
I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like
hadoop-streaming -D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files my_mapper.py,my_reducer.py \
-mapper my_mapper.py \
-reducer my_reducer.py \
-input /aws/input/input_warc.txt \
-output /aws/output
Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?
Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.
Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.
One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.