I have an transient Emr cluster up and ready, I want to run a simple pyspark script on the emr notebook.
Is there any way to create and modify the emr notebook through terraform?
Thanks in advance.
As far as i know, AWS says "You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported." [AWS Documentation on creating EMR Notebook][1]
You can create a notebook via console, the notebook will be stored in S3 as .ipynb, by giving the relative path, you can execute notebook on the cluster. Refer boto3 for more info [Boto3 Documentation][2]
[1]: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html
[2]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.start_notebook_execution
Yes, you can create and modify an EMR cluster from Terraform and choose which tools will be installed, but this seems like the "hard way". Easier would be a Sagemaker Notebook or using the new Glue Databrew tool.
Related
I am trying to work with glue interective session in sagemaker notebook by configuring the glue-conda-pyspark kernel via aws lifecycle configurations. It worked earlier while creating a notebook instance. Now the instance is running with configuration but i am no longer able to see the conda glue pyspark kernel in the kernel list. Could anybody help with the create script and start script to run the notebook with glue-pyspark.
I am configuring using this aws doc: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-sagemaker.html#is-sagemaker-existing
and also aws took help from aws github scripts: https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/install-conda-package-single-environment/on-start.sh
Could someone help me in deciding which would be better AWS EMR or creating own cluster in AWS? I am using airflow to create AWS EMR via terraform , run the job and destroy cluster. However did anyone created a spark cluster in AWS without EMR e.g. using ECS Fargate and docker image from bitnami/spark e.g. link or something along the same line in AWS. Thank you
I am aware of AWS cloudformation EMR resource to create Clusters. But, I could not find any instructions about EMR notebooks. Is there a cloudformation resource for EMR notebooks or similar alternative?
EMR Notebooks can only be created manually using the AWS EMR console. From the documentation (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html):
You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported.
Since there is no API for this I don't think there will be a way to create notebooks using CloudFormation or similar tools.
I use AWS EMR cluster to run HIVE query. For query optimization purpose, sometime I need to kill a long-running step but keep the EMR cluster live so I can keep using it. Is there a way to do it either in HIVE CLI or AWS console?
Please refer here for the detail. To cancel steps using the AWS CLI:
aws emr cancel-steps --cluster-id j-2QUAJ7T3OTEI8 --step-ids s-3M8DKCZYYN1QE
I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.