I am aware of AWS cloudformation EMR resource to create Clusters. But, I could not find any instructions about EMR notebooks. Is there a cloudformation resource for EMR notebooks or similar alternative?
EMR Notebooks can only be created manually using the AWS EMR console. From the documentation (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html):
You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported.
Since there is no API for this I don't think there will be a way to create notebooks using CloudFormation or similar tools.
Related
I want to create managed MongoDb using DocumentDb on AWS via terraform.
I created a DocumentDb Elastic cluster via the UI, and it seems to work fine. Now I want to create this cluster via terraform, and I don't find documentation for it.
I read that only the documentDb's 'Elastic Cluster' support MongoDb Sharding APIs (and not the 'Instance Based Cluster').
This is the Hashicorp doc for DocumentDb, but I don't see reference for Elastic cluster.
DocumentDB is relatively new. I think it's not possible to do it on terraform yet.
You can do it using Cloudformation
Using AWS CDK
Or AWS CLI
I think it will be available soon, if it is possible with other IaC Terraform don't take too long to update.
Could someone help me in deciding which would be better AWS EMR or creating own cluster in AWS? I am using airflow to create AWS EMR via terraform , run the job and destroy cluster. However did anyone created a spark cluster in AWS without EMR e.g. using ECS Fargate and docker image from bitnami/spark e.g. link or something along the same line in AWS. Thank you
I have an transient Emr cluster up and ready, I want to run a simple pyspark script on the emr notebook.
Is there any way to create and modify the emr notebook through terraform?
Thanks in advance.
As far as i know, AWS says "You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported." [AWS Documentation on creating EMR Notebook][1]
You can create a notebook via console, the notebook will be stored in S3 as .ipynb, by giving the relative path, you can execute notebook on the cluster. Refer boto3 for more info [Boto3 Documentation][2]
[1]: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html
[2]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.start_notebook_execution
Yes, you can create and modify an EMR cluster from Terraform and choose which tools will be installed, but this seems like the "hard way". Easier would be a Sagemaker Notebook or using the new Glue Databrew tool.
I am not good at writing a shell script/Bootstrap action for EMR. Can I able to use a preconfigured AMI snapshot for creating the cluster?
You can use a preconfigured AMI for EMR. However, there are some restrictions in that you must start with a supported EMR AMI. I have done this many times to create encrypted root volumes for EMR (copying the AMI and enabling encryption).
Amazon EMR now supports launching clusters with custom Amazon Linux AMIs
I created an AWS EMR Cluster through the regular EMR Cluster wizard on the AWS Management Console and I was able to select a security-configuration e.g., when you export the CLI command it's --security-configuration 'mySecurityConfigurationValue'.
I now need to create a similar EMR through the AWS Data Pipeline but I don't see any options where I can specify this security-configuration field.
The only similar fields I see are EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, AdditionalSlaveSecurityGroups, AdditionalMasterSecurityGroups, and SubnetId. I already have all of those filled out in my Pipeline configuration but I just need to also specify the security-configuration. Any thoughts?
Unfortunately, DataPipeline does not support the Security Configurations feature (as well as other features that were introduced in the EMR 5.x versions like using a custom AMI).
One solution for this is to:
Replace the EmrCluster in your pipeline with an EC2 resource
Use a ShellCommandActivity on the EC2 resource to run the aws emr create-cluster CLI command
Use a bootstrap step to install TaskRunner on the cluster
Replace all the runsOn properties in your pipeline with workerGroup so the tasks run on the EMR cluster you created in step 2
Add a final ShellCommandActivity at the end of the pipeline to terminate the cluster using CLI
Now since you are spinning up your cluster using the CLI you have access to all kinds of features like security configurations, custom AMI, instance fleets, etc. and you can still orchestrate the tasks using DataPipeline.