How to run an AWS EMR cluster on multiple subnets? - amazon-web-services

Currently we are creating instances using a config.json file from EMR to configure the cluster. This file specifies a subnet ("Ec2SubnetId").
ALL of my EMR instances end up using this subnet...how do I let it use multiple subnets?
Here is the terraform template I am pushing to S3.
{
"Applications": [
{"Name": "Spark"},
{"Name": "Hadoop"}
],
"BootstrapActions": [
{
"Name": "Step1-stuff",
"ScriptBootstrapAction": {
"Path": "s3://${artifact_s3_bucket_name}/artifacts/${build_commit_id}/install-stuff.sh",
"Args": ["${stuff_args}"]
}
},
{
"Name": "setup-cloudWatch-agent",
"ScriptBootstrapAction": {
"Path": "s3://${artifact_s3_bucket_name}/artifacts/${build_commit_id}/setup-cwagent-emr.sh",
"Args": ["${build_commit_id}"]
}
}
],
"Configurations": [
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
],
"Instances": {
"AdditionalMasterSecurityGroups": [ "${additional_master_security_group}" ],
"AdditionalSlaveSecurityGroups": [ "${additional_slave_security_group}" ],
"Ec2KeyName": "privatekey-${env}",
"Ec2SubnetId": "${data_subnet}",
"InstanceGroups": [

You cannot currently achieve what you are trying to do. EMR clusters always end up with all of their nodes in the same subnet.
Using Instance Fleets, you are indeed able to configure a set of subnets.. but at launch time, AWS will choose the best one and put all your instances there.
From the EMR Documentation, under "Use the Console to Configure Instance Fleets":
For Network, enter a value. If you choose a VPC for Network, choose a single EC2 Subnet or CTRL + click to choose multiple EC2 subnets. The subnets you select must be the same type (public or private). If you choose only one, your cluster launches in that subnet. If you choose a group, the subnet with the best fit is selected from the group when the cluster launches.

Related

How to share EFS among different ECS tasks and hosted in different instances

Currently, the tasks that we defined are using bind_mount to share the EFS persistent data among containers in a single task, lets say taskA saves in /efs/cache/taskA.
But we are looking to find out, if there's any way to share the EFS data of taskA with the taskB containers in ECS. So taskB can be able to access data from taskA by doing bind_mount in taskB.
So can we use bind_mount in ecs to achieve this? or is there any alternative. Thanks
taskB definition looks like:
containerDefinitions": [
"mountPoints": [
{
"readOnly": null,
"containerPath": "/efs/cache/taskA",
"sourceVolume": "efs_cache_taskA"
},
...],
"volumes": [
{
"fsxWindowsFileServerVolumeConfiguration": null,
"efsVolumeConfiguration": null,
"name": "efs_cache_taskA",
"host": {
"sourcePath": "/efs/cache/taskA"
},
"dockerVolumeConfiguration": null
},
...
}
You no longer need to mount EFS on EC2 and then to bind mounts. Now ECS supports a native integration with ECS (both EC2 and Fargate) that will allow you to configure the tasks to mount the same file system (or Access Point) without even bothering about configuring EC2 (in fact it works with Fargate as well). See this blog post series for more info.

Packer google cloud authentication within vpc

We are using Packer to build images in a GCP compute instance. Packer tries to grab the image based on project and image as follows:
https://www.googleapis.com/compute/v1/projects/<project-name>/global/images/<image-name>?alt=json
Then it throws an error:
oauth2: cannot fetch token: Post https://accounts.google.com/o/oauth2/token: dial tcp 108.177.111.84:443: i/o timeout
Based on security principle, our compute instance has no external IP address, therefore it does not have access to internet. In this case, accounts.google.com is no longer accessible. Then how can we authenticate google apis?
I tried to enable firewall rules and provide routes for internet access. But based on the requirement stated here, the instance still won't have access if it does not have external IP address.
This means we must have a separate way to authenticate googleapis.
But does Packer support this?
Here is the packer builder we have:
"builders": [
{
"type": "googlecompute",
"project_id": "test",
"machine_type": "n1-standard-4",
"source_image_family": "{{user `source_family`}}",
"source_image": "{{user `source_image`}}",
"source_image_project_id": "{{user `source_project_id`}}",
"region": "{{user `region`}}",
"zone": "{{user `zone`}}",
"network": "{{user `network`}}",
"subnetwork": "{{user `subnetwork`}}",
"image_name": "test-{{timestamp}}",
"disk_size": 10,
"disk_type": "pd-ssd",
"state_timeout": "5m",
"ssh_username": "build",
"ssh_timeout": "1000s",
"ssh_private_key_file": "./gcp-instance-key.pem",
"service_account_email": "test-account#test-mine.iam.gserviceaccount.com",
"omit_external_ip": true,
"use_internal_ip": true,
"metadata": {
"user": "build"
}
}
To do what you want manually you will need to have an ssh tunnel open on a working compute instance inside the project or in a vpc that has a peering enabled on the network the compute you want to reach is.
If you then use a CI with a runner like gitlab-ci, be sure to create the runner inside the same vpc or in a vpc with a peering.
If you don't want to create a compute with an external ip you could try to open a vpn connection to the project and do it through the vpn.

Mounting an elastic file system to AWS Batch Computer Enviroment

I'm trying to get my elastic file system (EFS) to be mounted in my docker container so it can be used with AWS batch. Here is what I did:
Create a new AMI that is optimized for Elastic Container Services (ECS). I followed this guide here to make sure it had ECS on it. I also put the mount into /etc/fstab file and verified that my EFS was being mounted (/mnt/efs) after reboot.
Tested an EC2 instance with my new AMI and verified I could pull the docker container and pass it my mount point via
docker run --volume /mnt/efs:/home/efs -it mycontainer:latest
Interactively running the docker image shows me my data inside efs
Set up a new compute enviorment with my new AMI that mounts EFS on boot.
Create a JOB definition File:
{
"jobDefinitionName": "MyJobDEF",
"jobDefinitionArn": "arn:aws:batch:us-west-2:#######:job-definition/Submit:8",
"revision": 8,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"retryStrategy": {
"attempts": 1
},
"containerProperties": {
"image": "########.ecr.us-west-2.amazonaws.com/mycontainer",
"vcpus": 1,
"memory": 100,
"command": [
"ls",
"/home/efs",
],
"volumes": [
{
"host": {
"sourcePath": "/mnt/efs"
},
"name": "EFS"
}
],
"environment": [],
"mountPoints": [
{
"containerPath": "/home/efs",
"readOnly": false,
"sourceVolume": "EFS"
}
],
"ulimits": []
}
}
Run Job, view log
Anyway, while it does not say "no file /home/efs found" it does not list anything in my EFS which is populated, which I'm inerpreting as the container mounting an empty efs. What am I doing wrong? Is my AMI not mounting the EFS in the compute environment?
I covered this in a recent blog post
https://medium.com/arupcitymodelling/lab-note-002-efs-as-a-persistence-layer-for-aws-batch-fcc3d3aabe90
You need to set up a launch template for your batch instances, and you need to make sure that your subnets/security groups are configured properly.

Where are the volumes located when using ECS and Fargate?

I have the following setup (I've stripped out the non-important fields):
{
"ECSTask": {
"Type": "AWS::ECS::TaskDefinition",
"Properties": {
"ContainerDefinitions": [
{
"Name": "mysql",
"Image": "mysql",
"MountPoints": [{"SourceVolume": "mysql", "ContainerPath": "/var/lib/mysql"}]
}
],
"RequiresCompatibilities": ["FARGATE"],
"Volumes": [{"Name": "mysql"}]
}
}
}
It seems to work (the container does start properly), but I'm not quite sure where exactly is this volume being saved. I assumed it would be an EBS volume, but I don't see it there. I guess it's internal to my task - but in that case - how do I access it? How can I control its limits (min/max size etc)? How can I create a backup for this volume?
Thanks.
Fargate does not support persistent volumes. Any volumes created attached to fargate tasks are ephemeral and cannot be initialized from an external source or backed up, sadly.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_data_volumes.html

How to set a custom environment variable in EMR to be available for a spark Application

I need to set a custom environment variable in EMR to be available when running a spark application.
I have tried adding this:
...
--configurations '[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Configurations": [],
"Properties": { "SOME-ENV-VAR": "qa1" }
}
],
"Properties": {}
}
]'
...
and also tried to replace "spark-env with hadoop-env
but nothing seems to work.
There is this answer from the aws forums. but I can't figure out how to apply it.
I'm running on EMR 5.3.1 and launch it with a preconfigured step from the cli: aws emr create-cluster...
Add the custom configurations like below JSON to a file say, custom_config.json
[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"VARIABLE_NAME": VARIABLE_VALUE,
}
}
]
}
]
And, On creating the emr cluster, pass the file reference to the --configurations option
aws emr create-cluster --configurations file://custom_config.json --other-options...
For me replacing spark-env to yarn-env fixed issue.
Use classification yarn-env to pass environment variables to the worker nodes.
Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.