I'm using aws cdk to setup my infrastructure. I'm wondering if there is any way to create a ETL job through an EMR serverless application with AWS CDK?
I can create the serverless application with CDK but cant find how to create a job.
There's not currently a built-in way to create a job with CDK (or CloudFormation). This is partially because CDK is typically used to deploy infrastructure while something like Airflow or Step Functions would be used to trigger an actual job on a recurring basis.
You could, in theory, write a custom resource to trigger a job. Here's an example of how to do so with Python CDK. This code creates an EMR Serverless application, a role that can be used with the job (no access granted in this case), and a custom resource that starts the job. Note that the policy associated with the custom resource needs to have iam:PassRole access granted to the EMR Serverless job execution role.
from aws_cdk import Stack
from aws_cdk import aws_emrserverless as emrs
from aws_cdk import aws_iam as iam # Duration,
from aws_cdk import custom_resources as custom
from constructs import Construct
class EmrServerlessJobRunStack(Stack):
def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
super().__init__(scope, construct_id, **kwargs)
# Create a serverless Spark app
serverless_app = emrs.CfnApplication(
self,
"spark_app",
release_label="emr-6.9.0",
type="SPARK",
name="cdk-spark",
)
# We need an execution role to run the job, this one has no access to anything
# But will be granted PassRole access by the Lambda that's starting the job.
role = iam.Role(
scope=self,
id="spark_job_execution_role",
assumed_by=iam.ServicePrincipal("emr-serverless.amazonaws.com"),
)
# Create a custom resource that starts a job run
myjobrun = custom.AwsCustomResource(
self,
"serverless-job-run",
on_create={
"service": "EMRServerless",
"action": "startJobRun",
"parameters": {
"applicationId": serverless_app.attr_application_id,
"executionRoleArn": role.role_arn,
"name": "cdkJob",
"jobDriver": {"sparkSubmit": {"entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py"}},
},
"physical_resource_id": custom.PhysicalResourceId.from_response(
"jobRunId"
),
},
policy=custom.AwsCustomResourcePolicy.from_sdk_calls(
resources=custom.AwsCustomResourcePolicy.ANY_RESOURCE
),
)
# Ensure the Lambda can call startJobRun with the earlier-created role
myjobrun.grant_principal.add_to_policy(
iam.PolicyStatement(
effect=iam.Effect.ALLOW,
resources=[role.role_arn],
actions=["iam:PassRole"],
conditions={
"StringLike": {
"iam:PassedToService": "emr-serverless.amazonaws.com"
}
},
)
)
Related
I'm trying to create an ECS cluster and then proceed to launch an EC2 instance into that cluster. However this is not happening.
My code:
ecs_client = boto3.client(
'ecs',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name=region
)
ec2_client = boto3.client(
'ec2',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
region_name=region
)
response = ecs_client.create_cluster(
clusterName=cluster_name
)
response = ec2_client.run_instances(
# Use the official ECS image
ImageId="ami-0128839b21d19300e",
MinCount=1,
MaxCount=1,
InstanceType="t2.micro",
IamInstanceProfile={
"Name": "ecsInstanceRole"
},
UserData="#!/bin/bash \n echo ECS_CLUSTER=" + cluster_name + " >> /etc/ecs/ecs.config"
)
the ecsInstanceRole
From what I've read, the UserData should make this possible but it is not at the moment.
I tried to replicate you issue in us-east-1, but your boto3 code works fine. I had no problems creating a cluster and launching an instance using your boto3 script to that cluster. You code, by default will launch an instance in a default VPC.
Thus, the fault must be outside of the code provided. A possible cases could be misconstrued default VPC, custom changes to ecsInstanceRole role permissions or lack connectivity to ECS service
I am trying to list all the EC2 instances that are of t2.micro type, running in Sydney region. I have written a Lambda function in Python (Python 3.6). Below is my code:
import json
import boto3
from pprint import pprint
def lambda_handler(event, context):
client = boto3.client("ec2")
response = client.describe_instances(
Filters=[
{
'Name': 'instance-state-name',
'Values': [
'running',
],
},
{
'Name': 'instance-type',
'Values': [
't2.micro',
],
},
{
'Name': 'availability-zone',
'Values': [
'ap-southeast-2',
],
},
],
)
print(response)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
I have given all necessary permissions for Lambda including AmazonEC2FullAccess, AmazonEC2ReadOnlyAccess and AWSLambdaBasicExecutionRole. I adjusted my memory till 3GB and time out value to 15 minutes. Still no luck. Any help would be appreciated.
Unfortunately I can't comment. If you use a VPC on lambda without NAT gateway boto3 won't work.
By the way,if you are trying to describe instances only through lambda you don't need to configure any VPC.
use this link to create a NAT gateway
The issue came down to the Lambda being associated with public subnets that use an internet gateway.
As the Lambda will connect outbound via the VPC it cannot connect to the internet without a NAT, this is because the Lambda itself is not granted a public IP address which stops the functionality of the internet gateway from working.
Alternatively for many services you can instead make use of a VPC endpoint, in your case (EC2) there is a endpoint for this. It is important to understand that whilst your EC2 instances are in a VPC the EC2 API operations are performed against the EC2 service rather than interacting with the resources of your VPC.
I've prepared a simple lambda function in AWS to terminate long running EMR clusters after a certain threshold is reached. This code snippet is tested locally and is working perfectly fine. Now I pushed it into a lambda, took care of the library dependencies, so that's also fine. This lambda is triggered from a CloudWatch rule, which is a simple cron schedule. I'm using an existing IAM rule which has these 7 policies attached to it.
SecretsManagerReadWrite
AmazonSQSFullAccess
AmazonS3FullAccess
CloudWatchFullAccess
AWSGlueServiceRole
AmazonSESFullAccess
AWSLambdaRole
I've configured the lambda to be inside the same vpc and security group as that of the emr(s). Still I'm getting this error consistently:
An error occurred (AccessDeniedException) when calling the ListClusters operation: User: arn:aws:sts::xyz:assumed-role/dev-lambda-role/terminate_inactive_dev_emr_clusters is not authorized to perform: elasticmapreduce:ListClusters on resource: *: ClientError
Traceback (most recent call last):
File "/var/task/terminate_dev_emr.py", line 24, in terminator
ClusterStates=['STARTING', 'BOOTSTRAPPING', 'RUNNING', 'WAITING']
File "/var/runtime/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 612, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the ListClusters operation: User: arn:aws:sts::xyz:assumed-role/dev-lambda-role/terminate_inactive_dev_emr_clusters is not authorized to perform: elasticmapreduce:ListClusters on resource: *
My lambda function looks something like this:
import pytz
import boto3
from datetime import datetime, timedelta
def terminator(event, context):
''' cluster lifetime limit in hours '''
LIMIT = 7
TIMEZONE = 'Asia/Kolkata'
AWS_REGION = 'eu-west-1'
print('Start cluster check')
emr = boto3.client('emr', region_name=AWS_REGION)
local_tz = pytz.timezone(TIMEZONE)
today = local_tz.localize(datetime.today(), is_dst=None)
lifetimelimit = today - timedelta(hours=LIMIT)
clusters = emr.list_clusters(
CreatedBefore=lifetimelimit,
ClusterStates=['STARTING', 'BOOTSTRAPPING', 'RUNNING', 'WAITING']
)
if clusters['Clusters'] is not None:
for cluster in clusters['Clusters']:
description = emr.describe_cluster(ClusterId=cluster['Id'])
if(len(description['Cluster']['Tags']) == 1
and description['Cluster']['Tags'][0]['Key'] == 'dev.ephemeral'):
print('Terminating Cluster: [{id}] with name [{name}]. It was active since: [{time}]'.format(id=cluster['Id'], name=cluster['Name'], time=cluster['Status']['Timeline']['CreationDateTime'].strftime('%Y-%m-%d %H:%M:%S')))
emr.terminate_job_flows(JobFlowIds=[cluster['Id']])
print('cluster check done')
return
Any help is appreciated.
As error message indicates, lambda does not have permissions to call ListClusters on EMR. As you are working with EMR clusters and would also like to terminate the clusters, you should give lambda function proper IAM role which is having that capability to do that. Create a new IAM policy from AWS console (say EMRFullAccess). here is how it looks like
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "elasticmapreduce:*",
"Resource": "*"
}
]
}
After creating policy, create a new role from AWS console with lambda as service and attach newly created policy above. After that, attach this role to your lambda function. That should solve issue :-)
I need to provide somebody with read only AWS CLI access to our CloudWatch billing metrics ONLY. I'm not sure how to do this since CloudWatch doesn't have any specific resources that one can control access to. This means there are no ARN's to specify in an IAM policy and as a result, any resource designation in the policy is "*". More info regarding CloudWatch ARN limitations can be found here. I looked into using namespaces but I believe the "aws-portal" namespace is for the console. Any direction or ideas are greatly appreciated.
With the current CloudWatch ARN limitations the IAM policy would look something like this.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"cloudwatch:DescribeMetricData",
"cloudwatch:GetMetricData"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
As you say, you will not be able to achieve this within CloudWatch. According to the docs:
CloudWatch doesn't have any specific resources for you to control access to... For example, you can't give a user access to CloudWatch data for only a specific set of EC2 instances or a specific load balancer. Permissions granted using IAM cover all the cloud resources you use or monitor with CloudWatch.
An alternative option might be to:
Use Scheduled events on a lambda function to periodically export relevant billing metrics from Cloudwatch to an S3 bucket. For example, using the Python SDK, the lambda might look something like this:
import boto3
from datetime import datetime, timedelta
def lambda_handler(event, context):
try:
bucket_name = "so-billing-metrics"
filename = '-'.join(['billing', datetime.now().strftime("%Y-%m-%d-%H")])
region_name = "us-east-1"
dimensions = {'Name': 'Currency', 'Value':'USD'}
metric_name = 'EstimatedCharges'
namespace = 'AWS/Billing'
start_time = datetime.now() - timedelta(hours = 1)
end_time = datetime.now()
# Create CloudWatch client
cloudwatch = boto3.client('cloudwatch', region_name=region_name)
# Get billing metrics for the last hour
metrics = cloudwatch.get_metric_statistics(
Dimensions=[dimensions],
MetricName=metric_name,
Namespace=namespace,
StartTime=start_time,
EndTime=end_time,
Period=60,
Statistics=['Sum'])
# Save data to temp file
with open('/tmp/billingmetrics', 'wb') as f:
# Write header and data
f.write("Timestamp, Cost")
for entry in metrics['Datapoints']:
f.write(",".join([entry['Timestamp'].strftime('%Y-%m-%d %H:%M:%S'), str(entry['Sum']), entry['Unit']]))
# Upload temp file to S3
s3 = boto3.client('s3')
with open('/tmp/billingmetrics', 'rb') as data:
s3.upload_fileobj(data, bucket_name, filename)
except Exception as e:
print str(e)
return 0
return 1
Note: You will need to ensure that the Lambda function has the relevant permissions to write to S3 and read from cloudwatch.
Restrict the IAM User/Role to read only access to the S3 bucket.
When I configure DNS Query Logging with Route53, I can create a resource policy for Route53 to log to my log group. I can confirm this policy with the cli aws logs describe-resource-policies and see something like:
{
"resourcePolicies": [
{
"policyName": "test-logging-policy",
"policyDocument": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"route53.amazonaws.com\"},\"Action\":[\"logs:CreateLogStream\",\"logs:PutLogEvents\"],\"Resource\":\"arn:aws:logs:us-east-1:xxxxxx:log-group:test-route53*\"}]}",
"lastUpdatedTime": 1520865407511
}
]
}
The cli also has a put-resource-policy to create one of these. I also see that Terraform has a resource aws_cloudwatch_log_resource_policy which does the same.
So the question: How do I do this with CloudFormation???
You can't use the CloudWatch console to create or edit a resource policy. You must use the CloudWatch API, one of the AWS SDKs, or the
AWS CLI.
There is no Cloudformation support for creating a resource policy right now, but you create a custom lambda resource to do this.
https://gist.github.com/sudharsans/cf9c52d7c78a81818a4a47872982bd76
CloudFormation Custom resource:
AddResourcePolicy:
Type: Custom::AddResourcePolicy
Version: '1.0'
Properties:
ServiceToken: arn:aws:lambda:us-east-1:872673965194:function:test-lambda-deploy-Lambda-15R963QKCI80A
CloudWatchLogsLogGroupArn: !GetAtt LogGroup.Arn
PolicyName: "testpolicy"
lambda:
import cfnresponse
import boto3
def PutPolicy(arn,policyname):
response = client.put_resource_policy(
policyName=policyname,
policyDocument="....",
)
return
def handler(event, context):
......
if event['RequestType'] == "Delete":
DeletePolicy(PolicyName)
if event['RequestType'] == "Create":
PutPolicy(CloudWatchLogsLogGroupArn,PolicyName)
responseData['Data'] = "SUCCESS"
status=cfnresponse.SUCCESS
.....
4 years later, this still doesn't seem to work through Cloudformation although there is apparently support for this included now