create EMR cluster with Autotermination setting using Airflows [CreateEMRJobFlowOperator] - amazon-web-services

I'm trying to create EMR cluster with Autotermination Idle timeout setting using Airflow DAG.
It doesn't accept Autoterminationpolicy parameter and fails with parameter validation with following error:
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Unknown parameter in input: "AutoTerminationPolicy", must be one of: Name, LogUri, LogEncryptionKmsKeyId, AdditionalInfo, AmiVersion, ReleaseLabel, Instances, Steps, BootstrapActions, SupportedProducts, NewSupportedProducts, Applications, Configurations, VisibleToAllUsers, JobFlowRole, ServiceRole, Tags, SecurityConfiguration, AutoScalingRole, ScaleDownBehavior, CustomAmiId, EbsRootVolumeSize, RepoUpgradeOnBoot, KerberosAttributes, StepConcurrencyLevel, ManagedScalingPolicy, PlacementGroupConfigs
JOB_FLOW_OVERRIDES = {
'Name': 'X',
'ReleaseLabel': "{{ReleaseLabel}}",
"Applications": [{"Name": "Hadoop"}, {"Name": "Spark"}, {"Name": "Hive"}],
"AutoTerminationPolicy": {
"IdleTimeout": 60 * 10
},
'LogUri': LogUri,
'Instances': {
'Ec2SubnetId': "{{Ec2SubnetId}}",
'EmrManagedMasterSecurityGroup': "{{EmrManagedMasterSecurityGroup}}",
'ServiceAccessSecurityGroup': "{{ServiceAccessSecurityGroup}}",
'EmrManagedSlaveSecurityGroup': "{{EmrManagedSlaveSecurityGroup}}",
'InstanceGroups': [

Related

AWS SSM error while targets.1.member.values failed to satisfy constraint: Member must have length less than or equal to 50

I am trying to run a SSM command on more than 50 EC2 instances of my fleet. By using AWS boto3's SSM client, I am running a specific command on my nodes. My code is given below. After running the code, an unexpected error is showing up.
# running ec2 instances
instances = client.describe_instances()
instance_ids = [inst["InstanceId"] for inst in instances] # might contain more than 50 instances
# run command
run_cmd_resp = ssm_client.send_command(
Targets=[
{"Key": "InstanceIds", "Values": inst_ids_all},
],
DocumentName="AWS-RunShellScript",
DocumentVersion="1",
Parameters={
"commands": ["#!/bin/bash", "ls -ltrh", "# some commands"]
}
)
On executing this, getting below error
An error occurred (ValidationException) when calling the SendCommand operation: 1 validation error detected: Value '[...91 instance IDs...]' at 'targets.1.member.values' failed to satisfy constraint: Member must have length less than or equal to 50.
How do I run the SSM command my whole fleet?
As shown in the error message and boto3 documentation (link), the number of instances in one send_command call is limited up to 50. To run the SSM command for all instances, splitting the original list into 50 each could be a solution.
FYI: If your account has a fair amount of instances, describe_instances() can't retrieve all instance info in one api call, so it would be better to check whether NextToken is in response.
ref: How do you use "NextToken" in AWS API calls
# running ec2 instances
instances = client.describe_instances()
instance_ids = [inst["InstanceId"] for inst in instances]
while "NextToken" in instances:
instances = client.describe_instances(NextToken=instances["NextToken"])
instance_ids += [inst["InstanceId"] for inst in instances]
# run command
for i in range(0, len(instance_ids), 50):
target_instances = instance_ids[i : i + 50]
run_cmd_resp = ssm_client.send_command(
Targets=[
{"Key": "InstanceIds", "Values": inst_ids_all},
],
DocumentName="AWS-RunShellScript",
DocumentVersion="1",
Parameters={
"commands": ["#!/bin/bash", "ls -ltrh", "# some commands"]
}
)
Finally after #Rohan Kishibe's answer, I tried to implement below batched execution for the SSM runShellScript.
import math
ec2_ids_all = [...] # all instance IDs fetched by pagination.
PG_START, PG_STOP = 0, 50
PG_SIZE = 50
PG_COUNT = math.ceil(len(ec2_ids_all) / PG_SIZE)
for page in range(PG_COUNT):
cmd = ssm.send_command(
Targets=[{"Key": "InstanceIds", "Values": ec2_ids_all[PG_START:PG_STOP]}],
DocumentVersion="AWS-RunShellScript",
Parameters={"commands": ["ls -ltrh", "# other commands"]}
}
PG_START += PG_SIZE
PG_STOP += PG_SIZE
In above way, the total number of instance IDs will be distributed in batches and then executed accordingly. One can also save the Command IDs and batch instance IDs in a mapping for future usage.

Import Azure Data explorer Arm Template fails

i exported my arm template from develop environment related to my Azure Data Explorer.
Now i'am trying to import it in test environment but the process fails:
New-AzResourceGroupDeployment : 12:18:24 - The deployment 'template' failed with error(s). Showing 3 out of 10 error(s). Status Message: [BadRequest] Validation Errors found: mapping does not exist (Code:EventHubValidationErrorFound) Status Message: [BadRequest] Validation Errors found: mapping does not exist (Code:EventHubValidationErrorFound) Status Message: [BadRequest] Validation Errors found: mapping does not exist (Code:EventHubValidationErrorFound) CorrelationId: b27cdf8e-c583-4dee-8dbc-2b0e4876b8ca
I have different Data Connections from my Azure DATA Explorer to an Event HUB:
`
{
"type": "Microsoft.Kusto/Clusters/Databases/EventHubConnections",
"apiVersion": "2018-09-07-preview",
"name": "[concat(parameters('Clusters_xyzazne_name'), '/asd/asd-fondi')]",
"location": "North Europe",
"dependsOn": [
"[resourceId('Microsoft.Kusto/Clusters/Databases', parameters('Clusters__name'), 'DNA_R_NRT')]",
"[resourceId('Microsoft.Kusto/Clusters', parameters('Clusters__name'))]"
],
"kind": "EventHub",
"properties": {
"eventHubResourceId": "[concat(parameters('namespaces_ehub_externalid'), '/eventhubs/fondi')]",
"consumerGroup": "fondi_consumer",
"tableName": "fondi",
"mappingRuleName": "fondi_mapping",
"dataFormat": "multijson"
}
}
`
I'm trying to import an Azure Data Explorer Arm Template to another environment but fails

An error has occurred: The server encountered an error processing the Lambda response

I am using AWS Lex and AWS Lambda for creating a chatbot. The request and response format are as follows
Event being passed to AWS Lambda
{
"alternativeIntents": [
{
"intentName": "AMAZON.FallbackIntent",
"nluIntentConfidence": null,
"slots": {}
}
],
"botVersion": "$LATEST",
"dialogState": "ConfirmIntent",
"intentName": "OrderBeverage",
"message": "you want to order 2 pints of beer",
"messageFormat": "PlainText",
"nluIntentConfidence": {
"score": 0.92
},
"responseCard": null,
"sentimentResponse": null,
"sessionAttributes": {},
"sessionId": "2021-05-10T09:13:06.841Z-bSWmdHVL",
"slotToElicit": null,
"slots": {
"Drink": "beer",
"Quantity": "2",
"Unit": "pints"
}
}
Response Format-
{
"statusCode": 200,
"dialogAction": {
"type": "Close",
"fulfillmentState": "Fulfilled",
"message": {
"contentType": "PlainText",
"content": "Message to convey to the user. For example, Thanks, your pizza has been ordered."
}
}
}
AWS LAMBDA PYTHON IMPLEMENTATION-
import json
def lambda_handler(event, context):
# TODO implement
slots= event["slots"];
drink,qty,unit= slots["Drink"], slots["Quantity"], slots["Unit"]
retStr= "your order of "+qty+" "+unit+ " of "+drink+ " is coming right up!";
return {"dialogAction": {
"type": "Close",
"fulfillmentState": "Fulfilled",
"message": {
"contentType": "PlainText",
"content": retStr
},
}
}
The formats are in accordance with the documentation, however still getting error in processing lambda response. What is the issue?
This error occurs when the execution of the Lambda function fails and throws an error back to Amazon Lex.
I have attempted to recreate your environment using the python code shared and the test input event.
The output format that you have specified in the original post is correct. Your problem appears to lie with the input test event. The input message that you are using differs from what Lex is actually sending to your Lambda function.
Try adding some additional debugging to your Lambda function to log the event that Lex passes into it and then use the logged event as your new test event.
Ensure that you have CloudWatch logging enabled for the Lambda function so that you can view the input message in the logs.
Here's how my Lambda function looks:
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
def dispatch(event):
# TODO implement
slots= event["slots"];
drink,qty,unit= slots["Drink"], slots["Quantity"], slots["Unit"]
retStr= "your order of "+qty+" "+unit+ " of "+drink+ " is coming right up!";
return {"dialogAction": {
"type": "Close",
"fulfillmentState": "Fulfilled",
"message": {
"contentType": "PlainText",
"content": retStr
},
}}
def lambda_handler(event, context):
logger.debug('event={}'.format(event))
response = dispatch(event)
logger.debug(response)
return response
Now if you test via the Lex console you will find your error in the CloudWatch logs.:
[ERROR] KeyError: 'slots' Traceback (most recent call last): File
"/var/task/lambda_function.py", line 33, in lambda_handler
response = dispatch(event) File "/var/task/lambda_function.py", line 19, in dispatch
slots= event["slots"];
Using this error trace and the logged event, you should see that slots is nested within currentIntent.
You will need to update your code to extract the slot values from the correct place.
Trust this helps you.

Pyspark - read data from elasticsearch cluster on EMR

I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?

Apache Drill: Not able to query the database

I am using UBUNTU 14.04.
I have started to explore about querying HDFS using apache drill, installed it my local system and configured the Storage plugin to point remote HDFS. Below is the configuration setup:
{
"type": "file",
"enabled": true,
"connection": "hdfs://devlpmnt.mycrop.kom:8020",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
}
},
"formats": {
"json": {
"type": "json"
}
}
}
After creating a json file "rest.json", I passed the query:
select * from hdfs.`/tmp/rest.json` limit 1
I am getting following error:
org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: From line 1, column 15 to line 1, column 18: Table 'hdfs./tmp/rest.json' not found
I would appreciate if someone tries to help me figure out what is wrong.
Thanks in advance!!