AWS Data Pipeline configured EMR cluster running Spark

AWS Data Pipeline configured EMR cluster running Spark - amazon-web-services

Please can someone help? I'm trying to do exactly this; I cannot create an EMR environment with a Spark installation from within a Data Pipeline configuration from within the AWS console. I choose 'Run job on an EMR cluster', the EMR cluster is always created with Pig and Hive as default, not Spark.
I understand that I can choose Spark as a bootstrap action as said here, but when I do I get this message:
Name: xxx.xxxxxxx.processing.dp
Build using a template: Run job on an Elastic MapReduce cluster
Parameters:
EC2 key pair (optional): xxx_xxxxxxx_emr_key
EMR step(s):
spark-submit --deploy-mode cluster s3://xxx.xxxxxxx.scripts.bucket/CSV2Parquet.py s3://xxx.xxxxxxx.scripts.bucket/
EMR Release Label: emr-4.3.0
Bootstrap action(s) (optional): s3://support.elasticmapreduce/spark/install-spark,-v,1.4.0.b
Where does the AMI bit go? And does the above look correct??
Here's the error I get when I activate the data pipeline:
Unable to create resource for #EmrClusterObj_2017-01-13T09:00:07 due to: The supplied bootstrap action(s): 'bootstrap-action.6255c495-578a-441a-9d05-d03981fc460d' are not supported by release 'emr-4.3.0'. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: b1b81565-d96e-11e6-bbd2-33fb57aa2526)
If I specify a later version of the EMR, do I get Spark installed as default?
Many thanks for any help here.
Regards.

That install-spark bootstrap action is only for 3.x AMI versions. If you are using a releaseLabel (emr-4.x or beyond), the applications to install are specified in a different way.
I myself have never used Data Pipeline, but I see that if, when you are creating a pipeline, you click "Edit in Architect" at the bottom, you can then click on the EmrCluster node and select Applications from the "Add an optional field..." dropdown. That is where you may add Spark.

Related

'm3.xlarge' is not supported in AWS Data Pipeline

I am new to AWS, trying to run an AWS DATA Pipeline by loading data from DynamoDB to S3. But i am getting below error. Please help
Unable to create resource for #EmrClusterForBackup_2020-05-01T14:18:47 due to: Instance type 'm3.xlarge' is not supported. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: 3bd57023-95e4-4d0a-a810-e7ba9cdc3712)

I was facing the same problem when I have dynamoDB table and s3 bucket created in us-east-2 region and pipeline in us-east-1 as I was not allowed to create pipeline in us-east-2 region.
But, once I created dynamoDB table and s3 bucket created in us-east-1 region and then pipeline also in the same region, it worked well even with m3.xlarge instance type.

It is always good to use latest generation instances. They are technologically more advanced and some times even cheaper.
So there is no reason to start on older generations.. They are there only for people who are already having infrastructure on those machines.. so to provide backward compatibility.

I think this should help you. AWS will force you to use m3 if you use DynamoDBDataNode or resizeClusterBeforeRunning
https://aws.amazon.com/premiumsupport/knowledge-center/datapipeline-override-instance-type/?nc1=h_ls

I faced the same error but just changing from m3.xlarge to m4.xlarge didn't solve the problem. The DynamoDB table I was trying to export was in eu-west-2 but at the time of writing Data Pipeline is not available in eu-west-2. I found I had to edit the pipeline to change the following:
Instance type from m3.xlarge to m4.xlarge
Release Label from emr-5.23.0 to emr-5.24.0 not strictly necessary for export but required for import [1]
Hardcode the region to eu-west-2
So the end result was:
[1] From: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-prereq.html
On-Demand Capacity works only with EMR 5.24.0 or later
DynamoDB tables configured for On-Demand Capacity are supported only when using Amazon EMR release version 5.24.0 or later. When you use a template to create a pipeline for DynamoDB, choose Edit in Architect and then choose Resources to configure the Amazon EMR cluster that AWS Data Pipeline provisions. For Release label, choose emr-5.24.0 or later.

Error running emrfs delete - Metadata 'EmrFSMetadata' does not exist

As title.
We have stage/prod emr clusters and we may need to run emrfs delete s3_path command on both clusters via the jenkins jobs.
However, I can run the emrfs delete successfully on stage emr one, but failed on prod. Below are the log:
22:54:51 Clear meta-store before loading into DW table.
22:54:51 ----------------------------------------------------------
22:54:51
22:54:51 Pseudo-terminal will not be allocated because stdin is not a terminal.
22:54:52 19/03/23 02:54:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22:54:53 EmrFsApplication.scala(91): dynamoDB endPoint = dynamodb.us-east-1.amazonaws.com
22:54:53 EmrFsApplication.scala(99): s3 endPoint = s3.amazonaws.com
22:54:53 EmrFsApplication.scala(107): sqs endPoint = sqs.us-east-1.amazonaws.com
22:54:54 Metadata 'EmrFSMetadata' does not exist
I don't know why EmrFSMetadata not exist in my prod emr? or is it something special settings need to apply to the prod one?
Thanks.

Ah.. I think I got the answer. Our prod emr doesnt enable EMRFS consistent view.
That's why...
Question can be closed.

How do I use AWS credentials with Jenkins to deploy to Elastic Beanstalk?

I have entered AWS credentials in Jenkins at /credentials, however they do not show up in the drop down list for the Post Build steps in the AWS Elastic Beanstalk plugin.
If I click Validate Credentials, I get this strange error.
Failure
com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [EnvironmentVariableCredentialsProvider: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)), SystemPropertiesCredentialsProvider: Unable to load AWS credentials from Java system properties (aws.accessKeyId and aws.secretKey), com.amazonaws.auth.profile.ProfileCredentialsProvider#5c932b96: profile file cannot be null, com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper#32abba7: The requested metadata is not found at http://169.254.169.254/latest/meta-data/iam/security-credentials/]
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136)
I don't know where it got that IP address. When I search for that IP in the Jenkins directory, I turn up with
-bash-4.2$ grep -r 169.254.169.254 *
plugins/ec2/AMI-Scripts/ubuntu-init.py:conn = httplib.HTTPConnection("169.254.169.254")
The contents of that file is here: https://pastebin.com/3ShanSSw
There are actually 2 different Amazon Elastic Beanstalk plugins.
AWSEB Deployment Plugin, v 0.3.19, Aldrin Leal
AWS Beanstalk Publisher Plugin, v 1.7.4, David Tanner
Neither of them work. Neither will display the credentials in the drop down list. Since updating Jenkins, I am unable to even show "Deploy to Elastic Beanstalk" as a post-build step for the first one (v0.3.19) even though it is the only one installed.
For the 2nd plugin (v1.7.4), I see this screen shot:
When I fill in what I can, and run it, it gives the error
No credentials provided for build!!!
Environment found (environment id='e-yfwqnurxh6', name='appenvironment'). Attempting to update environment to version label 'sprint5-13'
'appenvironment': Attempt 0/5
'appenvironment': Problem:
com.amazonaws.services.elasticbeanstalk.model.AWSElasticBeanstalkException: No Application Version named 'sprint5-13' found. (Service: AWSElasticBeanstalk; Status Code: 400; Error Code: InvalidParameterValue; Request ID: af9eae4f-ad56-426e-8fe4-4ae75548f3b1)
I tried to add an S3 sub-task to the Elastic Beanstalk deployment, but it failed with an exception.
No credentials provided for build!!!
Root File Object is a file. We assume its a zip file, which is okay.
Uploading file awseb-4831053374102655095.zip as s3://appname-sprint5-15.zip
ERROR: Build step failed with exception
com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 7C4734153DB2BC36; S3 Extended Request ID: x7B5HflSeiIw++NGosos08zO5DxP3WIzrUPkZOjjbBv856os69QRBVgic62nW3GpMtBj1IxW7tc=), S3 Extended Request ID: x7B5HflSeiIw++NGosos08zO5DxP3WIzrUPkZOjjbBv856os69QRBVgic62nW3GpMtBj1IxW7tc=

Jenkins is hopelessly out of date and unmaintained. I added the Post Build Task plugin, installed eb tool as jenkins user, ran eb init in the job directory, edited .elasticbeanstalk/config.yml to add the lines
deploy:
artifact: target/AppName-Sprint5-SNAPSHOT-bin.zip
Then entered in the shell command to deploy the build.
/var/lib/jenkins/.local/bin/eb deploy -l sprint5-${BUILD_NUMBER}

For Eleastic beanstalk plugin right place to configure AWS key is Jenkins Master configure
http://{jenkinsURL}/configure

AWS EMR provisioning fails when I use custom AMI

Problem:
I have an EMR cluster (along with a number of other resources) defined in a cloudformation template. I use the AWS rest api to provision my stack. It works, I can provision the stack successfully.
Then, I made one change: I specified a custom AMI for my EMR cluster. And now the EMR provisioning fails when I provision my stack.
And now my stack creation fails, due to EMR provisioning failing. The only information I can find is an error on the console: null: Error provisioning instances.. Digging into each instance, I see that the master node failed with error Status: Terminated. Last state change reason:Time out occurred during bootstrap
I have s3 logging configured for my EMR cluster, but there are no logs in the s3 bucket.
Details:
I updated my cloudformation script like so:
my_stack.cfn.yaml:
rMyEmrCluster:
Type: AWS::EMR::Cluster
...
Properties:
...
CustomAmiId: "ami-xxxxxx" # <-- I added this
Custom AMI details:
I am adding a custom AMI because I need to encrypt the root EBS volume on all of my nodes. (This is required per documentation)
The steps I took to create my custom AMI:
I launched the base AMI that is used by AWS for EMR nodes: emr 5.7.0-ami-roller-27 hvm ebs (ID: ami-8a5cb8f3)
I created an image from my running instance
I created a copy of this image, with EBS root volume encryption enabled. I use the default encryption key. (I must create my own base image from a running instance, because you are not allowed to create an encrypted copy from an AMI you don't own)
I wonder if this might be a permissions issue, or perhaps my AMI is misconfigured in some way. But it would be prudent for me to find some logs first, to figure out exactly what is going wrong with node provisioning.

I feel stupid. I accidentally used a completely un-related AMI (a redhat 7 image) as the base image, instead of the AMI that EMR uses for it's nodes by default: emr 5.7.0-ami-roller-27 hvm ebs (ami-8a5cb8f3)
I'll leave this question and answer up in case someone else makes the same mistake.
Make sure you create your custom AMI from the correct base AMI: emr 5.7.0-ami-roller-27 hvm ebs (ami-8a5cb8f3)

You mention that you created your custom AMI based on an EMR AMI. However, according to the documentation you linked, you should actually base your AMI on "the most recent EBS-backed Amazon Linux AMI". Your custom AMI does not need to be based on an EMR AMI, and indeed I suppose that doing so could cause some problems (though I have not tried it myself).

Cannot create AWS EMR with autoscaling via cloudformation

I am working on EMR template with autoscaling.
While a static EMR setup with instance group works fine, I cannot attach
AWS::ApplicationAutoScaling::ScalableTarget
As a troubleshooting I've split my template into 2 separate ones. In first I am creating a normal EMR cluster (which is fine). And then in second I have a ScalableTarget definition which fails attach with error:
11:29:34 UTC+0100 CREATE_FAILED AWS::ApplicationAutoScaling::ScalableTarget AutoscalingTarget EMR instance group doesn't exist: Failed to find Cluster XXXXXXX
Funny thing is that this cluster DOES exist.
I also had a look at IAM roles but everything seems to be ok there...
Can anyone advice on that matter?
Did anyone for Autoscaling instancegroup to work via Cloudformation?

I have already tried and raised a request with AWS. This autoscaling feature is not yet available using CloudFormation. Now I am using CF for Custom EMR SecGrp creation and S3 etc and in output tab, I am adding Command line command(aws emr create-cluster...... ). After getting output querying the result to launch Cluster.
Actually, autoscaling can be enabled at the time of cluster launching by using --auto-scaling-role. If we use CF for EMR, autoscaling feature is not available because it launches cluster without "--auto-scaling-role".
I hope this can be useful...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js