Error running emrfs delete - Metadata 'EmrFSMetadata' does not exist - amazon-web-services

As title.
We have stage/prod emr clusters and we may need to run emrfs delete s3_path command on both clusters via the jenkins jobs.
However, I can run the emrfs delete successfully on stage emr one, but failed on prod. Below are the log:
22:54:51 Clear meta-store before loading into DW table.
22:54:51 ----------------------------------------------------------
22:54:51
22:54:51 Pseudo-terminal will not be allocated because stdin is not a terminal.
22:54:52 19/03/23 02:54:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22:54:53 EmrFsApplication.scala(91): dynamoDB endPoint = dynamodb.us-east-1.amazonaws.com
22:54:53 EmrFsApplication.scala(99): s3 endPoint = s3.amazonaws.com
22:54:53 EmrFsApplication.scala(107): sqs endPoint = sqs.us-east-1.amazonaws.com
22:54:54 Metadata 'EmrFSMetadata' does not exist
I don't know why EmrFSMetadata not exist in my prod emr? or is it something special settings need to apply to the prod one?
Thanks.

Ah.. I think I got the answer. Our prod emr doesnt enable EMRFS consistent view.
That's why...
Question can be closed.

Related

aws data migration service . start replication task issue

I have created AWS DMS replication instance, replication task and source, target enpoints using terrform.
Now, when i run start replication task from windows aws cli . it throws this SSL error.
Error running command 'aws dms start-replication-task --start-replication-task-type start-replication --replication-task-arn arn:aws:dms:us-west-2:accountnumber:task:xxxxxxx': exit status 254. Output: C:\Program Files\Amazon\AWSCLIV2\urllib3\connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dms.us-east-1.amazonaws.com'. Adding certificate verification is strongly advised.
My CLI version is The version is aws-cli/2.1.6 Python/3.7.9 Windows/10 exe/AMD64 prompt/off.
There is no proxy configured
Any suggestion on this issue.
Thnaks
Make sure you set your credential correctly where you replication task is living by set [default] on the right one
./aws/credentials
to start the replication task you have to fill two mandatory field
{
"ReplicationTaskArn": "string",
"StartReplicationTaskType": "string"
}
StartReplicationTaskType --> Valid Values: start-replication | resume-processing | reload-target
start-replication only when you create the task so you may need to make it reload-target
Ref:
https://docs.aws.amazon.com/dms/latest/APIReference/API_StartReplicationTask.html

'm3.xlarge' is not supported in AWS Data Pipeline

I am new to AWS, trying to run an AWS DATA Pipeline by loading data from DynamoDB to S3. But i am getting below error. Please help
Unable to create resource for #EmrClusterForBackup_2020-05-01T14:18:47 due to: Instance type 'm3.xlarge' is not supported. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: 3bd57023-95e4-4d0a-a810-e7ba9cdc3712)
I was facing the same problem when I have dynamoDB table and s3 bucket created in us-east-2 region and pipeline in us-east-1 as I was not allowed to create pipeline in us-east-2 region.
But, once I created dynamoDB table and s3 bucket created in us-east-1 region and then pipeline also in the same region, it worked well even with m3.xlarge instance type.
It is always good to use latest generation instances. They are technologically more advanced and some times even cheaper.
So there is no reason to start on older generations.. They are there only for people who are already having infrastructure on those machines.. so to provide backward compatibility.
I think this should help you. AWS will force you to use m3 if you use DynamoDBDataNode or resizeClusterBeforeRunning
https://aws.amazon.com/premiumsupport/knowledge-center/datapipeline-override-instance-type/?nc1=h_ls
I faced the same error but just changing from m3.xlarge to m4.xlarge didn't solve the problem. The DynamoDB table I was trying to export was in eu-west-2 but at the time of writing Data Pipeline is not available in eu-west-2. I found I had to edit the pipeline to change the following:
Instance type from m3.xlarge to m4.xlarge
Release Label from emr-5.23.0 to emr-5.24.0 not strictly necessary for export but required for import [1]
Hardcode the region to eu-west-2
So the end result was:
[1] From: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-prereq.html
On-Demand Capacity works only with EMR 5.24.0 or later
DynamoDB tables configured for On-Demand Capacity are supported only when using Amazon EMR release version 5.24.0 or later. When you use a template to create a pipeline for DynamoDB, choose Edit in Architect and then choose Resources to configure the Amazon EMR cluster that AWS Data Pipeline provisions. For Release label, choose emr-5.24.0 or later.

spark read from different account s3 and write to my account s3

I have spark job which needs to read the data from s3 which is in other account**(Data Account)** and process that data.
once its processed it should write back to s3 which is in my account.
So I configured access and secret key of "Data account" like below in my spark session
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key","DataAccountKey")
hadoopConf.set("fs.s3a.secret.key","DataAccountSecretKey")
hadoopConf.set("fs.s3a.endpoint", "s3.ap-northeast-2.amazonaws.com")
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
val df = spark.read.json("s3a://DataAccountS/path")
/* Reading is success */
df.take(3).write.json("s3a://myaccount/test/")
with this reading is fine, but I am getting below error when writing.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 301, AWS Service: Amazon S3, AWS Request ID: A5E574113745D6A0, AWS Error Code: PermanentRedirect, AWS Error Message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
but If I dont configure details of Data Account and try to write some dummy data to my s3 from spark it works.
So how should I configure to make both reading from different account s3 and writing to my account s3 works
If your spark classpath has hadoop-2.7 JARs on, you can use secrets-in-Paths as the technique, so a URL like s3a://DataAccountKey:DataAccountSecretKey/DataAccount/path. Be aware this will log the secrets everywhere.
Hadoop 2.8+ JARs will tell you off for logging your secrets everywhere, but adds per-bucket binding
spark.hadoop.fs.s3a.bucket.DataAccount.access.key DataAccountKey
spark.hadoop.fs.s3a.bucket.DataAccount.secret.key DataAccountSecretKey
spark.hadoop.fs.s3a.bucket.DataAccount.endpoint s3.ap-northeast-2.amazonaws.com
then for all interaction with that bucket, these per-bucket options will override the main settings.
Note: if you want to use this, don't think dropping hadoop-aws-2.8.jar into your classpath will work, you'll only get classpath errors. All of hadoop-* JAR needs to go to 2.8 and the aws-sdk updated too.

How to configure Spark running in local-mode on Amazon EC2 to use the IAM rules for S3

I'm running Spark2 in local mode on a Amazon EC2, when I'm trying to read data from S3 I'm getting the following exception:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively)
I can, but I rather not manually set the AccessKey and the SecretKey from the code because of security issues.
The EC2 is set with an IAM rule that allow it full access to the relevant S3 Bucket. For every other Amazon API calls it is sufficient but it seems that the spark is ignoring it.
Can I set the spark to use this IAM rule instead of the AccessKey and the SecretKey?
Switch to using the s3a:// scheme (with the Hadoop 2.7.x JARs on your classpath) and this happens automatically. The "s3://" scheme with non-EMR versions of spark/hadoop is not the connector you want (it's old, non-interoperable and has been removed from recent versions)
I am using hadoop-2.8.0 and spark-2.2.0-bin-hadoop2.7.
Spark-S3-IAM integration is working well with the following AWS packages on driver.
spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 ...
Scala codes snippet:
sc.textFile("s3a://.../file.gz").count()

AWS Data Pipeline configured EMR cluster running Spark

Please can someone help? I'm trying to do exactly this; I cannot create an EMR environment with a Spark installation from within a Data Pipeline configuration from within the AWS console. I choose 'Run job on an EMR cluster', the EMR cluster is always created with Pig and Hive as default, not Spark.
I understand that I can choose Spark as a bootstrap action as said here, but when I do I get this message:
Name: xxx.xxxxxxx.processing.dp
Build using a template: Run job on an Elastic MapReduce cluster
Parameters:
EC2 key pair (optional): xxx_xxxxxxx_emr_key
EMR step(s):
spark-submit --deploy-mode cluster s3://xxx.xxxxxxx.scripts.bucket/CSV2Parquet.py s3://xxx.xxxxxxx.scripts.bucket/
EMR Release Label: emr-4.3.0
Bootstrap action(s) (optional): s3://support.elasticmapreduce/spark/install-spark,-v,1.4.0.b
Where does the AMI bit go? And does the above look correct??
Here's the error I get when I activate the data pipeline:
Unable to create resource for #EmrClusterObj_2017-01-13T09:00:07 due to: The supplied bootstrap action(s): 'bootstrap-action.6255c495-578a-441a-9d05-d03981fc460d' are not supported by release 'emr-4.3.0'. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: b1b81565-d96e-11e6-bbd2-33fb57aa2526)
If I specify a later version of the EMR, do I get Spark installed as default?
Many thanks for any help here.
Regards.
That install-spark bootstrap action is only for 3.x AMI versions. If you are using a releaseLabel (emr-4.x or beyond), the applications to install are specified in a different way.
I myself have never used Data Pipeline, but I see that if, when you are creating a pipeline, you click "Edit in Architect" at the bottom, you can then click on the EmrCluster node and select Applications from the "Add an optional field..." dropdown. That is where you may add Spark.