Writing parquet file from transient EMR Spark cluster fails on S3 credentials - amazon-web-services

I am creating a transient EMR Spark cluster programmatically, reading a vanilla S3 object, converting it to a Dataframe and writing a parquet file.
Running on a local cluster (with S3 credentials provided) everything works.
Spinning up a transient cluster and submitting the job fails on the write to S3 with the error:
AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records.
But my job is able to read the vanilla object from S3, and it logs to S3 correctly. Additionally I see that EMR_EC2_DefaultRole is set as EC2 instance profile, and that EMR_EC2_DefaultRole has the proper S3 permissions, and that my bucket has a policy set for EMR_EC2_DefaultRole.
I get the the 'filesystem' that I am trying to write the parquet file to is special, but I cannot figure out what needs to be set for this to work.

Arrrrgggghh! Basically as soon as I had posted my question, the light bulb went off.
In my Spark job I had
val cred: AWSCredentials = new DefaultAWSCredentialsProviderChain().getCredentials
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
which were necessary when running locally in a test cluster, but clobbered the good values when running on EMR. I changed the block to
overrideCredentials.foreach(cred=>{
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", cred.getAWSAccessKeyId)
session.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", cred.getAWSSecretKey)
})
and pushed the credential retrieval into my test harness (which is, of course, where it should have been all the time.)

If you are running in EC2 on the AWS code (not EMR), use the S3A connector, as it will use the EC2 IAM credential provider as the last one of the credential providers it uses by default.
The IAM credentials are short lived and include a session key: if you are copying them then you'll need to refresh at least every hour and set all three items: access key, session key and secret.
Like I said: s3a handles this, with the IAM credential provider triggering a new GET of the instance-info HTTP server whenever the previous key expires.

Related

Jenkins job trigger when s3 bucket updated

I am looking for a way to trigger the Jenkins job whenever the s3 bucket is updated with a particular file format.
I have tried a lambda function method with an "Add trigger -> s3 bucket PUT method". I have followed this article. But it's not working. I have explored and I have found out that "AWS SNS" and "AWS SQS" also can use for this, but the problem is some are saying this is outdated. So which is the simplest way to trigger the Jenkins job when the s3 bucket is updated?
I just want a trigger, whenever the zip file is updated from job A in jenkins1 to the S3 bucket name called 'testbucket' in AWS enviornment2. Both Jenkins are in different AWS accounts under seperate private VPC. I have attached my Jenkins workflow as a picture. Please refer below picture.
The approach you are using seems solid and a good way to go. I'm not sure what specific issue you are having so I'll list a couple things that could explain why this might not be working for you:
Permissions issue - Check to ensure that the Lambda can be invoked by the S3 service. If you are doing this in the console (manually) then you probably don't have to worry about that since the permissions should be automatically setup. If you're doing this through infrastructure as code then it's something you need to add.
Lambda VPC config - Your lambda will need to run out of the same subnet in your VPC that the Jenkins instance runs out of. Lambda by default will not be associated to a VPC and will not have access to the Jenkins instance (unless it's publicly available over the internet).
I found this other stackoverflow post that describes the SNS/SQS setup if you want to continue down that path Trigger Jenkins job when a S3 file is updated

Error 400: Bad request in Amazon SageMaker Ground Truth text Labeling task

I am trying to use Amazon AWS to annotate my text data. It's a csv of 10 rows include header : "orgiginalText, replyText" and text data. I put my data in s3 bucker, create IAM with S3, sageMaker FullAccees. When I want to 'Create labeling job', it gave me error 400 Badrequest to connect to S3. is there anything else to be considered? I stucked 2 days in this small task and can't go forward.
Few things to check:
Can you access the S3 bucket from other means? Like AWS CLI from your local machine or from an EC2 instance?
Can you access the S3 bucket from a SageMaker notebook instance using the same execution role you created for SageMaker GroundTruth?
Does this issue persist with a particular bucket? Did you try creating another bucket, copy the data there and point to this S3 bucket instead?
Are you using the S3 bucket and Groundtruth labeling job in the same AWS region?
The 400 error can be due to variety of reasons. From S3 perspective, it may happen due to the bucket being in a transitioned state like creating/deleting etc.
From SageMaker's point, it can be due to many reasons few of which are listed here: https://docs.amazonaws.cn/sagemaker/latest/APIReference/CommonErrors.html
Please try the above approaches and let me know your findings.

What and how should I provision Terraform remote state S3 bucket and state locking DynamoDB table?

I have multiple git repositories (e.g. cars repo, garage repo) where each one deploys multiple AWS services/resources using terraform .tf files.
I would like for each repo to save his state in s3 remote backend, such that when a repo will deploy its resources from the prod or dev workspace the state will be kept in the correct s3 bucket (prod/dev).
The S3 buckets and folders will look something like:
# AWS Prod bucket
terraform_prod_states Bucket:
- Path1: /cars/cars.tfstate
- Path2: /garage/garage.tfstate
# AWS Dev bucket
terraform_dev_states Bucket:
- Path1: /cars/cars.tfstate
- Path2: /garage/garage.tfstate
But prior to having repos deploying and saving state in remote backend -
The S3 buckets and permissions need to be set.
The question ?
Who should set the S3 buckets/permissions/dynamodb tables (for locking)? What will be best practice?
Options:
Should the S3 buckets and tables be created one time manually from AWS management console?
Should I have a separate repo that is responsible for preparing all the required AWS infrastructure: buckets/permissions/dynamodb (in this case, I assume that the infra repo should also preserve a remote state and locking - who should do that ?)
Should every repo (cars, garage) will take care of checking if S3 and dynamodb tables exists and if required to prepare the remote state resources for his own use ?
Feels like chicken and egg here.
You can add it to tf script
like:
resource "aws_s3_bucket" "b" {
bucket = lower("${terraform.workspace}.mysite${var.project_name}")
acl = "private"
tags = {
Environment = "${terraform.workspace}"
}
}
but much more settings related to security groups etc. required this all is in documnetation and also can see this tutorial- https://www.bacancytechnology.com/blog/aws-s3-bucket-using-terraform
Should the S3 buckets and tables be created one time manually from AWS management console?
Not necessarily - I would advise creating a separate folder that has a Terraform configuration for managing the state (one-time setup). I'd call the bucket tfstate but there are no restrictions whatsoever other than required IAM permissions. Also, use lifecycle to prevent the bucket from being destroyed via Terraform & enable versioning. If you would like state locking and consistency, also specify a DynamoDB table with a primary key named LockID with type of string (this is required) and specify the name when using your remote state.
Who should set the S3 buckets/permissions/dynamodb tables (for locking)? What will be best practice?
You will, inside the one-time TF state setup. I would keep it as part of the same repository that holds the other TF configurations instead of creating a new repository for a single TF file (which is very likely bound to not change).
hould every repo (cars, garage) will take care of checking if S3 and dynamodb tables exists and if required to prepare the remote state resources for his own use ?
No, don't do any checking - as I've mentioned, just do it manually.
You will always have a chicken and egg problem as you've noticed so do it manually but the state for a single S3 bucket is going to be extremely minimal and can be imported into the state if needs be.
The goal here is to isolate that chicken-egg problem to a situation where the chicken won't be laying any more eggs so to speak - you only need to set up the remote state once and rarely is it changed.
Still much better than manually creating buckets.
We create the tf backend infrastructure using cloud formation stack.
Make sure you create, kms key and alias, s3 bucket with kms encryption, Dynamodb table.

AWS S3 replication without versioning

I have enabled AWS S3 replication in an account and I want to replicate the same S3 data to another account and it all works fine. But I don't want to use S3 versioning because of its additional cost.
So is there any other way to accommodate this scenario?
The automated Same Region Replication(SRR) and Cross Region Replication(CRR) requires versioning to be activated due to the way that data is replicated between S3 buckets. For example, a new version of an object might be uploaded while a bucket is still being replicated, which can lead to problems without having separate versions.
If you do not wish to retain other versions, you can configure Amazon S3 Lifecycle Rules to expire (delete) older versions.
An alternative method would be to run the AWS CLI aws s3 sync command at regular intervals to copy the data between buckets. This command would need to be run on an Amazon EC2 instance or even your own computer. It could be triggered by a cron schedule (Linux) or a Schedule Task (Windows).

Pass AWS credentials to PigStorage function

How can I pass AWS credentials (aws_access_key and aws_secret_key) to PIG PigStorage function?
Thanks
Given this question is tagged with EMR I am going to assume you are using AWS EMR for the Hadoop cluster. If this is the case, then no further setup is required to access S3. The EMR service automatically configured Hadoop FS (which PigStorage will leverage) with either the AWS credentials of the user starting the cluster or uses the instance role requested. Just provide the S3 location and Pig will interface with S3 according to the policy and permissions of the user/role.
A = LOAD 's3://<yourbucket>/<path>/' using PigStorage('\t') as (id:int, field2:chararray, field3:chararray);
I wasn't very explicit, and did gave an example of my use case, sorry. I needed that because I needed to use two different AWS access_keys, and using something like s3n://access:secret#bucket did not solve. I solved this changing the PigStorage function , storing in hdfs the results, and on the cleanUpWithSucess method invoke a method that uploads the hdfs files to s3 with credentials. In this way I can pass the credentials to the PigStorageFunction when it is used to store, of course I also changed the constructor of the PigStorage to receive these arguments.