Custom ACL for EMR Hive output objects written to S3 - amazon-web-services

set fs.s3.canned.acl=BucketOwnerFullControl;
above line is an example of configuring emr's hive jobs to write objects to s3 using canned ACL (http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-s3-acls.html)
I was wondering if I can have custom ACL in the same way.
Use case:
EMR writes to S3 (regionA) which is then replicated to regionB. And Athena in RegionB to query the replicated objects. Inspite of RegionB being the owner of the bucket, the objects from regionA replicated under this bucket are not owned by RegionB.
So if anyone knows a way to set the ACL of the objects to allow read by cross account, I would appreciate the help.
Thanks.

EMR doesn't support writing objects with custom ACLs to s3 at the moment.

Related

Error 400: Bad request in Amazon SageMaker Ground Truth text Labeling task

I am trying to use Amazon AWS to annotate my text data. It's a csv of 10 rows include header : "orgiginalText, replyText" and text data. I put my data in s3 bucker, create IAM with S3, sageMaker FullAccees. When I want to 'Create labeling job', it gave me error 400 Badrequest to connect to S3. is there anything else to be considered? I stucked 2 days in this small task and can't go forward.
Few things to check:
Can you access the S3 bucket from other means? Like AWS CLI from your local machine or from an EC2 instance?
Can you access the S3 bucket from a SageMaker notebook instance using the same execution role you created for SageMaker GroundTruth?
Does this issue persist with a particular bucket? Did you try creating another bucket, copy the data there and point to this S3 bucket instead?
Are you using the S3 bucket and Groundtruth labeling job in the same AWS region?
The 400 error can be due to variety of reasons. From S3 perspective, it may happen due to the bucket being in a transitioned state like creating/deleting etc.
From SageMaker's point, it can be due to many reasons few of which are listed here: https://docs.amazonaws.cn/sagemaker/latest/APIReference/CommonErrors.html
Please try the above approaches and let me know your findings.

Moving Across AWS Regions: us-east-1 to us-east-2

I have the following currently created in AWS us-east-1 region and per the request of our AWS architect I need to move it all to the us-east-2, completely, and continue developing in us-east-2 only. What are the easiest and least work and coding options (as this is a one-time deal) to move?
S3 bucket with a ton of folders and files.
Lambda function.
AWS Glue database with a ton of crawlers.
AWS Athena with a ton of tables.
Thank you so much for taking a look at my little challenge :)
There is no easy answer for your situation. There are no simple ways to migrate resources between regions.
Amazon S3 bucket
You can certainly create another bucket and then copy the content across, either using the AWS Command-Line Interface (CLI) aws s3 sync command or, for huge number of files, use S3DistCp running under Amazon EMR.
If there are previous Versions of objects in the bucket, it's not easy to replicate them. Hopefully you have Versioning turned off.
Also, it isn't easy to get the same bucket name in the other region. Hopefully you will be allowed to use a different bucket name. Otherwise, you'd need to move the data elsewhere, delete the bucket, wait a day, create the same-named bucket in another region, then copy the data across.
AWS Lambda function
If it's just a small number of functions, you could simply recreate them in the other region. If the code is stored in an Amazon S3 bucket, you'll need to move the code to a bucket in the new region.
AWS Glue
Not sure about this one. If you're moving the data files, you'll need to recreate the database anyway. You'll probably need to create new jobs in the new region (but I'm not that familiar with Glue).
Amazon Athena
If your data is moving, you'll need to recreate the tables anyway. You can use the Athena interface to show the DDL commands required to recreate a table. Then, run those commands in the new region, pointing to the new S3 bucket.
AWS Support
If this is an important system for your company, it would be prudent to subscribe to AWS Support. They can provide advice and guidance for these types of situations, and might even have some tools that can assist with a migration. The cost of support would be minor compared to the savings in your time and effort.
Is it possible for you to create CloudFormation stacks (from existing resources) using the console, then copying the contents of those stacks and run them in the other region (replacing values where they need to be).
See this link: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/resource-import-new-stack.html

Is it possible to replicate a specific S3 folder between 2 buckets?

Does anyone know if it is possible to replicate just a folder of a bucket between 2 buckets using AWS S3 replication feature?
P.S.: I don't want to replicate the entire bucket, just one folder of the bucket.
If it is possible, what configurations I need to add to filter that folder in the replication?
Yes. Amazon S3's Replication feature allows you to replicate objects at a prefix (say, folder) level from one S3 bucket to another within same region or across regions.
From the AWS S3 Replication documentation,
The objects that you want to replicate — You can replicate all of the objects in the source bucket or a subset. You identify a subset by providing a key name prefix, one or more object tags, or both in the configuration.
For example, if you configure a replication rule to replicate only objects with the key name prefix Tax/, Amazon S3 replicates objects with keys such as Tax/doc1 or Tax/doc2. But it doesn't replicate an object with the key Legal/doc3. If you specify both prefix and one or more tags, Amazon S3 replicates only objects having the specific key prefix and tags.
Refer to this guide on how to enable replication using AWS console. Step 4 talks about enabling replication at prefix level. The same can be done via Cloudformation and CLI as well.
Yes you can do this using the Cross-Region Replication feature. You can replicate the object either in the same region or a different one. The replicated object in the new bucket will keep their original storage class, object name and object permissions.
However, you can change the owner to the new owner of the destination bucket.
Despite all of this, there are disadvantages of this feature:-
You cannot replicate objects which are present in the source bucket
before you create the replication rule using CRR. Only the ones
which are created after replication rule can be created.
You cannot use SSE-C encryption in replication.
You can do this with sync command.
aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
You must grant the destination account the permissions to perform the cross-account copy.

AWS S3: Is there a way to replicate objects from destination to source bucket

We have S3 replication infrastructure in place to redirect PUTs/GETs to replica (destination) S3 bucket if primary (source) is down.
But I'm wondering how to copy objects from destination bucket to source once primary is restored.
You can use Cross-Region Replication - Amazon Simple Storage Service.
This can also be configured for bi-directional sync:
Configure CRR for bucket-1 to replicate to bucket-2
Configure CRR for bucket-2 to replicate to bucket-1
I tested it and it works!
CRR requires that you have Versioning activated in both buckets. This means that if objects are overwritten, then the previous versions of those objects are still retained. You will be charged for storage of all the versions of each object. You can delete old versions if desired.

How to keep both data on aws s3 and glacier

I want to keep a backup of an AWS s3 bucket. If I use Glacier, it will archive the files from the bucket and moved to the Glacier but it will also delete the files from s3. I don't want to delete the files from s3. One option is to try with EBS volume. You can mount the AWS s3 bucket with s3fs and copy it to the EBS volume. Another way is do an rsync of the existing bucket to a new bucket which will act as a clone. Is there any other way ?
What you are looking for is cross-region replication:
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
setup versioning and setup the replication.
on the target bucket you could setup a policy to archive to Glacier (or you could just use the bucket as a backup as is).
(this will only work between 2 regions, i.e. buckets cannot be in the same region)
If you want your data to be present in both primary and backup locations then this is more of a data replication use case.
Consider using AWS Lambda which is an event driven compute service.
You can write a simple piece of code to copy the data wherever you want. This will execute every time there is a change in S3 bucket.
For more info check the official documentation.