Cross account S3 copy of 100Million files - amazon-web-services

I have 100 million small csv files that I have to copy from one aws account into another.
I tried to do parallel S3 copy using boto3 and also tried using aws sync. But due to the larger amount of files I could not get it done in reasonable amount time of time.
Is there any way to copy this large number of files from one account to another account S3 bucket.

You can:
Generate a list of objects by using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects
Pass the list to S3 Batch Operations and configure it to perform a Copy operation
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog

Imagine you want to transfer files between accounts (A & B).
Attach a bucket policy to the source bucket in Account A
1 Get the Amazon Resource Name (ARN) of the IAM identity (user or role) in Account B (destination account).
2 From Account A, attach a bucket policy to the source bucket that allows the IAM identity in Account B to get objects
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DelegateS3Access",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::222222222222:user/Jane"},
"Action": ["s3:ListBucket","s3:GetObject"],
"Resource": [
"arn:aws:s3:::awsexamplesourcebucket/*",
"arn:aws:s3:::awsexamplesourcebucket"
]
}
]
}
Attach an IAM policy to a user or role in Account B
From Account B, create an IAM customer managed policy that allows an IAM user or role to copy objects from the source bucket in Account A to the destination bucket in Account B.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::awsexamplesourcebucket",
"arn:aws:s3:::awsexamplesourcebucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::awsexampledestinationbucket",
"arn:aws:s3:::awsexampledestinationbucket/*"
]
}
]
}
Attach the customer managed policy to the IAM user or role that you want to use to copy objects between accounts.
Use the IAM user or role in Account B to perform the cross-account copy
After you set up the bucket policy and IAM policy, the IAM user or role in Account B can perform the copy from Account A to Account B. Then, Account B owns the copied objects.
To synchronize all content from a source bucket in Account A to a destination bucket in Account B, the IAM user or role in Account B can run the sync command using the AWS Command Line Interface (AWS CLI):
aws s3 sync s3://awsexamplesourcebucket s3://awsexampledestinationbucket
AWS Refrence

Related

AWS S3 bucket - Allow download files to every IAM and Users from specific AWS Account

Look for a policy for S3 bucket that will allow all IAM roles and users from different account, to be able to download files from the bucket that is located in my AWS account.
Thanks for help
You can apply object level permissions to another account via a bucket policy.
By using the principal of the root of the account, every IAM entity in that account is able to interact with the bucket using the permissions in your bucket policy.
An example bucket policy using the root of the account is below.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Example permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AccountB-ID:root"
},
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::awsexamplebucket1"
]
}
]
}
More information is available in the Bucket owner granting cross-account bucket permissions documentation
Fo that, you would need to provide a cross-account access to the objects in your buckets by giving the IAM role or user in the second Account permission to download (GET Object) objects from the needed bucket.
The following AWS post
https://aws.amazon.com/premiumsupport/knowledge-center/cross-account-access-s3/ provides details on how to define the IAM policy.
In your case, you just need the Get object permission.

AWS S3 data lake cross account usage

we have the following scenario:
AWS Account A (application) writes data from an application to an S3 bucket owned by account B (data lake). The analysts in account C (reporting) want to proccess the data and build reports and dashboards on top of it.
Account A can write data to the data lake with --acl bucket-owner-full-control to allow Account B the access. But Account C still cannot see and process the data.
One (in our eyes bad) solution is to copy the data to the same location (overwrite) as account B, effectively taking ownership for the data in the process and eliminating the issue. We don't want it, because ... ugly
We tried assuming roles in the different accounts, but it does not work for all our infrastructure. E.g. S3 access via CLI or console is OK, but using it from EMR in account C does not. Also we have on-premise infrastructure (local taskrunners), where this mechanism is not an option.
Maintaining IAM roles for all accounts and users is too much effort. We aim for an automatic solution, not one that we have to take action every time a new user or account is added.
Do you have any suggestions?
One nice and clean way is to use a bucket policy granting read access to the external account (account C) by supplying the account ARN as the principal.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Grant read access to reporting account",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::insertReportingAccountIdHere:root"
},
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetObject",
"s3:GetObjectAcl"
],
"Resource": [
"arn:aws:s3:::yourdatalakebucket",
"arn:aws:s3:::yourdatalakebucket/*"
]
}
]
}
This lets the reporting account manage the (ListBucket, gGtObject) permissions on the bucket for its own users, meaning you can now create an IAM policy on Account C with the permission to fetch data from the specified data lake bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Allow reading files from the data lake",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetObject",
"s3:GetObjectAcl"
],
"Resource": [
"arn:aws:s3:::yourdatalakebucket",
"arn:aws:s3:::yourdatalakebucket/*"
]
}
]
}
This policy can then be attached to any Account C IAM role or user group you want. For example, you could attach it to your standard Developer or Analyst roles to give access to large groups of users, or you could attach it to a service role to give a particular service access to the bucket.
There is a guide on the Amazon S3 documentation site on how to do this.
You can do via the following documentation,
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_enable-console-saml.html
Steps:
Create SAML provider
Create Role for the SAML provider, example below
Assign the users role based on saml conditions
E.g., You can create S3 Readers, S3 Writers and assign permissions based on that.
Example Assume Role with SAML:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Federated": "arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:saml-provider/ExampleOrgSSOProvider"},
"Action": "sts:AssumeRoleWithSAML",
"Condition": {"StringEquals": {
"saml:edupersonorgdn": "ExampleOrg",
"saml:aud": "https://signin.aws.amazon.com/saml"
}}
}]
}
Hope it helps.
In our case, we solved it using roles in the DataLake account (B), both for write (WriterRole) and read (ReaderRole) access. When writing to the DataLake from Account A, your writer assumes the "WriterRole" in Account B, that has the required permission. When reading from Account C, you assume the "ReaderRole".
The issues with EMR reading, we solved with EMRFS using IAM roles for reading (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-emrfs-iam-roles.html)

How to get AWS Glue crawler to assume a role in another AWS account to get data from that account's S3 bucket?

There's some CSV data files I need to get in S3 buckets belonging to a series of AWS accounts belonging to a third-party; the owner of the other accounts has created a role in each of the accounts which grants me access to those files; I can use the AWS web console (logged in to my own account) to switch to each role and get the files. One at a time, I switch to the role for each of the accounts and then get the files for that account, then move on to the next account and get those files, and so on.
I'd like to automate this process.
It looks like AWS Glue can do this, but I'm having trouble with the permissions.
What I need it to do is create permissions so that an AWS Glue crawler can switch to the right role (belonging to each of the other AWS accounts) and get the data files from the S3 bucket of those accounts.
Is this possible and if so how can I set it up? (e.g. what IAM roles/permissions are needed?) I'd prefer to limit changes to my own account if possible rather than having to ask the other account owner to make changes on their side.
If it's not possible with Glue, is there some other easy way to do it with a different AWS service?
Thanks!
(I've had a series of tries but I keep getting it wrong - my attempts are so far from being right that there's no point in me posting the details here).
Yes, you can automate your scenario with Glue by following these steps:
Create an IAM role in your AWS account. This role's name must start with AWSGlueServiceRole but you can append whatever you want. Add a trust relationship for Glue, such as:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Attach two IAM policies to your IAM role. The AWS managed policy named AWSGlueServiceRole and a custom policy that provides the access needed to all the target cross account S3 buckets, such as:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BucketAccess",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::examplebucket1",
"arn:aws:s3:::examplebucket2",
"arn:aws:s3:::examplebucket3"
]
},
{
"Sid": "ObjectAccess",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::examplebucket1/*",
"arn:aws:s3:::examplebucket2/*",
"arn:aws:s3:::examplebucket3/*"
]
}
]
}
Add S3 bucket policies to each target bucket that allows your IAM role the same S3 access that you granted it in your account, such as:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BucketAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::your_account_number:role/AWSGlueServiceRoleDefault"
},
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::examplebucket1"
},
{
"Sid": "ObjectAccess",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::your_account_number:role/AWSGlueServiceRoleDefault"
},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::examplebucket1/*"
}
]
}
Finally, create Glue crawlers and jobs in your account (in the same regions as the target cross account S3 buckets) that will ETL the data from the cross account S3 buckets to your account.
Using the AWS CLI, you can create named profiles for each of the roles you want to switch to, then refer to them from the CLI. You can then chain these calls, referencing the named profile for each role, and include them in a script to automate the process.
From Switching to an IAM Role (AWS Command Line Interface)
A role specifies a set of permissions that you can use to access AWS
resources that you need. In that sense, it is similar to a user in AWS
Identity and Access Management (IAM). When you sign in as a user, you
get a specific set of permissions. However, you don't sign in to a
role, but once signed in as a user you can switch to a role. This
temporarily sets aside your original user permissions and instead
gives you the permissions assigned to the role. The role can be in
your own account or any other AWS account. For more information about
roles, their benefits, and how to create and configure them, see IAM
Roles, and Creating IAM Roles.
You can achieve this with AWS lambda and Cloudwatch Rules.
You can create a lambda function that has a role attached to it, lets call this role - Role A, depending on the number of accounts you can either create 1 function per account and create one rule in cloudwatch to trigger all functions or you can create 1 function for all the accounts (be cautious to the limitations of AWS Lambda).
Creating Role A
Create an IAM Role (Role A) with the following policy allowing it to assume the role given to you by the other accounts containing the data.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1509358389000",
"Effect": "Allow",
"Action": [
"sts:AssumeRole"
],
"Resource": [
"",
"",
....
"
]// all the IAM Role ARN's from the accounts containing the data or if you have 1 function for each account you can opt to have separate roles
}
]
}
Also you will need to make sure that a trust relationship with all the accounts are present in Role A's Trust Relationship policy document.
Attach Role A to the lambda functions you will be running. you can use serverless for development.
Now your lambda function has Role A attached to it and Role A has sts:AssumeRole permissions over the role's created in the other accounts.
Assuming that you have created 1 function for 1 account in you lambda's code you will have to first use STS to switch to the role of the other account and obtain temporary credentials and pass these to S3 options before fetching the required data.
if you have created 1 function for all the accounts you can have the role ARN's in an array and iterate over it, again when doing this be aware of the limits of AWS lambda.

AWS S3 Transfer Between Accounts Not Working

I am trying to copy data in a bucket in one account, in which I have access to an IAM but not admin, to a bucket in another account, in which I am an admin, and failing. I can't even ls the source bucket.
I've followed the directions from AWS and various sources online to give myself list/read/get permissions on the source bucket, with no success. I can provide the details (e.g., the bucket policy json), but it is what is in the AWS docs and other places. What I've done works between two accounts I have admin access to.
This is "multi-region", in the sense that I'm in the US (mainly us-west-2) but the bucket is in eu-central-1. I am specifying the region in the aws cli, and set up a destination bucket in eu-central-1, but can't even list anyway.
I have done this couple of times with my AWS accounts. I am guessing you have setup cross account Access to your S3 bucket, but just double check, here is what I do for granting an S3 bucket cross account access.
Account (A):
S3 bucket (testbucket)
Account (B):
IAM User (testuser) needs access to the S3 bucket testbucket in Account (A)
Here are things that need to happen:
Create a bucket policy on testbucket (A) to grant read/list etc access to to your test bucket.
example:
{
"Version": "2012-10-17",
"Id": "BUCKETPOLICY",
"Statement": [
{
"Sid": "AllowS3ReadObject28",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::900000000:user/testuser"
]
},
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::testbucket",
"arn:aws:s3:::testbucket/*"
]
}
]
}
Create an IAM policy on testuser that also grants read, write, list etc access to the bucket.
It appears that your situation is:
Account A: Bucket A and User A (with limited access rights)
Account B: Bucket B and User B (with admin rights)
You can either push the data from Account A to Bucket B, or you can pull the data from Bucket A using Account B.
Pushing from Account A to Bucket B
Let's assume User A has access to Bucket A. All that's needed is to give User A permission to write to Bucket B. This can be done with a bucket policy on Bucket B:
{
"Id": "PolicyB",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantAccessToUserA",
"Action": "s3:*",
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::BUCKET-B",
"arn:aws:s3:::BUCKET-B/*"
],
"Principal": "arn:aws:iam::ACCOUNT-A:user/USER-A"
}
]
}
This grants all s3 permissions to User A on Bucket B. That's excessive, but presumably this is only temporary.
User A would then copy the files from Bucket A to Bucket B. For example:
aws s3 sync s3://BUCKET-A s3://BUCKET-B \
--acl bucket-owner-full-control \
--source-region SOURCE-REGION \
--region DESTINATION-REGION
Important: When copying the files, be sure to use the Access Control List that grants bucket-owner-full-control. This means that the files become owned by the owner of Bucket B. If you don't do this, the files are still owned by User A and can't be deleted by User B, even with admin rights!
Pulling from Bucket A using Account B
To do this, User B must be granted access to Bucket A. You will need enough access rights in Account A to add a bucket policy on Bucket A:
{
"Id": "PolicyA",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GrantAccessToUserB",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::BUCKET-A",
"arn:aws:s3:::BUCKET-A/*"
],
"Principal": "arn:aws:iam::ACCOUNT-B:user/USER-B"
}
]
}
Then, User B can copy the files across:
aws s3 sync s3://BUCKET-A s3://BUCKET-B \
--source-region SOURCE-REGION \
--region DESTINATION-REGION
(You might need to grant some more access rights, I didn't test the above policy.)
The fact that buckets are in different regions does not impact the permissions, but it does impact where you send the command. The command is sent to the destination region, which then pulls from the source region.
See: AWS CLI s3 sync command

Copy to Redshift from another accounts S3 bucket

Is it possible to copy from one AWS accounts S3 bucket into another AWS accounts Redshift cluster? The way I tried to do it was to log in using SQL Workbench to my AWS Account (Account1) and used a IAM User of (Account2) to copy the file over like this:
copy my_table (town,name,number)
from 's3://other-s3-account-bucket/fileToCopy.tsv'
credentials 'aws_access_key_id=<other_accounts_aws_access_key_id>;aws_secret_access_key=<other_accounts_aws_secret_access_key>'
delimiter '\t';
I know the other account's user has s3 permissions after double checking. Do I have share IAM users or setup different permissions in order to do this?
You will need to "pull" the data from the other account's S3 bucket.
AWS Account A has an S3 bucket called source-bucket-account-a.
AWS Account B has a Redshift cluser called TargetCluster.
On bucket source-bucket-account-a, add a bucket policy allowing AWS Account B to read files.
A sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DelegateS3Access",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account-b-number>:root"
},
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": [
"arn:aws:s3:::source-bucket-account-a",
"arn:aws:s3:::source-bucket-account-a/*"
]
}
]
}
It's very similar to the following:
http://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example2.html
or the following:
http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_policy-examples.html
Once the bucket policy is in place, you use the credentials for AWS Account B to run the copy command because it owns the Redshift cluster. In the copy command, you specify the bucket by it's name source-bucket-account-a.
The bucket policy has granted read access to AWS Account B so it can "pull" the data into Redshift.