Boto3 get only S3 buckets of specific region - amazon-web-services

The following code sadly lists all buckets of all regions and not only from "eu-west-1" as specified. How can I change that?
import boto3
s3 = boto3.client("s3", region_name="eu-west-1")
for bucket in s3.list_buckets()["Buckets"]:
bucket_name = bucket["Name"]
print(bucket["Name"])

s3 = boto3.client("s3", region_name="eu-west-1")
connects to S3 API endpoint in eu-west-1. It doesn't limit the listing to eu-west-1 buckets. One solution is to query the bucket location and filter.
s3 = boto3.client("s3")
for bucket in s3.list_buckets()["Buckets"]:
if s3.get_bucket_location(Bucket=bucket['Name'])['LocationConstraint'] == 'eu-west-1':
print(bucket["Name"])
If you need a one liner using Python's list comprehension:
region_buckets = [bucket["Name"] for bucket in s3.list_buckets()["Buckets"] if s3.get_bucket_location(Bucket=bucket['Name'])['LocationConstraint'] == 'eu-west-1']
print(region_buckets)

The solution above does not always work for buckets in some US regions because the 'LocationConstraint' can be null. Here is another solution:
s3 = boto3.client("s3")
for bucket in s3.list_buckets()["Buckets"]:
if s3.head_bucket(Bucket=bucket['Name'])['ResponseMetadata']['HTTPHeaders']['x-amz-bucket-region'] == 'us-east-1':
print(bucket["Name"])
The SDK method:
s3.head_bucket(Bucket=[INSERT_BUCKET_NAME_HERE])['ResponseMetadata']['HTTPHeaders']['x-amz-bucket-region']
... should always give you the bucket region. Thanks to sd65 for the tip: https://github.com/boto/boto3/issues/292

Related

S3 bucket size for subset bucket names

How can I use a custom list of S3 bucket names from local file as sometimes it takes too long for large buckets or different storage class. not sure sometimes it doesn't show all S3 buckets?
with open('subsetbucketslist.txt') as f:
allbuckets = f.read().splitlines()
How to use local file of buckets names as input?
By default it would list all buckets:
import boto3
total_size = 0
s3=boto3.resource('s3')
for mybucket in s3.buckets.all():
mybucket_size=sum([object.size for object in boto3.resource('s3').Bucket(mybucket.name).objects.all()])
print (mybucket.name, mybucket_size)
If you want to calculate the size for particular buckets, then put those bucket names in your for loop:
import boto3
total_size = 0
s3 = boto3.resource('s3')
with open('subsetbucketslist.txt') as f:
allbuckets = f.read().splitlines()
for bucket_name in allbuckets:
mybucket_size = sum([object.size for object in boto3.resource('s3').Bucket(bucket_name).objects.all()])
print (bucket_name, mybucket_size)
It's also worth mentioning that Amazon CloudWatch keeps track of bucket sizes (BucketSizeBytes). See: Metrics and dimensions - Amazon Simple Storage Service

AWS lambda with dynamic trigger for S3 buckets

I have a S3 bucket by name "archive_A". I have created a lambda function to retrieve meta data info for any object "creation" or "permanently delete" from S3 bucket as triggers to my lambda function (python) and insert the meta data collected into DynamoDB.
For S3 bucket archive_A, I have manually added the triggers, one for "creation" and another one for "permanently delete" in my lambda function via GUI.
import boto3
from uuid import uuid4
def lambda_handler(event, context):
s3 = boto3.client("s3")
dynamodb = boto3.resource('dynamodb')
for record in event['Records']:
bucket_name = record['s3']['bucket']['name']
object_key = record['s3']['object']['key']
size = record['s3']['object'].get('size', -1)
event_name = record ['eventName']
event_time = record['eventTime']
dynamoTable = dynamodb.Table('S3metadata')
dynamoTable.put_item(
Item={'Resource_id': str(uuid4()), 'Bucket': bucket_name, 'Object': object_key,'Size': size, 'Event': event_name, 'EventTime': event_time})
In the future there could be more S3 buckets like archive_B, archive_C etc. In that case I have to keep adding triggers manually for each S3 bucket which is bit cumbersome.
Is there any dynamic way or adding triggers to lambda for S3 buckets with name "archive_*" and hence any future S3 bucket with name like "archive_G" will have a dynamically added triggers to lambda.
Please suggest. I am quite new to AWS too. Any example would be easier to follow.
There is no in-built way to automatically add triggers for new buckets.
You could probably create an Amazon EventBridge rule that triggers on CreateBucket and calls an AWS Lambda function with details of the new bucket.
That Lambda function could then programmatically add a trigger on your existing Lambda function.

How do I get list of all S3 Buckets with given prefix using terraform?

I am writing a Terraform script to setup an event notification on multiple S3 buckets which are starting with given prefix.
For example I want to setup notification for bucket starting with finance-data. With help of aws_s3_bucket datasource, we can configure a multiple S3 buckets which are already present and later we can use them in aws_s3_bucket_notification resource. Example:
data "aws_s3_bucket" "source_bucket" {
# set of buckets on which event notification will be set
# finance-data-1 and finance-data-2 are actual bucket id
for_each = toset(["finance-data-1", "finance-data-2"])
bucket = each.value
}
resource "aws_s3_bucket_notification" "bucket_notification_to_lambda" {
for_each = data.aws_s3_bucket.source_bucket
bucket = each.value.id
lambda_function {
lambda_function_arn = aws_lambda_function.s3_event_lambda.arn
events = [
"s3:ObjectCreated:*",
"s3:ObjectRemoved:*"
]
}
}
In aws_s3_bucket datasource, I am not able to find an option to give a prefix of the bucket and instead I have to enter bucket-id for all the buckets. Is there any way to achieve this?
Is there any way to achieve this?
No there is not. You have to explicitly specify buckets that you want.

Setting S3 Bucket permissions when writing between 2 AWS Accounts while running from Glue

I have a scala jar which I am calling from AWS Glue job. My jar writes to write a DataFrame to an S3 bucket in another AWS account which has KMS Encryption turned on. I am able to write to the bucket but I am not able to add the destination bucket owner permission to access the files. I can achieve this if simply use Glue Writer but with straight Spark, it just not work. I have read all the documentation and I am setting following bucket policies in hadoop configuration.
def writeDataFrameInTargetLocation( sparkContext:SparkContext = null, dataFrame: DataFrame, location: String,
fileFormat: String,saveMode:String,encryptionKey:Option[String] = Option.empty,kms_region:Option[String]=Option("us-west-2")): Unit = {
if(encryptionKey.isDefined) {
val region = if(kms_region.isDefined) kms_region.getOrElse("us-west-2")
else
"us-west-2"
sparkContext.hadoopConfiguration.set("fs.s3.enableServerSideEncryption", "false")
sparkContext.hadoopConfiguration.set("fs.s3.cse.enabled", "true")
sparkContext.hadoopConfiguration.set("fs.s3.cse.encryptionMaterialsProvider", "com.amazon.ws.emr.hadoop.fs.cse.KMSEncryptionMaterialsProvider")
sparkContext.hadoopConfiguration.set("fs.s3.cse.kms.keyId", encryptionKey.get) // KMS key to encrypt the data with
sparkContext.hadoopConfiguration.set("fs.s3.cse.kms.region", region) // the region for the KMS key
sparkContext.hadoopConfiguration.set("fs.s3.canned.acl", "BucketOwnerFullControl")
sparkContext.hadoopConfiguration.set("fs.s3.acl.default", "BucketOwnerFullControl")
sparkContext.hadoopConfiguration.set("fs.s3.acl", "bucket-owner-full-control")
sparkContext.hadoopConfiguration.set("fs.s3.acl", "BucketOwnerFullControl")
}
else {
sparkContext.hadoopConfiguration.set("fs.s3.canned.acl", "BucketOwnerFullControl")
sparkContext.hadoopConfiguration.set("fs.s3.acl.default", "BucketOwnerFullControl")
sparkContext.hadoopConfiguration.set("fs.s3.acl", "bucket-owner-full-control")
sparkContext.hadoopConfiguration.set("fs.s3.acl", "BucketOwnerFullControl")
}
val writeDF = dataFrame
.repartition(5)
.write
writeDF
.mode(saveMode)
.option(Header, true)
.format(fileFormat)
.save(location)
}
You are probably using the S3AFileSystem implementation for the "s3" scheme (i.e. URLs of the form "s3://..."). You can check that by looking at sparkContext.hadoopConfiguration.get("fs.s3.impl"). If that is the case, then you actually need to set the hadoop properties for "fs.s3a.*" not "fs.s3.*".
Then the correct settings would be:
sparkContext.hadoopConfiguration.set("fs.s3a.canned.acl", "BucketOwnerFullControl")
sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")
The S3AFileSystem implementation is not using any of the properties under "fs.s3". You can see that by investigating the code related to the following hadoop source code link:
https://github.com/apache/hadoop/blob/43e8ac60971323054753bb0b21e52581f7996ece/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L268

How to delete s3 life-cycle rule using boto3

i have n number of rules for a s3 bucket. I need to delete one of the rule which i configured using boto3.
but I am not finding the command for that. if I use
response = s3.bucket_lifecycle.delete() or
response = s3.delete_bucket_lifecycle(Bucket='examplebucket',)
how it will find a particular rule which i need to delete ?
You can use the boto3.resource to do this. See documentation.
import boto3
s3 = boto3.resource('s3')
bucket_lifecycles = s3.BucketLifecycle('bucket_name').rules
# pick a lifecycle and delete
for bucket_lifecycle in bucket_lifecycles:
if bucket_lifecycle['ID'] == 'something':
bucket_lifecycle.delete()