Move S3 files older than 100 days to another bucket - amazon-web-services

Is there a way to find all files that are older than 100 days in one S3 bucket and move them to a different bucket? Solutions using AWS CLI or SDK both welcome.
In the src bucket, the files are organized like bucket/type/year/month/day/hour/file
S3://my-logs-bucket/logtype/2020/04/30/16/logfile.csv
For instance, on 2020/04/30, log files on or before 2020/01/21 will have to be moved.

Here's some Python code that will:
Move files from Bucket-A to Bucket-B if they are older than a given period
Full names and paths will be retained
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
DESTINATION_BUCKET = 'bucket-b'
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET)
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now().astimezone() - timedelta(days=2): # <-- Change time period here
print(f"Moving {object['Key']}")
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=object['Key'],
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])
It worked for me, but please test it on less-important data before deploying in production since it deletes objects!
The code uses a paginator in case there are over 1000 objects in the bucket.
You can change the time period as desired.
(In addition to the license granted under the terms of service of this site the contents of this post are licensed under MIT-0.)

As mentioned in my comments you can create a lifecycle policy for an S3 bucket. Here is steps to do it https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
It's optional to delete\expire an object using Lifecycle policy rules, you define the actions you want on the objects in your S3 bucket.
Lifecycle policies uses different storage classes to transition your objects. Before configuring Lifecycle policies I suggest reading up on the different storage classes as each have their own associated cost: Standard-IA, One Zone-IA, Glacier, and Deep Archive storage classes
Your use case of 100 days, I recommend transitioning your logs to a archive storage class such as S3 Glacier. This might prove to be more cost effective.

Adding on from John's answer, if the objects are not in the root directory of the bucket then a few adjustments to the script need to be made. If they are in the root directory, use John's answer, this script will only work if the objects are in a sub-directory. This script moves objects from bucket/path/to/objects/ to bucket2/path/to/objects/ assuming you have access to each bucket from same set of aws cli credentials.
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
SOURCE_PATH = 'path/to/objects/'
DESTINATION_BUCKET = 'bucket-b'
DESTINATION_PATH = 'path/to/send/objects/' #<- you may need to add a prefix of the filenames to the end so that paginator doesn't look at the 'objects' directory
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator and include Prefix argument and optional PaginationConfig argument to control the number of objects you want to iterate over (incase you have a lot)
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET, Prefix=SOURCE_PATH, PaginationConfig={'MaxItems':10000})
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page.get("Contents", []):
if object['LastModified'] < datetime.now().astimezone() - timedelta(days=100): # <-- Change time period here
# grab filename from path/to/filename
FILENAME = object['Key'].rsplit('/',1)[1]
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=DESTINATION_PATH+FILENAME,
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])

Related

unable to get object metadata from S3. Check object key, region and/or access permissions."

I have a Lambda function that scans for text and is triggered by an S3 bucket. I get this error when trying to upload a photo directly into s3 bucket using browser
Unable to get object metadata from S3. Check object key, region, and/or access permissions
However, if I hardcode the key (e.g., image01.jpg) which is in my bucket, there are no errors.
import json
import boto3
def lambda_handler(event, context):
# Get bucket and file name
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
location = key[:17]
s3Client = boto3.client('s3')
client = boto3.client('rekognition', region_name='us-east-1')
response=client.detect_text(Image={'S3Object':
{'Bucket':'myarrowbucket','Name':key}})
detectedText = response['TextDetections']
I am confused as it was working a few weeks ago but now i am getting that error
ANSWER
I have seen this question answered many times and i tried every solution , the one which worked for me was 'key' name . i was getting the metadata error when the filename contained special characters e.g - or _ but when i changed the names of the files uploaded it works . Hope this answer helps someone.

Boto3: copy objects within bucket

Is it possible to copy/duplicate objects within one prefix to another prefix in the same s3 bucket?
You can use copy_object() to copy an object in Amazon S3 to another prefix, another bucket and even another Region. The copying takes place entirely within S3, without needing to download/upload the object.
For example, to copy an object in mybucket from folder1/foo.txt to folder2/foo.txt, you could use:
import boto3
s3_client = boto3.client('s3')
response = s3_client.copy_object(
CopySource='/mybucket/folder1/foo.txt', # /Bucket-name/path/filename
Bucket='mybucket', # Destination bucket
Key='folder2/foo.txt' # Destination path/filename
)
An alternative using boto3 resource instead of client:
bucket = boto3.resource("s3").Bucket(my_bucket_name)
copy_source = {"Bucket": my_bucket_name, "Key": my_old_key}
bucket.copy(copy_source, my_new_key)
Where my_bucket_name, my_old_key and my_new_key are user defined variables.
Depending on the setup, additional arguments might be needed to instantiate a boto3 resource. A more complete instantiation call would be:
boto3.resource(
"s3",
endpoint_url=my_endpoint_url,
aws_access_key_id=my_aws_access_key_id, # Do not expose me in source code!
aws_secret_access_key=my_aws_secret_access_key, # Do not expose me in source code!
)

Grouping S3 Files Into Subfolders

I have a pipeline that moves approximately 1 TB of data, all CSV files. In this pipeline there are hundreds of files with different names. They have a date component, which is automatically partitioned. My question is how to use the CDK to automatically create subfolders based on the name of the file. In other words, the data comes in as broad category, but our data scientists need it at one more level of detail.
It appears that your requirement is to move incoming objects into folders based on information in their filename (Key).
This could be done by adding a trigger on the Amazon S3 bucket that triggers an AWS Lambda function when a new object is created.
Here is some code from Moving file based on filename with Amazon S3:
import boto3
import urllib
def lambda_handler(event, context):
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
# Only copy objects that were uploaded to the bucket root (to avoid an infinite loop)
if '/' not in key:
# Determine destination directory based on Key
directory = key # Your logic goes here to extract the directory name
# Copy object
s3_client = boto3.client('s3')
s3_client.copy_object(
Bucket = bucket,
Key = f"{directory}/{key}",
CopySource= {'Bucket': bucket, 'Key': key}
)
# Delete source object
s3_client.delete_object(
Bucket = bucket,
Key = key
)
You would need to modify the code that determines the name of the destination directory based on the key of the new object.
It also assumes that new objects will come into the top-level (root) of the bucket and then be moved into sub-directories. If, instead, new objects are coming in a given path (eg incoming/) then only set the S3 trigger to operate on that path and remove the if '/' not in key logic.

Generate presigned s3 URL of latest object in the bucket using boto3

I have a s3 bucket with multiple folders. How can I generate s3 presigned URL for a latest object using python boto3 in aws for each folder asked by a user?
You can do something like
import boto3
from botocore.client import Config
import requests
bucket = 'bucket-name'
folder = '/' #you can add folder path here don't forget '/' at last
s3 = boto3.client('s3',config=Config(signature_version='s3v4'))
objs = s3.list_objects(Bucket=bucket, Prefix=folder)['Contents']
latest = max(objs, key=lambda x: x['LastModified'])
print(latest)
print (" Generating pre-signed url...")
url = s3.generate_presigned_url(
ClientMethod='get_object',
Params={
'Bucket': bucket,
'Key': latest['Key']
}
)
print(url)
response = requests.get(url)
print(response.url)
here it will give the latest last modified file from the whole bucket however you can update login and update prefix value as per need.
if you are using Kubernetes POD, VM, or anything you can pass environment variables or use the python dict to store the latest key if required.
If it's a small bucket then recursively list the bucket, with prefix as needed. Sort the results by timestamp, and create the pre-signed URL for the latest.
If it's a very large bucket, this will be very inefficient and you should consider other ways to store the key of the latest file. For example: trigger a Lambda function whenever an object is uploaded and write that object's key into a LATEST item in DynamoDB (or other persistent store).

list automated RDS snapshots created today and copy to other region using boto3

We are building an automated DR cold site on other region, currently are working on retrieving a list of RDS automated snapshots created today, and passed them to another function to copy them to another AWS region.
The issue is with RDS boto3 client where it returned a unique format of date, making filtering on creation date more difficult.
today = (datetime.today()).date()
rds_client = boto3.client('rds')
snapshots = rds_client.describe_db_snapshots(SnapshotType='automated')
harini = "datetime("+ today.strftime('%Y,%m,%d') + ")"
print harini
print snapshots
for i in snapshots['DBSnapshots']:
if i['SnapshotCreateTime'].date() == harini:
print(i['DBSnapshotIdentifier'])
print (today)
despite already converted the date "harini" to the format 'SnapshotCreateTime': datetime(2015, 1, 1), the Lambda function still unable to list out the snapshots.
The better method is to copy the files as they are created by invoking a lambda function using a cloud watch event.
See step by step instruction:
https://geektopia.tech/post.php?blogpost=Automating_The_Cross_Region_Copy_Of_RDS_Snapshots
Alternatively, you can issue a copy for each snapshot regardless of the date. The client will raise an exception and you can trap it like this
# Written By GeekTopia
#
# Copy All Snapshots for an RDS Instance To a new region
# --Free to use under all conditions
# --Script is provied as is. No Warranty, Express or Implied
import json
import boto3
from botocore.exceptions import ClientError
import time
destinationRegion = "us-east-1"
sourceRegion = 'us-west-2'
rdsInstanceName = 'needbackups'
def lambda_handler(event, context):
#We need two clients
# rdsDestinationClient -- Used to start the copy processes. All cross region
copies must be started from the destination and reference the source
# rdsSourceClient -- Used to list the snapshots that need to be copied.
rdsDestinationClient = boto3.client('rds',region_name=destinationRegion)
rdsSourceClient=boto3.client('rds',region_name=sourceRegion)
#List All Automated for A Single Instance
snapshots = rdsSourceClient.describe_db_snapshots(DBInstanceIdentifier=rdsInstanceName,SnapshotType='automated')
for snapshot in snapshots['DBSnapshots']:
#Check the the snapshot is NOT in the process of being created
if snapshot['Status'] == 'available':
#Get the Source Snapshot ARN. - Always use the ARN when copying snapshots across region
sourceSnapshotARN = snapshot['DBSnapshotArn']
#build a new snapshot name
sourceSnapshotIdentifer = snapshot['DBSnapshotIdentifier']
targetSnapshotIdentifer ="{0}-ManualCopy".format(sourceSnapshotIdentifer)
targetSnapshotIdentifer = targetSnapshotIdentifer.replace(":","-")
#Adding a delay to stop from reaching the api rate limit when there are large amount of snapshots -
#This should never occur in this use-case, but may if the script is modified to copy more than one instance.
time.sleep(.2)
#Execute copy
try:
copy = rdsDestinationClient.copy_db_snapshot(SourceDBSnapshotIdentifier=sourceSnapshotARN,TargetDBSnapshotIdentifier=targetSnapshotIdentifer,SourceRegion=sourceRegion)
print("Started Copy of Snapshot {0} in {2} to {1} in {3} ".format(sourceSnapshotIdentifer,targetSnapshotIdentifer,sourceRegion,destinationRegion))
except ClientError as ex:
if ex.response['Error']['Code'] == 'DBSnapshotAlreadyExists':
print("Snapshot {0} already exist".format(targetSnapshotIdentifer))
else:
print("ERROR: {0}".format(ex.response['Error']['Code']))
return {
'statusCode': 200,
'body': json.dumps('Opearation Complete')
}
The code below will take automated snapshots created today.
import boto3
from datetime import date, datetime
region_src = 'us-east-1'
client_src = boto3.client('rds', region_name=region_src)
date_today = datetime.today().strftime('%Y-%m-%d')
def get_db_snapshots_src():
response = client_src.describe_db_snapshots(
SnapshotType = 'automated',
IncludeShared=False,
IncludePublic=False
)
snapshotsInDay = []
for i in response["DBSnapshots"]:
if i["SnapshotCreateTime"].strftime('%Y-%m-%d') == date.isoformat(date.today()):
snapshotsInDay.append(i)
return snapshotsInDay