Is it possible to copy/duplicate objects within one prefix to another prefix in the same s3 bucket?
You can use copy_object() to copy an object in Amazon S3 to another prefix, another bucket and even another Region. The copying takes place entirely within S3, without needing to download/upload the object.
For example, to copy an object in mybucket from folder1/foo.txt to folder2/foo.txt, you could use:
import boto3
s3_client = boto3.client('s3')
response = s3_client.copy_object(
CopySource='/mybucket/folder1/foo.txt', # /Bucket-name/path/filename
Bucket='mybucket', # Destination bucket
Key='folder2/foo.txt' # Destination path/filename
)
An alternative using boto3 resource instead of client:
bucket = boto3.resource("s3").Bucket(my_bucket_name)
copy_source = {"Bucket": my_bucket_name, "Key": my_old_key}
bucket.copy(copy_source, my_new_key)
Where my_bucket_name, my_old_key and my_new_key are user defined variables.
Depending on the setup, additional arguments might be needed to instantiate a boto3 resource. A more complete instantiation call would be:
boto3.resource(
"s3",
endpoint_url=my_endpoint_url,
aws_access_key_id=my_aws_access_key_id, # Do not expose me in source code!
aws_secret_access_key=my_aws_secret_access_key, # Do not expose me in source code!
)
Related
I have a pipeline that moves approximately 1 TB of data, all CSV files. In this pipeline there are hundreds of files with different names. They have a date component, which is automatically partitioned. My question is how to use the CDK to automatically create subfolders based on the name of the file. In other words, the data comes in as broad category, but our data scientists need it at one more level of detail.
It appears that your requirement is to move incoming objects into folders based on information in their filename (Key).
This could be done by adding a trigger on the Amazon S3 bucket that triggers an AWS Lambda function when a new object is created.
Here is some code from Moving file based on filename with Amazon S3:
import boto3
import urllib
def lambda_handler(event, context):
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
# Only copy objects that were uploaded to the bucket root (to avoid an infinite loop)
if '/' not in key:
# Determine destination directory based on Key
directory = key # Your logic goes here to extract the directory name
# Copy object
s3_client = boto3.client('s3')
s3_client.copy_object(
Bucket = bucket,
Key = f"{directory}/{key}",
CopySource= {'Bucket': bucket, 'Key': key}
)
# Delete source object
s3_client.delete_object(
Bucket = bucket,
Key = key
)
You would need to modify the code that determines the name of the destination directory based on the key of the new object.
It also assumes that new objects will come into the top-level (root) of the bucket and then be moved into sub-directories. If, instead, new objects are coming in a given path (eg incoming/) then only set the S3 trigger to operate on that path and remove the if '/' not in key logic.
I want to replicate few prefixes into another bucket.
I do not want to use the sync and replication service. I want to do it with lambda only and I have ongoing migration running so data is coming in every 20 minutes, no transformation is required.
Can someone help me how I can use AWS Lambda for sync and copy?
You can configure a Trigger for an AWS Lambda function so that it is invoked when an object is added to the Amazon S3 bucket. The function will receive information about the object that triggered the function, including its Bucket and Key.
You should code the Lambda function to call CopyObject() to copy that object to the desired destination bucket. Here's an example in Python:
import boto3
import urllib
def lambda_handler(event, context):
TARGET_BUCKET = 'my-target-bucket'
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
# Copy object
s3_client = boto3.client('s3')
s3_client.copy_object(
Bucket = TARGET_BUCKET,
Key = key,
CopySource= {'Bucket': bucket, 'Key': key}
)
Is there a way to find all files that are older than 100 days in one S3 bucket and move them to a different bucket? Solutions using AWS CLI or SDK both welcome.
In the src bucket, the files are organized like bucket/type/year/month/day/hour/file
S3://my-logs-bucket/logtype/2020/04/30/16/logfile.csv
For instance, on 2020/04/30, log files on or before 2020/01/21 will have to be moved.
Here's some Python code that will:
Move files from Bucket-A to Bucket-B if they are older than a given period
Full names and paths will be retained
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
DESTINATION_BUCKET = 'bucket-b'
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET)
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now().astimezone() - timedelta(days=2): # <-- Change time period here
print(f"Moving {object['Key']}")
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=object['Key'],
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])
It worked for me, but please test it on less-important data before deploying in production since it deletes objects!
The code uses a paginator in case there are over 1000 objects in the bucket.
You can change the time period as desired.
(In addition to the license granted under the terms of service of this site the contents of this post are licensed under MIT-0.)
As mentioned in my comments you can create a lifecycle policy for an S3 bucket. Here is steps to do it https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
It's optional to delete\expire an object using Lifecycle policy rules, you define the actions you want on the objects in your S3 bucket.
Lifecycle policies uses different storage classes to transition your objects. Before configuring Lifecycle policies I suggest reading up on the different storage classes as each have their own associated cost: Standard-IA, One Zone-IA, Glacier, and Deep Archive storage classes
Your use case of 100 days, I recommend transitioning your logs to a archive storage class such as S3 Glacier. This might prove to be more cost effective.
Adding on from John's answer, if the objects are not in the root directory of the bucket then a few adjustments to the script need to be made. If they are in the root directory, use John's answer, this script will only work if the objects are in a sub-directory. This script moves objects from bucket/path/to/objects/ to bucket2/path/to/objects/ assuming you have access to each bucket from same set of aws cli credentials.
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
SOURCE_PATH = 'path/to/objects/'
DESTINATION_BUCKET = 'bucket-b'
DESTINATION_PATH = 'path/to/send/objects/' #<- you may need to add a prefix of the filenames to the end so that paginator doesn't look at the 'objects' directory
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator and include Prefix argument and optional PaginationConfig argument to control the number of objects you want to iterate over (incase you have a lot)
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET, Prefix=SOURCE_PATH, PaginationConfig={'MaxItems':10000})
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page.get("Contents", []):
if object['LastModified'] < datetime.now().astimezone() - timedelta(days=100): # <-- Change time period here
# grab filename from path/to/filename
FILENAME = object['Key'].rsplit('/',1)[1]
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=DESTINATION_PATH+FILENAME,
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])
I have a s3 bucket with multiple folders. How can I generate s3 presigned URL for a latest object using python boto3 in aws for each folder asked by a user?
You can do something like
import boto3
from botocore.client import Config
import requests
bucket = 'bucket-name'
folder = '/' #you can add folder path here don't forget '/' at last
s3 = boto3.client('s3',config=Config(signature_version='s3v4'))
objs = s3.list_objects(Bucket=bucket, Prefix=folder)['Contents']
latest = max(objs, key=lambda x: x['LastModified'])
print(latest)
print (" Generating pre-signed url...")
url = s3.generate_presigned_url(
ClientMethod='get_object',
Params={
'Bucket': bucket,
'Key': latest['Key']
}
)
print(url)
response = requests.get(url)
print(response.url)
here it will give the latest last modified file from the whole bucket however you can update login and update prefix value as per need.
if you are using Kubernetes POD, VM, or anything you can pass environment variables or use the python dict to store the latest key if required.
If it's a small bucket then recursively list the bucket, with prefix as needed. Sort the results by timestamp, and create the pre-signed URL for the latest.
If it's a very large bucket, this will be very inefficient and you should consider other ways to store the key of the latest file. For example: trigger a Lambda function whenever an object is uploaded and write that object's key into a LATEST item in DynamoDB (or other persistent store).
When using AWS S3 service, I need to change storage class of existing key from STANDARD to STANDARD_IA.
change_storage_class from boto doesn't exist in boto3.
What is the equivalent in Boto3?
from amazon doc
You can also change the storage class of an object that is already stored in Amazon S3 by copying it to the same key name in the same bucket. To do that, you use the following request headers in a PUT Object copy request:
x-amz-metadata-directive set to COPY
x-amz-storage-class set to STANDARD, STANDARD_IA, or REDUCED_REDUNDANCY
in term of boto3 copy code, this will look like
import boto3
s3 = boto3.client('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.copy(
copy_source, 'mybucket', 'mykey',
ExtraArgs = {
'StorageClass': 'STANDARD_IA',
'MetadataDirective': 'COPY'
}
)