Grouping S3 Files Into Subfolders - amazon-web-services

I have a pipeline that moves approximately 1 TB of data, all CSV files. In this pipeline there are hundreds of files with different names. They have a date component, which is automatically partitioned. My question is how to use the CDK to automatically create subfolders based on the name of the file. In other words, the data comes in as broad category, but our data scientists need it at one more level of detail.

It appears that your requirement is to move incoming objects into folders based on information in their filename (Key).
This could be done by adding a trigger on the Amazon S3 bucket that triggers an AWS Lambda function when a new object is created.
Here is some code from Moving file based on filename with Amazon S3:
import boto3
import urllib
def lambda_handler(event, context):
# Get the bucket and object key from the Event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
# Only copy objects that were uploaded to the bucket root (to avoid an infinite loop)
if '/' not in key:
# Determine destination directory based on Key
directory = key # Your logic goes here to extract the directory name
# Copy object
s3_client = boto3.client('s3')
s3_client.copy_object(
Bucket = bucket,
Key = f"{directory}/{key}",
CopySource= {'Bucket': bucket, 'Key': key}
)
# Delete source object
s3_client.delete_object(
Bucket = bucket,
Key = key
)
You would need to modify the code that determines the name of the destination directory based on the key of the new object.
It also assumes that new objects will come into the top-level (root) of the bucket and then be moved into sub-directories. If, instead, new objects are coming in a given path (eg incoming/) then only set the S3 trigger to operate on that path and remove the if '/' not in key logic.

Related

unable to get object metadata from S3. Check object key, region and/or access permissions."

I have a Lambda function that scans for text and is triggered by an S3 bucket. I get this error when trying to upload a photo directly into s3 bucket using browser
Unable to get object metadata from S3. Check object key, region, and/or access permissions
However, if I hardcode the key (e.g., image01.jpg) which is in my bucket, there are no errors.
import json
import boto3
def lambda_handler(event, context):
# Get bucket and file name
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
location = key[:17]
s3Client = boto3.client('s3')
client = boto3.client('rekognition', region_name='us-east-1')
response=client.detect_text(Image={'S3Object':
{'Bucket':'myarrowbucket','Name':key}})
detectedText = response['TextDetections']
I am confused as it was working a few weeks ago but now i am getting that error
ANSWER
I have seen this question answered many times and i tried every solution , the one which worked for me was 'key' name . i was getting the metadata error when the filename contained special characters e.g - or _ but when i changed the names of the files uploaded it works . Hope this answer helps someone.

Aws Lambda to write to firebase storage

I want to write a downloaded file to Firebase storage using AWS Lambda.
I have already written a dynamic link for fire-storage
Can someone hint me how to do that? I have ready for s3 I want to store it in Firestorage now.
Only the index.html to store in fire storage name google-list/index.html
def lambda_handler(event, context):
url='https://www.google.com/index.html' # put your url here
bucket = 'google-list' #your s3 bucket
key = 'index.html' #your path
#write to s3 want to replace with firestorage.
#s3=boto3.client('s3')
#http=urllib3.PoolManager()
#s3.upload_fileobj(http.request('GET', url,preload_content=False), bucket, key, ExtraArgs={'ACL':'public-read'})```

Move S3 files older than 100 days to another bucket

Is there a way to find all files that are older than 100 days in one S3 bucket and move them to a different bucket? Solutions using AWS CLI or SDK both welcome.
In the src bucket, the files are organized like bucket/type/year/month/day/hour/file
S3://my-logs-bucket/logtype/2020/04/30/16/logfile.csv
For instance, on 2020/04/30, log files on or before 2020/01/21 will have to be moved.
Here's some Python code that will:
Move files from Bucket-A to Bucket-B if they are older than a given period
Full names and paths will be retained
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
DESTINATION_BUCKET = 'bucket-b'
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET)
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page['Contents']:
if object['LastModified'] < datetime.now().astimezone() - timedelta(days=2): # <-- Change time period here
print(f"Moving {object['Key']}")
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=object['Key'],
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])
It worked for me, but please test it on less-important data before deploying in production since it deletes objects!
The code uses a paginator in case there are over 1000 objects in the bucket.
You can change the time period as desired.
(In addition to the license granted under the terms of service of this site the contents of this post are licensed under MIT-0.)
As mentioned in my comments you can create a lifecycle policy for an S3 bucket. Here is steps to do it https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
It's optional to delete\expire an object using Lifecycle policy rules, you define the actions you want on the objects in your S3 bucket.
Lifecycle policies uses different storage classes to transition your objects. Before configuring Lifecycle policies I suggest reading up on the different storage classes as each have their own associated cost: Standard-IA, One Zone-IA, Glacier, and Deep Archive storage classes
Your use case of 100 days, I recommend transitioning your logs to a archive storage class such as S3 Glacier. This might prove to be more cost effective.
Adding on from John's answer, if the objects are not in the root directory of the bucket then a few adjustments to the script need to be made. If they are in the root directory, use John's answer, this script will only work if the objects are in a sub-directory. This script moves objects from bucket/path/to/objects/ to bucket2/path/to/objects/ assuming you have access to each bucket from same set of aws cli credentials.
import boto3
from datetime import datetime, timedelta
SOURCE_BUCKET = 'bucket-a'
SOURCE_PATH = 'path/to/objects/'
DESTINATION_BUCKET = 'bucket-b'
DESTINATION_PATH = 'path/to/send/objects/' #<- you may need to add a prefix of the filenames to the end so that paginator doesn't look at the 'objects' directory
s3_client = boto3.client('s3')
# Create a reusable Paginator
paginator = s3_client.get_paginator('list_objects_v2')
# Create a PageIterator from the Paginator and include Prefix argument and optional PaginationConfig argument to control the number of objects you want to iterate over (incase you have a lot)
page_iterator = paginator.paginate(Bucket=SOURCE_BUCKET, Prefix=SOURCE_PATH, PaginationConfig={'MaxItems':10000})
# Loop through each object, looking for ones older than a given time period
for page in page_iterator:
for object in page.get("Contents", []):
if object['LastModified'] < datetime.now().astimezone() - timedelta(days=100): # <-- Change time period here
# grab filename from path/to/filename
FILENAME = object['Key'].rsplit('/',1)[1]
# Copy object
s3_client.copy_object(
Bucket=DESTINATION_BUCKET,
Key=DESTINATION_PATH+FILENAME,
CopySource={'Bucket':SOURCE_BUCKET, 'Key':object['Key']}
)
# Delete original object
s3_client.delete_object(Bucket=SOURCE_BUCKET, Key=object['Key'])

Generate presigned s3 URL of latest object in the bucket using boto3

I have a s3 bucket with multiple folders. How can I generate s3 presigned URL for a latest object using python boto3 in aws for each folder asked by a user?
You can do something like
import boto3
from botocore.client import Config
import requests
bucket = 'bucket-name'
folder = '/' #you can add folder path here don't forget '/' at last
s3 = boto3.client('s3',config=Config(signature_version='s3v4'))
objs = s3.list_objects(Bucket=bucket, Prefix=folder)['Contents']
latest = max(objs, key=lambda x: x['LastModified'])
print(latest)
print (" Generating pre-signed url...")
url = s3.generate_presigned_url(
ClientMethod='get_object',
Params={
'Bucket': bucket,
'Key': latest['Key']
}
)
print(url)
response = requests.get(url)
print(response.url)
here it will give the latest last modified file from the whole bucket however you can update login and update prefix value as per need.
if you are using Kubernetes POD, VM, or anything you can pass environment variables or use the python dict to store the latest key if required.
If it's a small bucket then recursively list the bucket, with prefix as needed. Sort the results by timestamp, and create the pre-signed URL for the latest.
If it's a very large bucket, this will be very inefficient and you should consider other ways to store the key of the latest file. For example: trigger a Lambda function whenever an object is uploaded and write that object's key into a LATEST item in DynamoDB (or other persistent store).

Copying existing files in a s3 Bucket to another s3 bucket

I have a existing s3 bucket which contains large amount of files. I want to run a lambda function every 1 minute and copy those files to another destination s3 bucket.
My function is:
s3 = boto3.resource('s3')
clientname=boto3.client('s3')
def lambda_handler(event, context):
bucket = 'test-bucket-for-transfer-check'
try:
response = clientname.list_objects(
Bucket=bucket,
MaxKeys=5
)
for record in response['Contents']:
key = record['Key']
copy_source = {
'Bucket': bucket,
'Key': key
}
try:
destbucket = s3.Bucket('serverless-demo-s3-bucket')
destbucket.copy(copy_source, key)
print('{} transferred to destination bucket'.format(key))
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. '.format(key, bucket))
raise e
except Exception as e:
print(e)
raise e
Now how can I make sure the function is copying new files each time it runs??
Suppose the two buckets in question are Bucket-A and Bucket-B
and task to be done is copy files from Bucket-A --> Bucket-B
Lamba-1 (code in the question) reads all files from Bucket-A and copy them one-by-one to Bucket-B in a loop.
In the same loop it also puts one entry in DynamoDb table say
"Copy_Logs" with column File_key and Flag.
Where File_Key is the object key of the file
and Flag is set to false telling the state of copy operation
Now configure events on Bucket-B to invoke a Lambda-2 on every put and multi-part upload
Now Lambda-2 will read the object key from S3 notification payload and updates its respective record in DynamoDb table with flag set to true.
Now you have all records in DynamoDb table "Copy-Logs" which files were copied successfully and which were not.
HIH
[Update]
I missed the last part. It's a one-way sync. you can also get the list/names of files from both buckets and then do a difference of these set(file names) and upload the rest to target bucket.