I'm using Databricks - python runtime 8.2.
There are two S3 buckets:
Remote S3 - belongs to someone else, I have access ONLY to bucketname/my_prefix only
Local S3 - belongs to me, I have full access to the bucket policy (can mount it to dbfs for example, copy files from dbfs to local s3.
My mission is to copy LOTs of files from remote S3 which I can't mount to local S3.
From Databricks Mount S3 using AWS access keys:
Important
When you mount an S3 bucket using keys, all users have read and write access to all the objects in the S3 bucket.
I have tried to use boto3 library but I cant seem to get much luck.
Essentially, is there a way to copy files from remote S3 to dbfs?
Here is one version of the code I am using - I have tried SO many other ways.
# test if list of days to copy is not empty; loop through each
day and copy the s3 folder contents
if len(days_to_copy) != 0:
for day in days_to_copy:
#convert the items in the list from date to string and format yyyy/mm/dd
day_to_copy = day.strftime('%Y/%m/%d')
print(day_to_copy)
source_bucket = "Remote-S3"
source_key = "myclient/2021/10/06/part-000d339cb-
c000.snappy.parquet"
target_bucket = "Local-s3"
target_key = "{}/".format(day_to_copy)
copy_source = {
'Bucket': source_bucket,
'Key': source_key
}
copy_target = {
'Bucket': target_bucket,
'Key': target_key
}
print('copy_source',copy_source)
print('copy_target',copy_target)
s3.meta.client.copy(copy_source, 'Local-s3','2021/10/06/part-000d339cb-c000.snappy.parquet' )
This code gives this error:
ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
Thanks.
Related
I am planning to use AWS Python SDK (Boto3) to copy files from one bucket to other. Below is the same code I got from AWS documentation :
dest_object.copy_from(CopySource={
'Bucket': self.object.bucket_name,
'Key': self.object.key
})
My question is how do I trigger this code and where should I deploy this code?
I originally thought of Lambda function but I am looking for alternate options in case Lambda times out for larger files ( 1 TB etc.).
Can I use Airflow to trigger this code somehow? may be invoke through Lambda ? Looking for suggestions from AWS experts.
The easiest way to copy new files to another bucket is to use Amazon S3 Replication. It will automatically copy new objects to the selected bucket, no code required.
However, this will not meet your requirement of deleting the incoming file after it is copied. Therefore, you should create an AWS Lambda function and add a S3 trigger. This will trigger the Lambda function whenever an object is created in the bucket.
The Lambda function should:
Extract the bucket and object name from the event parameter
Copy the object to the target bucket
Delete the original object
The code would look something like:
import boto3
import urllib
TARGET_BUCKET = 'target_bucket' # Change this
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
# Loop through each incoming object
for record in event['Records']:
# Get incoming bucket and key
source_bucket = record['s3']['bucket']['name']
source_key = urllib.parse.unquote_plus(record['s3']['object']['key'])
# Copy object to different bucket
copy_source = {
'Bucket': source_bucket,
'Key': source_key
}
s3_resource.Bucket(TARGET_BUCKET).Object(source_key).copy(copy_source)
# Delete original object
s3_resource.Bucket(source_bucket).Object(source_key).delete()
The copy process is unlikely to approach the 15-minute limit of AWS Lambda, but it is worth testing on large objects.
I want to automatically move objects from first s3 bucket to second bucket. As and when a file is created or uploaded to first bucket, that should be moved across to the second bucket. There shouldn't be any copy of the file on the source bucket after the transfer.
I have seen examples of aws s3 sync but that leaves a copy on the source bucket and it's not automated.
aws mv command from cli will move the files across but how to automate the process. Creating a lambda notification and send the files to second bucket could solve but I am looking for a more automated simpler solution. Not sure if there is anything we could do with SQS? Is there anything we can set on the source bucket which would automatically send the object to the second? Appreciate any ideas
There is no "move" command in Amazon S3. Instead, it involves CopyObject() and DeleteObject(). Even the AWS CLI aws mv command does a Copy and Delete.
The easiest way to achieve your object is:
Configure an Event on the source bucket to trigger an AWS Lambda function
Code the Lambda function to copy the object to the target bucket, then delete the source object
Here's some sample code:
import boto3
import urllib
TARGET_BUCKET = 'my-target-bucket'
def lambda_handler(event, context):
# Get incoming bucket and key
source_bucket = event['Records'][0]['s3']['bucket']['name']
source_key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])
# Copy object to different bucket
s3_resource = boto3.resource('s3')
copy_source = {
'Bucket': source_bucket,
'Key': source_key
}
s3_resource.Bucket(TARGET_BUCKET).Object(source_key).copy(copy_source)
# Delete the source object
s3_resource.Bucket(TARGET_BUCKET).Object(source_key).delete()
It will copy the object to the same path in the destination bucket and then delete the source object.
The Lambda function should be assigned an IAM Role that has sufficient permission to GetObject from the source bucket and CopyObject + PutObject in the destination bucket.
At the moment I am facing a problem, that I can't determine if a file was PUT via the AWS Transfer Family or via the S3 GUI.
Is there any change to default tag files which are PUT on S3 via AWS Transfer Family?
Regards
Ribase
There is S3 object metadata described in the Transfer Family user guide for post upload processing, which indicates Transfer Family uploaded this.
One use case and application of using the metadata is when an SFTP user has an inbox and an outbox. For the inbox, objects are put by an SFTP client. For the outbox, objects are put by the post upload processing pipeline. If there is an S3 event notification, the downstream service on the processor side can do an S3 HeadObject call for the metadata, dismiss if it does not have the metadata, and only process incoming files.
You could also use Transfer Family managed workflows to apply a Tag step. An example of application of using the Tag step can be found in demo 1 of the AWS Transfer Family managed workflows demo video.
Configure the S3 bucket where Transfer Family is writing the files to trigger a Lambda using an Event Notification.
Use this Boto3 code in the Lambda. It will tag the file with the principal that placed the file in S3. If it is the Transfer Familiy then it is the role that was assigned to Transfer Family to write the files to the bucket. If it is a user uploading the files via the Console then it will be that users role.
import boto3
import json
import urllib.parse
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
principal = event['Records'][0]['userIdentity']['principalId']
try:
s3 = boto3.client('s3')
response = s3.put_object_tagging(
Bucket = bucket,
Key = key,
Tagging={
'TagSet': [
{
'Key': 'Principal',
'Value': str(principal)
},
]
}
)
except Exception as e:
print('Error {}.'.format(e))
I'm an AWS newbie trying to use Textract API, their OCR service.
As far as I understood I need to upload files to a S3 bucket and then run textract on it.
I got the bucket on and the file inside it:
I got the permissions:
But when I run my code it bugs.
import boto3
import trp
# Document
s3BucketName = "textract-console-us-east-1-057eddde-3f44-45c5-9208-fec27f9f6420"
documentName = "ok0001_prioridade01_x45f3.pdf"
]\[\[""
# Amazon Textract client
textract = boto3.client('textract',region_name="us-east-1",aws_access_key_id="xxxxxx",
aws_secret_access_key="xxxxxxxxx")
# Call Amazon Textract
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
Here is the error I get:
botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the AnalyzeDocument operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.
What am I missing? How could I solve that?
You are missing S3 access policy, you should add AmazonS3ReadOnlyAccess policy if you want a quick solution according to your needs.
A good practice is to apply the least privilege access principle and keep granting access when needed. So I'd advice you to create a specific policy to access your S3 bucket textract-console-us-east-1-057eddde-3f44-45c5-9208-fec27f9f6420 only and only in us-east-1 region.
Amazon Textract currently supports PNG, JPEG, and PDF formats. Looks like you are using PDF.
Once you have a valid format, you can use the Python S3 API to read the data of the object in the S3 object. Once you read the object, you can pass the byte array to the analyze_document method. TO see a full example of how to use the AWS SDK for Python (Boto3) with Amazon Textract to
detect text, form, and table elements in document images.
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_wrapper.py
Try following that code example to see if your issue is resolved.
"Could you provide some clearance on the params to use"
I just ran the Java V2 example and it works perfecly. In this example, i am using a PNG file located in a specific Amazon S3 bucket.
Here are the parameters that you need:
Make sure when implementing this in Python, you set the same parameters.
I am trying to unload the results of a particular query in Snowflake to an S3 location directly.
copy into 's3://bucket-name/folder/text.csv'
from <Some SQL Query>
file_format = (type = CSV file_extension = '.csv' field_optionally_enclosed_by = NONE empty_field_as_null = false)
max_file_size = 5000000000
storage_integration = aws
single = true;
The problem with this is after the write is successful, the bucket owner cannot read the new file from S3 because of the ACL.
So, how do you add the canned ACL of "Bucket-Owner-Full-Control" while writing to the S3 from Snowflake? And I am not much into Google Cloud Storage, what will be the scenario in GCS buckets??
You might not be able to add a canned ACL to your COPY INTO statement, however what you can do is to add the required parameter to the Storage Integration.
When you create your Storage Integration or if you have to update it, please add this to the statement.
STORAGE_AWS_OBJECT_ACL = 'bucket-owner-full-control'
This should ensure whatever data you unload to a bucket from Snowflake will let the the bucket owner have full control over the object.
https://docs.snowflake.com/en/user-guide/data-unload-s3.html#configuring-support-for-amazon-s3-access-control-lists-optional