I am trying to unload the results of a particular query in Snowflake to an S3 location directly.
copy into 's3://bucket-name/folder/text.csv'
from <Some SQL Query>
file_format = (type = CSV file_extension = '.csv' field_optionally_enclosed_by = NONE empty_field_as_null = false)
max_file_size = 5000000000
storage_integration = aws
single = true;
The problem with this is after the write is successful, the bucket owner cannot read the new file from S3 because of the ACL.
So, how do you add the canned ACL of "Bucket-Owner-Full-Control" while writing to the S3 from Snowflake? And I am not much into Google Cloud Storage, what will be the scenario in GCS buckets??
You might not be able to add a canned ACL to your COPY INTO statement, however what you can do is to add the required parameter to the Storage Integration.
When you create your Storage Integration or if you have to update it, please add this to the statement.
STORAGE_AWS_OBJECT_ACL = 'bucket-owner-full-control'
This should ensure whatever data you unload to a bucket from Snowflake will let the the bucket owner have full control over the object.
https://docs.snowflake.com/en/user-guide/data-unload-s3.html#configuring-support-for-amazon-s3-access-control-lists-optional
Related
I am planning to use AWS Python SDK (Boto3) to copy files from one bucket to other. Below is the same code I got from AWS documentation :
dest_object.copy_from(CopySource={
'Bucket': self.object.bucket_name,
'Key': self.object.key
})
My question is how do I trigger this code and where should I deploy this code?
I originally thought of Lambda function but I am looking for alternate options in case Lambda times out for larger files ( 1 TB etc.).
Can I use Airflow to trigger this code somehow? may be invoke through Lambda ? Looking for suggestions from AWS experts.
The easiest way to copy new files to another bucket is to use Amazon S3 Replication. It will automatically copy new objects to the selected bucket, no code required.
However, this will not meet your requirement of deleting the incoming file after it is copied. Therefore, you should create an AWS Lambda function and add a S3 trigger. This will trigger the Lambda function whenever an object is created in the bucket.
The Lambda function should:
Extract the bucket and object name from the event parameter
Copy the object to the target bucket
Delete the original object
The code would look something like:
import boto3
import urllib
TARGET_BUCKET = 'target_bucket' # Change this
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
# Loop through each incoming object
for record in event['Records']:
# Get incoming bucket and key
source_bucket = record['s3']['bucket']['name']
source_key = urllib.parse.unquote_plus(record['s3']['object']['key'])
# Copy object to different bucket
copy_source = {
'Bucket': source_bucket,
'Key': source_key
}
s3_resource.Bucket(TARGET_BUCKET).Object(source_key).copy(copy_source)
# Delete original object
s3_resource.Bucket(source_bucket).Object(source_key).delete()
The copy process is unlikely to approach the 15-minute limit of AWS Lambda, but it is worth testing on large objects.
I am trying to restore some objects in a s3 bucket from glacier to standard tier permanently below is my code
def restoreObject(latest_object):
#s3_resource=boto3.resource('s3')
s3 = boto3.client('s3')
my_bucket = s3.Bucket('bucket_name')
for bucket_object in my_bucket.objects.all():
object_key=bucket_object.key
if(bucket_object.storage_class == "Glacier_Deep_Archive"):
if object_key in latest_object:
my_bucket.restore_object(Bucket="bucket_name",Key=object_key,RestoreRequest={'Days': 30,'Tier': 'Standard'})
But this restores the bundle for a particular time only (30 days in my case)
Is there a way to restore bundles permanently from Glacier_Deep_Archive to standard tier?
To permanently change the Storage Class of an object in Amazon S3, either:
Write code to CopyObject with the same Key (overwriting itself) while specifying the new Storage Class, or
Transition objects using Amazon S3 Lifecycle to let S3 do it automatically
However, Lifecycle policies do not seem to support going from Glacier to Standard tiers.
Therefore, you would need to copy the object to itself to change the storage class:
copy_source = {'Bucket': 'bucket_name', 'Key': object_key}
my_bucket.copy(copy_source, object_key, ExtraArgs = {'StorageClass': 'STANDARD','MetadataDirective': 'COPY'})
Here's a nice example: aws s3 bucket change storage class by object size ยท GitHub
I have created a S3 mock service in my code base.
// Create a S3 Client
S3Client client = S3Client.builder()
.serviceConfiguration(configuration)
.credentialsProvider(AnonymousCredentialsProvider.create())
.region(Region.of("us-west-2"))
.endpointOverride(new URI("http://localhost:8001"))
.build();
// Create a S3 bucket with a bucket name i.e.
CreateBucketResponse createBucketResponse = client.createBucket(CreateBucketRequest.builder().
bucket(<BucketName>).build());
// Verify if the bucket has been created or not.
// If the bucket is not created then the following lines will throw no such bucket
exception.
HeadBucketRequest headBucketRequest = HeadBucketRequest.builder()
.bucket(**<BucketName>**)
.build();
HeadBucketResponse headBucketResponse = client.headBucket(headBucketRequest);
// Update the Versioning status of the created bucket to Enabled.
PutBucketVersioningRequest versioningRequest = PutBucketVersioningRequest.builder().bucket(**<BucketName>**)
.versioningConfiguration(
VersioningConfiguration.builder().status(BucketVersioningStatus.ENABLED).build()
).build();
PutBucketVersioningResponse result = client.putBucketVersioning(versioningRequest);
// Check if the bucket versioning status is enabled or not.
BucketVersioningStatus bucketVersioningStatus = client.getBucketVersioning(GetBucketVersioningRequest.builder().bucket(**<BucketName>**).build()).status();
Note : The bucketVersioningStatus is null/empty for the above call. In the real time, when I am creating the bucket in the cloud and setting the versioning status gives me proper results but in the mock s3 client I am not getting the same expected outcome.
I am not sure if I have used the PutBucketVersioningRequest appropriately.
NOTE: I am using the S3 Client V2 and not the AmazonS3Client
Kindly help me in identifying the root cause of the issue.
At the moment I am facing a problem, that I can't determine if a file was PUT via the AWS Transfer Family or via the S3 GUI.
Is there any change to default tag files which are PUT on S3 via AWS Transfer Family?
Regards
Ribase
There is S3 object metadata described in the Transfer Family user guide for post upload processing, which indicates Transfer Family uploaded this.
One use case and application of using the metadata is when an SFTP user has an inbox and an outbox. For the inbox, objects are put by an SFTP client. For the outbox, objects are put by the post upload processing pipeline. If there is an S3 event notification, the downstream service on the processor side can do an S3 HeadObject call for the metadata, dismiss if it does not have the metadata, and only process incoming files.
You could also use Transfer Family managed workflows to apply a Tag step. An example of application of using the Tag step can be found in demo 1 of the AWS Transfer Family managed workflows demo video.
Configure the S3 bucket where Transfer Family is writing the files to trigger a Lambda using an Event Notification.
Use this Boto3 code in the Lambda. It will tag the file with the principal that placed the file in S3. If it is the Transfer Familiy then it is the role that was assigned to Transfer Family to write the files to the bucket. If it is a user uploading the files via the Console then it will be that users role.
import boto3
import json
import urllib.parse
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
principal = event['Records'][0]['userIdentity']['principalId']
try:
s3 = boto3.client('s3')
response = s3.put_object_tagging(
Bucket = bucket,
Key = key,
Tagging={
'TagSet': [
{
'Key': 'Principal',
'Value': str(principal)
},
]
}
)
except Exception as e:
print('Error {}.'.format(e))
I'm using Databricks - python runtime 8.2.
There are two S3 buckets:
Remote S3 - belongs to someone else, I have access ONLY to bucketname/my_prefix only
Local S3 - belongs to me, I have full access to the bucket policy (can mount it to dbfs for example, copy files from dbfs to local s3.
My mission is to copy LOTs of files from remote S3 which I can't mount to local S3.
From Databricks Mount S3 using AWS access keys:
Important
When you mount an S3 bucket using keys, all users have read and write access to all the objects in the S3 bucket.
I have tried to use boto3 library but I cant seem to get much luck.
Essentially, is there a way to copy files from remote S3 to dbfs?
Here is one version of the code I am using - I have tried SO many other ways.
# test if list of days to copy is not empty; loop through each
day and copy the s3 folder contents
if len(days_to_copy) != 0:
for day in days_to_copy:
#convert the items in the list from date to string and format yyyy/mm/dd
day_to_copy = day.strftime('%Y/%m/%d')
print(day_to_copy)
source_bucket = "Remote-S3"
source_key = "myclient/2021/10/06/part-000d339cb-
c000.snappy.parquet"
target_bucket = "Local-s3"
target_key = "{}/".format(day_to_copy)
copy_source = {
'Bucket': source_bucket,
'Key': source_key
}
copy_target = {
'Bucket': target_bucket,
'Key': target_key
}
print('copy_source',copy_source)
print('copy_target',copy_target)
s3.meta.client.copy(copy_source, 'Local-s3','2021/10/06/part-000d339cb-c000.snappy.parquet' )
This code gives this error:
ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
Thanks.