Amazon S3 copying multiple files from one bucket to another

Amazon S3 copying multiple files from one bucket to another - amazon-web-services

I have a large list of objects in source S3 bucket and i selectively want to copy a subset of objects in to destination bucket.
As per doc here it seems its possible with TransferManager.copy(from_bucket, from_key, to_bucket, to_key), however i need to do it one at a time.
Is anyone aware of other ways, preferably to copy in a batched fashion instead of calling copy() for each object ?

If you wish to copy a whole directory, you could use the AWS Command-Line Interface (CLI):
aws s3 cp --recursive s3://source-bucket/folder/* s3://destination-bucket/folder/
However, since you wish to selectively copy files, there's no easy way to indicate which files to copy (unless they all have the same prefix).
Frankly, when I need to copy selective files, I actually create an Excel file with a list of filenames. Then, I create a formula like this:
="aws s3 cp s3://source-bucket/"&A1&" s3://destination-bucket/"
Then just use Fill Down to replicate the formula. Finally, copy the commands and paste them into a Terminal window.
If you are asking whether there is a way to programmatically copy multiples between buckets using one API call, then the answer is no, this is not possible. Each API call will only copy one object. You can, however, issue multiple copy commands in parallel to make things go faster.

I think it's possible via the S3 console but using the SDK there's no such option. Although this isn't the solution to your problem, this script selectively copies objects one at a time and if you're reading from an external file, it's just a matter of entering your file names there.
ArrayList<String> filesToBeCopied = new ArrayList<String>();
filesToBeCopied.add("sample.svg");
filesToBeCopied.add("sample.png");
String from_bucket_name = "bucket1";
String to_bucket = "bucket2";
BasicAWSCredentials creds = new BasicAWSCredentials("<key>","<secret>");
final AmazonS3 s3 = AmazonS3ClientBuilder.standard().withRegion(Regions.AP_SOUTH_1)
.withCredentials(new AWSStaticCredentialsProvider(creds)).build();
ListObjectsV2Result result = s3.listObjectsV2(from_bucket_name);
List<S3ObjectSummary> objects = result.getObjectSummaries();
try {
for (S3ObjectSummary os : objects) {
String bucketKey = os.getKey();
if (filesToBeCopied.contains(bucketKey)) {
s3.copyObject(from_bucket_name, bucketKey, to_bucket, bucketKey);
}
}
} catch (AmazonServiceException e) {
System.err.println(e.getErrorMessage());
System.exit(1);
}

Related

Scala Copy large file between S3 buckets [duplicate]

I am trying to copy data from one s3 folder to another within same bucket. I am using copyObject function from AmazonS3 class. I don't see any errors or exceptions and I do get result also. But the file is not copied. I would at least except some error if there is any failure. What I am doing wrong?
How do I know the actual error?
AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider());
CopyObjectRequest copyObjRequest = new CopyObjectRequest(
sourceURI.getBucket(), sourceURI.getKey(), destinationURI.getBucket(), destinationURI.getKey());
CopyObjectResult copyResult = s3client.copyObject(copyObjRequest);
I have proper values in source and destination URI. Is it because of credentials? If in case it is not working because of missing credentials I expect an error from this code.

I suspect that you have incorrectly specified the destination of the copy.
The destination is a complete bucket and key, for example if you are copying dogs/beagle.png to smalldogs/beagle.png then it is not sufficient to specify the destination key as smalldogs/. That's how copies work on a regular file system like NTFS or NFS, but not how they work in object storage. What that will result in is an object named smalldogs/ and it will appear to be a folder, but it's actually a copy of the beagle image.
So, delete the errant object that you created and then specify the destination fully and correctly.
Note that the operation here is copy. If you want 'move' then you need to delete the source afterwards.
Here is some code based on yours that runs fine for me and results in the correct copied file:
String bkt = "my-bucket";
String src = "dogs/beagle.png";
String dst = "smalldogs/beagle.png";
AmazonS3Client s3 = new AmazonS3Client();
CopyObjectRequest req = new CopyObjectRequest(bkt, src, bkt, dst);
CopyObjectResult res = s3.copyObject(req);

can't get akka streams / alpakka S3 to work in simple case

I'm doing what I think is a very simple thing to check that alpakka is working:
val awsCreds = AwsBasicCredentials.create("xxx", "xxx")
val credentialsProvider = StaticCredentialsProvider.create(awsCreds)
implicit val staticCreds = S3Attributes.settings(S3Ext(context.system).settings.withCredentialsProvider(credentialsProvider)
.withS3RegionProvider(new AwsRegionProvider {val getRegion: Region = Region.US_EAST_2}))
val d = S3.checkIfBucketExists(action.bucket)
d foreach { msg => log.info("mgs: " + msg.toString)}
When I run this I get
msgs: NotExists
But the bucket referred to by action.bucket does exist, and I can access it using these credentials. What's more, when I modify the credentials (by changing the secret key), I get the same message. What I should get, according to the documentation, is AccessDenied.
I got to this point because I didn't think the environment was picking up on the right credentials - hence all the hard-coded values. But now I don't really know what could be causing this behavior.
Thanks
Update: The action object is just a case class consisting of a bucket and a path. I've checked in debug that action.bucket and action.path point to the things they should be - in this case an S3 bucket. I've also tried the above code with just the string bucket name in place of action.bucket.

Just my carelessness . . .
An errant copy added an extra implicit system to the mix. Some changes were made to implicit materializers in akka 2.6 and I think those, along with the extra implicit actor system, made for a weird mix.

file payload in google cloud function bucket triggered

I have a question about a Google Cloud functions triggered by an event on a storage bucket (I’m developing it in Python).
I have to read the data of the file just finalized (a PDF file) on the bucket that is triggering the event and I was looking for the file payload on the event object passed to my function (data, context) but it seems there is not payload on that object.
Do I have to use the cloud storage library to get the file from the bucket ? Is there a way to get the payload directly from the context of the triggered function ?
Enrico

From checking the more complete examplein the Firebase documentation, it indeed seems that the payload of the file is not included in the parameters. That make sense, since there's no telling how big the file is that was just finalized, and if that will even fit in the memory of your Functions runtime.
So you'll have to indeed grab the file from the bucket with a separate call, based on the information in the metadata. The full Firebase example grabs the filename and other info from its context/data with:
exports.generateThumbnail = functions.storage.object().onFinalize(async (object) => {
const fileBucket = object.bucket; // The Storage bucket that contains the file.
const filePath = object.name; // File path in the bucket.
const contentType = object.contentType; // File content type.
const metageneration = object.metageneration; // Number of times metadata has been generated. New objects have a value of 1.
...
I'll see if I can find a more complete example. But I'd expect it to work similarly on raw Google Cloud Functions, which Firebase wraps, even when using Python.
Update: from looking at this Storage/Function/PubSub documentation that the Python binding is apparently based on, it looks like the the path should be available as data['resource'] or as data['name'].

Amazon S3 copy object Java API

I am trying to copy data from one s3 folder to another within same bucket. I am using copyObject function from AmazonS3 class. I don't see any errors or exceptions and I do get result also. But the file is not copied. I would at least except some error if there is any failure. What I am doing wrong?
How do I know the actual error?
AmazonS3 s3client = new AmazonS3Client(new ProfileCredentialsProvider());
CopyObjectRequest copyObjRequest = new CopyObjectRequest(
sourceURI.getBucket(), sourceURI.getKey(), destinationURI.getBucket(), destinationURI.getKey());
CopyObjectResult copyResult = s3client.copyObject(copyObjRequest);
I have proper values in source and destination URI. Is it because of credentials? If in case it is not working because of missing credentials I expect an error from this code.

I suspect that you have incorrectly specified the destination of the copy.
The destination is a complete bucket and key, for example if you are copying dogs/beagle.png to smalldogs/beagle.png then it is not sufficient to specify the destination key as smalldogs/. That's how copies work on a regular file system like NTFS or NFS, but not how they work in object storage. What that will result in is an object named smalldogs/ and it will appear to be a folder, but it's actually a copy of the beagle image.
So, delete the errant object that you created and then specify the destination fully and correctly.
Note that the operation here is copy. If you want 'move' then you need to delete the source afterwards.
Here is some code based on yours that runs fine for me and results in the correct copied file:
String bkt = "my-bucket";
String src = "dogs/beagle.png";
String dst = "smalldogs/beagle.png";
AmazonS3Client s3 = new AmazonS3Client();
CopyObjectRequest req = new CopyObjectRequest(bkt, src, bkt, dst);
CopyObjectResult res = s3.copyObject(req);

Restoring files on a version enabled amazon s3 bucket

I am trying to enable versioning and lifecycle policies on my Amazon S3 buckets. I understand that it is possible to enable Versioning first and then apply LifeCycle policy on that bucket. If you see the image below, that will confirm this idea.
I have then uploaded a file several times which created several versions of the same file. I then deleted the file and still able to see several versions. However, if I try to restore a file, I see that the Initiate Restore option is greyed out.
I would like to ask anyone who had any similar issue or let me know what I am doing wrong.
Thanks,

Bucket Versioning on Amazon S3 keeps all versions of objects, even when they are deleted or when a new object is uploaded under the same key (filename).
As per your screenshot, all previous versions of the object are still available. They can be downloaded/opened in the S3 Management Console by selecting the desired version and choosing Open from the Actions menu.
If Versions: Hide is selected, then each object only appears once. Its contents is equal to the latest uploaded version of the object.
Deleting an object in a versioned bucket merely creates a Delete Marker as the most recent version. This makes the object appear as though it has been deleted, but the prior versions are still visible if you click the Versions: Show button at the top of the console. Deleting the Delete Marker will make the object reappear and the contents will be the latest version uploaded (before the deletion).
If you want a specific version of the object to be the "current" version, either:
Delete all versions since that version (making the desired version that latest version), or
Copy the desired version back to the same object (using the same key, which is the filename). This will add a new version, but the contents will be equal to the version you copied. The copy can be performed in the S3 Management Console -- just choose Copy and then Paste from the Actions Menu.
Initiate Restore is used with Amazon Glacier, which is an archival storage system. This option is not relevant unless you have created a Lifecycle Policy to move objects to Glacier.

With the new console, you can do it as following.
Click on the Deleted Objects button
You will see your deleted object below, Select it
Click on More -> Undo delete

If you have a lot of deleted files to restore. You might want to use a script to do the job for you.
The script should
Get the versions of objects in your bucket using the Get object versions API
Inspect the versions data to get Delete Marker (i.e. delete objects) name and version id
Delete the markers found using the marker names and version ids using Delete object API
Python example with boto:
This example script deletes delete markers found one by one once.
#!/usr/bin/env python
import boto
BUCKET_NAME = "examplebucket"
DELETE_DATE = "2015-06-08"
bucket = boto.connect_s3().get_bucket(BUCKET_NAME)
for v in bucket.list_versions():
if (isinstance(v, boto.s3.deletemarker.DeleteMarker) and
v.is_latest and
DELETE_DATE in v.last_modified):
bucket.delete_key(v.name, version_id=v.version_id)
Python example with boto3:
However, if you have thousands of objects, this could be a slow process. AWS does provide a way to batch delete objects with a maximum batch size of 1000.
The following example script searches your objects with a prefix, and test them if they are deleted ( i.e. current version is a delete marker ) and them batch delete them. It is set to search 500 objects in your bucket in each batch, and try to delete multiple object with a batch no more than 1000 objects.
import boto3
client = boto3.client('s3')
def get_object_versions(bucket, prefix, max_key, key_marker):
kwargs = dict(
Bucket=bucket,
EncodingType='url',
MaxKeys=max_key,
Prefix=prefix
)
if key_marker:
kwargs['KeyMarker'] = key_marker
response = client.list_object_versions(**kwargs)
return response
def get_delete_markers_info(bucket, prefix, key_marker):
markers = []
max_markers = 500
version_batch_size = 500
while True:
response = get_object_versions(bucket, prefix, version_batch_size, key_marker)
key_marker = response.get('NextKeyMarker')
delete_markers = response.get('DeleteMarkers', [])
markers = markers + [dict(Key=x.get('Key'), VersionId=x.get('VersionId')) for x in delete_markers if
x.get('IsLatest')]
print '{0} -- {1} delete markers ...'.format(key_marker, len(markers))
if len(markers) >= max_markers or key_marker is None:
break
return {"delete_markers": markers, "key_marker": key_marker}
def delete_delete_markers(bucket, prefix):
key_marker = None
while True:
info = get_delete_markers_info(bucket, prefix, key_marker)
key_marker = info.get('key_marker')
delete_markers = info.get('delete_markers', [])
if len(delete_markers) > 0:
response = client.delete_objects(
Bucket=bucket,
Delete={
'Objects': delete_markers,
'Quiet': True
}
)
print 'Deleting {0} delete markers ... '.format(len(delete_markers))
print 'Done with status {0}'.format(response.get('ResponseMetadata', {}).get('HTTPStatusCode'))
else:
print 'No more delete markers found\n'
break
delete_delete_markers(bucket='data-global', prefix='2017/02/18')

I have realised that I can perform and Initiate Restore operation once the object is stored on Gliacer, as shown by the Storage Class of the object. To restore a previous copy on S3, the Delete marker on the current Object has to be removed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js