Unable to restore Glacier deep archive to different S3 bucket

Unable to restore Glacier deep archive to different S3 bucket - amazon-web-services

I'm trying to restore the files for which I had enabled life cycle rule to Glacier deep archive. When I am trying to restore to a different directory with below command on AWS CLI it's throwing an error after downloading few files.
Command used to restore the directory:aws s3 cp s3://xxxxxxx/cf-ant-prod/year=2020/ s3://xxxxxxxx/atest/ --force-glacier-transfer --storage-class STANDARD --recursive --profile mfa
Error: An error occurred (InvalidObjectState) when calling the CopyObject operation: Operation is not valid for the source object's storage clas

As mentioned on your other question, the --force-glacier-transfer parameter does not restore objects stored in Glacier. It is simply a way to avoid warning notices.
To retrieve from Glacier Deep Archive you will need to:
Use restore-object to change the Storage Class to Standard or Standard-IA -- this will take some time to restore
Copy the file to your desired location
It is not possible to do an instant restore or a Restore+Copy.

As mentioned by John Rotenstein - it appears a simple restore of an object from Glacier must be done "in place" and once restored it can be manipulated (copied) as needed.
I was attempting to do something similar to the question topic via Lambda and I struggled for a while because I found the documentation to be murky regarding the fact that restoreObject() requests are either an SQL Select object restoration OR a simple single object restore... and most significantly which parameters apply to which operational mode.
My goal was to restore an object out of Glacier and to a new location/file name in the same bucket. The documentation strongly suggests that this is possible because there are parameters within OutputLocation that allow the BucketName and Prefix to be specified... as it seems to be the case these parameters only apply to SQL Select object restoration.
The confusing part for me was related to the parameters for the restoreObject() method there isn't sufficient differentiation to know that you can't for example provide the Description parameter when making a simple restore request using the GlacierJobParameters parameter... What was frustrating for me was that I would get errors such as:
MalformedXML: The XML you provided was not well-formed or did not validate against our published schema
There was no indication as to where the published schema is located and Googling for the published schemas yielded no results that seemed to apply to the S3 API... my hope was that I could get out of the API documentation and directly refer to the "published schema"... (published where/how?)
My suggestion would be that the documentation for the restoreObject() method be improved and/or the restoreObject() method is split into a simpleRestoreObject() and an sqlRestoreObject() object so that the parameter schemas are cleanly distinct.

Restoring objects from S3 Glacier Deep Archive (or Glacier, for that matter) must be done individually, and before copying those objects to some other location.
One way to accomplish this is by first retrieving the list of objects in the desired folder using s3 ls, for example
aws s3 ls s3://xxxxxxx/cf-ant-prod/year=2020/ --recursive
and, using each of those object names, running a restore command individually:
aws s3api restore-object --bucket s3://xxxxxxx --key <keyName> --restore-request Days=7
This will initiate a standard restore request for each object, so expect this to take 12-24 hours. Then, once the restores are complete, you are free to copy those objects using your above syntax.
Another option would be to use a tool such as s3cmd, which supports recursive restores given a bucket and folder. However, you'll still have to wait for the restore requests to complete before running a cp command.

Related

Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?

The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.

You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

Copying objects from one bucket directory folder to another bucket folder using transfer

I'm wanting to use google transfer to copy all folders/files in a specific directory in Bucket-1 to the root directory of Bucket-2.
Have tried to use transfer with the filter option but doesn't copy anything across.
Any pointers on getting this to work within transfer or step by step for functions would be really appreciated.

I reproduced your issue and worked for me using gsutil.
For example:
gsutil cp -r gs://SourceBucketName/example.txt gs://DestinationBucketName
Furthermore, I tried to copy using Transfer option and it also worked. The steps I have done with Transfer option are these:
1 - Create new Transfer Job
Panel: “Select Source”:
2 - Select your source for example Google Cloud Storage bucket
3 - Select your bucket with the data which you want to copy.
4 - On the field “Transfer files with these prefixes” add your data (I used “example.txt”)
Panel “Select destination”:
5 - Select your destination Bucket
Panel “Configure transfer”:
6 - Run now if you want to complete the transfer now.
7 - Press “Create”.
For more information about copy from a bucket to another you can check the official documentation.

So, a few things to consider here:
You have to keep in mind that Google Cloud Storage buckets don’t treat subdirectories the way you would expect. To the bucket it is basically all part of the file name. You can find more information about that in the How Subdirectories Work documentation.
The previous is also the reason why you cannot transfer a file that is inside a “directory” and expect to see only the file’s name appear in the root of your targeted bucket. To give you an example:
If you have a file at gs://my-bucket/my-bucket-subdirectory/myfile.txt, once you transfer it to your second bucket it will still have the subdirectory in its name, so the result will be: gs://my-second-bucket/my-bucket-subdirectory/myfile.txt
This is why, If you are interested in automating this process, you should definitely give the Google Cloud Storage Client Libraries a try.
Additionally, you could also use the GCS Client with Google Cloud Functions. However, I would just suggest this if you really need the Event Triggers offered by GCF. If you just want the transfer to run regularly, for example on a cron job, you could still use the GCS Client somewhere other than a Cloud Function.
The Cloud Storage Tutorial might give you a good example of how to handle Storage events.
Also, on your future posts, try to provide as much relevant information as possible. For this post, as an example, it would’ve been nice to know what file structure you have on your buckets and what you have been getting as an output. And If you can provide straight away what’s your use case, it will also prevent other users from suggesting solutions that don’t apply to your needs.

try this in Cloud Shell in the project
gsutil cp -r gs://bucket1/foldername gs://bucket2

AWS CLI S3 rm command does not produce error if file does not exist

I have a bash script which iterates over an array of file names and removes them from S3.
The following command:
aws s3 rm "s3://myBucket/myFolder/myFile.txt"
will produce this output.
delete: s3://myBucket/myFolder/myFile.txt
I can see that the delete was successful by verifying it has been removed in the AWS console.
However if I iterate over the same list again, I get the same output even though the file is gone.
Is there any way -- using just the rm command -- of indicating that AWS CLI tried to delete the file but could not find it?

The s3 cli rm command uses the s3 API Delete Object operation
As you can see in the documentation, this adds a "delete marker" to the object. In a sense, it is "labelled" as deleted.
There doesn't seem to be any check before these markers are made that the underlying object actually exists
As the S3 storage is distributed, consistency isn't guaranteed under all circumstances.
What this means is that if you were to carry out some operations on a file and then check it, the answer wouldn't be certain
In the case of S3 the AWS docs say
Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all regions.
"eventual consistency" means that at some undefined point in the future all the distributed nodes will catch up with the changes and the results returned from a query will be as expected, given the changes you have done
So basically, this is a long-winded way of saying: No, you can't get a confirmation that the file is deleted. Checking to see if it exists afterwards will not work reliably

How to rollback to previous version in Amazon S3 bucket?

I upload folders/files by:
aws s3 cp files s3://my_bucket/
aws s3 cp folder s3://my_bucket/ --recursive
Is there a way to return/rollback to previous version?
Like git revert or something similar?
Here the is test file that I uploaded 4 times.
How to get to previous version (make it the "Latest version")
For example make this "Jan 17, 2018 12:48:13" or "Jan 17, 2018 12:24:30"
to become the "Latest version" not in gui but by using command line?

Here is how to get that done:
If you are using cli,
https://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html
Get the object with the version you want.
Then perform a put object for the downloaded object.
https://docs.aws.amazon.com/cli/latest/reference/s3api/put-object.html
Your old S3 object will be the latest object now.
AWS S3 object is immutable and you can only put and delete. Rename is GET and PUT of the same object with a different name.
Hope it helps.

No. However, to protect against this in the future, you can enable versioning on your bucket and even configure the bucket to prevent automatic overwrites and deletes.
To enable versioning on your bucket, visit the Properties tab in the bucket and turn it on. After you have done so, copies or versions of each item within the bucket will contain version meta data and you will be able to retrieve older versions of the objects you have uploaded.
Once you have enabled versioning, you will not be able to turn it off.
EDIT (Updating my answer for your updated question):
You can't version your objects in this fashion. You are providing each object a unique Key, so S3 is treating it as a new object. You are going to need to use the same Key for each object PUTS to use versioning correctly. The only way to get this to work would be to GETS all of the objects from the bucket and find the most current date in the Key programmatically.
EDIT 2:
https://docs.aws.amazon.com/AmazonS3/latest/dev/RestoringPreviousVersions.html
To restore previous versions you can:
One of the value propositions of versioning is the ability to retrieve
previous versions of an object. There are two approaches to doing so:
Copy a previous version of the object into the same bucket The copied
object becomes the current version of that object and all object
versions are preserved.
Permanently delete the current version of the object When you delete
the current object version, you, in effect, turn the previous version
into the current version of that object.

I wasn't able to get answer I was looking to get for this question. I figured out myself by going to aws s3 console and would like to share here.
So, the quickest way is to simply navigate to:
--AWS Console -> to s3 console -> the bucket -> the s3 object
You will see the following:
At this point you can simpyl navigate to all your object
versions by clicking at the "Versions" and pick (download or move)
whichever version of the object you are interested in

S3 allows you to enable versioning for your bucket. If you have versioning on, you should be able to find previous versions back. If not, you are out of luck.
See the following page for more information: https://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html

Amazon S3 Object Lifecycle Management via header

I've been searching for an answer to this question for quite some time but apparently I'm missing something.
I use s3cmd heavily to automate document uploads to AWS S3, via script. One of the parameters that can be used in s3cmd is --add-header, which I assume allows for lifecycle rules to be added.
My objective is to add this parameters and specify a +X (where X is days) to the upload. In the event of ... --add-header=...1 ... the lifecyle rule would delete this file after 24h.
I know this can be easily done via the console, but I would like to have a more detailed control over individual files/scripts.
I've read the parameters that can be passed to S3 via s3cmd, but I somehow can't understand how to put all of those together to get the intended result.
Thank you very much for any help or assistance!

The S3 API itself does not implement support for any request header that triggers lifecycle management at the object level.
The --add-header option for s3cmd can add headers that S3 understands, such as Content-Type, but there is no lifecycle header you can send using any tool.
You might be thinking of this:
If you make a GET or a HEAD request on an object that has been scheduled for expiration, the response will include an x-amz-expiration header that includes this expiration date and the corresponding rule Id
https://aws.amazon.com/blogs/aws/amazon-s3-object-expiration/
This is a reaponse header, and is read-only.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js