I have implemented an architecture as per the link https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions
But the issue is when multiple files come at the same time(For E:g. 3 files comes at the same timestamp(21/06/2020, 12:13:54 UTC+5:30)) in the bucket. In this scenario, the cloud function is unable to move all these files with the same timestamp to success bucket after processing.
Can someone please suggest.
Google Cloud Storage is not a file system. You can only CREATE, READ and DELETE the BLOB. Therefore, you can't MOVE a file. The MOVE that exist on the console or in some client library (in python for example) perform a CREATE (copy the existing BLOB to the target name) and then a DELETE of the old BLOB.
Eventually, you can't keep the original timestamp with you perform a MOVE operation.
NOTE: because you perform a CREATE and a DELETE when you MOVE your file, you are charge on early deletion when you use classes such as Nearline, coldline and archive
Related
I have more than 10,000 files in a folder and i want to select some of these files (around 2,000 of them) and move it to another folder on the same bucket. I have the list of files names to be moved and i'm looking for a way or a script to go through the files and move them to the destination folder. how can i do that easily?
Amazon S3 does not have a "move" operation. Instead, you can copy the objects to a new location and then delete the original objects.
From Performing large-scale batch operations on Amazon S3 objects - Amazon Simple Storage Service:
You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can perform a single operation on lists of Amazon S3 objects that you specify.
You can provide the list of files in a CSV file and configure the batch to copy the objects to a new location. However, I'm not sure if you can then delete the list of source files, so it's not really "moving" the objects.
Frankly, the method I use is:
Create an Excel spreadsheet with a list of objects in column A
Create a formula in column B like: ="aws s3 mv s3://bucket/"&a1"& s3://bucket/destination/"&a1"
Then, Fill Down to create the formula in every row
Finally, copy column B into a text file
Test a couple of lines to make sure it works correctly, then simply run the text file in Shell. It will copy the files across. Not the world's fanciest method, but should work fine for 2000 files!
I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)
One of our client asks to get all the video that they uploaded to the system. The files are stored at s3. Client expect to get one link that will download archive with all the videos.
Is there a way to create such an archive without downloading files archiving it and uploading back to aws?
So far I didn't find the solution.
Is it possible to do it with glacier, or move the files to folder and expose it?
Unfortunately, you can't create a zip-like archives from existing objects directly on S3. Similarly you can't transfer them to Glacier to do this. Glacier is not going to produce a single zip or rar (or any time of) archive from multiple s3 objects for you.
Instead, you have to download them first, zip or rar (or use which ever archiving format you prefer), and the re-upload to S3. Then you can share the zip/rar with your customers.
There is also a possibility of using multi-part AWS API to merge S3 objects without downloading them. But this requires programming custom solution to merge objects (not creating zip/rar-type archives).
You can create a glacier archive for a specific prefix (what you see as a subfolder) by using AWS lifecycle rules to perform this action.
More information available here.
Original answer
There is no native way to be able to retrieve all objects as an archive via S3.
S3 simply exposes all objects as they are uploaded, unfortunately you will need to perform the archiving as a separate process afterwards.
I have a bucket (s3://Bucket1) and there are millions of files in that with format like below:
s3://Bucket1/yyyy-mm-dd/
I want to move these files like
s3://Bucket1/year/mm
Any help, script, method will be really helpful.
I have tried aws s3 cp s3://Bucket1/ s3://Bucket1/ --include "2017-01-01*" but this is not working good and plus I have to put extra stuff to delete files.
The basic steps are:
Get a list of objects
Copy the objects to the new name
Delete the old objects
Get a list of objects
Given that you have millions of files, the best way to start is to use Amazon S3 Inventory to obtain a CSV file of all the objects.
Copy the objects to the new name
Then, write a script that reads the CSV file and issues a copy() command to copy the file to the new location. This could be written in any language that has an AWS SDK (eg Python).
Delete the old objects
Rather than individually deleting the objects, use S3 object lifecycle management to delete the old files. The benefits of using this method are:
There is no charge for the delete (whereas issuing millions of delete commands would involve a charge)
It can be done after the files have been copied, providing a chance to verify that all the files have been correctly copied (by checking the next S3 inventory output)
You could use the AWS CLI to issue a aws s3 mv command, which will combine the copy and delete -- effectively providing a rename function. However, shell scripts aren't that easy and if things fail half-way the files will be in a mixed state. That's why I prefer the "copy all objects, and only then delete" method more.
I need to move some files (thousands) to Amazon S3 bucket, from where they will be displayed to the end-user by another application (instead of the current one).
Problem is, that these files have creation/upload date now (dates very between 2012 and 2017, when they were uploaded to current application), and when I move them they all start to be of the same date. That is a problem because when you look at the files in the new application, you don't understand the time hierarchy which is sometimes very important.
Is there any way I can modify upload date of a file(s) in S3?
The Last Modification Date is generated by Amazon S3 and cannot be set via the API.
If dates and other information (eg user) are important to your application, you can store it as metadata on the object. Then, retrieve the metadata when displaying dates, user, etc.
What I did was renaming the file to something else and then renaming it again to its original name.
As you cannot rename directly, you have to copy the file to a new name, and then copy it back to its original name. (and delete the auxiliary file, of course)
It is not optimal, but that's the solution when using AWS client. I hope one day AWS will have all function the FTP used to have.
You can just copy over the same object and the timestamp will update.
This technique is also used to prolong the expire of an object in a bucket with a lifecycle rule.