We have 500 gb data set in S3 bucket, where we have some empty files, need to remove the empty files is there a better than copying to a linux machine and running the find cmd to delete the empty files ?
If you don't know which files are empty, you could request S3 inventory. It is provided once a day or week in a CSV format. One of its fields is:
Size – Object size in bytes.
Thus, having the inventory file, you will be able to very efficiently identify, and then remove, empty files from your bucket.
I mounted the bucket on to an ec2 instance using s3fs and ran the empty file/dir check, this method was more convenient.
Related
I have a situation where thousands o files are created for a user by multiple backend instances, and then they're uploaded to AWS S3 / Azure Storage. After all the files are created, the user wants to download them as a zip. I can create the zip and then get a pre-signed URL, but I tried few archiving solutions and all of them are just taking too much time (hours).
Is there any way of creating the zip dynamically from the multiple backend instances? I want append to zip after each file creation, from any backend instance.
Zip itself supports the use case you want. For example, zip command in Linux:
When given the name of an existing zip archive, zip will replace identically named entries in the zip archive (matching the relative names as stored in the archive) or add entries for new names.
You need to persist the working zip file somewhere in a file system though. The most obvious choice I can think of is EFS, so that multiple instances can mount the file system and access the zip file.
If you don't want to modify the existing instances/workloads, you can even mount EFS on Lambda. Then set S3 trigger for the Lambda to update zip file every time a new file is uploaded.
I think you can not use only S3 for this, because you cannot update S3 objects. Then you need to download/upload for every new file, which is really not ideal.
I have more than 10,000 files in a folder and i want to select some of these files (around 2,000 of them) and move it to another folder on the same bucket. I have the list of files names to be moved and i'm looking for a way or a script to go through the files and move them to the destination folder. how can i do that easily?
Amazon S3 does not have a "move" operation. Instead, you can copy the objects to a new location and then delete the original objects.
From Performing large-scale batch operations on Amazon S3 objects - Amazon Simple Storage Service:
You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can perform a single operation on lists of Amazon S3 objects that you specify.
You can provide the list of files in a CSV file and configure the batch to copy the objects to a new location. However, I'm not sure if you can then delete the list of source files, so it's not really "moving" the objects.
Frankly, the method I use is:
Create an Excel spreadsheet with a list of objects in column A
Create a formula in column B like: ="aws s3 mv s3://bucket/"&a1"& s3://bucket/destination/"&a1"
Then, Fill Down to create the formula in every row
Finally, copy column B into a text file
Test a couple of lines to make sure it works correctly, then simply run the text file in Shell. It will copy the files across. Not the world's fanciest method, but should work fine for 2000 files!
I have a TB sized S3 bucket with pdf files. I need to migrate the old files to glacier. I know that I can create a life cycle rule to migrate files which are older than certain number of days. But in my case currently the bucket consists of both old and new pdf files and they were added at a same time. So they may have same uploaded date. In this case a life cycle rule won't be useful.
In the pdf files there is a field called capture_date. So i need to migrate those files based on the capture_date. (ie: migrate all pdf files if the capture_date < 2015-05-21 likewise).
Will a Fargate job will be useful here? if so, please give a brief idea.
Please suggest your ideas. Thanks in advance
S3 by itself will not read your pdf files. Thus you have to read them yourself, extract data that determine which ones are old and new, and using AWS SDK (or CLI) to move them to Glacier.
Since the files are not too big, you could use S3 Batch along with lambda function which would do the change of the class to glacier.
Alternatively, you could do this on an EC2 instance, using S3 Inventory's CSV list of your objects (assuming large number of them).
And the most traditional way is to just list your bucket, and iterate over each object.
I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open · PyPI is a great library for reading from an S3 object without having to download it first.)
We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?
The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.
You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script