I have to filter the GCS using some search criteria and capture the filtered files in a python list(number of files will vary every time). I want to move them to separate GCS folder in same bucket. Currently I am looping through the list and calling gsutil cp for every file. As my list is dynamic, is there any way to implement this without loop because sometimes my list will have more than million files.
Related
I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.
I have a a few directories with about 1000 files each. The size of a directory is about 30G each.
I need to upload these directories to s3 bucket, but each file needs to be in a separate directory.
I'm using AWS SDK for uploading.
What would be the most efficient way to do it?
Copy directories as they are and move files inside s3 bucket to final destinations?
Run separate command for each file?
For 1st I hope aws libraries will better handle parallelism while updating than I can do with 2nd solution.
But 2nd approach saves me time for moving (in fact copying and deleting) objects inside s3.
Regards
Pawel
I have more than 10,000 files in a folder and i want to select some of these files (around 2,000 of them) and move it to another folder on the same bucket. I have the list of files names to be moved and i'm looking for a way or a script to go through the files and move them to the destination folder. how can i do that easily?
Amazon S3 does not have a "move" operation. Instead, you can copy the objects to a new location and then delete the original objects.
From Performing large-scale batch operations on Amazon S3 objects - Amazon Simple Storage Service:
You can use S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can perform a single operation on lists of Amazon S3 objects that you specify.
You can provide the list of files in a CSV file and configure the batch to copy the objects to a new location. However, I'm not sure if you can then delete the list of source files, so it's not really "moving" the objects.
Frankly, the method I use is:
Create an Excel spreadsheet with a list of objects in column A
Create a formula in column B like: ="aws s3 mv s3://bucket/"&a1"& s3://bucket/destination/"&a1"
Then, Fill Down to create the formula in every row
Finally, copy column B into a text file
Test a couple of lines to make sure it works correctly, then simply run the text file in Shell. It will copy the files across. Not the world's fanciest method, but should work fine for 2000 files!
I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)
We have a problem wherein some of the files in a s3 directory are in ~500MiB range, but many other files are in KiB and Bytes. I want to merge all the small files into fewer bigger files of the order of ~500MiB.
What is the most efficient way to rewriting data in an s3 folder instead of having to download, merge on local and push back to s3. Is there some utility/aws command i can use to achieve it?
S3 is a storage service and has no compute capability. For what you are asking, you need compute (to merge). So you cannot do what you want without downloading, merging and uploading.