There are many compressed files inside google cloud storage.
I want to unzip the zipped file, rename it and save it back to another bucket.
I've seen a lot of posts, but I couldn't find a way other than how to download it with gsutil and handle it.
Do you have any other way?
To modify a file, such as unzipping, you must read, modify and then write. This means download, unzip and upload the extracted files.
Use gsutil or another tool, one of the SDKs, or the REST APIs. To unzip a file, use a zip tool or one of the libraries that support zip operations.
You can start using cloudFunctions for it. CloudFunction will get triggered when the object is created/finalized and it will do the job automatically.
You can also use the same function to iterate over all the zip files, do the tasks and move same files to another bucket. One thing to make sure it may not be able to move all files in a single go, but for the files which already exists on the bucket, you can use local machine to run the program.
Here is the list of libraries to connect to CloudStorage
Related
I have a job that clones a repo then s3 syncs changes files over to an s3 bucket. I'd like to sync only changed files. Since the repo is cloned first, the files always have a new timestamp so s3 sync will always upload them. I thought about using "--size-only", but my understanding is that this can potentially miss files that have legitimately changed. What's the best way to go about this?
There are no answers out of the box that will sync changed files if the mtime cannot be counted on. As you point out, this means that if a file does not change in size, then using the "--size-only" flag will cause aws s3 sync to skip those files. To my mind there are two basic paths, the solution you use will depend on your exact needs.
Take advantage of Git
First off, you could use the fact you have the files stored in git to help update the modified time. git itself will not store the metadata, the maintainers have a philisphy that doing so is a bad idea. I won't argue for or against this, but there are two basic ways around this:
You could store this metadata in git. There are multiple approaches to doing this, one such is metastore which uses a tool that's installed alongside git to store the metadata and apply it later. This does require adding a tool to all users of your git repo, which may or may not be acceptable.
Another option is to attempt to recreate the mtime from metadata that's already in git. For instance, git-restore-mtime does this by using the timestamp of the most recent commit that modified the file. This would require running an external tool before running the sync command, but it shouldn't require any other workflow changes.
Using either of these options would allow a basic aws sync command to work, since the timestamps would be consistent from one run to another.
Do your own thing
Fundamentally, you want to upload files that have changed. aws sync attempts to use file size and modification timestamps to detect changes, but if you wanted to, you could write a script or program to enumerate all files you want to upload, and upload them along with a small bit of extra metadata including something like a sha256 hash. Then on future runs, you can enumerate the files in S3 using list-objects and use head-object on each object in turn to get the metadata to see if the hash has changed.
Alternatively, you could use the "etag" of each object in S3, as that is returned in the list-objects call. As I understand it, the etag formula isn't documented and subject to change. That said, it is known, you can find implementations of it here on Stack Overflow and elsewhere. You could calculate the etag for your local files, then see if the remote files differ and need to be updated. That would save you having to do the head-object on each object as you check for changes.
I have an Amazon S3 bucket with tons of images. A subset of these images need to be synced to a local machine for image analysis (AI) purposes. This has to be done regularly and ideally with a list of file names as input. Not all images need to be synced.
There are ways to synchronise S3 with either Dropbox/Amazon Drive or other storage services, but none of them appear to have the option to provide a list of files that need to be synced.
How can this be implemented?
The first thing that springs to mind when talking about syncing and s3 is using the aws s3 sync cli command. This will allow you to sync specific origin destination folders as well as afford you the ability to use --include, --exclude if you want to list specific files. The commands also allow for the use of wildcards [*] if you have specific naming conventions you can use to identify the files.
You can also repeatedly call the --exclude command for multiple files, so depending on your OS you could either list all files or create a find script that identifies the files and singles them out.
Additionally you are able to do --delete which will remove any files in the destination path that are not in the origin.
As much as I would like to answer but I felt it would be good to
comment with one's thoughts initially if they are in line with the OP!
But I see the comments are being used to provide an answer to gain
points :)
I would like to submit my official answer!
Ans:
If I get this correctly I would use aws cli wth filters of include and exclude.
https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters
I have million of files in different folders on S3 bucket.
The files are very small. I wish to download all the files that are
under folder named VER1. The folder VER1 contains many subfolders,
I wish to download all the million files under all the subfolders of VER1.
(e.g VER1-> sub1-> file1.txt ,VER1-> sub1 -> subsub1 -> file2.text, etc.)
What is the fastest way to download all the files?
Using s3 cp? s3 sync?
Is there a way to download all the files located under the folder in parallel?
Use the AWS Command-Line Interface (CLI):
aws s3 sync s3://bucket/VER1 [name-of-local-directory]
From my experience, it will download in parallel but it won't necessarily use the full bandwidth because there is a lot of overhead for each object. (It is more efficient for large objects, since there is less overhead.)
It is possible that aws s3 sync might have problems with a large number of files. You'd have to try it to see whether it works.
If you really wanted full performance, you could write your own code that downloads in massive parallel, but the time saving would probably be lost in the time it takes you to write and test such a program.
Another option is to use aws s3 sync to download to an Amazon EC2 instance, then zip the files and simply download the zip file. That would reduce bandwidth requirements.
So I have a use case where I need to put files from on-prem FTP to S3.
The size of each file (XML) is 5KB max.
The no of files is 100 files per minutes.
No, the use case is such that as soon as files come at FTP location I need to put into S3 bucket immediately.
What could be the best way to achieve that.
Here are my option
Using AWS CLI at my FTP location.(push mechanism )
Using lambda (pull mechanism.
Writing java application to put the file into S3 from FTP.
Or is there anything built in that I can leverage in.
Basically, i need to put the file in S3 as soon as possible because UI is built on top of S3 and if the file does not arrive immediately I might be in trouble.
The easiest would be to use the AWS Command-Line Interface (CLI), or an API call if you wish to do it from application code.
It doesn't really make sense doing it via Lambda, because Lambda would need to somehow retrieve the file from FTP and then copy it to S3 (so it is doing double work).
You can certainly write a Java application to do it, or simply call the AWS CLI (written in Python) since it will work out-of-the-box.
You could either use aws s3 sync to copy all new/updated files, or copy specific files with aws s3 cp. If you have so many files, it's probably best to specify the files otherwise it will waste time scanning many historical files that don't need to be copied.
The ultimate best case would be for the files to be sent to S3 directly, without involving FTP at all!
Rename is not just renaming a folder in amazon in turn it does the copy and delete which involves PUT request and cost is involved and the might be slow when there are huge files with huge size exist in the folder (http://gerardvivancos.com/2016/04/12/Single-and-Bulk-Renaming-of-Objects-in-Amazon-S3/)
I come across the following page (http://gerardvivancos.com/2016/04/12/Single-and-Bulk-Renaming-of-Objects-in-Amazon-S3/) which talks about renaming via script.
Can we execute the similar script via Amazon SDK java API?
Does still it does copy and delete internally or just changing the paths alone?
Thanks.
That script is just issuing aws s3 mv commands in a loop. You can issue S3 move commands via the AWS Java SDK.
It is still copying and deleting internally.