Transferring Pdf files from Local folder to AWS - amazon-web-services

I have a monthly activity where i get hundreds of PDF files in a folder and i need to transfer those to an AWS server . Currently i do this activity manually . But i need to automate this process of transfer of all pdf files form my local folder to a specific folder in AWS .
Also this process takes a lot of time ( approx 5 hours for 500 pdf files) . Is there a way to spped up the process?

While doing the copy from local to AWS you must be using some tool like winSCP or any SSH client, so you could automate the same using the script.
scp [-r] /you/pdf/dir youruser#aswhost:/home/user/path/
If you want to do it with speed, you could run multiple scp command in parallel of multiple terminal and may split files while creating to some logical grouped directories.

You can zip the files and transfer them. After transfer unzip the files.
Or else write a program which iterates over all files in your folder and uploads files to s3 using S3 api methods.

Related

Is it feasible to maintain directory structure when backing up to AWS S3 Glacier classes?

I am trying to backup 2TB from a shared drive of Windows Server to S3 Glacier
There are maybe 100 folders (some may be nested ) and perhaps 5000 files (some small like spread sheets, photos and other are larger like server images. My first question is what counts as an object here?
Let’s say I have Folder 1 which has 10 folders inside it. Each of 10 folders have 100 files.
Would number of objects be 1 folder + (10 folders * 100 files) = 1001 objects?
I am trying to understand how folder nesting is treated in S3. Do I have to manually create each folder as a prefix and then upload each file inside that using AWS CLI? I am trying to recreate the shared drive experience on the cloud where I can browse the folders and download the files I need.
Amazon S3 does not actually support folders. It might look like it does, but it actually doesn't.
For example, you could upload an object to invoices/january.txt and the invoices directory will just magically 'appear'. Then, if you deleted that object, the invoices folder would magically 'disappear' (because it never actually existed).
So, feel free to upload objects to any location without creating the directories first.
However, if you click the Create folder button in the Amazon S3 management console, it will create a zero-length object with the name of the directory. This will make the directory 'appear' and it would be counted as an object.
The easiest way to copy the files from your Windows computer to an Amazon S3 bucket would be:
aws s3 sync directoryname s3://bucket-name/ --storage-class DEEP_ARCHIVE
It will upload all files, including files in subdirectories. It will not create the folders, since they aren't necessary. However, the folder will still 'appear' in S3.

dynamically create / append to zip from multiple instances

I have a situation where thousands o files are created for a user by multiple backend instances, and then they're uploaded to AWS S3 / Azure Storage. After all the files are created, the user wants to download them as a zip. I can create the zip and then get a pre-signed URL, but I tried few archiving solutions and all of them are just taking too much time (hours).
Is there any way of creating the zip dynamically from the multiple backend instances? I want append to zip after each file creation, from any backend instance.
Zip itself supports the use case you want. For example, zip command in Linux:
When given the name of an existing zip archive, zip will replace identically named entries in the zip archive (matching the relative names as stored in the archive) or add entries for new names.
You need to persist the working zip file somewhere in a file system though. The most obvious choice I can think of is EFS, so that multiple instances can mount the file system and access the zip file.
If you don't want to modify the existing instances/workloads, you can even mount EFS on Lambda. Then set S3 trigger for the Lambda to update zip file every time a new file is uploaded.
I think you can not use only S3 for this, because you cannot update S3 objects. Then you need to download/upload for every new file, which is really not ideal.

Faster method to download 6.5m objects from GCP bucket

I'm looking for a faster method to download a ton of objects (6.5 million in my case) from a bucket. The average object size is 2kb (it's a JSON file). The method I used was gsutil -m cp -r gs://<bucket>/<folder> . which took 14 hours for 1M objects.
It's not feasible to run this on my laptop for 7 days straight. Any ideas?
PS: I don't need them to be in individual JSON files. I'm thinking to create a script that pulls a file from the bucket, and adds a row to a CSV, then deletes the file.
Try downloading the files to a VM, compressing the files into a single tgz (or bz2 or xz), upload back to the bucket, and download the tgz.
Cloud shell should work too.

How can I make sure files in a Storage Bucket are being served gzipped on my website?

I have a setup like this in Google Cloud Platform for my website, a Gatsbyjs project:
Push to repository
Trigger CloudBuild that builds my Gatsby website into public folder
Copy the files within the "public" folder to a Storage Bucket, using rsync
However, when I visit my site (LoadBalancer connected to the Bucket), the .js,.css, .html files are not being served as gzip.
I know there seems to be a flag on cp command for gzip, but how does this work for rsync?
Thanks
If you want to "compress on the fly" your files with rsync this is not possible.
However, if you just want to apply to gzip transport encoding to certain files you can use the -j <...> option of rsync. This will saves network bandwidth but is going to leave the data uncompressed in the Storage bucket.
However, if you want to take uncompressed files from the public folder and send them to a bucket and keep them compressed you will need to do a gsutil cp -z command. This will compress your file (if they are not compressed in your public folder) and store them compressed in the bucket

aws s3 mv/sync command

I have about 2 million files nested in subfoldrs in a bucket and want to move all of them to another bucket. Spending much of time on searching ... i found a solution to use AWS CLI mv/sync command. use move command or use sync command and then delete all the files after successfully synced.
aws s3 mv s3://mybucket/ s3://mybucket2/ --recursive
or it can be as
aws s3 sync s3://mybucket/ s3://mybucket2/
But the problem is how would i know that how many files/folders have moved or synced and how much time would it take...
And what if some exception occurs(machine/server stops/ internet disconnection due to any reason )...i have to again execute the command or it will for surely complete and move/sync all files. How can i be sure about the number of files moved/synced and files not moved/synced.
or can i have something like that
I move limited number of files e.g 100 thousand.. and repeat until all files are moved...
or move files on the basis of uploaded time.. e.g files uploaded from starting date to ending date
if yes .. how?
To sync them use:
aws s3 sync s3://mybucket/ s3://mybucket2/
You can repeat the command, after it finish (or fail) without issue. This will check if anything is missing/different to the target s3 bucket and will process it again.
The time depends on what size are the files, how much objects you have. Amazon counts directories as an object, so they matter too.