Unzip file on ec2 and save it on S3? - amazon-web-services

I have a zip file that is about 20 GB large and contains about 400'000 images that I was able to move to my EC2 instance by using wget. Now I want to unzip the files and save them to my S3.
Preferably it would be great if I didnt need to unzip them to the ec2 first. Can I by SSH somehow use unzip -options to extract each file to S3?
I have found answers like this https://stackoverflow.com/a/9722141/2335675. But I have no understanding of what he actually means by "unzipping it to S3". Can I do this while connected to my EC2 instance by SSH? Do Amazon have some kind of build in unzip command that extracts it to the s3 instead of the current server?
I can see other people have asked this questions, but I'm unable to find a direct answer of how to actually do it.

How I solved it:
I created a secondary volume on my EC2 instance to have space for the file x3 or so, to also include space for the extracted files. See guide here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-add-volume-to-instance.html
By being connected to the EC2 instance by SSH, I used the unzipcommand to unzip the file to the new volume.
I used aws s3 cp myfolder s3://mybucket/myfolder --recursive to move all my files into my S3 bucket.
I deleted my temporary volume and all files on it.
Everything was done using SSH. No script or programming was required.
Remember you need to use sudo to have permission to do many of the things.

The first solution:
Mount s3 on ec2 using s3fs.
Extract files to the mount point.
The second solution:
Using python and its aws library boto
extracting one file to the temporal location using zipfile
and uploading it to s3 using boto,
then delete the temporal file.
go to 2 while finishied

Related

How to open a file that is stored in a bucket connected to the google cloud VM instance

I am new to google cloud, and I need to run a single python script in a compute engine.
I opened a new VM compute engine instance, opened a new bucket, uploaded the script to the bucket and I can see that the VM is connected to the bucket since when I run the command to list the buckets in the VM it finds the bucket and states the script is indeed there.
What I'm missing out on is how do I run the script? Or more generally how do I access these files?
Was looking for a suitable command but could not find any, but I have a feeling there should be such a command (since the VM can find the bucket and the files contained in it, I guess it can also access them somehow). How should I proceed to run the script from here?
The bucket's content is not attached to a volume in the VM. They are totally independent. With that being said, you first have to copy the python file from the bucket to your compute instance by using the gsutil cp command as below:
gsutil cp gs://my-bucket/main.py .
Once you have the file locally in your compute instance, you can simply run the python file.

On-Premise file backup to aws

Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.

How can I move my media files stored in local machine to S3?

I have a Django application running on EC2. Currently, all my media files are stored in the instance. All the documents I uploaded to the models are in the instance too. Now I want to add S3 as my default storage. What I am worried about is that, how am I gonna move my current media files and to the S3 after the integration.
I am thinking of running a Python script one time. But I am looking for any builtin solution or maybe just looking for opinions.
Amazon CLI should do the job:
aws s3 cp path/to/file s3://your-bucket/
or if the whole directory then:
aws s3 cp path/to/dir/* s3://your-bucket/ --recursive
All options can be seen here : https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
The easiest method would be to use the AWS Command-Line Interface (CLI) aws s3 sync command. It can copy files to/from Amazon S3.
However, if there are complicated rules associated with where to move the files, then you certain use a Python script and the upload_file() command.

aws cli copy command halted

I used Putty to get into my AWS instance and ran a cp command to copy files into my S3 instance.
aws cli cp local s3://server_folder --recursive
Partway through, my internet dropped out and the copy halted even though the AWS instances was still running properly. Is there a way to make sure the cp command keeps running even if I lose my connection?
You can alternatively use Minio Client aka mc,it is open source and is compatible with AWS S3. Minio client is available for Windows along with mac, Linux.
The mc mirror command will help you in copying local content to remote AWS S3 bucket, incase of network issue the upload fails mc session resume will start uploading from where connection was terminated.
mc supports these commands.
COMMANDS:
ls List files and folders.
mb Make a bucket or folder.
cat Display contents of a file.
pipe Write contents of stdin to one target. When no target is specified, it writes to stdout.
share Generate URL for sharing.
cp Copy one or more objects to a target.
mirror Mirror folders recursively from a single source to single destination.
diff Compute differences between two folders.
rm Remove file or bucket [WARNING: Use with care].
access Set public access permissions on bucket or prefix.
session Manage saved sessions of cp and mirror operations.
config Manage configuration file.
update Check for a new software update.
version Print version.
You can check docs.minio.io for more details.
Hope it helps.
Disclaimer: I work for Minio.

downloading a file from Internet into S3 bucket

I would like to grab a file straight of the Internet and stick it into an S3 bucket to then copy it over to a PIG cluster. Due to the size of the file and my not so good internet connection downloading the file first onto my PC and then uploading it to Amazon might not be an option.
Is there any way I could go about grabbing a file of the internet and sticking it directly into S3?
Download the data via curl and pipe the contents straight to S3. The data is streamed directly to S3 and not stored locally, avoiding any memory issues.
curl "https://download-link-address/" | aws s3 cp - s3://aws-bucket/data-file
As suggested above, if download speed is too slow on your local computer, launch an EC2 instance, ssh in and execute the above command there.
For anyone (like me) less experienced, here is a more detailed description of the process via EC2:
Launch an Amazon EC2 instance in the same region as the target S3 bucket. Smallest available (default Amazon Linux) instance should be fine, but be sure to give it enough storage space to save your file(s). If you need transfer speeds above ~20MB/s, consider selecting an instance with larger pipes.
Launch an SSH connection to the new EC2 instance, then download the file(s), for instance using wget. (For example, to download an entire directory via FTP, you might use wget -r ftp://name:passwd#ftp.com/somedir/.)
Using AWS CLI (see Amazon's documentation), upload the file(s) to your S3 bucket. For example, aws s3 cp myfolder s3://mybucket/myfolder --recursive (for an entire directory). (Before this command will work you need to add your S3 security credentials to a config file, as described in the Amazon documentation.)
Terminate/destroy your EC2 instance.
[2017 edit]
I gave the original answer back at 2013. Today I'd recommend using AWS Lambda to download a file and put it on S3. It's the desired effect - to place an object on S3 with no server involved.
[Original answer]
It is not possible to do it directly.
Why not do this with EC2 instance instead of your local PC? Upload speed from EC2 to S3 in the same region is very good.
regarding stream reading/writing from/to s3 I use python's smart_open
You can stream the file from internet to AWS S3 using Python.
s3=boto3.resource('s3')
http=urllib3.PoolManager()
urllib.request.urlopen('<Internet_URL>') #Provide URL
s3.meta.client.upload_fileobj(http.request('GET', 'Internet_URL>', preload_content=False), s3Bucket, key,
ExtraArgs={'ServerSideEncryption':'aws:kms','SSEKMSKeyId':'<alias_name>'})