I would like to grab a file straight of the Internet and stick it into an S3 bucket to then copy it over to a PIG cluster. Due to the size of the file and my not so good internet connection downloading the file first onto my PC and then uploading it to Amazon might not be an option.
Is there any way I could go about grabbing a file of the internet and sticking it directly into S3?
Download the data via curl and pipe the contents straight to S3. The data is streamed directly to S3 and not stored locally, avoiding any memory issues.
curl "https://download-link-address/" | aws s3 cp - s3://aws-bucket/data-file
As suggested above, if download speed is too slow on your local computer, launch an EC2 instance, ssh in and execute the above command there.
For anyone (like me) less experienced, here is a more detailed description of the process via EC2:
Launch an Amazon EC2 instance in the same region as the target S3 bucket. Smallest available (default Amazon Linux) instance should be fine, but be sure to give it enough storage space to save your file(s). If you need transfer speeds above ~20MB/s, consider selecting an instance with larger pipes.
Launch an SSH connection to the new EC2 instance, then download the file(s), for instance using wget. (For example, to download an entire directory via FTP, you might use wget -r ftp://name:passwd#ftp.com/somedir/.)
Using AWS CLI (see Amazon's documentation), upload the file(s) to your S3 bucket. For example, aws s3 cp myfolder s3://mybucket/myfolder --recursive (for an entire directory). (Before this command will work you need to add your S3 security credentials to a config file, as described in the Amazon documentation.)
Terminate/destroy your EC2 instance.
[2017 edit]
I gave the original answer back at 2013. Today I'd recommend using AWS Lambda to download a file and put it on S3. It's the desired effect - to place an object on S3 with no server involved.
[Original answer]
It is not possible to do it directly.
Why not do this with EC2 instance instead of your local PC? Upload speed from EC2 to S3 in the same region is very good.
regarding stream reading/writing from/to s3 I use python's smart_open
You can stream the file from internet to AWS S3 using Python.
s3=boto3.resource('s3')
http=urllib3.PoolManager()
urllib.request.urlopen('<Internet_URL>') #Provide URL
s3.meta.client.upload_fileobj(http.request('GET', 'Internet_URL>', preload_content=False), s3Bucket, key,
ExtraArgs={'ServerSideEncryption':'aws:kms','SSEKMSKeyId':'<alias_name>'})
Related
I have a huge tar file in an s3 bucket that I want to decompress while remaining in the bucket. I do not have enough space on my local machine to download the tar file and upload it back to the s3 bucket. Whats the best way to do this?
Amazon S3 does not have in-built functionality to manipulate files (such as compressing/decompressing).
I would recommend:
Launch an Amazon EC2 instance in the same region as the bucket
Login to the EC2 instance
Download the file from S3 using the AWS CLI
Untar the file
Upload desired files back to S3 using the AWS CLI
Amazon EC2 instances are charged per-second, so choose a small machine (eg t3a.micro) and it will be rather low-cost (perhaps under 1 cent).
Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.
I need to transfer all our files (With folders structure) to AWS S3. I have researched lot about how this done.
Most of the places mentioned s3fs. But looks like this is bit old. And I have tried to install s3fs to my exsisting CentOS 6 web server. But its stuck on $ make command. (Yes there is Makefile.in)
And as per this answer AWS S3 Transfer Acceleration is the next better option. But still I have to write a PHP script (My application is PHP) to transfer all folders and files to S3. It is working same as how file save in S3 (API putObject), but faster. Please correct me if I am wrong.
Is there any other better solution (I prefer FTP) to transfer 1TB files with folders from CentOS 6 server to AWS S3? Is there any way to use FTP client in EC2 to transfer files from outside CentOS 6 to AWS S3?
Use the aws s3 sync command of the AWS Command-Line Interface (CLI).
This will preserve your directory structure and can be restarted in case of disconnection. Each execution will only copy new, changed or missing files.
Be aware that 1TB is a lot of data and can take significant time to copy.
An alternative is to use AWS Snowball, which is a device that AWS can send to you. It can hold 50TB or 80TB of data. Simply copy your data to the device, then ship it back to AWS and they will copy the data to Amazon S3.
I have imported a vhd file from local network to the EC2 and have the data in the S3 bucket. It ran properly. I accidentally terminated the EC2 image that was created. I still have the data in the S3 bucket in parts. Can I use that data, or do I have to re-upload the image? It's 40gb and would take over one work day to push to the cloud.
Yes, if you do not specifically remove your file from your S3 bucket you should be able to use it again from any future EC2 instance. For example, you could use the aws-cli to "copy" the file from the S3 bucket to any number of EC2 instances.
If you had used the aws mv or aws rm command, then I would expect the file to be gone.
The bottom line is, if the file is still in the bucket, then you can still use it provided you have the permissions set correctly.
I re started the conversion task and re-uploaded the VM.
I have a zip file that is about 20 GB large and contains about 400'000 images that I was able to move to my EC2 instance by using wget. Now I want to unzip the files and save them to my S3.
Preferably it would be great if I didnt need to unzip them to the ec2 first. Can I by SSH somehow use unzip -options to extract each file to S3?
I have found answers like this https://stackoverflow.com/a/9722141/2335675. But I have no understanding of what he actually means by "unzipping it to S3". Can I do this while connected to my EC2 instance by SSH? Do Amazon have some kind of build in unzip command that extracts it to the s3 instead of the current server?
I can see other people have asked this questions, but I'm unable to find a direct answer of how to actually do it.
How I solved it:
I created a secondary volume on my EC2 instance to have space for the file x3 or so, to also include space for the extracted files. See guide here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-add-volume-to-instance.html
By being connected to the EC2 instance by SSH, I used the unzipcommand to unzip the file to the new volume.
I used aws s3 cp myfolder s3://mybucket/myfolder --recursive to move all my files into my S3 bucket.
I deleted my temporary volume and all files on it.
Everything was done using SSH. No script or programming was required.
Remember you need to use sudo to have permission to do many of the things.
The first solution:
Mount s3 on ec2 using s3fs.
Extract files to the mount point.
The second solution:
Using python and its aws library boto
extracting one file to the temporal location using zipfile
and uploading it to s3 using boto,
then delete the temporal file.
go to 2 while finishied