Download and reuse Amazon public datasets on multiple EC2 instances - amazon-web-services

I have an EC2 instance running in us-east-1 that needs to be able to access/manipulate data available in the KITTI Vision Benchmark public dataset. I'd like to make this data available to the instance, but would also like to be able to reuse it with other instances in the future (more like a mounted S3 approach).
Other similar questions get at downloading from the bucket directly... and I understand that I can view the bucket and recursively download the data to a local folder using AWS cli from within the instance:
aws s3 ls --no-sign-request s3://avg-kitti/
aws s3 sync s3://avg-kitti/ or aws s3 cp s3://avg-kitti/ . --recursive
However, this feels like a brute force approach and would likely require me to increase my EBS volume size... and would limit my reuse of this data elsewhere (unless I was to snapshot and reuse the entire EBS volume). I did find some stackoverflow solutions that mentioned some of the open data sets being available as a snapshot you could copy over and attach as a volume. But the KITTI Vision Benchmark public dataset appears to be on S3 so I don't think it would have a snapshot like it would on EBS datasets...
That being said, is there an easier way to copy public data over to an existing S3 bucket? and then mount my instance to that? I have played around with S3FS and feel like that might be my best bet, but I am worried about 1) the cost of copying / downloading all data from public bucket to my own 2) best approach for reusing this data on other instances 3) simply not knowing if there's a better/cheaper way to make this data available without downloading or needing to download again in the future.

Related

AWS Import-Snapshot with shared S3 Bucket

I am currently looking for a way of easily distributing customised volumes to clients.
An approach I am looking at is creating RAW disk images, saving them to S3 and having clients import them as snapshots using the AWS CLI.
My question is - who pays for the data access request/data transfer?
...I'm assuming its bucket owner as there is no "requester-pays" option for the Import-Snapshot command. Has anybody done anything similar?
Another approach is directly sharing snapshots to a clients account - but this involves an added charge on our part to create the ideal sized volumes + generate the snapshots to share.
Is there a better method of generating + sharing data (essentially what would become EBS volumes) of varying sizes and content?
The easiest method would be to create an Amazon Machine Image (AMI), which is a bootable snapshot. You can list this it as a Community AMI.
Your clients can select the AMI when launching an Amazon EC2 instance. The boot disk will be exactly as you configured -- with the operating system, your application and all configurations that were saved on the disk.
There is no cost to you when a client uses the AMI.
See: Make an AMI public - Amazon Elastic Compute Cloud

Does mounting an S3 bucket as a drive in an EC2 instance copy-pastes or directly saves files in the bucket?

I have built a script that scrapes a few thousands of pdf files. I want to build a t2 instance that runs the script for at least 2 weeks continuously and saves the downloaded files in the S3 bucket. I read this tutorial but I have a doubt :
If I set the download folder to the mounted drive location, then does mounting here imply that data will be stored in the EBS and S3 both or that the files will be saved in the S3 bucket directly.
I need this clarification because while building the instance, I'll keep the storage low (~75 GB) and rent an S3 bucket since the total size of scraped files is going to exceed 300 GB.
Thanks!
Yes, mounted drive doesn't take up your local storage so you could just spin up an instance with only 8GB. For the mounting tools I'd recommend https://github.com/kahing/goofys (very actively developed) instead of s3fs which seems to be slow and heats up CPU usage pretty badly if you have large files. I've been using goofys for years with my micro instance plus 300GB mounted drive without any slowness and issues.
Another even better solution is to use aws cli to transfer files directly to S3 without requiring any mounting technique. You can simply write a python script with boto3 which first downloads the pdf then copies to S3 and then removes that pdf locally (that would take only few seconds even for large files).
https://cloudkul.com/blog/mounting-s3-bucket-linux-ec2-instance/
A S3 bucket can be mounted in a AWS instance as a file system known as S3fs. S3fs is a FUSE file-system that allows you to mount an Amazon S3 bucket as a local file-system. It behaves like a network attached drive, as it does not store anything on the Amazon EC2, but user can access the data on S3 from EC2 instance.
The key point to take away from this is "network attached drive," meaning it will not use any disk memory on your EC2 instance aside from the dependencies you will need to install.
If the script you are using is directly copying the file to directory on the s3fs mount, it will not take up any space on the EBS.
If the script copies the pdf locally first anywhere outside the s3fs and then MOVES it to s3fs, it is still fine. It will only take up space on s3 bucket.
If the script copies the pdf locally first anywhere outside the s3fs and then copies it to s3fs. It will still leave a copy on EBS and take up space there as well. So you need to check - are you copying or moving to S3fs.
if you are copying, replace it with a move or delete at source after a successfull copy.
So even 8 GB space should be enough for the Instance.

How's a simple way to download a bunch of URLs into S3?

I have a bunch of URLs to some files I want in S3 (about 500), each are around 80-100mb. I want to get them into S3 while staying within the the free limits for everything other than S3.
What's the best way to approach this? I've put the URL's in a .txt in S3 for now.
The way I would do it is:
Make an Excel spreadsheet of the filenames
Create a formula that creates a copy command with the filename (see below)
Launch an Amazon EC2 Linux instance in the same region as the bucket. The t2.micro is included in the free tier, but has relatively small network bandwidth. I'd splurge on a t2.large, but launch it as a Spot instance and you'll only pay a few cents. It depends whether you want to save time or save a few cents.
Connect to the EC2 instance and paste the commands from Excel
When finished, terminate the EC2 instance (it is charged per second)
The command you'd want in Excel is:
wget <URL>; aws s3 cp <filename> s3://my-bucket/<filename>; rm <filename>;
When launching the EC2 instance, also assign it an IAM Role that has permissions to access the S3 bucket.
Test is out by copying the first few files, one at a time. If that looks good, paste larger batches of 100 at a time. It might seem primitive, but it's a fast way to copy that many files. I'd do it differently if it was 1000+ files.

Best option to take complete Backup of EC2 instance?

Currently I am taking manual backup of our EC2 instance by zipping the data and downloading it locally as well as on DropBox.
But I am wondering, can I have an option where I just take a complete copy of the whole system automatically daily so if something goes wrong/crashes, I can replace it with previous copy immediately rather than spending hours installing and configuring things ?
I can see there is an option of take "Image" but can I automated them to have just 1 latest image and replace the system with single click ?
You can create a single Image of your instance as Backup of your instance Configuration.
And
To keep back up of your data you can use snapshots of your volumes.
snapshots store data in incremental format whenever you make any changes.
When ever needed you can just attach the volume from the snapshot to your Instance.
It is not a good idea to do "external backup" for EC2 instance snapshot, before you read AWS pricing details.
First, AWS is charging every GB of data your transfer OUTside AWS cloud. Check out this pricing. Generally speaking, after the 1st GB, the rest will be charge at least $0.09/GB, against S3-standard pricing ~ $0.023/GB.
Second, the snapshot created is actually charges as S3 pricing(Check :
Copying an Amazon EBS Snapshot), not EBS pricing. After offset the transfer cost, perhaps you should consider create multiple snapshot than keep doing the data transfer out backup.
HOWEVER, if you happens to use an instance that use ephemeral storage, snapshot will not help. You need to copy the data out from ephemeral storage yourself. Then it is your choice to store under S3 or other place.
Third. If you worry the AWS region going down, check the multiple AZ option. Or checkout alternate AWS region option.
Fourth. When storing backup data in S3, you can always store them under Infrequent-Access, which save you some bucks, and you don't need to face an insane Glacier bills during emergency restore(Avoid Glacier, unless you are pretty sure about your own requirement).
Fifth, after done your plan of doing everything inside AWS, you can write bash script (AWS CLI) or use boto3, etc API to do the automatic backup.
Lastly , here is way of AWS create and maintain snapshot. Though each snapshot are deem "incremental", when u delete old snap shot :
the snapshot deletion process is designed so that you need to retain
only the most recent snapshot in order to restore the volume.
You can always "test" restore by create another EC2 instance that load the backup snapshot. Or you can mount the snapshot volume from another EC2 instance to check the contents.

Best setup to work with amazon AWS

I have a website which gets backup from different social media services and then stores the data on server and then that is displayed on my website. content includes, videos, images, and text data.
Currently i am using an EC2 instance with RDS and EBS. Data is stored in EBS Volumes, But as the amount of the data is big enough more than 1 TB and that is increasing. Every time my EBS volume gets filled i attach another volume.
Then i added S3 to my Setup. Cron jobs runs and stores data on S3 and the EC2 instance displays data from the S3. I am using PHP SDK for this purpose.
The problem which i am facing is that the S3 is very slow in my current setup.
Please suggest whether my setup is good or i need some change in my setup and the other way how can i speedup S3. or i should opt some other way to my setup.
EC2 instance is large reserved instance running CentOS.
I have listened some about the S3fs that mount S3 bucket to Ec2 as a volume. Is this a good choice, as when i mounted S3 Bucket to Ec2 instance the transfer rate was very slow.
I am new to the AWS. My users does not access files directly from S3, but they access through my website which is running on EC2 Instance.
RDS is a good choice for storing metadata such as tags, comments and other relevant information about your multimedia files. S3 is good for storing static content such as Video, Audio and Pictures. I think your approach with RDS and S3 is good enough.
EBS backed instances are good for persistence. If you store your metadata on RDS and static content on S3, the only reason why you should use EBS backed EC2 instances is that you have some configuration files which are unversioned right now. If that's not the case, assuming that your configuration is checked into version control and can be pulled on-demand for a fresh instance every time, then you might want to ditch EBS volumes in favor of ephemeral storage. That may give you some performance boost, nothing significant though.
Regarding your concern with S3's latency, yes, S3 is slow. While all your writes may happen directly to S3, I would highly recommend that you set up Amazon CloudFront for your S3 buckets and let your website consume multimedia content from the CloudFront. CloudFront is a Content Delivery Network (CDN) which works with disk volumes (EBS backed or ephemeral) as well as with S3. Setting it up would take not more than a few minutes. CloudFront also supports streaming media files over RTMP. You may need a library like GPAC for hinting multimedia files to make them streamable if not being done already. You might then want to consider creating one distribution for Video/Audio files for streaming and another distribution for Images, Javascript, Stylesheets and other text files.
Hope this helps.
For faster getting and uploading files from Amazon S3 I use batch() found here.
Also you can use cloudfront for faster getting files. I think 9gag uses cloudfront also..