File storage for social media application - amazon-web-services

I am launching Mobile application with backend as PHP hosted on 4 instances of AWS Elastic beanstalk. For media storage (images and videos) I am not sure if S3 is a better option or having an EC2 instance with a share directory will be fine.
My consideration will be based on performance and throughput. For S3 i never came across any documentation or reference which can give me the throughput between EC2 and S3.

As per your use case S3 is the best option as per the images durability goes. And the data transfer speeds between an EC2 instance and S3 is super fast so you don't have to worry about that.
And if you come across issue where there is latency in data transferred between the EC2 instance and S3 due to the Instance and S3 bucket regions being different AWS just introduced S3 Accelerated transfer http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
So using S3 for image file storage is most durable and reliable option for your use case.

Related

Can I use s3fs to perform "free data transfer" between AWS EC2 and S3?

I am looking to deploy a Python Flask app on an AWS EC2 (Ubuntu 20.04) instance. The app fetches data from an S3 bucket (in the same region as the EC2 instance) and performs some data processing.
I prefer using s3fs to achieve the connection to my S3 bucket. However, I am unsure if this will allow me to leverage the 'free data transfer' from S3 to EC2 in the same region - or if I must use boto directly to facilitate this transfer?
My app works when deployed with s3fs, but I would have expected the data transfer to be much faster - so I am wondering that perhaps AWS EC2 is not able to "correctly" fetch data using s3fs from S3.
All communication between Amazon EC2 and Amazon S3 in the same region will not incur a Data Transfer fee. It does not matter which library you are using.
In fact, communication between any AWS services in the same region will not incur Data Transfer fees.

What to use as storage mechanism which can be accessible by application and windows executable on AWS

I am new to AWS infrastructure, I would like to know what storage mechanism should I use which can be accessible by my war (hosted in AWS elasticbeanstalk) and one windows service hosted on one of my AWS machine. I have little knowledge about S3, EBS and EFS.
Use Case:
my webapp in elasticbeanstalk would like to create objects in some storage system.
my executable on one of aws machine produces files that should be accessible by my webapp deployed in elasticbeanstalk
Questions:
Is it possible to share some storage to both my webapp and my executable.
If answer of 1 is yes, What storage mechanism should I use ?
Please advise.
S3 is Simple Storage Service that is reachable through a web interface. I reckon this is what you are looking for. It is reachable through an URL and is used for storing objects such as files, images, and so on.
EBS is a virtual hard disk that is connected to an EC2 instance (virtual machine).
EFS is Elastic File System used for Linux.
https://aws.amazon.com/s3/
If full hierarchical file system is not needed and you need to store plain object which seems to be the case according to use case given by you then Amazon s3 is the way to go.
Amazon EBS is the block storage service that can only be attached to one machine it resembles the physical hardisk attached to your home computer for more information read this.
EBS
Amazon EFS stands for elastic file system its also a block storage service, its different from EBS in terms that it can be shared accross multiple machines, it resembles the NAS in a datacenter. for more information on EFS read this.
EFS

AWS storage choice for file system S3 or EFS?

I want to use file system to store xml files being received from SFTP connection to EC2 instance. Which storage to choose S3 or EFS? Once files are stored, I want to read the files and process data.
My understanding is that we should choose EFS as S3 is not recommended to mount a file system. Also, it is easy to manage directories and sub-directories permission with EFS.
The decision should depend on the budget and requirement as well
If you want to read the files and process data then you can choose EFS :
Amazon EFS is a fully-managed service that makes it easy to set up and scale file storage in the Amazon Cloud. With a few clicks in the AWS Management Console, you can create file systems that are accessible to Amazon EC2 instances via a file system interface (using standard operating system file I/O APIs) and supports full file system access semantics (such as strong consistency and file locking).
Amazon EFS file systems can automatically scale from gigabytes to petabytes of data without needing to provision storage. Tens, hundreds, or even thousands of Amazon EC2 instances can access an Amazon EFS file system at the same time, and Amazon EFS provides consistent performance to each Amazon EC2 instance. Amazon EFS is designed to be highly durable and highly available. With Amazon EFS, there are no minimum fee or setup costs, and you pay only for the storage you use.
And S3 would be an alternate solution if you want to download/upload the files/objects with different clients platforms like Android, iOS, Web etc..
It's hard to tell since you didn't specify the average file size, estimated storage requirements and the file usage pattern. The price difference between S3 and EFS is also an essential factor to consider.
Example:
EC2 instance receives a file, processes it immediately and store results to the database. XML is just stored for backup afterward and should be long-term archived for audit or recovery purposes.
In this case, I would recommend S3 and lifecycle policies to migrate data to the Glacier service for long-term archiving automatically.
Yes, for your use case it would be better if you choose EFS as it is easy to use and offers a simple interface that allows you to create and configure file systems quickly and easily.
When mounted on Amazon EC2 instances, an Amazon EFS file system provides a standard file system interface and file system access semantics, allowing you to seamlessly integrate Amazon EFS with your existing applications and tools. Multiple Amazon EC2 instances can access an Amazon EFS file system at the same time, allowing Amazon EFS to provide a common data source for workloads and applications running on more than one Amazon EC2 instance.
https://aws.amazon.com/documentation/efs/

Setting up AWS for data processing S3 or EBS?

Hey there I am new to AWS and trying to piece together the best way to do this.
I have thousands of photos I'd like to upload and process on AWS. The software is Agisoft Photoscan and is run in stages. So for the first stage i'd like to use an instance that is geared towards CPU/Memory usage and the second stage geared towards GPU/Memory.
What is the best way to do this? Do I create a new volume for each project in EC2 and attach that volume to each instance when I need to? I see people saying to use S3, do I just create a bucket for each project and then attach the bucket to my instances?
Sorry for the basic questions, the more I read the more questions I seem to have,
I'd recommend starting with s3 and seeing if it works - will be cheaper and easier to setup. Switch to EBS volumes if you need to, but I doubt you will need to.
You could create a bucket for each project, or you could just create a bucket a segregate the images based on the file-name prefix (i.e. project1-image001.jpg).
You don't 'attach' buckets to EC2, but you should assign an IAM role to the instances as you create them, and then you can grant that IAM role permissions to access the S3 bucket(s) of your choice.
Since you don't have a lot of AWS experience, keep things simple, and using S3 is about as simple as it gets.
You can go with AWS S3 to upload photos. AWS S3 is similar like Google Drive.
If you want to use AWS EBS volumes instead of S3. The problem you may face is,
EBS volumes is accessible within availability zone but not within region also means you have to create snapshots to transfer another availability zone. But S3 is global.
EBS volumes are not designed for storing multimedia files. It is like hard drive. Once you launch an EC2 instance need to attach EBS volumes.
As per best practice, you use AWS S3.
Based on your case view, you can create bucket for each project or you can use single bucket with multiple folders to identify the projects.
Create an AWS IAM role with S3 access permission and attach it to EC2 instance. No need of using AWS Credentials in the project. EC2 instance will use role to access S3 and role doesn't have permanent credentials, it will keep rotating it.

How to setup shared persistent storage for multiple AWS EC2 instances?

I have a service hosted on Amazon Web Services. There I have multiple EC2 instances running with the exact same setup and data, managed by an Elastic Load Balancer and scaling groups.
Those instances are web servers running web applications based on PHP. So currently there are the very same files etc. placed on every instance. But when the ELB / scaling group launches a new instance based on load rules etc., the files might not be up-to-date.
Additionally, I'd rather like to use a shared file system for PHP sessions etc. than sticky sessions.
So, my question is, for those reasons and maybe more coming up in the future, I would like to have a shared file system entity which I can attach to my EC2 instances.
What way would you suggest to resolve this? Are there any solutions offered by AWS directly so I can rely on their services rather than doing it on my on with a DRBD and so on? What is the easiest approach? DRBD, NFS, ...? Is S3 also feasible for those intends?
Thanks in advance.
As mentioned in a comment, AWS has announced EFS (http://aws.amazon.com/efs/) a shared network file system. It is currently in very limited preview, but based on previous AWS services I would hope to see it generally available in the next few months.
In the meantime there are a couple of third party shared file system solutions for AWS such as SoftNAS https://aws.amazon.com/marketplace/pp/B00PJ9FGVU/ref=srh_res_product_title?ie=UTF8&sr=0-3&qid=1432203627313
S3 is possible but not always ideal, the main blocker being it does not natively support any filesystem protocols, instead all interactions need to be via an AWS API or via http calls. Additionally when looking at using it for session stores the 'eventually consistent' model will likely cause issues.
That being said - if all you need is updated resources, you could create a simple script to run either as a cron or on startup that downloads the files from s3.
Finally in the case of static resources like css/images don't store them on your webserver in the first place - there are plenty of articles covering the benefit of storing and accessing static web resources directly from s3 while keeping the dynamic stuff on your server.
From what we can tell at this point, EFS is expected to provide basic NFS file sharing on SSD-backed storage. Once available, it will be a v1.0 proprietary file system. There is no encryption and its AWS-only. The data is completely under AWS control.
SoftNAS is a mature, proven advanced ZFS-based NAS Filer that is full-featured, including encrypted EBS and S3 storage, storage snapshots for data protection, writable clones for DevOps and QA testing, RAM and SSD caching for maximum IOPS and throughput, deduplication and compression, cross-zone HA and a 100% up-time SLA. It supports NFS with LDAP and Active Directory authentication, CIFS/SMB with AD users/groups, iSCSI multi-pathing, FTP and (soon) AFP. SoftNAS instances and all storage is completely under your control and you have complete control of the EBS and S3 encryption and keys (you can use EBS encryption or any Linux compatible encryption and key management approach you prefer or require).
The ZFS filesystem is a proven filesystem that is trusted by thousands of enterprises globally. Customers are running more than 600 million files in production on SoftNAS today - ZFS is capable of scaling into the billions.
SoftNAS is cross-platform, and runs on cloud platforms other than AWS, including Azure, CenturyLink Cloud, Faction cloud, VMware vSPhere/ESXi, VMware vCloud Air and Hyper-V, so your data is not limited or locked into AWS. More platforms are planned. It provides cross-platform replication, making it easy to migrate data between any supported public cloud, private cloud, or premise-based data center.
SoftNAS is backed by industry-leading technical support from cloud storage specialists (it's all we do), something you may need or want.
Those are some of the more noteworthy differences between EFS and SoftNAS. For a more detailed comparison chart:
https://www.softnas.com/wp/nas-storage/softnas-cloud-aws-nfs-cifs/how-does-it-compare/
If you are willing to roll your own HA NFS cluster, and be responsible for its care, feeding and support, then you can use Linux and DRBD/corosync or any number of other Linux clustering approaches. You will have to support it yourself and be responsible for whatever happens.
There's also GlusterFS. It does well up to 250,000 files (in our testing) and has been observed to suffer from an IOPS brownout when approaching 1 million files, and IOPS blackouts above 1 million files (according to customers who have used it). For smaller deployments it reportedly works reasonably well.
Hope that helps.
CTO - SoftNAS
For keeping your webserver sessions in sync you can easily switch to Redis or Memcached as your session handler. This is a simple setting in the PHP.ini and they can all access the same Redis or Memcached server to do sessions. You can use Amazon's Elasticache which will manage the Redis or Memcache instance for you.
http://phpave.com/redis-as-a-php-session-handler/ <- explains how to setup Redis with PHP pretty easily
For keeping your files in sync is a little bit more complicated.
How to I push new code changes to all my webservers?
You could use Git. When you deploy you can setup multiple servers and it will push your branch (master) to the multiple servers. So every new build goes out to all webserver.
What about new machines that launch?
I would setup new machines to run a rsync script from a trusted source, your master web server. That way they sync their web folders with the master when they boot and would be identical even if the AMI had old web files in it.
What about files that change and need to be live updated?
Store any user uploaded files in S3. So if user uploads a document on Server 1 then the file is stored in s3 and location is stored in a database. Then if a different user is on server 2 he can see the same file and access it as if it was on server 2. The file would be retrieved from s3 and served to the client.
GlusterFS is also an open source distributed file system used by many to create shared storage across EC2 instances
Until Amazon EFS hits production the best approach in my opinion is to build a storage backend exporting NFS from EC2 instances, maybe using Pacemaker/Corosync to achieve HA.
You could create an EBS volume that stores the files and instruct Pacemaker to umount/dettach and then attach/mount the EBS volume to the healthy NFS cluster node.
Hi we currently use a product called SoftNAS in our AWS environment. It allows us to chooses between both EBS and S3 backed storage. It has built in replication as well as a high availability option. May be something you can check out. I believe they offer a free trial you can try out on AWS
We are using ObjectiveFS and it is working well for us. It uses S3 for storage and is straight forward to set up.
They've also written a doc on how to share files between EC2 instances.
http://objectivefs.com/howto/how-to-share-files-between-ec2-instances