Folks,
I've setup an SFTP server on an EC2 instance to receive files from remote customers that need to send 3 files each, several times throughout the day (each customer connects multiple times a day, each time transferring the 3 files which keep their names but change their contents). This works fine if the number of customers connecting simultaneously is kept under control, however I cannot control exactly when each customer will connect (they have automated the connection process at their end). I am anticipating that I may reach a bottleneck in case too many people try to upload files at the same time, and have been looking for alternatives to the whole process ("distributed file transfer" of some sort). That's when I stumbled upon AWS S3, which is distributed by definition, and was wondering if I could do something like the following:
Create a bucket called "incoming-files"
Create several folders inside this bucket, one for each customer
Setup a file transfer mechanism (I believe I'd have to use S3's SDK somehow)
Provide a client application for each customer, so that they can run it at their side to upload the files to their specific folders inside the bucket
This last point is easy on SFTP, since you can set a "root" folder for each user so that when the user connects to the server it automatically lands on its appropriate folder. Not sure if something of this sort can be worked out on S3. Also the file transfer mechanism would have to not only provide credentials to access the bucket, but also "sub-credentials" to access the folder.
I have been digging into S3 but couldn't quite figure out if this whole idea is (a) feasible and (b) practical. The other limitation with my original SFTP solution is that by definition an SFTP server is a single point of failure, which I'd be glad to avoid. I'd be thrilled if someone could shed some light on this (btw, other solutions are also welcomed).
Note that I am trying to eliminate the SFTP server altogether, and not mount an S3 bucket as the "root folder" for the SFTP server.
Thank you
You can create an S3 policy that will grant access only to certain prefix ("folder" in your plan). The only thing your customers need is permission to do PUT request. For each customer you will also need to create a set of access keys.
It seems you're overcomplicating. If SFTP is a bottleneck and is not redundant, you can always create a scale group (with ELB or DNS round-robin in front of it) and mount S3 to EC2 instances with sshfs or goofys. If cost is not an issue here, you can even mount EFS as NFS share.
AWS has an example configuration here that seems like it may meet your needs pretty well.
I think you're definitely right to consider s3 over a traditional SFTP setup. If you do go with a server-based approach, I agree with Sergey's answer -- an auto-scaling group of servers backed by shared EFS storage. You will, of course, have to own maintenance of those servers, which may or may not be an issue depending on your expertise and desire to do so.
A pure s3 solution, however, will almost certainly be cheaper and require less maintenance in the long-run.
There is now an AWS managed SFTP service in the AWS Transfer family.
https://aws.amazon.com/blogs/aws/new-aws-transfer-for-sftp-fully-managed-sftp-service-for-amazon-s3/
Today we are launching AWS Transfer for SFTP, a fully-managed, highly-available SFTP service. You simply create a server, set up user accounts, and associate the server with one or more Amazon Simple Storage Service (S3) buckets. You have fine-grained control over user identity, permissions, and keys. You can create users within Transfer for SFTP, or you can make use of an existing identity provider. You can also use IAM policies to control the level of access granted to each user. You can also make use of your existing DNS name and SSH public keys, making it easy for you to migrate to Transfer for SFTP. Your customers and your partners will continue to connect and to make transfers as usual, with no changes to their existing workflows.
Related
Problem:
We need to perform a task under which we have to transfer all files ( CSV format) stored in AWS S3 bucket to a on-premise LAN folder using the Lambda functions. This will be a scheduled tasks which will be carried out after every 1 hour, and the file will again be transferred from S3 to on-premise LAN folder while replacing the existing ones. Size of these files is not large (preferably under few MBs).
I am not able to find out any AWS managed service to accomplish this task.
I am a newbie to AWS, any solution to this problem is most welcome.
Thanks,
Actually, I am looking for a solution by which I can push S3 files to on-premise folder automatically
For that you need to make the on-premise network visible to the logic (lambda, whatever..) "pushing" the content. The default solution is using the AWS site-to-site VPN.
There are multiple options for setting up the VPN, you could choose based on the needs.
Then the on-premise network will look just like another subnet.
However - VPN has its complexity and cost. In most of the cases it is much easier to "pull" data from the on-premise environment.
To sync data there are multiple options. For a managed service, I could point out the S3 Gateway which based on your description sounds like an insane overkill.
Maybe you could start with a simple cron job (or a task timer if working with windows) and run a CLI command to sync the S3 content or just copy specified files.
Check out S3 Sync, I think it will help you accomplish this task: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html#examples
To run any AWS CLI in your computer, you will need to setup credentials, and the setup account/roles should have permissions to do the task (e.g. access S3)
Check out AWS CLI setup here: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
I have setup AWS Beanstalk instance where a server app is deployed. In the backend users can change files in images/ directory. But when autoscaling the instances, the user files are not mirrored.
How to solve this requirement? Can I setup AWS Ec2 to create new AMI each night based on last EC2 files and use that for autoscaling. Or is there better approach?
There are a few options, but in general first you need to understand that in order to scale horizontally (add more servers), your servers have to be "stateless" meaning they can't have any special information on them, including but not limited to user uploaded files.
The current "best practice" (and cheapest, most reliable, most robust, etc.) option is instead of having user upload files sit on the filesystem, save them on S3. Then when you need to load them, read them from S3.
Another option, still in Preview Mode, is to use the new Elastic File System service to provide you with a scalable, reliable, shared hard drive. All of your servers can connect to your EFS drive, and they all have access to the same files. You can scale up and down without losing your files.
I have a service hosted on Amazon Web Services. There I have multiple EC2 instances running with the exact same setup and data, managed by an Elastic Load Balancer and scaling groups.
Those instances are web servers running web applications based on PHP. So currently there are the very same files etc. placed on every instance. But when the ELB / scaling group launches a new instance based on load rules etc., the files might not be up-to-date.
Additionally, I'd rather like to use a shared file system for PHP sessions etc. than sticky sessions.
So, my question is, for those reasons and maybe more coming up in the future, I would like to have a shared file system entity which I can attach to my EC2 instances.
What way would you suggest to resolve this? Are there any solutions offered by AWS directly so I can rely on their services rather than doing it on my on with a DRBD and so on? What is the easiest approach? DRBD, NFS, ...? Is S3 also feasible for those intends?
Thanks in advance.
As mentioned in a comment, AWS has announced EFS (http://aws.amazon.com/efs/) a shared network file system. It is currently in very limited preview, but based on previous AWS services I would hope to see it generally available in the next few months.
In the meantime there are a couple of third party shared file system solutions for AWS such as SoftNAS https://aws.amazon.com/marketplace/pp/B00PJ9FGVU/ref=srh_res_product_title?ie=UTF8&sr=0-3&qid=1432203627313
S3 is possible but not always ideal, the main blocker being it does not natively support any filesystem protocols, instead all interactions need to be via an AWS API or via http calls. Additionally when looking at using it for session stores the 'eventually consistent' model will likely cause issues.
That being said - if all you need is updated resources, you could create a simple script to run either as a cron or on startup that downloads the files from s3.
Finally in the case of static resources like css/images don't store them on your webserver in the first place - there are plenty of articles covering the benefit of storing and accessing static web resources directly from s3 while keeping the dynamic stuff on your server.
From what we can tell at this point, EFS is expected to provide basic NFS file sharing on SSD-backed storage. Once available, it will be a v1.0 proprietary file system. There is no encryption and its AWS-only. The data is completely under AWS control.
SoftNAS is a mature, proven advanced ZFS-based NAS Filer that is full-featured, including encrypted EBS and S3 storage, storage snapshots for data protection, writable clones for DevOps and QA testing, RAM and SSD caching for maximum IOPS and throughput, deduplication and compression, cross-zone HA and a 100% up-time SLA. It supports NFS with LDAP and Active Directory authentication, CIFS/SMB with AD users/groups, iSCSI multi-pathing, FTP and (soon) AFP. SoftNAS instances and all storage is completely under your control and you have complete control of the EBS and S3 encryption and keys (you can use EBS encryption or any Linux compatible encryption and key management approach you prefer or require).
The ZFS filesystem is a proven filesystem that is trusted by thousands of enterprises globally. Customers are running more than 600 million files in production on SoftNAS today - ZFS is capable of scaling into the billions.
SoftNAS is cross-platform, and runs on cloud platforms other than AWS, including Azure, CenturyLink Cloud, Faction cloud, VMware vSPhere/ESXi, VMware vCloud Air and Hyper-V, so your data is not limited or locked into AWS. More platforms are planned. It provides cross-platform replication, making it easy to migrate data between any supported public cloud, private cloud, or premise-based data center.
SoftNAS is backed by industry-leading technical support from cloud storage specialists (it's all we do), something you may need or want.
Those are some of the more noteworthy differences between EFS and SoftNAS. For a more detailed comparison chart:
https://www.softnas.com/wp/nas-storage/softnas-cloud-aws-nfs-cifs/how-does-it-compare/
If you are willing to roll your own HA NFS cluster, and be responsible for its care, feeding and support, then you can use Linux and DRBD/corosync or any number of other Linux clustering approaches. You will have to support it yourself and be responsible for whatever happens.
There's also GlusterFS. It does well up to 250,000 files (in our testing) and has been observed to suffer from an IOPS brownout when approaching 1 million files, and IOPS blackouts above 1 million files (according to customers who have used it). For smaller deployments it reportedly works reasonably well.
Hope that helps.
CTO - SoftNAS
For keeping your webserver sessions in sync you can easily switch to Redis or Memcached as your session handler. This is a simple setting in the PHP.ini and they can all access the same Redis or Memcached server to do sessions. You can use Amazon's Elasticache which will manage the Redis or Memcache instance for you.
http://phpave.com/redis-as-a-php-session-handler/ <- explains how to setup Redis with PHP pretty easily
For keeping your files in sync is a little bit more complicated.
How to I push new code changes to all my webservers?
You could use Git. When you deploy you can setup multiple servers and it will push your branch (master) to the multiple servers. So every new build goes out to all webserver.
What about new machines that launch?
I would setup new machines to run a rsync script from a trusted source, your master web server. That way they sync their web folders with the master when they boot and would be identical even if the AMI had old web files in it.
What about files that change and need to be live updated?
Store any user uploaded files in S3. So if user uploads a document on Server 1 then the file is stored in s3 and location is stored in a database. Then if a different user is on server 2 he can see the same file and access it as if it was on server 2. The file would be retrieved from s3 and served to the client.
GlusterFS is also an open source distributed file system used by many to create shared storage across EC2 instances
Until Amazon EFS hits production the best approach in my opinion is to build a storage backend exporting NFS from EC2 instances, maybe using Pacemaker/Corosync to achieve HA.
You could create an EBS volume that stores the files and instruct Pacemaker to umount/dettach and then attach/mount the EBS volume to the healthy NFS cluster node.
Hi we currently use a product called SoftNAS in our AWS environment. It allows us to chooses between both EBS and S3 backed storage. It has built in replication as well as a high availability option. May be something you can check out. I believe they offer a free trial you can try out on AWS
We are using ObjectiveFS and it is working well for us. It uses S3 for storage and is straight forward to set up.
They've also written a doc on how to share files between EC2 instances.
http://objectivefs.com/howto/how-to-share-files-between-ec2-instances
We have a couple of environments in Engine Yard. Each of them runs the same application, but on different stages: production, staging, etc. In total about 10 environments. Now, we want to dump the production database every night, and restore it on the rest of environments to have the latest data.
The problem is, an instance from one environment can't access instances in other environments. There are two ways to connect that are suitable for us:
SSH.
Specify the RDS host as the --host parameter to mysqldump. The RDS host is of the form environment.random_string.region.rds.amazonaws.com as opposed to a regular EC2 host name.
Neither of them works out of box. The straightforward solution would be to generate RSA keys on all the servers that want access, and add them to authorized_hosts to all the servers that should allow access. However, this solution isn't scalable: once we add or recreate an environment we'd need to repeat process.
Is there any better solution?
There is a way to setup a special backup configuration file on your other instances that would allow you to directly access the Production S3 bucket from another environment within the same account. There is some risk involved with this since it would also technically allow your non-production environment the ability to edit the contents of the production bucket.
There may be some other options depending on the specifics of your configuration. Your best option would be to open a ticket with the Engine Yard Support team so we can discuss your needs further.
Is it possible to set up a separate HUB server with FTP or SFTP service only?
open inbound port 21/22 from all environments to that HUB server, so all clients can download the database dump.
open inbound port 3306 or other database port from Hub Server to RDS/Database.
run cron job on Hub server to get the db dump, push the dump to other environment and so on.
Backup your production to a S3 bucket created for this purpose.
Use IAM roles to control how your other environments can connect to the same bucket.
Since the server of your Production environment should be known you can use a script to mysqldump that one server to the shared S3 bucket.
Once complete, your other servers can collect the data from that S3 bucket using a properly authorized IAM role.
Here is my scenario.
We have an ELB setup with two reserved instances of EC2 acting as web server under it (Amazon Linux).
There are some rapidly changing files (pdf, xls, jpg, etc) on the web server which are consumed by the websites hosted on the EC2 instances. Code files are identical and we will be sure to update both the servers manually at the same time with new code as and when needed.
The main problem is the user uploaded content which is stored on the EC2 instance.
What is the best approach to make sure that the uploaded files are available on both the servers almost instantly ?
Many people have suggested the use of rsync or unison, but this will involve setting a cron job. I am looking for something like FileSystemWatcher in C# which is triggered
ONLY when the contents of the specified folder are changed. Moreover due to the ELB we are not sure which of the EC2 instances will actually be connected to the user when the files are uploaded.
To add to the above we have one more Staging Server which pushes certain files to one of the EC2 web servers. We want these files too replicated to the other instance.
I was wondering whether S3 can solve the problem ? Will this setup be still good if we decide to enable auto scaling ?
I am confused at this stage. Please help
S3 will be the choice for your case. In this way, you don't have to sync files between EC2 instances. Also it is probably the best choice if you need to enable auto scaling. You should not put any data in EC2 instances, they should be stateless so that you can easily auto scale.
To use S3, it will require your application to support it instead of directly writing to local file system. It should be quite easy, there are many libraries in each language which can help you to store files into S3.