AWS Storage gateway for caching millions of files in S3

AWS Storage gateway for caching millions of files in S3 - amazon-web-services

We have a use case where in we need to access almost millions of files from a Java application. Currently we are storing them in EBS volume. This is turning out to be expensive option(as we have reached upto 15TB now) so we are looking for S3 as the file storage. We are okay to bear the latency.
One option is to mount S3 using s3fs and access the files. But I was exploring the option of AWS Storage gateway if that can provide better caching and faster access. We have faced quite a few issues with s3fs so was looking for alternatives.

Avoid using s3fs if possible because it merely emulates a file system and is likely to run into problems with high utilization.
The best solution is for your application to access the files directly from Amazon via S3 API calls, rather than pretending that S3 is a filesystem. This works very nicely for large-scale applications and you would have no administration/maintenance overhead because your application communicates directly with S3. You should serious consider this option.
If you do really need to access the files via a filesystem, consider using AWS Storage Gateway – File Gateway, which can present S3 storage as an NFS share.

Related

What is better Mounting S3 bucket or copying files from S3 bucket to windows EC2 instance?

I have a use case where the CSV files are stored on an S3 bucket by a service. My program running on windows EC2 has to use the CSV files dumped on S3 bucket. Mounting or copying, which approach will be better to use the file? And how to approach it.

Mounting the bucket as a local Windows drive will just cache info about the bucket and copy the files locally when you try to access them. Either way you will end up having the files copied to the Windows machine. If you don't want to program the knowledge of the S3 bucket into your application then the mounting system can be an attractive solution, but in my experience it can be very buggy. I built a system on Windows machines in the past that used an S3 bucket mounting product, but after so many bugs and failures I ended up rewriting it to simply perform an aws s3 sync operation to a local folder before the process ran.

I always suggest copying using either by CLI or directly using endpoints or SDK or whatever the way suggested by AWS but not mounting.
Actually, S3 is not built for a filesystem purpose. It's an object storage system. NOt saying that you cannot do it, but it is not advisable. The correct way to use Amazon S3 is to put/get files using the S3 APIs.
And if you are concerned about the network latency, I would say both will be the same and if you are thinking about directly modifying/editing a file within the file system, No you cannot Since Amazon S3 is designed for atomic operations, they have to be completely replaced with modified files.

AWS CLI S3 CP performance is painfully slow

I've got an issue whereby uploads to and downloads from AWS S3 via the aws cli are very slow. By very slow I mean it consistently takes around 2.3s for a 211k file which indicates an average download speed of less than 500Kb/s which is extremely slow for such a small file. My webapp is heavily reliant on internal APIs and I've narrowed down that the bulk of the API's round-trip performance is predominantly related to uploading and downloading files from S3.
Some details:
Using the latest version of aws cli (aws-cli/1.14.44 Python/3.6.6, Linux/4.15.0-34-generic botocore/1.8.48) on an AWS hosted EC2 instance
Instance is running the latest version of Ubuntu (18.04)
Instance is in region ap-southeast-2a (Sydney)
Instance is granted role based access to S3 via a least privilege policy (i.e. minimum rights to the buckets that it needs access to)
Type is t2.micro which should have Internet Bandwidth of ~60Mb or so
S3 buckets are in ap-southeast-2
Same result with encrypted (default) and unencrypted files
Same result with files regardless of whether they have a random collection of alpha numeric characters in the object name
The issue persists consistently, even after multiple cp attempts and after a reboot the cp attempt consistently takes 2.3s
This leads me to wonder whether S3 or the EC2 instance (which is using a standard Internet Gateway) is throttled back
I've tested downloading the same file from the same instance to a webserver using wget and it takes 0.0008s (i.e. 8ms)
So to summarise:
Downloading the file from S3 via the AWS CLI takes 2.3s (i.e. 2300ms)
Downloading the same file from a webserver (> Internet > Cloudflare > AWS > LB > Apache) via wget takes 0.0008s (i.e. 8ms)
I need to improve AWS CLI S3 download performance because the API is going to be quite heavily used in the future.

Okay this was a combination of things.
I'd had problems with the AWS PHP API SDK previously (mainly related to orphaned threads when copying files), so had changed my APIs to use the AWS CLI for simplicity and reliability reasons and although they worked, I encountered a few performance issues:
Firstly because my Instance had role based access to my S3 buckets, the aws CLI was taking around 1.7s just to determine which region my buckets were in. Configuring the CLI to point to a default region overcame this
Secondly because PHP has to invoke a whole new shell when running an exec() command (e.g. exec("aws s3 cp s3://bucketname/objectname.txt /var/app_path/objectname.txt)) that is a very slow exercise. I know it's possible to offload shell commands via Gearman or similar but since simplicity was one of my goals, I didn't want to go down that road
Finally because the AWS CLI uses Python, it takes almost 0.4s just to initiate, before it even begins processing a command. That might not seem like alot but when my API is in production usage it will be quite an impact to users and infrastructure alike
To cut a long story short, I've done two things:
Reverted to using the AWS PHP API SDK instead of the AWS CLI
Referring to the correct S3 region name within my PHP code
My APIs are now performing much better, i.e. From 2.3s to an average of around .07s.
This doesn't make my original issue go away but at least performance is much better.

I found that if I try to download an object using aws s3 cp, the download would hang close to finishing when the object size is greater than 500MB.
However, using get-object directly causes no hang or slowdown whatsoever. Therefore instead of using
aws s3 cp s3://my-bucket/path/to/my/object .
getting the object with
aws s3api get-object --bucket my-bucket --key path/to/my/object out-file
I experience no slowdown.

AWS S3 is slow and painfully complex and you can't easily search for files. If used with cloudfront, it is faster and there are supposed to be advantages, but complexity shifts from very complex to insanely complex because caching obfuscates any file changes, and invalidating the cache is hit and miss unless you change the file name which involves changing the file name in the page referencing that file.
In practice, particularly if all or most of your traffic is located in the same region as your load balancer, I have found even a low specced web server located in the same region is faster by factors of 10. If you need multiple web servers attached to a common volume, AWS only provides this in certain regions, so I got around this by using NFS to share the volume on multiple web servers. This gives you a file system that is mounted on a server you can log in to and list and find files. S3 has become a turnkey solution for a problem that was solved better a couple of decades ago.

You may try using boto3 to download files instead of aws s3 cp.
Refer to Downloading a File from an S3 Bucket

While my download speeds weren't as slow as yours, I managed to max out my ISPs download bandwidth with aws s3 cp by adding the following configuration to my ~/.aws/config:
[profile default]
s3 =
max_concurrent_requests = 200
max_queue_size = 5000
multipart_threshold = 4MB
multipart_chunksize = 4MB
If you don't want to edit the config file, you can probably use CLI parameters instead. Have a look at the documentation: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html

AWS storage choice for file system S3 or EFS?

I want to use file system to store xml files being received from SFTP connection to EC2 instance. Which storage to choose S3 or EFS? Once files are stored, I want to read the files and process data.
My understanding is that we should choose EFS as S3 is not recommended to mount a file system. Also, it is easy to manage directories and sub-directories permission with EFS.

The decision should depend on the budget and requirement as well
If you want to read the files and process data then you can choose EFS :
Amazon EFS is a fully-managed service that makes it easy to set up and scale file storage in the Amazon Cloud. With a few clicks in the AWS Management Console, you can create file systems that are accessible to Amazon EC2 instances via a file system interface (using standard operating system file I/O APIs) and supports full file system access semantics (such as strong consistency and file locking).
Amazon EFS file systems can automatically scale from gigabytes to petabytes of data without needing to provision storage. Tens, hundreds, or even thousands of Amazon EC2 instances can access an Amazon EFS file system at the same time, and Amazon EFS provides consistent performance to each Amazon EC2 instance. Amazon EFS is designed to be highly durable and highly available. With Amazon EFS, there are no minimum fee or setup costs, and you pay only for the storage you use.
And S3 would be an alternate solution if you want to download/upload the files/objects with different clients platforms like Android, iOS, Web etc..

It's hard to tell since you didn't specify the average file size, estimated storage requirements and the file usage pattern. The price difference between S3 and EFS is also an essential factor to consider.
Example:
EC2 instance receives a file, processes it immediately and store results to the database. XML is just stored for backup afterward and should be long-term archived for audit or recovery purposes.
In this case, I would recommend S3 and lifecycle policies to migrate data to the Glacier service for long-term archiving automatically.

Yes, for your use case it would be better if you choose EFS as it is easy to use and offers a simple interface that allows you to create and configure file systems quickly and easily.
When mounted on Amazon EC2 instances, an Amazon EFS file system provides a standard file system interface and file system access semantics, allowing you to seamlessly integrate Amazon EFS with your existing applications and tools. Multiple Amazon EC2 instances can access an Amazon EFS file system at the same time, allowing Amazon EFS to provide a common data source for workloads and applications running on more than one Amazon EC2 instance.
https://aws.amazon.com/documentation/efs/

Using both AWS and GCP at the same time?

I have a social network website and I stored all the media in S3. I'm planning to use AWS for S3+Lambda and GCP for GCE, Cloudsql. What are the cons of using it this way? Bandwidth between GCP and S3 (since it's not in the same network)?
Thanks.

Using both services together can make sense when you're leveraging one provider's strengths, or for redundancy / disaster recovery. You might also find the pricing model of one provider suits your use-case better. The tradeoff is inconvenience, extra code to manage interoperability, learning two sets of APIs and libraries, and possibly latency.
A few use-cases I've seen personally:
Backing up S3 buckets to Cloud Storage in COLDLINE via the Transfer Job system; goal is to protect code and data backups against worst-case S3 data loss or account hacking in AWS
Using BigQuery to analyze logs pre-processed in AWS EMR and synced into Cloud Storage; depending on your workload BigQuery might cost a lot less than running a Redshift cluster
I've also heard arguments that Google's ML pipelines are superior in some domains, so this might be a common crossover case.
Given the bulk of your infrastructure is already in Google, have you considered using Cloud Functions and Cloud Storage instead of Lambda and S3?

How to setup shared persistent storage for multiple AWS EC2 instances?

I have a service hosted on Amazon Web Services. There I have multiple EC2 instances running with the exact same setup and data, managed by an Elastic Load Balancer and scaling groups.
Those instances are web servers running web applications based on PHP. So currently there are the very same files etc. placed on every instance. But when the ELB / scaling group launches a new instance based on load rules etc., the files might not be up-to-date.
Additionally, I'd rather like to use a shared file system for PHP sessions etc. than sticky sessions.
So, my question is, for those reasons and maybe more coming up in the future, I would like to have a shared file system entity which I can attach to my EC2 instances.
What way would you suggest to resolve this? Are there any solutions offered by AWS directly so I can rely on their services rather than doing it on my on with a DRBD and so on? What is the easiest approach? DRBD, NFS, ...? Is S3 also feasible for those intends?
Thanks in advance.

As mentioned in a comment, AWS has announced EFS (http://aws.amazon.com/efs/) a shared network file system. It is currently in very limited preview, but based on previous AWS services I would hope to see it generally available in the next few months.
In the meantime there are a couple of third party shared file system solutions for AWS such as SoftNAS https://aws.amazon.com/marketplace/pp/B00PJ9FGVU/ref=srh_res_product_title?ie=UTF8&sr=0-3&qid=1432203627313
S3 is possible but not always ideal, the main blocker being it does not natively support any filesystem protocols, instead all interactions need to be via an AWS API or via http calls. Additionally when looking at using it for session stores the 'eventually consistent' model will likely cause issues.
That being said - if all you need is updated resources, you could create a simple script to run either as a cron or on startup that downloads the files from s3.
Finally in the case of static resources like css/images don't store them on your webserver in the first place - there are plenty of articles covering the benefit of storing and accessing static web resources directly from s3 while keeping the dynamic stuff on your server.

From what we can tell at this point, EFS is expected to provide basic NFS file sharing on SSD-backed storage. Once available, it will be a v1.0 proprietary file system. There is no encryption and its AWS-only. The data is completely under AWS control.
SoftNAS is a mature, proven advanced ZFS-based NAS Filer that is full-featured, including encrypted EBS and S3 storage, storage snapshots for data protection, writable clones for DevOps and QA testing, RAM and SSD caching for maximum IOPS and throughput, deduplication and compression, cross-zone HA and a 100% up-time SLA. It supports NFS with LDAP and Active Directory authentication, CIFS/SMB with AD users/groups, iSCSI multi-pathing, FTP and (soon) AFP. SoftNAS instances and all storage is completely under your control and you have complete control of the EBS and S3 encryption and keys (you can use EBS encryption or any Linux compatible encryption and key management approach you prefer or require).
The ZFS filesystem is a proven filesystem that is trusted by thousands of enterprises globally. Customers are running more than 600 million files in production on SoftNAS today - ZFS is capable of scaling into the billions.
SoftNAS is cross-platform, and runs on cloud platforms other than AWS, including Azure, CenturyLink Cloud, Faction cloud, VMware vSPhere/ESXi, VMware vCloud Air and Hyper-V, so your data is not limited or locked into AWS. More platforms are planned. It provides cross-platform replication, making it easy to migrate data between any supported public cloud, private cloud, or premise-based data center.
SoftNAS is backed by industry-leading technical support from cloud storage specialists (it's all we do), something you may need or want.
Those are some of the more noteworthy differences between EFS and SoftNAS. For a more detailed comparison chart:
https://www.softnas.com/wp/nas-storage/softnas-cloud-aws-nfs-cifs/how-does-it-compare/
If you are willing to roll your own HA NFS cluster, and be responsible for its care, feeding and support, then you can use Linux and DRBD/corosync or any number of other Linux clustering approaches. You will have to support it yourself and be responsible for whatever happens.
There's also GlusterFS. It does well up to 250,000 files (in our testing) and has been observed to suffer from an IOPS brownout when approaching 1 million files, and IOPS blackouts above 1 million files (according to customers who have used it). For smaller deployments it reportedly works reasonably well.
Hope that helps.
CTO - SoftNAS

For keeping your webserver sessions in sync you can easily switch to Redis or Memcached as your session handler. This is a simple setting in the PHP.ini and they can all access the same Redis or Memcached server to do sessions. You can use Amazon's Elasticache which will manage the Redis or Memcache instance for you.
http://phpave.com/redis-as-a-php-session-handler/ <- explains how to setup Redis with PHP pretty easily
For keeping your files in sync is a little bit more complicated.
How to I push new code changes to all my webservers?
You could use Git. When you deploy you can setup multiple servers and it will push your branch (master) to the multiple servers. So every new build goes out to all webserver.
What about new machines that launch?
I would setup new machines to run a rsync script from a trusted source, your master web server. That way they sync their web folders with the master when they boot and would be identical even if the AMI had old web files in it.
What about files that change and need to be live updated?
Store any user uploaded files in S3. So if user uploads a document on Server 1 then the file is stored in s3 and location is stored in a database. Then if a different user is on server 2 he can see the same file and access it as if it was on server 2. The file would be retrieved from s3 and served to the client.

GlusterFS is also an open source distributed file system used by many to create shared storage across EC2 instances

Until Amazon EFS hits production the best approach in my opinion is to build a storage backend exporting NFS from EC2 instances, maybe using Pacemaker/Corosync to achieve HA.
You could create an EBS volume that stores the files and instruct Pacemaker to umount/dettach and then attach/mount the EBS volume to the healthy NFS cluster node.

Hi we currently use a product called SoftNAS in our AWS environment. It allows us to chooses between both EBS and S3 backed storage. It has built in replication as well as a high availability option. May be something you can check out. I believe they offer a free trial you can try out on AWS

We are using ObjectiveFS and it is working well for us. It uses S3 for storage and is straight forward to set up.
They've also written a doc on how to share files between EC2 instances.
http://objectivefs.com/howto/how-to-share-files-between-ec2-instances

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js