AWS EFS for critical small files with frequent updates

AWS EFS for critical small files with frequent updates - amazon-web-services

Our application's userbase has reached 2M users and we are planning to scale up the application using the AWS.
The main problem we are facing is the handling of shared data which includes cache, uploads, models, sessions, etc.
An option is AWS EFS but it will kill the performance of the application as the files will be really small ranging from few Bytes to few MBs and are being updated very frequently.
We can use Memcache/Redis for sessions and S3 for uploads but still need to manage cache, models, and some other shared files.
Is there any alternative to EFS or any way to make EFS work for this scenario where small files are updated frequently?

Small files and frequent updates should not be a problem for EFS.
The problem some users encountered in the original release was that it had two dimensions tightly coupled together -- the amount of throughput available was a function of how much you were paying, and how much you were paying was a function of the total size of the filesystem (all files combined, regardless of individual file sizes)... so the larger, the faster.
But they have, since then, introduced "provisioned throughput," allowing you to decouple these two dimensions.
This default Amazon EFS throughput bursting mode offers a simple experience that is suitable for a majority of applications. Now with Provisioned Throughput, applications with throughput requirements greater than those allowed by Amazon EFS’s default throughput bursting mode can achieve the throughput levels required immediately and consistently independent of the amount of data.
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-efs-now-supports-provisioned-throughput/
If you use this feature, you pay for the difference between the throughput you provision, and the throughput that would have been included, anyway, based on the size of the data.
See also Amazon EFS Performance in the Amazon Elastic File System User Guide.
Provisioned throughput can be activated and deactivated, so don't confuse this with the fact that there are also two performance modes, called General Purpose and Max I/O, one of which must be selected when creating the filesystem, and this selection can't be changed later. These are related to an optional tradeoff in the underlying infrastructure and the recommended practice is to select General Purpose unless you have a reason not to, based on observed metrics. The Max I/O mode does not have the same metadata consistency model as general purpose.

Related

Sagemaker Endpoint throttling exception

I have created an endpoint using Sagemaker, and designed my system so that it is called about 100 times simultaneously. This seemed to cause 'Model error' and take too much time. Do I need to create an endpoint for each event, and make one call per endpoint, instead?

you can go in cloudwatch logs to diagnose your model failure.
Real-time inference traffic scaling can be addressed via working on 3 independent dimensions:
hardware: choosing larger machines or more
machines. For example you can load test your model endpoint with bigger and bigger machines and see when hardware size gives you acceptable latency. The Autoscaling feature of SageMaker helps you address this automatically. If deploying a deep neural net, you can also consider using appropriate accelerators, eg GPU (EC2 P3, EC2 G4) or Amazon Elastic Inference Accelerator to make each prediction much faster.
software: you have 2 levers to tune here:
choosing a serving stack that is lean and fast. Different servers will handle load at different levels of performance. One common trick is to batch the load - for example, instead of hitting 100 times your server can you hit it only once with a batch of 100 records? If clients cannot batch their requests, can you use micro-asynchrony so that you do the batching yourself after they issued requests? You can usually configure such micro-batching in advanced deep learning servers such as TF Serving or MXNet Model Server (both can be used in SageMaker), but otherwise you can also do it yourself by having a queue (SQS) in front of your server.
model compilation - optimizing the model graph and its runtime. This is a very smart concept, that leverages the fact that when you know where you're going to deploy (eg NVIDIA, Intel, ARM, etc), you have an insider edge and you can refine your model artifact and create a bespoke runtime application that are tailor-made for this specific target platform. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research. In the SageMaker ecosystem, such a compilation task can be performed with SageMaker Neo, but the open source ecosystem is developing fast, with notably treelite (paper, doc) for decision tree compilation and TVM (paper, doc) for arbitrary neural net compilation. Both are dependencies of Neo by the way.
science: some models are slower or heavier than others. If speed and concurrency are your priorities over accuracy, and if you already exploited all possible tricks at level (1) and (2) above, consider using fast-throughput models, eg linear models & logistic regression for structured data, MobileNet or SqueezeNet instead of large Resnets for classification (nice benchmark here), Yolov3 instead of FasterRCNN for detection (nice benchmark here), etc. But be aware that unlike levels (1) and (2), changing model science will alter accuracy.
As mentioned above, those 3 areas of improvements really are about real-time inference; if you can afford to pre-compute all possible model inputs, then the ultimate low-latency high-throughput solution is to pre-compute offline a variety of input-predictions pairs of interest and serve them on demand from a fast database or local read-only store.

bigstore increasing almost linearly Google Cloud

I use many api's from Google Cloud. Recently I noticed that the bigstore is gradually increasing on a daily basis. I am worried that if this continues I wont be able to pay the bill.
I do not know however how to check where this increase is coming from. Is there a way to see which cloud functions are causing this increased traffic?
The reason I am surprised about the increase in the traffic of bigstore is because I have cron jobs that are running multiple times per day to store the data in BigQuery. I have not changed these settings, so I would assume that this traffic should not increase as shown on the chart.
One other explanation I can think of is that the amount of data that I am storing has increased, which is indeed true on a daily basis. But why does this increase the traffic?
What is the way to check this?

There are two main data sources you should use:
GCP-wide billing export. This will tell you an exact breakdown of your costs. This is important to make sure you target your effort where the cost is largest to you. It also provides some level of detail about what the usage is.
Enable access & storage logging. The access log will give you an exact accounting of incoming requests down to the number of bytes transferred. The storage logs give you similar granularity into the cost of storage itself.

In addition, if you have a snapshot of your bigstore, as time goes on and you replace or even rename files, your storage charges will increase because where once you had 2 views of the same storage, as the files change each file forks in 2 copies (one is the current view of your storage, one is the snapshot.)

AWS : Splitting software & data in different volumes

AWS recommends keeping data & OS on separate EBS volumes. I have a webserver running on EC2 with an EBS volume. On a bare VM, I install the following:
- webserver, wsgi, pip & related software/config (some in /etc some in /home/<user>)
- server code & static assets in /var/www/
- log files are written to /var/log/<respective-folder>
- maintenance scripts in /home/<user>/
Database server is separate. For a webserver, which of the above items would benefit from higher IOPS and for which ones it doesn't matter ? My understanding is that the server code & log files should be moved to a separate EBS volume with higher IOPS. Or should I just move all of my stuff (except the softwares I installed in /etc i.e. webserver) to a separate volume with better IOPS ?

I would recommend that you have a separate EBS volume for code, logs, and maintenance in the case that you need to move it to another server. That allows you a faster TTR (time to resolution), than having to build an entire server.
The code shouldn't be changing largely past deployment, so I would focus on a general purpose SSD here, and look towards a caching layer (Varnish (full page caching) & CDN (static assets) ) more than having disk I/O issues. A CDN is a quick win and mitigates most the I/O for reading static assets. At 50GB, you get 150 IOPS, and with the mitigate of the static assets; the I/O should be fine.
As for logs, if you are a high traffic site, then you should definitely focus on I/O here as you don't want have blocking I/O here. This is mainly focusing on access logs more than error logs, as those shouldn't be past ERROR level on a production systems. If you aren't high traffic, then you should be fine with general purpose SSD, at 10GB, you get 30 IOPS, and that is generally enough.
What are your maintenance scripts doing? If they are generating and outputting files, then you could use SSD, but if you need high I/O, you should revisit the code and optimize the code as these disks can get expensive, and that cost is usually wasteful for maintenance that runs intermittently.
As for your web-server, et cetera, that should be based on infrastructure as code, via OpsWorks or Puppet, and doesn't need much in the term of I/O as those are usually memory-based processes once built and deployed.

Nuodb and HDFS as storage

Using HDFS for Nuodb as storage. Would this have a performance impact?
If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?
On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.
How would Nuodb manage these kind of latency gotchas?

Good afternoon,
My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.
First... a mini lesson on NuoDB architecture/layout:
The most basic NuoDB set-up includes:
Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory
Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.
Transaction Engines process incoming SQL requests and manage transactions.
Storage Managers read and write data to and from "disk" (Archive Directory)
All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.
Ok, so now lets talk about Storage:
This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:
Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)
So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.
And now to finally address the question of latency:
Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.
I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.
Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.
All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.
Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.
If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).
Thank you,
Elisabete
Technical Support Engineer at NuoDB

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?

If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.

HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js