Since HDFS support RAMDisk, what's the advantage by using Alluxio. In our case we are not going to support integrate different type of under storage beside HDFS.
Having the concept of Under Storage and keeping the data and metadata in sync between Alluxio and Under Storage is the key difference between Alluxio and HDFS. Besides, there are still a few other difference as the consequences that Alluxio is designed to host hot data and implements the semantics of a distributed cache whereas HDFS is designed to be a persistent storage service.
Alluxio provides with configurable eviction policies.
Alluxio natively supports operations like setting TTLs (see link).
The number of block copies of data in HDFS is a fixed constant for persistency (3 by default, one can use setrep command to change the replication level in HDFS). However, the number of block replicas in Alluxio can be changed automatically based on the popularity of different blocks. If a block is accessed by multiple different applications on different servers, there can be more copies.
Alluxio supports tiered storage, so one can configure multiple tiers with MEM, SSD and HDD (see link).
Related
I have some to those terms: Block Level Storage and File Level Storage. Can someone explain why one is better than the other?
Perhaps with examples and algorithmic thinning it would be really interesting to understand.
For example, articles in AWS say that AWS EBS can be use for databases, but why is it better than File Level?
I like to think of it like this:
Amazon Elastic Block Store (Amazon EBS) is block storage. It is just like a USB disk that you plug into your computer. Information is stored in specific blocks on the disk and it is the job of the operating system to keep track of which blocks are used by each file. That's why disk formats vary between Windows and Linux.
Amazon Elastic File System (Amazon EFS) is a filesystem that is network-attached storage. It's just like the H: drive (or whatever) that companies provide their employees to store data on a fileserver. You mount the filesystem on your computer like a drive, but your computer sends files to the fileserver rather than managing the block allocation itself.
Amazon Simple Storage Service (Amazon S3) is object storage. You give it a file and it stores it as an object. You ask for the object and it gives it back. Amazon S3 is accessed via an API. It is not mounted as a disk. (There are some utilities that can mount S3 as a disk, but they actually just send API calls to the back-end and make it behave like a disk.)
When it comes to modifying files, they behave differently:
Files on block storage (like a USB disk) can be modified by the operating system. For example, changing one byte or adding data to the end of the file.
Files on a filesystem (like the H: drive) can be modified by making a request to the fileserver, much like block storage.
Files in object storage (like S3) are immutable and cannot be modified. You can upload another file with the same name, which will replace the original file, but you cannot modify a file. (Uploaded files are called objects.)
Amazon S3 has other unique attributes, such as making object available via the Internet, offering multiple storage classes for low-cost backups and triggering events when objects are created/deleted. It's a building-block for applications as opposed to a simple disk for storing data. Plus, there is no limit to the amount of data you can store.
Databases
Databases like to store their data in their own format that makes the data fast to access. Traditional databases are built to run on normal servers and they want fast access, so they store their data on directly-attached disks, which are block storage. Amazon RDS uses Amazon EBS for block storage.
A network-attached filesystem would slow the speed of disk access for a database, thereby reducing performance. However, sometimes this trade-off is worthwhile because it is easier to manage network-attached storage (SANs) than to keep adding disks to each individual server.
Some modern 'databases' (if you can use that term) like Presto can access data directly in Amazon S3 without loading the data into the database. Thus, the database processing layer is separated from the data layer. This makes it easier to access historical archived data since it doesn't need to be imported into the database.
Suppose you have to implement video streaming platform from scratch. It doesn't matter where you gonna store metadata, your not-very-popular video files will be stored at file system, or object store in case you want to use Cloud. If you'll choose AWS, in order to boost AWS S3 read performance, you can make multiple read requests against the same video file, see Performance Guidelines for Amazon S3:
You can use concurrent connections to Amazon S3 to fetch different
byte ranges from within the same object. This helps you achieve higher
aggregate throughput versus a single whole-object request.
In the same time, as you know, disk I/O is sequential for all HDD/SDD drives, so to boost read performance (if neglect RAM necessary for uncompress/decrypt each video chunk) you have to read from multiple disks (YouTube use RAID).
Why S3 will have better performance on concurrent byte range requests agains the same file? Isn't it stored on single disk? I suppose S3 may have some replication factor and still store the file on multiple disks, does it?
We have a completely serverless application, with only lambdas and DynamoDB.
The lambdas are running in two regions, and the originals are stored in Cloud9.
DynamoDB is configured with all tables global (bidirectional multi-master replication across the two regions), and the schema definitions are stored in Cloud9.
The only data loss we need to worry about is DynamoDB, which even if it crashed in both regions is presumably diligently backed up by AWS.
Given all of that, what is the point of classic backups? If both regions were completely obliterated, we'd likely be out of business anyway, and anything short of that would be recoverable from AWS.
Not all AWS regions support backup and restore functionality. You'll need to roll your own solution for backups in unsupported regions.
If all the regions your application runs in supports the backup functionality, you probably don't need to do it yourself. That is the point of going serverless. You let the platform handle simple DevOps tasks.
Having redundancy with regional or optionally cross-regional replication for DynamoDB provides mainly the durability, availability and fault tolerance for your data storage. However along with these inbuilt capabilities, still there can be the need for having backups.
For instance, if there is a data corruption due to an external threat (Like an attack) or based on an application malfunction, still you might want to restore the data back. This is one place where having backups is useful to restore the data back to a recent point of time.
There can also be compliance related requirement, which will require taking backups of your database system.
Another use case is when there is a need to create new DynamoDB tables for your build pipeline and quality assurance, it is more practical to re-use an already made snapshot of data from a backup rather taking a copy from the live database (Since it can consume the IOPS provisioned, affecting the application behaviors).
My web application requires extremely low-latency read/write of small data blobs (<10KB) that can be stored as key-value pairs. I am considering DynamoDB (with DAX) and EFS and ElastiCache. AWS claims that they all offer low latency but I cannot find any head-2-head comparison and also it is not clear to me if these three are even on the same league. Can someone share any insight?
You are trying to compare different storage systems for different use cases with different pricing models.
EFS is a filesystem for which you don't need to provision storage devices and can access from multiple EC2 instances. EFS might work fine for your use case, but you will need to manage files. Meaning you will need to structure your data to fit in files. Alternatively, you might need to build key-value or blob/object storage system depending on the level of structure and retrieval you need. There are products that solve this problem for you, such as S3, DynamoDB, Elasticache Redis or Memcached.
S3 is a blob storage, with no structure, no data types, items can't be updated only replaced. You may only query by listing blobs in a bucket. It is typically used for storing static media files.
DynamoDB is a non-relational (aka No-SQL) database, which can be used as a document or key-value store in which data is structured, strongly typed and has query capabilities. Can store items up to 400KB.
Elasticache (Redis or Memcached) are key-value stores which are typically used as a cache in front of durable data store such as DynamoDB. In this case the application needs to be aware of the different layers; manage different APIs and handle the caching logic in the application.
With DAX, you can seamlessly integrate a cache layer without having the caching logic in the application. DAX currently provides a write-through cache to DynamoDB. DAX APIs are compatible with DynamoDB APIs, which makes it seamless to add a cache layer if your application already uses DynamoDB by substituting DynamoDB client with DAX client. Keep in mind that DAX currently supports Java, Node.js, Go, .NET and Python clients only.
So it really depends on your workload. If you require sub-millisecond latency, without the headache of managing a cache layer, and your application is Java, Node.js, Go, .NET or Python then DAX is for you.
Our project has a directory of huge files spanning from 900MB to 2GB of file sizes.
The objective is to allow end-users to download huge files using typical web browsers.
Can AWS S3 as a file server be good option for this kind of application?
In short yes
The maximum object size is 5 terabytes
There are various options for optimising the storage type to reduce the cost
To answer your question: Yes. More info: Amazon S3 Standard offers high durability, availability, and performance object storage for frequently accessed data. Because it delivers low latency and high throughput, Standard is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile and gaming applications, and big data analytics. Lifecycle management offers configurable policies to automatically migrate objects to the most appropriate storage class.
Key Features:
Low latency and high throughput performance,
Designed for durability of 99.999999999% of objects,
Designed for 99.99% availability over a given year,
Backed with the Amazon S3 Service Level Agreement for availability.
Supports SSL encryption of data in transit and at rest and
Lifecycle management for automatic migration of objects.