In the FAQs of AWS SimpleDb service I have noticed paragraph
Q: When should I use Amazon S3 vs. Amazon SimpleDB?
Amazon S3 stores raw data. Amazon SimpleDB takes your data as input
and indexes all the attributes, enabling you to quickly query that
data. Additionally, Amazon S3 and Amazon SimpleDB use different types
of physical storage. Amazon S3 uses dense storage drives that are
optimized for storing larger objects inexpensively. Amazon SimpleDB
stores smaller bits of data and uses less dense drives that are
optimized for data access speed.
Can somebody explain how AWS SimpleDB reach high data access speed with using of less dense drives?
As I know: more density -> more speed.
Amazon SimpleDB is a non-relational (NoSQL) data store. These days, if you're looking to use NoSQL on AWS, DynamoDB is recommended. SimpleDB is not even listed on the service menu.
It can reach high speeds because it is a database (with item-level data pre-loaded and indexed), while Amazon S3 is an object store (that only works at the object level).
You can read these posts that may help you.
Check out:
SimpleDB Essentials for High Performance Users: Part 1
SimpleDB Essentials for High Performance Users: Part 2
SimpleDB Essentials for High Performance Users: Part 3
SimpleDB Performance : 5 Steps to Achieving High Write Throughput
Related
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.
i'm looking at using AWS Redshift to let users submit queries against the old archived data which isn't available in my web page.
the total data i'm dealing with across all my users is a couple of terabytes. the data is already in an s3 bucket, split up into files by week. most requests won't deal with more than a few files totaling 100GB.
to keep costs down should i use snapshots and delete our cluster when not in use or should i have a smaller cluster which doesn't hold all of the data and only COPY data from S3 into a temporary table when running a query?
If you are just doing occasional queries where cost is more important than speed, you could consider using Amazon Athena, which can query data stored in Amazon S3. (Only in some AWS regions at the moment.) You are only charged for the amount of data read from disk.
To gain an appreciation for making Athena even better value, see: Analyzing Data in S3 using Amazon Athena
Amazon Redshift Spectrum can perform a similar job to Athena but requires an Amazon Redshift cluster to be running.
All other choices are really a trade-off between cost and access to your data. You could start by taking a snapshot of your Amazon Redshift database and then turning it off at night and on the weekends. Then, have a script that can restore it automatically for queries. Use fewer nodes to reduce costs -- this will make queries slower, but that doesn't seem to be an issue for you.
I've seen many environments where critical data is backed up to Amazon S3 and it is assumed that this will basically never fail.
I know that Amazon reports that data stored in S3 has 99.999999999% durability (11 9's), but one thing that I'm struck by is the following passage from the AWS docs:
Amazon S3 provides a highly durable storage infrastructure designed
for mission-critical and primary data storage. Objects are redundantly
stored on multiple devices across multiple facilities in an Amazon S3
region.
So, S3 objects are only replicated within a single AWS region. Say there's an earthquake in N. California that decimates the whole region. Does that mean N. California S3 data has gone with it?
I'm curious what others consider best practices with respect to persisting mission-critical data in S3?
Given that S3 is 99.999999999% durability [1], what is the equivalent figure for DynamoDB?
[1] http://aws.amazon.com/s3/
This question implies something that is incorrect. Though S3 has an SLA (aws.amazon.com/s3-sla) that SLA references availability (99.9%) but has no reference to durability, or the loss of objects in S3.
The 99.999999999% durability figure comes from Amazon's estimate of what S3 is designed to achieve and there is no related SLA.
Note that Amazon S3 is designed for 99.99% availability but the SLA kicks in at 99.9%.
There is no current DynamoDB SLA from Amazon, nor am I aware of any published figures from Amazon on the expected or designed durability of data in DynamoDB. I would suspect that it is less than S3 given the nature, relative complexities, and goals of the two systems (i.e., S3 is designed to simply store data objects very, very safely; DynamoDB is designed to provide super-fast reads and writes in a scalable distributed database while also trying to keep your data safe).
Amazon talks about customers backing up DynamoDB to S3 using MapReduce. They also say that some customers back up DynamoDB using Redshift, which has DynamoDB compatibility built in. I additionally recommend backing up to an off-AWS store to remove the single point of failure that is your AWS account.
Although the DynamoDB FAQ doesn't use the exact same wording as you can see from my highlights below both DynamoDB & S3 are designed to be fault tolerant, with data stored in three facilities.
I wasn't able to find exact figures reported anywhere, but from the information I did have it looks like DynamoDB is pretty durable (on par with S3), although that won't stop it from having service interruptions from time to time. See this link:
http://www.forbes.com/sites/kellyclay/2013/02/20/amazons-aws-experiencing-problems-again/
S3 FAQ: http://aws.amazon.com/s3/faqs/#How_is_Amazon_S3_designed_to_achieve_99.999999999%_durability
Q: How durable is Amazon S3? Amazon S3 is designed to provide
99.999999999% durability of objects over a given year.
In addition, Amazon S3 is designed to sustain the concurrent loss of
data in two facilities.
Also Note: The "99.999999999%" figure for S3 is over a given year.
DynamoDB FAQ: http://aws.amazon.com/dynamodb/faqs/#Is_there_a_limit_to_how_much_data_I_can_store_in_Amazon_DynamoDB
Scale, Availability, and Durability
Q: How highly available is Amazon DynamoDB?
The service runs across Amazon’s proven, high-availability data
centers. The service replicates data across three facilities in an AWS
Region to provide fault tolerance in the event of a server failure or
Availability Zone outage.
Q: How does Amazon DynamoDB achieve high uptime and durability?
To achieve high uptime and durability, Amazon DynamoDB synchronously
replicates data across three facilities within an AWS Region.