Using AWS DynamoDB or Redshift to store analytics data - amazon-web-services

I would like to ask which services would suits me the best. For example, an facebook-like mobile app where I need to track every movement of a user such as the pages visited or links clicked.
I am thinking of using DynamoDB to create multiple tables to track each different activities. When I run my analytic app, it will query all the data for each table (Similar hash key but different range key so I can query all data) and compute the result in the app. So the main cost is the read throughput which can easily be 250 reads/s (~ $28/mth) for each table. The storage for each table has no limit so it is free?
For Redshift, I will be paying for the storage size on a 100% utilized per month basis for 160GB. That will cost me about $14.62/mth. Although it looks cheaper, I am not familiar with Redshift, hence, I am not sure of what are the other hidden costs.
Thanks in advance!

Pricing for Amazon DynamoDB has several components:
Provisioned Throughput Capacity (the speed of the tables)
Indexed Data Storage (the cost of storing data)
Data Transfer (for data going from AWS to the Internet)
For example, 100GB of data storage would cost around $25.
If you want 250 reads/second, it would cost $0.0065 per 50 units, which is .0065 * 5 units * 24 hours * 30 days = $23.40 (plus some write capacity units).
Pricing for Amazon Redshift is based upon the number and type of nodes. A 160GB dc1.large node would cost 25c/hour * 24 hours * 30 days = $180 per node (but only one node is probably required for your situation).
Amazon Redshift therefore comes out as more expensive, but it is also a more-featured system. You can run complex SQL against Amazon Redshift, whereas you would have to write an application to retrieve, join and compute information from DynamoDB. Think of DynamoDB as a storage service, while Redshift is also a querying service.
The real decision, however, should be based on how you are going to use the data. If you can create an application that will work with DynamoDB, then use it. However, many people find the simplicity of using SQL on Redshift to be much easier.

Related

Using AWS Redshift for building a Multi-tenant SaaS application

We're building a multi-tenant SaaS application hosted on AWS that exposes and visualizes data in the front end via a REST api.
Now, for storage we're considering using AWS Redshift (Cluster or Serverless?) and then exposing the data using API Gateway and Lambda with the Redshift Data API.
The reason why I'm inclined to using Redshift as opposed to e.g RDS is that it seems like a nice option to also be able to conduct data experiments internally when building our product.
My question is, would this be considered a good strategy?
Redshift is sized for very large data and tables. For example the minimum storage size is 1MB. That's 1MB for every column and across all the slices (minimum 2). A table with 5 columns and just a few rows will take 26MB on the smallest Redshift cluster size (default distribution style). Redshift shines when your tables have 10s of millions of rows minimum. It isn't clear from your case that you will have the data sizes that will run efficiently on Redshift.
The next concern would be about your workload. Redshift is a powerful analytics engine but is not designed for OLTP workloads. High volumes of small writes will not perform well; it wants batch writes. High concurrency of light reads will not work as well as a database designed for that workload.
At low levels of work Redshift can do these things - it is a database. But if you use it in a way it isn't optimized for it likely isn't the most cost effective option and won't scale well. If job A is the SAS workload and analytics is job B, then choose the right database for job A. If this choice cannot do job B at the performance level you need then add an analytics engine to the mix.
My $.02 and I'm the Redshift guy. If my assumptions about your workload are wrong please update with specific info.

Can I have less than 2.5TB of disk for a BigTable node?

In the GCP user interface I can estimate the pricing for whatever disk size I wish to use, but when I want to create my BigTable instance I can only choose the number of nodes and each node comes with 2.5TB of SSD or HDD disk.
Is there a way to, for example, setup a BigTable cluster with 1 node and 1TB of SSD instead of the 2.5TB default one ?
Even in the GCP pricing calculator I can change the disk size, but I can't find where to configure it when creating the cluster (https://cloud.google.com/products/calculator#id=2acfedfc-4f5a-4a9a-a5d7-0470d7fa3973)
Thanks
If you only want a 1TB database, then only write 1TB and you'll be charged accordingly.
From the Bigtable pricing documentation:
Cloud Bigtable frequently measures the average amount of data in your
Cloud Bigtable tables during a short time interval. For billing
purposes, these measurements are combined into an average over a
one-month period, and this average is multiplied by the monthly rate.
You are billed only for the storage you use, including overhead for
indexing and Cloud Bigtable's internal representation on disk. For
instances that contain multiple clusters, Cloud Bigtable keeps a
separate copy of your data with every cluster, and you are charged for
every copy of your data.
When you delete data from Cloud Bigtable, the data becomes
inaccessible immediately; however, you are charged for storage of the
data until Cloud Bigtable compacts the table. This process typically
takes up to a week.
In addition, if you store multiple versions of a value in a table
cell, or if you have set an expiration time for one of your table's
column families, you can read the obsolete and expired values until
Cloud Bigtable completes garbage collection for the table. You are
also charged for the obsolete and expired values prior to garbage
collection. This process typically takes up to a week.

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

Using AWS To Process Large Amounts Of Data With Serverless

I have about 300,000 transactions for each user in my DynamoDB database.
I would like to calculate the taxes based on those transactions in a serverless manner, if that is the cheapest way.
My thought process was that I should use AWS Step Functions to grab all of the transactions, store them into Amazon S3, then use AWS Step Functions to iterate over each row in the CSV file. The problem is that once I read a row in the CSV, I would have to store it in memory so that I can use it for later calculations. If this Lambda function runs out of time, then I have no way to save the state, so this route is not plausible.
Another route which would be expensive, is to have two copies of each transaction in DynamoDB and perform the operations on the copy Table, keeping the original data untouched. The problem with this is that the DynamoDB table is eventually consistent and there could be a scenario where I read a dirty item.
Serverless is ideal for event-driven processing but for your batch use-case, it is probably easier to use an EC2 instance.
An Amazon EC2 t2.nano instance is under 1c/hour, as is a t2.micro instance with spot pricing, and they are per-second pricing.
There really isn't enough detail here to make a good suggestion. For example, how is the data organized in your DynamoDB table? How often do you plan on running this job? How quickly do you need the job to complete?
You mentioned price so I'm assuming that is the biggest factor for you.
Lambda tends to be cheapest for event-driven processing. The idea is that with any EC2/ECS event driven system you would need to over provision by some amount to handle spikes in traffic. The over provisioned compute power is idle most of the time but you still pay for it. In the case of lambda, you pay a little more for the compute power but you save money by needing less since you don't need to over provision.
Batch processing systems tend to lend themselves nicely to EC2 since they typically use 100% of the compute power throughout the duration of the job. At the end of the job, you shutdown all of the instances and you don't pay for them anymore. Also, if you use spot pricing, you can really push the price of your compute power down.

Redshift COPY or snapshots?

i'm looking at using AWS Redshift to let users submit queries against the old archived data which isn't available in my web page.
the total data i'm dealing with across all my users is a couple of terabytes. the data is already in an s3 bucket, split up into files by week. most requests won't deal with more than a few files totaling 100GB.
to keep costs down should i use snapshots and delete our cluster when not in use or should i have a smaller cluster which doesn't hold all of the data and only COPY data from S3 into a temporary table when running a query?
If you are just doing occasional queries where cost is more important than speed, you could consider using Amazon Athena, which can query data stored in Amazon S3. (Only in some AWS regions at the moment.) You are only charged for the amount of data read from disk.
To gain an appreciation for making Athena even better value, see: Analyzing Data in S3 using Amazon Athena
Amazon Redshift Spectrum can perform a similar job to Athena but requires an Amazon Redshift cluster to be running.
All other choices are really a trade-off between cost and access to your data. You could start by taking a snapshot of your Amazon Redshift database and then turning it off at night and on the weekends. Then, have a script that can restore it automatically for queries. Use fewer nodes to reduce costs -- this will make queries slower, but that doesn't seem to be an issue for you.