We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.
Related
I am trying to consume some data in redshift using sagemaker to train some model. After some research, I found the best way to do so is first unloading the data from redshift to an S3 bucket. I assume sagemaker has API to directly interact with redshift, but why do we need to first unload it to an S3 bucket?
UNLOADing is a best practice and generally the method that the docs will promote. This is due to efficiency and performance. Redshift is a cluster with a single leader and multiple compute nodes. S3 is a cluster - a distributed object store. Having multiple compute nodes connect to S3 when moving data is far faster and less of a burden to the database.
Also, tools that you may be using with sagemaker (like EMR) are also clusters and will also benefit from multiple parallel connections to S3.
The larger the amount of data being moved the greater this benefit will be.
We're building a multi-tenant SaaS application hosted on AWS that exposes and visualizes data in the front end via a REST api.
Now, for storage we're considering using AWS Redshift (Cluster or Serverless?) and then exposing the data using API Gateway and Lambda with the Redshift Data API.
The reason why I'm inclined to using Redshift as opposed to e.g RDS is that it seems like a nice option to also be able to conduct data experiments internally when building our product.
My question is, would this be considered a good strategy?
Redshift is sized for very large data and tables. For example the minimum storage size is 1MB. That's 1MB for every column and across all the slices (minimum 2). A table with 5 columns and just a few rows will take 26MB on the smallest Redshift cluster size (default distribution style). Redshift shines when your tables have 10s of millions of rows minimum. It isn't clear from your case that you will have the data sizes that will run efficiently on Redshift.
The next concern would be about your workload. Redshift is a powerful analytics engine but is not designed for OLTP workloads. High volumes of small writes will not perform well; it wants batch writes. High concurrency of light reads will not work as well as a database designed for that workload.
At low levels of work Redshift can do these things - it is a database. But if you use it in a way it isn't optimized for it likely isn't the most cost effective option and won't scale well. If job A is the SAS workload and analytics is job B, then choose the right database for job A. If this choice cannot do job B at the performance level you need then add an analytics engine to the mix.
My $.02 and I'm the Redshift guy. If my assumptions about your workload are wrong please update with specific info.
I am looking for advice which service should I use. I am new to big data and confused with differences between them on AWS.
Use case:
I receive 60-100 csv files daily (each one can be from few MB to few GB). There are six corresponding schemas, and each file can be treated as part of only one table.
I need to load those files to the six database tables and execute joins between them and generate daily output. After generation of the output, the data present in database is no longer need, so we can truncate that tables and await on the next day.
Files have predictable naming patterns:
A_<timestamp1>.csv goes to A table
A_<timestamp2>.csv goes to A table
B_<timestamp1>.csv goes to B table
etc ...
Which service could be used for that purpose?
AWS Redshift (execute here joins)
AWS Glue (load to redshift)
AWS EMR (spark)
or maybe something else? I heard that spark could be used to do the joins, but what is the proper, optimal and performant way of doing that?
Edit:
Thanks for the responses. I see two options for now:
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, execute SQL joins with Athena
Use AWS Glue, setup 6 crawlers which will load on trigger files to specific AWS Glue Data Catalogs, trigger spark job (AWS Glue in serverless form) to do the SQL joins and setup output to the S3.
Edit 2:
But according to the: https://carbonrmp.com/knowledge-hub/tech-engineering/athena-vs-spark-lessons-from-implementing-a-fully-managed-query-system/
Presto is designed for low latency and uses a massively parallel processing (MPP) approach which is fast but requires everything to happen at once and in memory. It’s all or nothing, if you run out of memory, then “Query exhausted resources at this scale factor”. Spark is designed for scalability and follows a map-reduce design [1]. The job is split and processed in chunks, which are generally processed in batches. If you double the workload without changing the resource, it should take twice as long instead of failing [2]
So Athena (aka Presto) is not scalable as much as I want. I've seen "Query exhausted resources at this scale factor" for my case.
Any possibility of changing the file type to a columnar format like parquet? Then you can use AWS EMR and spark should be able to handle the joins easily. Obviously, you need to optimize the query depending on the data/cluster size etc.
I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.
I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.