Sanity check on AWS Big Data Architecture - amazon-web-services

We're currently looking to move our AWS architecture over to something that supports large amounts of data and can scale as we gain more customers. When this project started we stuck with what we knew, a Ruby app on an EC2 making RESTful API calls, storing the results in S3, and also storing everything in an RDS. We have a SPA front end written in VueJS to support the stored data.
As our client list has grown, the outbound API calls and subsequence data we are storing is also growing. I'm currently tasked with looking for a better solution and I wanted to get a sense of feedback on what I was thinking so far. Currently we have around 5 millions rows of relational data which will only increase as our client list does. I could see in a year or two we would be in the low billions or rows.
The Ruby app does a great job of handling queuing the outbound API calls, retries, and everything else in-between. For this reason we thought about keeping the app and rather than inserting directly into the RDS, it would simply dump the results into S3 as a CSV.
A trigger in S3 could now convert the raw CSV data into parquet format using a Lambda function (I was looking at something like PyArrow). From here we could move over from the traditional RDS to something like Athena which supports parquet and would allow us to reuse most of our existing SQL queries.
To further optimize the performance for the user we thought about caching commonly used queries in a Dynamo table. Because the data is based on the scheduled external API calls, we could control when to bust the cache of the queries.
Big Data backends aren't really my thing, so any feedback is greatly appreciated. I know I have a lot more research to do into parquet as it's new to me. Eventually we'd like to do some ML on this data, so I believe parquet will also support thanks.

Related

How to efficiently aggregate data in billions of individual records in AWS?

At a high / theoretical level I know exactly the type of architecture I want to build and how it would work, but I'm attempting to construct this as cheaply as possible using AWS services and my lack of familiarity with the offerings of AWS has me running in circles.
The Data
We run a video streaming platform. On busy nights we have about 100 simultaneous live streams going with upwards of 30,000 viewers. We expect this number to rise to 100,000 in the next few years. A live stream lasts, on average, 2 hours.
We send a heartbeat from our player every 10 seconds with information about the viewer -- how much data they've viewed, how much data they've buffered, what quality they're streaming, etc.
These heartbeats are sent directly to an AWS Kinesis endpoint.
Finally, we want to retain all past messages for at least 5 years (hopefully longer) so that we can look at historic analytics.
Some back of the envelope calculations suggest we will have 0.1 * 60 * 60 * 2 * 100000 * 365 * 5 = 131 billion heartbeat messages five years from now.
Our Old Pipeline
Our old system had a single Kinesis consumer. Aggregate data was stored in DynamoDB. Whenever a message arrived we would read the record from DynamoDB, update the record, then write the new record back. This read-update-write loop limited the speed at which we could process messages and made it so that each message coming in was dependent on the messages before it, so they could not be processed in parallel.
Part of the reason for this setup is that our message schema was not well designed from the outset. We send the timestamp at which the message was sent, but we do not send "amount of video watched since last heartbeat". As a result in order to compute the total viewer time we need to look up the last heartbeat message sent by this player, subtract the timestamps, and add that value. Similar issues exist with many other metrics.
Our New Pipeline
We've begun to run into scaling issues. During our peak hours analytics can be delayed by as much as four hours while waiting for a backlog of messages to be processed. If this backlog reaches 24 hours Kinesis will start deleting data. So we need to fix our pipeline to remove this dependency on past messages so we can process them in parallel.
The first part of this was updating the messages sent by our players. Our new specification includes only metrics that can be trivially sum'd with no subtraction. So we can just keep adding to the "time viewed" metric, for instance, without any regard to past messages.
The second part of this was ensuring that Kinesis never backs up. We dump the raw messages to S3 as quickly as they arrive with no processing (Kinesis Data Fire Hose) so that we can crunch analytics on them at our leisure.
Finally, we now want to actually extract information from these analytics as quickly as possible. This is where I've hit a snag.
The Questions We Want to Answer
As this is an analytics pipeline, our questions mostly revolve around filtering these messages and then aggregating fields for the remaining messages (possibly, in fact likely, with grouping). For instance:
How many Android users watched last night's stream in HD? (FILTER by stream and OS)
What's the average bandwidth usage among all users? (SUM and COUNT, with later division of the final aggregates which could be done on the dashboard side)
What percent of users last year were on any Apple device (iOS, tvOS, etc)? (COUNT, grouped by OS)
What's the average time spent buffering among Android users for streams in the past year? (a mix of all of the above)
Options
AWS Athena would allow us to query the data in S3 directly as if it were an ANSI SQL table. However reading up on Athena, unless the data is properly formatted it can be incredibly slow. Some benchmarks I've seen show that processing 1.1 billion rows of CSV data can take up to 2 minutes. I'm looking at processing 100x that much data
AWS EMR and AWS Redshift sound like they are built for this purpose, but are complicated to set up and have a high base cost to run (requiring an EC2 cluster to remain active at all times). AWS Redshift also requires data be loaded into it, which sounds like it might be a very slow process, delaying our access to analytics
AWS Glue sounds like it may be able to take the raw messages as they arrive in S3 and convert them to Parquet files for more rapid querying via Athena
We could run a job to regularly batch messages to reduce the total number that must be processed. While a stream is live we'll receive one message every 10 seconds, but we really only care about the totals for a given viewer. This means that when a 2-hour stream concludes we can combine the 720 messages we've received from that player into a single "summary" message about the viewer's experience during the whole stream. This would massively reduce the amount of data we need to process, but exactly how and when to trigger this process isn't clear to me
The Ideal Architecture
This is a Big Data problem. The generic solution to Big Data problems is "don't take your data to your query, take your query to your data". If these messages were spread across 100 small storage nodes then each node could filter, sum, and count the subset of data they hold and pass these aggregates back to a central node which sums the sums and sums the counts. If each node is only operating on 1/100th of the data set then this kind of processing could theoretically be incredibly fast.
My Confusion
While I have a theoretical understanding of the "ideal" architecture, it's not clear to me if AWS works this way or how to construct a system that will function well like this.
S3 is a black box. It's not clear if Athena queries are run on individual nodes and aggregates are further reduced elsewhere, or if there's a system reading all of the data and aggregating it in a central location
Redshift requires the data by copied into a Redshift database. This doesn't sound fast, nor distributed
It's unclear to me how EMR works or if it will suit my purpose. Still researching
AWS Glue seems like it may need to be triggered by some event?
Parquet files seems to be like CSVs, where multiple records reside in a single file. Meanwhile I'm dumping one record per file. But perhaps there's a way to fix that? e.g. batching files every minute or every 5 minutes?
RDS or a similar service might be really good for this (indexing and whatnot) but would require a guaranteed schema (or necessitate migrating if our message schema changed) which is a concern. Migrating terabytes of data if we change our message schema sounds out of the question
Finally, along with wanting to get analytics results in as "real time" as possible (ideally we want to know within 1 minute when someone joins or leaves a stream), we want the dashboards to load quickly. Waiting 30 seconds to see the count of live viewers is horrendous. Dashboards should load in 2 seconds or less (ideally)
The plan is to use QuickSight to create dashboards (our old system had a hack-y Django app that read from our DynamoDB aggregates table, but I'd like to avoid creating more code for people to maintain)
I expect you are going to get a lot of different answers and opinions from the broad set of experts you have pinged with this. There is likely no single best answer to this as there are a lot of variables. Let me give you my best advice based on my experience in the field.
Kinesis to S3 is a good start and not moving data more than needed is the right philosophy.
You didn't mention Kinesis Data Analytics and this could be a solution for SOME of your needs. It is best for questions about what is happening in the data feed right now. The longer timeframe questions are better suited for the tools you mention. If you aren't too interested in what is happening in the past 10 minutes (or so) it could be good to omit.
S3 organization will be key to performing any analytic directly on the data there. You mention parquet formatting which is good but partitioning is far more powerful. Organizing the S3 data into "days" or "hours" of data and setting up the partitioning based on this can greatly speed up any query that is limited in the amount of time that is needed (don't read what you don't need).
Important safety note on S3 - S3 is an object store and as such there is overhead for each object you reference. Having many small objects (10,000+) treated as a single set of data is going to be slow no matter what solution you go with. You need to fix this before you go forward with any solution. You see it takes upwards of .5 sec to look up an object in S3 but if the file is small the transfer time is next to nothing. Now multiply .5 sec times all the objects you have and see how long it will take to read them. This is not a function of the downstream tool you choose but of the S3 organization you have. S3 objects as part of a Big Data solution should be at least 100M in size to not suffer greatly from the object lookup time. The choice of parquet or CSV files is mute without addressing object size and partitioning first.
Athena is good for occasional queries especially if the date ranges are limited. Is this the query pattern you expect? As you say "move the compute to the data" but if you use Athena to do large cross-sectional analytics where a large percentage of the data needs to be used, you are just moving the data to Athena every time you execute this query. Don't stop thinking about data movement at the point it is stored - think about the data movements to do the analytics also.
So a big question is how much data is needed and how often to support your analytics workloads and BI functions? This is the end result you are looking for. If a high percentage of the data is needed frequently then a warehouse solution like Redshift with the data loaded to disk is the right answer. The data load time to Redshift is quite fast as it parallel loads the data from S3 (you see S3 is a cluster and Redshift is a cluster and parallel loads can be done). If loading all your data into Redshift is what you need then the load time is not your main concern - the cost is. Big powerful tool with a price tag to match. The new RA3 instance type bends this curve down significantly for large data size clusters so could be a possibility.
Another tool you haven't mentioned is Redshift Spectrum. This brings several powerful technologies together that could be important to you. First is the power of Redshift with the ability to choose smaller cluster sizes that normally would be used for your data size. S3 filtering and aggregation technology allows Spectrum to perform actions on the data in S3 (yes initial compute actions of the query are performed inside of S3 potentially greatly reducing the data moved to Redshift). If your query patterns support this data reduction in S3 then the data movement will be small and the Redshift cluster can be small (cheap) too. This can be a powerful compromise point for IoT solutions like yours since complex data models and joining are not needed.
You bring up Glue and conversion to parquet. These can be good to do but as I mentioned before partitioning of the data in S3 is usually far more powerful. The value of parquet will increase as the width of your data increases. Parquet is a columnar format so it is advantaged if only a subset of "columns" are needed. The downside is the conversion time/cost and the loss of easy human readability (which can be huge during debug).
EMR is another choice you mention but I generally advise clients against going with EMR unless they need the flexibility it brings to the analytics and they have the skills to use it well. Without these EMR tends to be an unneeded costs sink.
If this is really going to be a Big Data solution then RDS (and Aurora) not good choices. They are designed for transactional workloads, not analytics. The data size and analytics will not fit well or be cost effective.
Another tool in the space is S3 Select. Not likely what you are looking for but something to remember exists and can be a tool in the toolbox.
Hybrid solutions are common in this space if there are variable needs based on some factor. A common one "is time of day" - no one is running extensive reports at 3am so the needed performance is much less. Another is user group - some groups need simple analytics while others need much more power. Another factor is timeliness of data - does everyone need "up to the second" information or is daily information sufficient? Trying to have one tool that does everything for everybody, all the time is often a path to an expensive, oversized solution.
Since Redshift Spectrum and Athena can point at the same S3 data (well organized since both will benefit) both tools can coexist on the same data. Also, Redshift is ideal for sifting through huge mounds of data, it is ideal for producing summary tables and then writing them (in partitioned parquet) to S3 for tools like Athena to use. All these cloud services can be run on schedules and this includes Redshift and EMR (Athena is query on demand) so they don't need to run all the time. Redshift with Spectrum can run a few hours a day to perform deep analytics and summarize data for writing to S3. Your data scientist can also use Redshift for their hardcore work while Athena supports dashboards using the daily summary data and Kinesis Data Analytics as source.
Lastly you bring up a 2 sec requirement for dashboards. This is definitely possible with Quicksight backed up by Redshift or Athena but won't be met for arbitrarily complex / data intensive queries. To meet this you will need the engine to have enough horsepower to produce the data in question. Redshift with local data storage is likely the fastest (Redshift Spectrum with some data pruning done in S3 wins in some cases) and Athena is the weakest / slowest. But the power doesn't matter if the work is small - see your query workload will be a huge deciding factor. The fastest will be to load the needed data into Quicksight storage (SPICE) but this is another localized / summarized version of the data so timeliness is again a factor (how often is this updated).
Based on designing similar systems and a bunch of guesses as to what you need I'd recommend that you:
Fix your object size (Kineses can be configured to do this)
Partition your data by day
Set up a small Redshift cluster (4 X dc2.large) and use Spectrum source address the data
Connect Quicksight to Redshift
Measure the performance (and cost) and compare to requirements (there will likely be gaps)
Adjust to solution (summary tables to S3, Athena, SPICE etc.) to meet goals
The alternative is to hire someone who has set up such systems before and have them review the requirements in detail and make a less "guess-based" recommendation.
I would look into Druid. Not an AWS offering, but easily runs on AWS, with good integration with S3 and Kinesis.
Capable of reading from Kinesis, at high speeds, and make the data available for querying right away. Can also flatten and transform the data as it reads it.
Capable of doing rollups/aggregation/compaction during ingestion (and further reduce data in an async manner). From what you wrote, it seems to me that it could easily reduce the number of rows in the DB by a very large factor.
Capable of fast queries, using standard SQL.
Smart partitioning of the data to scan only the relevant dates.
The down-side is that you will need to keep a cluster up and running for ingestion and for querying. It is pretty scalable, so you can start small.
On the up-side - you're not using 10 different technologies (Athena/Glue/EMR/etc.)
You might want to consider contacting Imply, which can ease the deployment.
A usual approach a lot of companies take is they do heavy weight lifting in athena or bigquery (or some other distributed sql environment) -> aggregate intermediate results into multiple indexed+partitioned postgres/mysql/redshift/clickhouse tables and then connect their APIs to read on those tables. Of course, this works fine except the fact that with an increased amount of intermediate-aggregated data, table indices grow and problems like cumulative sum or sorting become less and less efficient.
With your problem in hand, I think you can get a lot of help with AWS Lambda. AWS Lambda provides a very feasible serverless approach towards solving large granular data problems (if used correctly). For instance, assume that your pipelines partitions incoming stream by YYYYMMMDDHHMM and stores it into some S3 path which has a Lambda listening to it (as a trigger function) then your data ingest + aggregation becomes pretty much simultaneous processes. As soon as a minute is up, a new instance of the same Lambda function will be taking care of data landing into partition YYYYMMMDDHHMM+1. So, this way, you can run thousands of simultaneous processes with a good bunch of Lambda functions doing the same thing in parallel. Of course, this is a rough picture, but I think it can greatly help.

Mirror Marketo Data in S3 Bucket for Visualization

I am looking to get all of the Activity and Lead data in Marketo to be mirrored in an AWS S3 bucket so that I can build dashboards on it in Quicksight, so preferably I'd like to stream the data from Marketo into S3 in real-time, and then use Glue and Athena to connect the data to Quicksight. However, the only way to get large volumes of data out of Marketo appears to be their Bulk Extract tool (one for Leads, one for Activity data).
The problem is that these API interfaces make any attempt at near real-time streaming really clunky. Currently, I have Lambda functions being triggered every hour to pull the most recent hour of Lead/Activity data and saving it as a gzipped CSV in S3. But Marketo's Bulk Extract tool has a request queue and requests often take longer than 15 minutes to process (15 minutes being Lambda's max timeout length). So at least once a day my requests are getting dropped.
The solution seems to be to instead run this on an EC2 instance that can juggle multiple requests and patiently wait for Marketo's queue. But I'd rather not get into all the async and error-handling issues that that approach may entail if there is an easier way to accomplish this.
As an alternative solution, Amazon Appflow integrates with Marketo. But last I checked, it only works with Lead data, not Activity data. And there are restrictions on the filters you have to apply to the Lead data that make it clunky to work with anyway.
On Google I have found several companies that claim to offer seamless, reliable Marketo-to-S3 ETL, but I haven't yet researched their pricing or quality.
If anyone knows of a good approach to set up reliable and cost-efficient ETL between Marketo and S3 in a short period of time, I would very much appreciate it.
In a case like this, I would be tempted to recommend using an EC2 instance to run Singer with a Marketo input and CSV output, then set up something to move the CSV over to S3 as needed. That would be the absolute cheapest ETL solution, but this does suppose you have some comfort and familiarity with Python.
Also worth noting is that Stitch, Singers's paid product equivalent, supports native S3 export--you could always first test with a non-Marketo data source and see if that performs the way you'd like if you prefer money over time.

Best practices for setting up a data pipeline on AWS? (Lambda/EMR/Redshift/Athena)

*Disclaimer: *This is my first time ever posting on stackoverflow, so excuse me if this is not the place for such a high-level question.
I just started working as a data scientist and I've been asked to set up an AWS environment for 'external' data. This data comes from different sources, in different formats (although its mostly csv/xlsx). They want to store it on AWS and be able to query/visualize it with Tableau.
Despite my lack of AWS experience I managed to come up with a solution that's more or less working. This is my approach:
Raw csv/xlsx are grabbed using a Lambda
Data is cleaned and transformed using pandas/numpy in the same Lambda as 1.
The processed data is written to S3 folders as CSV (still within the same lambda)
Athena is used to index the data
Extra tables are created using Athena (some of which are views, others aren't)
Athena connector is setup for Tableau
It works but it feels like a messy solution: the queries are slow and lambdas are huge. Data is often not as normalized as it could be, since it increases query time even more. Storing as CSV also seems silly
I've tried to read up on best practices, but it's a bit overwhelming. I've got plenty questions, but it boils down to: What services should I be using in a situation like this? What does the high-level architecture look like?
I have a fairly similar use-case; however, it all comes down to the size of the project and how for you want to take robustness / future planning of the solution.
As a first iteration, what you have described above seems like it works and is a reasonable approach but as you pointed out is quite basic and clunky. If the external data is something you will be consistently ingestion and can foresee growing i would strongly suggest you design a datalake system first, my recommendation would be either use AWS lake formation service or if you want more control, and build ground up, use something like the 3x3x3 approach.
By designing your datalake correctly managing the data in the future becomes much simpler and nicely partitions your files for future use / data diving.
As a high level architecture would be something like:
Lambda get request from source and paste to s3
Datalake system handles file and auto partitions + tags
then,
Depending on how quickly you need to visualise your data and if it large amounts of data potentially use AWS glue pyshell or pyspark instead of lambda. Which will handle your pandas/numpy a lot cleaner.
I would also recommend converting your files into parquet if your using Athena or equivalent for improved query speed. Remember file partitioning is important to performance!
Note, the above is for quite a robust ingestion system and may be overkill if you have a basic use case with infrequent data ingestion.
If your data is in small packets but is very frequent you could even use a kinesis layer in-front of the lambda to s3 step to pipe your data in a more organised manner. You could also use redshift to host your files instead of S3 if you wanted a more contemporary warehouse solution. However, if you have x sources i would suggest stick with s3 for simplicity.

How can I implement Amazon EMR to read data from my API calls?

All the examples i've seen are with Java programs?
I want to be able to track the a user's behaviour while navigating my website by looking at all the API calls made by that user. All the API calls are based on data stored in a SQL database.
I also for example want to check all the keywords passed to my search API to have a list of most search terms.
I thought about using Oozie but does anyone have any other suggestions ?
There are several option for analyzing the data in your database.
Normal SQL experimentation
I'd suggest starting with normal SQL statements against your database to experiment with finding what data is of interest. This might be a little slow if you have millions of records, but gives you full flexibility to play around with the data.
Amazon EMR
Once you have identified the types of analysis you'd like to run on a regular basis (eg daily or weekly), you could launch an EMR cluster to perform analysis. Please note that this is a powerful but rather complex toolset and the time required to fully utilize it might not be worthwhile.
You can launch a transient cluster, which means that the cluster terminates once it has finished the jobs it has been given. Thus, the cluster can be triggered via a scheduled API call and will automatically terminate.
Amazon Athena
Amazon Athena provides an SQL interface to data stored in Amazon S3. The common use-case is to analyze log files that are in S3 without having to load them into a database. Athena is powerful and processes data in parallel to give results back very quickly.
Bottom line: Start simple. Play with the existing data to figure out what you'd like to discover. Then optimize.

AWS Redshift vs Snowflake use cases

I was wondering if anyone has used both AWS Redshift and Snowflake and use cases where one is better . I have used Redshift but recently someone suggested Snowflake as a good alternative . My use case is basically retail marketing data that will be used by handful of analysts who are not terribly SQL savvy and will most likely have reporting tool on top
Redshift is a good product, but it is hard to think of a use case where it is better than Snowflake. Here are some reasons why Snowflake is better:
The admin console is brilliant, Redshift has none.
Scale-up/down happens in seconds to minutes, Redshift takes minutes to hours.
The documentation for both products is good, but Snowflake is better laid
out and more accessible.
You need to know less "secret sauce" to make Snowflake work well. On Redshift you need to know and understand the performance impacts of things like distribution keys and sort keys, at a minimum.
The load processes for Snowflake are more elegant than Redshift. Redshift assumes that your data is in S3 already. Snowflake supports S3, but has extensions to JDBC, ODBC and dbAPI that really simplify and secure the ingestion process.
Snowflake has great support for in-database JSON, and is rapidly enhancing its XML. Redshift has a more complex approach to JSON, and recommends against it for all but smaller use cases, and does not support XML.
I can only think of two cases which Redshift wins hands-down. One is geographic availability, as Redshift is available in far more locations than Snowflake, which can make a difference in data transfer and statement submission times. The other is the ability to submit a batch of multiple statements. Snowflake can only accept one statement at a time, and that can slow down your batches if they comprise many statements, especially if you are on another continent to your server.
At Ajilius our developers use Redshift, Snowflake and Azure SQL Data Warehouse on a daily basis; and we have customers on all three platforms. Even with that choice, every developer prefers Snowflake as their go-to cloud DW.
I evaluated both Redshift(Redshfit spectrum with S3) and SnowFlake.
In my poc, snowFlake is way way better than Redshift. SnowFlake integrates well with Relational/NOSQL data. No upfront index or partition key required. It works amazing without worrying about what way to access the day.
Redshift is very limited and no json support. Its hard to understand the partition. You have to do lot of work to get something done. No json support. You can use redshift specturm as a bandaid to access S3. Good luck with partioning upfront. Once you created partition in S3 bucket, you are done with that and no way to change until unless you redo process all data again to new structue. You will end up sending time to fix these issues instead of working on fixing real business problems.
Its like comparing Smartphone vs Morse code mechine. Redshift is like morse code kind of implementation and its not for mordern development
We recently switched from Redshift to Snowflake for the following reasons:
Real-time data syncing
Handling of concurrent queries
Minimizing of database administration
Providing different amounts of computing power to different Looker users
A more in-depth writeup can be found on our data blog.
I evaluated Redshift and Snowflake, and a little bit of Athena and Spectrum as well. The latter two were non-starters in cases where we had big joins, as they would run out of memory. For Redshift, I could actually get a better price to performance ratio for a couple reasons:
allows me to choose a distribution key which is huge for co-located joins
allows for extreme discounts on three year reserved pricing, so much so that you can really upsize your compute at a reasonable cost
I could get better performance in most cases with Redshift, but it requires good MPP knowledge to setup the physical schema properly. The cost of expertise and complexity offsets some of the product cost.
Redshift stores JSON in a VARCHAR column. That can cause problems (OOM) when querying a subset of JSON elements across large tables, where the VARCHAR column is sized too big. In our case we had to define the VARCHAR as extremely large to accommodate a few records that had very large JSON documents.
Snowflake functionality is amazing, including:
ability to clone objects
deep functionality in handling JSON data
snowpipe for low maintenance loading, auto scaling loads, trickle updates
streams & tasks for home grown ETL
ability to scale storage and compute separately
ability to scale compute within a minute, requiring no data migration
and many more
One thing that I would caution about Snowflake is that one might be tempted to hire less skilled developers/DBAs to run the system. Performance in a bad schema design can be worked around using a huge compute cluster, but that may not be the best bang for the buck. Regardless, the functionality in Snowflake is amazing.