AWS for serverless Google Analytics-like tool - amazon-web-services

I'm currently brainstorming an idea and trying to figure out what are the missing pieces or a better way to solve this problem.
Assume I have a product that customers can embed on their website. My end goal is to build a dashboard on my website showing relevant analytics (such as page load, click, custom events) to my customer.
I separated this feature into 2 parts:
collection of data
We can collect data from 2 sources:
Embed of https://my-bucket/$customerId/product.json
CloudFront Logs -> S3 -> Kinesis Data Streams -> Kinesis Data Firehose -> S3
Http request POST /collect to collect an event
ApiGateway end point -> Lambda -> Kinesis Data Firehose -> S3
access of data
My dashboard will be calling GET /analytics?event=click&from=...&to=...&productId=...
The first part is straight forward:
ApiGateWay route -> Lambda
The struggling part: How can I have my Lambda accessing data at the moment stored on S3?
So far, I have evaluated this options:
S3 Glue -> Athena: Athena is not a high availability service. To my understand, some requests could take minutes to execute. I need something that is fast and snappy.
Kinesis Data Firehose -> DynamoDB: It is difficult to filter and sort on DynamoDB. I'm afraid that the high volume of analytics will slow it down and make it unpractical.
QuickSight: It doesn't expose an SQL way to get data
Kinesis Analytics: It doesn't expose an SQL way to get data
Amazon OpenSearch Service: Feels overkill (?)
Redshift: Looking into it next
I'm most probably misnaming what I'm trying to do as I can't seem to find any relevant help to solve this problem I would think must be quite common.

Related

AWS Eventbridge to S3

I am integrating an AWS partner (Freshservice). I am able to set up an event bus to see the data from the partner, but I need to get the ticket information into S3. I am thinking of using a Glue workflow but I am uncertain if this is the best method. The end result is to have the data available in Quicksight for analytics. Any thoughts on best options?
Solution was not what I thought. I ended up going from Event Bridge -> Kinesis -> S3 -> GLUE -> Athena

Looking for guidance/alternatives on ETL process on AWS

I'm working on an ETL pipeline using Apache NiFi, the flow runs hourly and is something like this:
data provider API->Apache Nifi->S3 landing
->Athena Query to transform the data->S3 stage
->Athena Query to change field types and join with another data so it be ready for analysis->S3 trusted
->Glue->Redshift
I found GLUE to be expensive to send data to redshift, will code something ad-hoc to use the COPY command.
The question I would like to ask is if you can guide me if there is a better tool/way to do something better/cheaper/scalable, specially on steps 2 and 3.
I'm looking for ways, to optimize this process and make it ready to recieve millions of registries per hour.
Thank you!
Interesting workflow.
You can actually use some neat combinations to automatically get data from s3 into redshift.
You can do S3 (Raw Data) -> Lambda (Off PUT notification) -> Kinesis Firehose -> S3 (batched & transformed with firehose transformer) -> Kinesis Redshift Copy
This flow will completely automate updates based on your data. You can read more about it here. Hope this helps.
You can save your data in partitioned fashion in s3.
Then use glue spark jobs to transform the data and implementing joins and aggregations as that will be fast if written in optimized way.
This will also save you cost as glue will process the data faster then expected and then to move data to redshift copy command is the best approach.
Read AWS GLUE https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console

Pipeline for parsing daily json data with AWS?

json files are posted daily to an s3 bucket. I want to take that json file, do some processing on it, then post the data to a new s3 bucket where it will get picked up and stored in Redshift. What would be the recommended AWS pipeline for this? AWS lambda that triggers when a new json file is placed on s3, that then kicks off something like an AWS batch job? Or something else? I am not familiar with all AWS web services so might be overlooking something obvious.
So the flow looks like this:
s3 bucket -> data processing -> s3 bucket -> redshift
and it's the data processing step I'm not sure about - how to schedule something fairly scalable that runs daily and efficiently and puts the data back. The processing is parsing of json data and some aggregation and data clean up.
and it's the data processing step I'm not sure about - how to schedule something fairly scalable that runs daily and efficiently and puts the data back.
Don't worry about scalability with Lambda, just focus on short running jobs. Here is an example:
https://docs.aws.amazon.com/lambda/latest/dg/with-scheduledevents-example.html
I think one piece of the puzzle you're missing is the documentation for Schedule Expressions Using Rate or Cron: https://docs.aws.amazon.com/lambda/latest/dg/with-scheduledevents-example.html

What is the right architecture\design to perform javascript-client to aws-database website tracking system

We wish to build data pipeline system which tracks website interactions/events.
The goal is to track user behavior in a website so we would like to choose the right architecture to implement it having the following two constraints :
1) the system is Amazon
2) this is budgetary project so we cannot use redshift for this purpose
Based on the above two constraints my plan is to implement the following architecture:
website-javascript --> AWS-S3 -->(AWS-Lambda)--> AWS-RDS
website javascript client -
aws-firehose data delivery system to S3 - tracking user interaction and load them to aws-firehose which eventually write them in aws-S3.
AWS Lambda (Python) - Periodically task which pulls daily events from AWS-S3 and load them to AWS-RDS.
The reason I have chosen AWS-RDS is due to its cost-effectiveness for this objective
Appreciate any comment to the above mentioned implementation or any other architecture proposal that you may recommend to use instead of the above
If I understand your question correctly, you are proposing below solution to perform web analytics for your application:
WebServer --> Firehose --> AWS-S3 --> AWS-Lambda --> AWS-RDS
I see below pros and cons with above design
Pros:
low cost
easy to implement
Cons:
RDS may not be salable enough to handle analytics on massive amounts of web-streaming data, which tend to grow rapidly
Need to handle load balancing, failure scenarios and other complexities for lambda
You need to handle data transformation for RDS as it expects structured data to be ingested into relational tables
Proposal to store the data in S3 through Firehose sounds a good solution. But please keep in mind that minimum interval for Firehose is one minute, so your application needs to tolerate this minor latency. You may use Kinesis Streams to have millisecond latency, but then you need to manage your own application code and instances to handle Streams.
After ingesting data in Kinesis Firehose or Streams, you may also explore below alternatives:
Use Kinesis Analytics to track web users activity in real-time if it's available in your AWS region. It's only available in selected AWS regions currently
Within Firehose, transform your data using lambda and store it in S3 in optimized format for further analysis with AWS Athena
Use Elastic Search as a destination and perform web analytics with ELK stack instead of RDS
Though you mentioned that you can not use RedShift, it still may be the best solution for time series analysis. Exploring RedShift, RedShift Spectrum and formatted data stored in S3 may still be a cost effective solution with better cababilities
Adding few references from AWS, which you may go through before deciding on the solution:
Real-Time Web Analytics with Kinesis Data Analytics Solution
Near Real-time Analytics on Streaming Data with Amazon Kinesis and Amazon Elasticsearch
Schema-On-Read Analytics Pipeline Using Amazon Athena
Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required
Hey folky this is a getting more and more common.
Generally the pattern is click events to Kinesis streams then you can monitor user interaction with the website in real time using Kinesis analytics. You can connect the stream to firehose to offload data in to an S3 bucket as well as incorporate Lambdas to transform the data.
There is some major complexity around handling Lambdas and Kinesis streams in parallel so this solution might not be as scalable as using AWS Kafka. Or perhaps run a job to move your s3 data into rds for whatever reporting you might need that is adhoc.
Here is a pattern AWS already has real-time-web-analytics-with-kinesis

IoT Big Data design on AWS

I'm trying to design a big IoT solution of millions of devices starting from zero. That's why I need a highly scalable platform like AWS.
My devices are going to report data using AWS IoT, and that's the only thing I've really decided. I need to store a lot of data like a temperature measure every 15 minutes on each device so for that measures I've planned to insert those measures directly to DynamoDB using IoT Rules, but on the other side, I need a relational structure to store companies, temperature sensors, etc. So I thought I could store that in MySQL RDS.
After that, I need to configure a proper analysis tool, so I was thinking of Kinesis and load the data from Redshift after ETL using Data Pipeline since AWS Glue doesn't support DynamoDB.
I'm new with some of the services so I don't know exactly what I'm doing and I don't know if this approach is the best one. What do you think?.
Thanks.
I would have your applications write the edge data (Raw Data) to an S3 bucket with this flow:
Edge(With credentials) -> APIGateway -> Lambda -> S3
Save your Raw data as .json files in S3. Then you can use tools like Athena and Quicksight to visualize.
The benefits to this are:
1) Your edge devices don't have to have the AWS SDK
2) S3 is cheep and insanely scalable
3) JSON format can be read by any service so you are not locked into AWS for visualization.