I am integrating an AWS partner (Freshservice). I am able to set up an event bus to see the data from the partner, but I need to get the ticket information into S3. I am thinking of using a Glue workflow but I am uncertain if this is the best method. The end result is to have the data available in Quicksight for analytics. Any thoughts on best options?
Solution was not what I thought. I ended up going from Event Bridge -> Kinesis -> S3 -> GLUE -> Athena
Related
I am using AWS RDS(MySQL) and I would like to sync this data to AWS elasticsearch in real-time.
I am thinking that the best solution for this is AWS Glue but I am not sure about I could realize what I want.
This is information for my RDS database:
■ RDS
・I would like to sync several tables(MySQL) to opensearch(1 table to 1 index).
・The schema of tables will be changed dynamically.
・The new column will be added or The existing columns will be removed since previous sync.
(so I also have to sync this schema change)
Could you teach me roughly whether I could do these things by AWS Glue?
I wonder if AWS Glue can deal with dynamic schame change and syncing in (near) real-time.
Thank you in advance.
Glue Now have OpenSearch connector but Glue is like a ETL tool and does batch kind of operation very well but event based or very frequent load to elastic search might not be best fit ,and cost also can be high .
https://docs.aws.amazon.com/glue/latest/ug/tutorial-elastisearch-connector.html
DMS can help not completely as you have mentioned schema keeps changing .
Logstash Solution
Since Elasticsearch 1.5, Elasticsearch added jdbc input plugin in Logstash to sync MySQL data into Elasticsearch.
AWS Native solution
You can have a lambda function on MySQL event Invoking a Lambda function from an Amazon Aurora MySQL DB cluster
The lambda will write to Kinesis Firehouse in json and kinesis can load into OpenSearch .
I'm currently brainstorming an idea and trying to figure out what are the missing pieces or a better way to solve this problem.
Assume I have a product that customers can embed on their website. My end goal is to build a dashboard on my website showing relevant analytics (such as page load, click, custom events) to my customer.
I separated this feature into 2 parts:
collection of data
We can collect data from 2 sources:
Embed of https://my-bucket/$customerId/product.json
CloudFront Logs -> S3 -> Kinesis Data Streams -> Kinesis Data Firehose -> S3
Http request POST /collect to collect an event
ApiGateway end point -> Lambda -> Kinesis Data Firehose -> S3
access of data
My dashboard will be calling GET /analytics?event=click&from=...&to=...&productId=...
The first part is straight forward:
ApiGateWay route -> Lambda
The struggling part: How can I have my Lambda accessing data at the moment stored on S3?
So far, I have evaluated this options:
S3 Glue -> Athena: Athena is not a high availability service. To my understand, some requests could take minutes to execute. I need something that is fast and snappy.
Kinesis Data Firehose -> DynamoDB: It is difficult to filter and sort on DynamoDB. I'm afraid that the high volume of analytics will slow it down and make it unpractical.
QuickSight: It doesn't expose an SQL way to get data
Kinesis Analytics: It doesn't expose an SQL way to get data
Amazon OpenSearch Service: Feels overkill (?)
Redshift: Looking into it next
I'm most probably misnaming what I'm trying to do as I can't seem to find any relevant help to solve this problem I would think must be quite common.
We need to run an analysis of the data in Amazon DynamoDB. Since doing it in the DDB isn't an option due to DDB's limitations with analysis, based on the recommendations I am leaning towards DDB -?> S3 -> Athena.
It is a data-heavy application with data streaming from AWS IoT devices and is also a multi-tenant application. Now, to sync data from DDB to Amazon S3, it will be probably a couple of times a day. How do we set up incremental exports for this purpose?
There is an Athena connector to be able to query your data in DynamoDB table directly using SQL query.
https://docs.aws.amazon.com/athena/latest/ug/athena-prebuilt-data-connectors-dynamodb.html
https://dev.to/jdonboch/finally-dynamodb-support-in-aws-quicksight-sort-of-2lbl
Another solution for this use case is you can write an AWS Step Functions workflow that when invoked, can read data from an Amazon DynamoDB table and then format the data to the way you want it and place the data into an Amazon S3 bucket (an example that shows a similar use case will be available soon):
This is the reverse (here the source is an Amazon S3 bucket and the target is an Amazon DynamoDB table) but you can build the Workflow so the target is an Amazon S3 bucket. Because it's a workflow, you can use a Lambda function that is scheduled to fire a few times a day based on a CRON expression. The job of this Lambda function is to invoke the workflow using the Step Functions API.
I'm working on an ETL pipeline using Apache NiFi, the flow runs hourly and is something like this:
data provider API->Apache Nifi->S3 landing
->Athena Query to transform the data->S3 stage
->Athena Query to change field types and join with another data so it be ready for analysis->S3 trusted
->Glue->Redshift
I found GLUE to be expensive to send data to redshift, will code something ad-hoc to use the COPY command.
The question I would like to ask is if you can guide me if there is a better tool/way to do something better/cheaper/scalable, specially on steps 2 and 3.
I'm looking for ways, to optimize this process and make it ready to recieve millions of registries per hour.
Thank you!
Interesting workflow.
You can actually use some neat combinations to automatically get data from s3 into redshift.
You can do S3 (Raw Data) -> Lambda (Off PUT notification) -> Kinesis Firehose -> S3 (batched & transformed with firehose transformer) -> Kinesis Redshift Copy
This flow will completely automate updates based on your data. You can read more about it here. Hope this helps.
You can save your data in partitioned fashion in s3.
Then use glue spark jobs to transform the data and implementing joins and aggregations as that will be fast if written in optimized way.
This will also save you cost as glue will process the data faster then expected and then to move data to redshift copy command is the best approach.
Read AWS GLUE https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console
I have a use-case where I want to sync two AWS Glue Data Catalog residing on different accounts.
Does Glue emit notifications which can be published when a new database/table/partition is created/deleted? Or some other way of knowing what is happening in other Glue Data Catalog?
One way is to listen Cloudwatch notifications of that Glue account but according to Doc Cloudwatch notifications are not reliable:
https://docs.aws.amazon.com/glue/latest/dg/automating-awsglue-with-cloudwatch-events.html
AWS provides an open source script(s) for that purpose. See here
Not sure how reliable and fast it is, but worth trying.