How does kinesis firehose stream data to self managed elasticsearch? - amazon-web-services

I am hosting Elasticsearch cluster in EKS and I'd like to stream all cloudwatch groups to this Elasticsearch cluster via Kinesis Firehose. But AWS Kinesis firehose doesn't support stream data to Elasticsearch cluster other than AWS hosted ES.
What is the best way to stream data to self hosted ES cluster?

I think the best way is by means of a lambda function for Firehose. For this to work, you would have to choose supported destination, e.g. S3. The function normally is used to transform the records, but you can program what ever logic you want, including uploading records to a custom ES.
If you would use Python, the function could use elasticsearch layer to connect with your custom cluster and inject records into it. elasticsearch is python interface to ES and it will work with any ES cluster.
An alternative is to use HTTP Endpoint for Your Destination. In this scenario, you could have maybe small instance on ec2 container which would get the records from firehose, and then push them to ES. Just like before, elasticsearch library could be used with Python.

Related

RDS(dynamic schema) -> AWS opensearch by using AWS Glue

I am using AWS RDS(MySQL) and I would like to sync this data to AWS elasticsearch in real-time.
I am thinking that the best solution for this is AWS Glue but I am not sure about I could realize what I want.
This is information for my RDS database:
■ RDS
・I would like to sync several tables(MySQL) to opensearch(1 table to 1 index).
・The schema of tables will be changed dynamically.
・The new column will be added or The existing columns will be removed since previous sync.
(so I also have to sync this schema change)
Could you teach me roughly whether I could do these things by AWS Glue?
I wonder if AWS Glue can deal with dynamic schame change and syncing in (near) real-time.
Thank you in advance.
Glue Now have OpenSearch connector but Glue is like a ETL tool and does batch kind of operation very well but event based or very frequent load to elastic search might not be best fit ,and cost also can be high .
https://docs.aws.amazon.com/glue/latest/ug/tutorial-elastisearch-connector.html
DMS can help not completely as you have mentioned schema keeps changing .
Logstash Solution
Since Elasticsearch 1.5, Elasticsearch added jdbc input plugin in Logstash to sync MySQL data into Elasticsearch.
AWS Native solution
You can have a lambda function on MySQL event Invoking a Lambda function from an Amazon Aurora MySQL DB cluster
The lambda will write to Kinesis Firehouse in json and kinesis can load into OpenSearch .

Why front Amazon AWS ElasticSearch with AWS Kinesis Firehose

I see several applications where data is being sent to AWS Kinesis Firehose and then automatically transferred to AWS ElasticSearch. You can directly write to AWS ElasticSearch. If I don't need any kind of data transformation and I can directly write data to ElasticSearch does fronting ElasticSearch with AWS Kinesis Firehose still provide any advantage. Like does it protect ElasticSearch from spikes in traffic etc.?
Apart from transformations, the following reasons can be considered for having Firehose in front of AWS ES:
Better control over streaming data
Since Elasticsearch has limit on the write queue size, if there is a burst in data for few seconds, ES might throw rejects if it wont be
able to write the data in that limited data. In this, you will end
up loosing the rejected data as well.
However, when Firehose is kept in front, it will handle the retries for you and there will be less chances of data loss.
Firehose is one-way to ES
Your ES cluster might contain confidential data and if you are
allowing user to make POST requests (required for some writes), you
might expose the cluster to more than required users. Firehose can
help you in limiting that by only giving write applications/user
access to the FH stream instead of the ES cluster.

what is difference between Kinesis Streams and Kinesis Firehose?

Firehose is fully managed whereas Streams is manually managed.
If other people are aware of other major differences, please add them. I'm just learning.
Thanks..
Amazon Kinesis Data Firehose can send data to:
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
Splunk
To do the same thing with Amazon Kinesis Data Streams, you would need to write an application that consumes data from the stream and then connects to the destination to store data.
So, think of Firehose as a pre-configured streaming application with a few specific options. Anything outside of those options would require you to write your own code.

How can I integrate AWS RDS with the AWS Elasticsearch service

How do I integrate AWS RDS with the AWS Elasticsearch service? Is there any AWS service so that I can use it to stream data from AWS RDS to AWS Elasticsearch for Indexing?
I'm not seeing a magic way like this for DynamoDB.
I can think of three ways.
set up your RDS to log all transactions, and set up a logstash to parse any inserts and updates and insert to ES.
Create a special log file, that your app uses to store the inserts and updates. Less work to set up logstash this way.
Make your app send all inserts and updates through SNS. From there, distribute them to a ES SQS queue and a RDS SQS queue, and have workers (or lambdas) for each queue to do the inserts to their respective stores.

Stream data from EC2 web server to Redshift

We would like to stream data directly from EC2 web server to RedShift. Do I need to use Kinesis? What is the best practice? I do not plan to do any special analysis before the storage on this data. I would like a cost effective solution (it might be costly to use DynamoDB as a temporary storage before loading).
If cost is your primary concern than the exact number of records/second combined with the record sizes can be important.
If you are talking very low volume of messages a custom app running on a t2.micro instance to aggregate the data is about as cheap as you can go, but it won't scale. The bigger downside is that you are responsible for monitoring, maintaining, and managing that EC2 instance.
The modern approach would be to use a combination of Kinesis + Lambda + S3 + Redshift to have the data stream in requiring no EC2 instances to mange!
The approach is described in this blog post: A Zero-Administration Amazon Redshift Database Loader
What that blog post doesn't mention is now with API Gateway if you do need to do any type of custom authentication or data transformation you can do that without needing an EC2 instance by using Lambda to broker the data into Kinesis.
This would look like:
API Gateway -> Lambda -> Kinesis -> Lambda -> S3 -> Redshift
Redshift is best suited for batch loading using the COPY command. A typical pattern is to load data to either DynamoDB, S3, or Kinesis, then aggregate the events before using COPY to Redshift.
See also this useful SO Q&A.
I implemented a such system last year inside my company using Kinesis and Kinesis connector. Kinesis connector is just a standalone app released by AWS we are running in a bunch of ElasticBeanStalk servers as Kinesis consumers, then the connector will aggregate messages to S3 every a while or every amount of messages, then it will trigger the COPY command from Redshift to load data into Redshift periodically. Since it's running on EBS, you can tune the auto-scaling conditions to make sure the cluster grows and shrinks with the volume of data from Kinesis stream.
BTW, AWS just announced Kinesis Firehose yesterday. I haven't played it but it definitely looks like a managed version of the Kinesis connector.