Is it possible to transfer data from Redshift to Elasticsearch? - amazon-web-services

I'm working on something related to Amazon elasticsearch service.For that,I need to get data from Amazon Redshift.The data to be tranfered is huge i.e. 100 GB.Is there any way to get it directly form Redshift or is it a two step process like Redshift->s3->elasticsearch?

I see, at least in theory, 2 possible approaches for transfering data from Redshift to Elasticsearch:
Logstash, using the JDBC input plugin
elasticsearch-jdbc

Don’t gzip the data unloaded.
Use the bulk load on elastic
Use a large number of records in the bulk load (>5000) – fewer large bulk
loads are better than more smaller ones.
When working with AWS elastic search there is a risk of hitting the limits of the bulk queue size.
Process a single file in the lambda and then recursively call the lambda function with an event
Before recursing wait for a couple of seconds –> setTimeout. When waiting make sure that you aren’t idle for 30 seconds because your lambda will stop.
Don’t use s3 object creation to trigger your lambda — you’ll end up with
multiple lambda functions being called at the same time.
Don’t bother trying to put kinesis in the middle – unloading your data
into kinesis is almost certain to hit load limits in kinesis.
Monitor your elastic search bulk queue size with something like
this:
curl https://%ES-SERVER:PORT%/_nodes/stats/thread_pool |jq
‘.nodes |to_entries[].value.thread_pool.bulk’

It looks like there is no direct data transfer pipeline for pushing data into elasticsearch from Redshift. One alternative approach is to first dump the data in S3 and then push into elasticsearch.

Related

Ingest large amounts of csv data into elastic search on a daily basis

I have some fairly large datasets (upwards of 60k rows in a csv file) that I need to ingest in elasticsearch on a daily basis (to keep the data updated).
I currently have two lambda functions handling this.
Lambda 1:
A python lambda (nodejs would run out of memory doing this task) is triggered when a .csv file is added to S3 (this file could have upwards of 60k rows). The lambda converts this to JSON and saves to another S3 bucket.
Lambda 2:
A nodejs lambda that is triggered by .json files generated from the Lambda 1. This lambda uses the elasticsearch bulk api to try and insert all of the data into ES.
However because of the large amount of data we hit the ES api rate limiting and fail to insert much of the data.
I have tried splitting the data and uploading smaller amount at a time, however this would then be a very long running lambda function.
I have also looked at adding the data to a kinesis stream however even that has a limit to the data you can add to it in each operation.
I am wondering what the best solution may be to insert large amounts of data like this into ES. My next thought is possibly splitting the .json files into multiple .json files and trigger the lambda that adds data to ES for each smaller .json file. However I am concerned that I would still just hit the rate limiting of the ES domain.
Edit* Looking into the kinesis firehose option this seems like it is the best option as I can set the buffer size to maximum 5mb (this is the ES bulk api limit).
However firehose has an import limit of 1mb so I'd still need to do some form of processing on the lambda that pushes to firehose to split up the data before pushing.
I'd suggest designing the application to use SQS for queuing your messages rather than using firehose (which is more expensive and perhaps not the best option for your use case). Amazon SQS provides a lightweight queueing solution and is cheaper than Firehose (https://aws.amazon.com/sqs/pricing/)
Below is how it can work -
Lambda 1 converts each row to JSON and posts each JSON to SQS.
(Assuming each JSON is less than 256KB)
the SQS queue acts as an event source for Lambda 2 and triggers it in batches of, say 5000 messages.
(Ref - https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)
Lambda 2 uses the payload received from SQS to insert into ElasticSeach using the Bulk API.
Illustration -
The batch size can be adjusted based on your observation of how the lambda is performing. Make sure to adjust the visibility timeout and set up DLQ for efficient running.
This also reduces the S3 cost by avoiding storing the JSON in S3 for the 2nd Lambda to pick up. The data gets stored into ElasticSearch, hence duplication of data should be avoided.
You could potentially create one record per row and push it to firehose. When the data in the firehose stream reaches the buffer size configured it will be flushed to ES. This way only one lambda is required which can process rhe records from the csv and push to firehose

How to query AWS load balancer log if there are terabytes of logs?

I want to query AWS load balancer log to automatically and on schedule send report for me.
I am using Amazon Athena and AWS Lambda to trigger Athena. I created data table based on guide here: https://docs.aws.amazon.com/athena/latest/ug/application-load-balancer-logs.html
However, I encounter following issues:
Logs bucket increases in size day by day. And I notice if Athena query need more than 5 minutes to return result, sometimes, it produce "unknown error"
Because the maximum timeout for AWS Lambda function is 15 minutes only. Therefore, I can not continue to increase Lambda function timeout to wait for Athena to return result (if in the case that Athena needs >15 minutes to return result, for example)
Can you guys suggest for me some better solution to solve my problem? I am thinking of using ELK stack but I have no experience in working with ELK, can you show me the advantages and disadvantages of ELK compared to the combo: AWS Lambda + AWS Athena? Thank you!
First off, you don't need to keep your Lambda running while the Athena query executes. StartQueryExecution returns a query identifier that you can then poll with GetQueryExecution to determine when the query finishes.
Of course, that doesn't work so well if you're invoking the query as part of a web request, but I recommend not doing that. And, unfortunately, I don't see that Athena is tied into CloudWatch Events, so you'll have to poll for query completion.
With that out of the way, the problem with reading access logs from Athena is that it isn't easy to partition them. The example that AWS provides defines the table inside Athena, and the default partitioning scheme uses S3 paths that have segments /column=value/. However, ALB access logs use a simpler yyyy/mm/dd partitioning Scheme.
If you use AWS Glue, you can define a table format that uses this simpler scheme. I haven't done that so can't give you information other than what's in the docs.
Another alternative is to limit the amount of data in your bucket. This can save on storage costs as well as reduce query times. I would do something like the following:
Bucket_A is the destination for access logs, and the source for your Athena queries. It has a life-cycle policy that deletes logs after 30 (or 45, or whatever) days.
Bucket_B is set up to replicate logs from Bucket_A (so that you retain everything, forever). It immediately transitions all replicated files to "infrequent access" storage, which cuts the cost in half.
Elasticsearch is certainly a popular option. You'll need to convert the files in order to upload it. I haven't looked, but I'm sure there's a Logstash plugin that will do so. Depending on what you're looking to do for reporting, Elasticsearch may be better or worse than Athena.

What is the best way to copy large csv files from s3 to redshift?

I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.
You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.

AWS lambda extract large data and upload to s3

I am trying to write a nodeJS lambda function to query data from our database cluster and upload this to s3, we require this for further analysis. But my doubt is, if the data to be queried from the db is large (9GB), how does the lambda function handle this as the memory limit is 3008 MB ?
There is also a disk storage limit of 500MB.
Therefore, you would need to stream the result to Amazon S3 as it is coming in from the database.
You might also run into a time limit of 15 minutes for a Lambda function, depending upon how fast the database can query and transfer that quantity of information.
You might consider an alternative strategy, such as having the Lambda function call Amazon Athena to query the database. The results of an Athena query are automatically saved to Amazon S3, which would avoid the need to transfer the data.
lambda have some limitations it term of run time and space . it's better to use crawler or job in amazon glue. it's the easy way of doing this.
for that go to `
amazon glue>>job>>create job
and fill basic requirements like source and destination.
and run job. there is no constrains for size and time limitation.
`

Partition Kinesis firehose S3 records by event time

Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.