Looking for guidance/alternatives on ETL process on AWS - amazon-web-services

I'm working on an ETL pipeline using Apache NiFi, the flow runs hourly and is something like this:
data provider API->Apache Nifi->S3 landing
->Athena Query to transform the data->S3 stage
->Athena Query to change field types and join with another data so it be ready for analysis->S3 trusted
->Glue->Redshift
I found GLUE to be expensive to send data to redshift, will code something ad-hoc to use the COPY command.
The question I would like to ask is if you can guide me if there is a better tool/way to do something better/cheaper/scalable, specially on steps 2 and 3.
I'm looking for ways, to optimize this process and make it ready to recieve millions of registries per hour.
Thank you!

Interesting workflow.
You can actually use some neat combinations to automatically get data from s3 into redshift.
You can do S3 (Raw Data) -> Lambda (Off PUT notification) -> Kinesis Firehose -> S3 (batched & transformed with firehose transformer) -> Kinesis Redshift Copy
This flow will completely automate updates based on your data. You can read more about it here. Hope this helps.

You can save your data in partitioned fashion in s3.
Then use glue spark jobs to transform the data and implementing joins and aggregations as that will be fast if written in optimized way.
This will also save you cost as glue will process the data faster then expected and then to move data to redshift copy command is the best approach.

Read AWS GLUE https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console

Related

What is the best way to copy large csv files from s3 to redshift?

I'm working on a task of copying csv files from s3 bucket to redshift. I've found multiple ways to do so but I'm not sure which one will be the best possible way to do it. Here's the scenario:
On regular intervals, multiple CSV files of size around 500 MB - 1 GB, will be added to my s3 bucket. The data can contain duplicates. The task is to copy the data to redshift table while ensuring that the duplicate data is not present in redshift.
Here are the ways I found which can be used:
Create a AWS Lambda function which will be triggered whenever a file is added to s3 bucket.
Use AWS Kinesis
Use AWS Glue
I understand Lambda should not be used for jobs that takes more than 5 minutes. So should I use it or just eliminate this option?
Kinesis can handle large amount of data but is it the best way to do it?
I'm not familiar with Glue and Kinesis. But I read that Glue can be slow.
If anyone can point me to the right direction, it will be really helpful.
You can definitely make it work with Lambda, if you leverage StepFunctions and the S3 Select option to filter subsets of data into smaller chunks. You'd have your Step Functions manage your ETL orchestration wherein you execute your lambdas that selectively pull from the large data file via the S3 select option. Your pre-process state--see links below--could be used to determine execution requirements, then execute multiple Lambdas, even in parallel, if you wish. Those lambdas would process the subsets of data to remove dups and perform any other ETL operations you might require. Then, you'd take the processed data and write to Redshift. Here are links that will help you put that architecture together:
Trigger State Machine Execution from S3 Event
Manage Lambda Processing Executions and workflow state
Use S3 Select to pull subsets from large data objects
Also, here's a link to a Python ETL pipeline example for the CDK that I built. You'll see an example of an S3 event-driven lambda along with data processing and DDB or MySQL writes. Will give you an idea as to how you can build out comprehensive Lambdas for ETL operations. You would just need to add a psycopg2 layer to your deployment for Redshift. Hope this helps.

Pipeline for parsing daily json data with AWS?

json files are posted daily to an s3 bucket. I want to take that json file, do some processing on it, then post the data to a new s3 bucket where it will get picked up and stored in Redshift. What would be the recommended AWS pipeline for this? AWS lambda that triggers when a new json file is placed on s3, that then kicks off something like an AWS batch job? Or something else? I am not familiar with all AWS web services so might be overlooking something obvious.
So the flow looks like this:
s3 bucket -> data processing -> s3 bucket -> redshift
and it's the data processing step I'm not sure about - how to schedule something fairly scalable that runs daily and efficiently and puts the data back. The processing is parsing of json data and some aggregation and data clean up.
and it's the data processing step I'm not sure about - how to schedule something fairly scalable that runs daily and efficiently and puts the data back.
Don't worry about scalability with Lambda, just focus on short running jobs. Here is an example:
https://docs.aws.amazon.com/lambda/latest/dg/with-scheduledevents-example.html
I think one piece of the puzzle you're missing is the documentation for Schedule Expressions Using Rate or Cron: https://docs.aws.amazon.com/lambda/latest/dg/with-scheduledevents-example.html

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.

Kinesis to S3 custom partitioning

Question
I've read this and this and this articles. But they provide contradictory answers to the question: how to customize partitioning on ingesting data to S3 from Kinesis Stream?
More details
Currently, I'm using Firehose to deliver data from Kinesis Streams to Athena. Afterward, data will be processed with EMR Spark.
From time to time I have to handle historical bulk ingest into Kinesis Streams. The issue is that my Spark logic hardly depends on data partitioning and order of event handling. But Firehouse supports partitioning only by ingestion_time (into Kinesis Stream), not by any other custom field (I need by event_time).
For example, under Firehouse's partition 2018/12/05/12/some-file.gz I can get data for the last few years.
Workarounds
Could you please help me to choose between the following options?
Copy/partition data from Kinesis Steam with help of custom lambda. But this looks more complex and error-prone for me. Maybe because I'm not very familiar with AWS lambdas. Moreover, I'm not sure how well it will perform on bulk load. At this article it was said that Lambda option is much more expensive than Firehouse delivery.
Load data with Firehouse, then launch Spark EMR job to copy the data to another bucket with right partitioning. At least it sounds simpler for me (biased, I just starting with AWS Lambas). But it has the drawback of double-copy and additional spark Job.
At one hour I could have up to 1M rows that take up to 40 MB of memory (at compressed state). From Using AWS Lambda with Amazon Kinesis I know that Kinesis to Lambda event sourcing has a limitation of 10,000 records per batch. Would it be effective to process such volume of data with Lambda?
While Kinesis does not allow you to define custom partitions, Athena does!
The Kinesis stream will stream into a table, say data_by_ingestion_time, and you can define another table data_by_event_time that has the same schema, but is partitioned by event_time.
Now, you can make use of Athena's INSERT INTO capabilities to let you repartition data without needing to write Hadoop or a Spark job and you get the serverless scale-up of Athena for your data volume. You can use SNS, cron, or a workflow engine like Airflow to run this at whatever interval you need.
We dealt with this at my company and go in-to more depth details of the trade-offs of using EMR or a streaming solution, but now you don't need to introduce anymore systems like Lambda or EMR.
https://radar.io/blog/custom-partitions-with-kinesis-and-athena
you may use the kinesis stream, and create the partitions like you wants.
you create a producer, and in your consumer, create the partitions.
https://aws.amazon.com/pt/kinesis/data-streams/getting-started/

How to copy data in bulk from Kinesis -> Redshift

When i read about AWS data pipeline the idea immediately struck - produce statistics to kinesis and create a job in pipeline that will consume data from kinesis and COPY it to redshift every hour. All in one go.
But it seems there is no node in pipeline that can consume kinesis. So now i have two possible plans of action:
Create instance where Kinesis's data will be consumed and sent to S3 split by hours. Pipeline will copy from there to Redshift.
Consume from Kinesis and produce COPY directly to Redshift on the spot.
What should I do? Is there no way to connect Kinesis to redshift using AWS services only, without custom code?
It is now possible to do so without user-code via a new managed service called Kinesis Firehose. It manages the desired buffering intervals, temp uploads to s3, upload to Redshift, error handling and auto throughput management.
That is already done for you!
If you use the Kinesis Connector Library, there is a built-in connector to Redshift
https://github.com/awslabs/amazon-kinesis-connectors
Depending on the logic you have to process connector can be really easy to implement.
You can create and orchestrate complete pipeline with InstantStack to read data from Kinesis, transform it and push it into any Redshift or S3.