I Am working on one AWS POC, it uses different aws component, below are the details of each individual components.
1- java function have code to generate data, I am calling it from lambda function through cloud watch scheduler
2- datapipe-line to copy data from RDS to S3.
3- Run hive scripts using athena over s3 data.
4- quicksight for visualization.
I am done with creating individual model but not able to understand what could be best way to connect all these components,So it can run in one go.
one though is to use lambda as a connector for each step. but have no template to connect lamda with Athena.
Kindly anyone can suggest best way to connect all above component.So that it can run in one go.
I am not familiar with hive scripts or quicksight, but a cloudformation stack or a terraform stack should assist you to connect various aws components as your workflow demands.
Related
I am looking to trigger code every 1 Hour in AWS.
The code should: Parse through a list of zip codes, fetch data for each of the zip codes, store that data somewhere in AWS.
Is there a specific AWS service would I use for parsing through the list of zip codes and call the api for each zip code? Would this be Lambda?
How could I schedule this service to run every X hours? Do I have to use another AWS Service to call my Lambda function (assuming that's the right answer to #1)?
Which AWS service could I use to store this data?
I tried looking up different approaches and services in AWS. I found I could write serverless code in Lambda which made me think it would be the answer to my first question. Then I tried to look into how that could be ran every x time, but that's where I was struggling to know if I could still use Lambda for that. Then knowing where my options were to store the data. I saw that Glue may be an option, but wasn't sure.
Yes, you can use Lambda to run your code (as long as the total run time is less than 15 minutes).
You can use Amazon EventBridge Scheduler to trigger the Lambda every 1 hour.
Which AWS service could I use to store this data?
That depends on the format of the data and how you will subsequently use it. Some options are
Amazon DynamoDB for key-value, noSQL data
Amazon Aurora for relational data
Amazon S3 for object storage
If you choose S3, you can still do SQL-like queries on the data using Amazon Athena
I am trying to create job script using Java. In AWS Glue Console, I could be able to find only "Python, Spark", so which means we cant write script using Java at all? If yes, then whats this api used for: aws-java-sdk-glue
I even found some example: https://stackoverflow.com/questions/48256281/how-to-read-aws-glue-data-catalog-table-schemas-programmatically
In above, seems like we can able to write aws glue script in Java too. Can anyone please confirm this?
EDIT:
In Scala, we are writing as: glueContext.getCatalogSource(database = "my_data_base", tableName = "my_table")
In Java, I found below class, which has method names: withDatabaseName and withTableName
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/model/CatalogEntry.html
Then, may I know what is the purpose of above class?
The language option on the Glue console that you see is the script/code that yoiu will write to extract, transform and load the actual data that needs to be processed. The source can be a db or s3 bucket and destination can be anything depending on your use case.
Normally you can create a Glue job or a S3 bucket from AWS Management console and when you don't want to do this manually then you need a SDK which has the API call definitions that you use to create AWS resources.
So the script inside a Glue job can be written only in python or scala but when it comes to creating a Glue job you can use different languages/SDKs.
Java - https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/AWSGlueClient.html
Python - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html
Java script - https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Glue.html
Ruby - https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/Glue/Client.html
Above all are SDKs used to define resources in AWS where as refer to below link which has the actual code used inside a Glue job.
https://github.com/aws-samples/aws-glue-samples
Java is not supported for the actual script definition of AWS Glue jobs.
The API that you are referring to is the AWS SDK that will allow you to create and manage AWS Glue resources such as creating/running crawlers, viewing and manage the glue catalogues, creating job definitions, etc.
So you can manage resources in the Glue service with the AWS SDK for Java similar to how to you manage resources in EC2, S3, RDS with the AWS SDK for Java.
I'm building a system that has a web service(AWS API Gateway + AWS lambda + AWS RDS Aurora MySQL) fully integrated with a CI/CD pipeline(AWS CodePipeline) integrated with a Git WebHook. So, I have a template that provides the gateway, the lambda and the RDS cluster. Additionally, I have a custom resource in my template that creates the database and the tables( not ingesting data for now).
Regarding the architecture previously mentioned, here I have a couple of questions:
In this scenario, is a custom resource for creating the schema the best approach according to standards?
Regarding data ingestion and schema updates, is it a good practice to manage this within the pipeline, or is it better to do it outside(running incremental scripts manually)?.
In case you manage schema changes within the pipeline process... how do you achieve that?
Thanks
For creating the initial schema at this time the best choice is as you said using a custom resource.
Regarding data ingestion/schema updates if you're using version control for managing then having some kind of pipeline is definitely the correct way to go, however, where the difficulties lie are in a rollback scenario (especially with data manipulation).
You could either use a pure Lambda action within CodePipeline (including functionality to test and rollback your changes) or you could integrate the Lambda function with a third party solution for managing rolling updates to your SQL schema.
I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL).
After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each tries to solve. I looked around but did not find a authoritative comparison between both. There are multiple resources that show how I can use them both to schedule and trigger Spark jobs on an EMR cluster.
Which one should I use for scheduling and orchestrating my processing EMR jobs?
More generally, in what situation would one be a better choice over the other as far as ETL/data processing is concerned?
Yes, there are many ways to achieve the same thing, and the difference is in the details and in your use case. I am going to even offer yet one more alternative :)
If you are doing a sequence of transformations and all of them are on an EMR cluster, maybe all you need is either to create the cluster with steps, or submit an API job with several steps. Steps will execute in order on your cluster.
If you have different sources of data, or you want to handle more complex scenarios, then both AWS Data Pipeline and AWS Step Functions would work. AWS Step Functions is a generic way of implementing workflows, while Data Pipelines is a specialized workflow for working with Data.
That means that Data Pipeline will be better integrated when it comes to deal with data sources and outputs, and to work directly with tools like S3, EMR, DynamoDB, Redshift, or RDS. So for a pure data pipeline problem, chances are AWS Data Pipeline is a better candidate.
Having said so, AWS Data Pipeline is not very flexible. If the data source you need is not supported, or if you want to execute some activity which is not integrated, then you need to hack your way around with shell scripts.
On the other hand, AWS Step Functions are not specialized and have good integration with some AWS Services and with AWS Lambda, meaning you can easily integrate with anything via serverless apis.
So it really depends on what you need to achieve and the type of workload you have.
I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.