Can we write script in Java for AWS Glue - amazon-web-services

I am trying to create job script using Java. In AWS Glue Console, I could be able to find only "Python, Spark", so which means we cant write script using Java at all? If yes, then whats this api used for: aws-java-sdk-glue
I even found some example: https://stackoverflow.com/questions/48256281/how-to-read-aws-glue-data-catalog-table-schemas-programmatically
In above, seems like we can able to write aws glue script in Java too. Can anyone please confirm this?
EDIT:
In Scala, we are writing as: glueContext.getCatalogSource(database = "my_data_base", tableName = "my_table")
In Java, I found below class, which has method names: withDatabaseName and withTableName
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/model/CatalogEntry.html
Then, may I know what is the purpose of above class?

The language option on the Glue console that you see is the script/code that yoiu will write to extract, transform and load the actual data that needs to be processed. The source can be a db or s3 bucket and destination can be anything depending on your use case.
Normally you can create a Glue job or a S3 bucket from AWS Management console and when you don't want to do this manually then you need a SDK which has the API call definitions that you use to create AWS resources.
So the script inside a Glue job can be written only in python or scala but when it comes to creating a Glue job you can use different languages/SDKs.
Java - https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/AWSGlueClient.html
Python - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html
Java script - https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Glue.html
Ruby - https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/Glue/Client.html
Above all are SDKs used to define resources in AWS where as refer to below link which has the actual code used inside a Glue job.
https://github.com/aws-samples/aws-glue-samples

Java is not supported for the actual script definition of AWS Glue jobs.
The API that you are referring to is the AWS SDK that will allow you to create and manage AWS Glue resources such as creating/running crawlers, viewing and manage the glue catalogues, creating job definitions, etc.
So you can manage resources in the Glue service with the AWS SDK for Java similar to how to you manage resources in EC2, S3, RDS with the AWS SDK for Java.

Related

How to automate cleaning of data in AWS using Jupyter Notebook

I have a Jupyter Notebook file that cleans the data file (.csv) in S3. The cleaning process is taken care of...
However, I want to be able to automatically apply this cleaning process to every file that is uploaded to the S3 bucket. Each file will have the exact same data format. I am thinking maybe of using AWS Glue, but not sure where to start. If we can skip the upload to S3 and go straight into Glue that would be interesting to explore...
The end goal is to load the clean data in Quick Sight and also AWS Sage Maker for ML applications.
Any advice on how to approach this?
Thanks
A simple AWS Lambda function with an AWS EventBridge Rule can do this. Check out this link: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html
Use the above link to set up the cron or event(whatever is your use case), write you s3 clean up code in lambda function. Lambda support multiple programming languages. And then let the lambda take care of it.

Aws extract tar.gz on S3

As i'm New on aws and a little confused by all the similar services, I would like to have some leads and know if I am in the right direction.
I have tar.gz archives stored on aws s3 glacier deep archives. I would like that when requesting a restore, the archive is automatically extracted and the folders and files it contains put in s3 (with an expiration date).
these archives are too big to be extracted via lambda (300GB or more).
My idea would be to trigger a lambda function when the restore is complete and use that lambda function to start another aws service that does the extraction. I was thinking either aws batch or fargate. Which service do you think is the most suitable? For this kind of simple task it is preferable to use an arm architecture?
If someone has already done this before and has codes to share I'm interested (if not I'll try to put my final solution here for others).

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Best way to develop application with multiple instances in AWS

I Am working on one AWS POC, it uses different aws component, below are the details of each individual components.
1- java function have code to generate data, I am calling it from lambda function through cloud watch scheduler
2- datapipe-line to copy data from RDS to S3.
3- Run hive scripts using athena over s3 data.
4- quicksight for visualization.
I am done with creating individual model but not able to understand what could be best way to connect all these components,So it can run in one go.
one though is to use lambda as a connector for each step. but have no template to connect lamda with Athena.
Kindly anyone can suggest best way to connect all above component.So that it can run in one go.
I am not familiar with hive scripts or quicksight, but a cloudformation stack or a terraform stack should assist you to connect various aws components as your workflow demands.

Automation of on-demand AWS EMR cluster - Using Python (boto3) over AWS CLI

We are in the process of automating the launch of on demand EMR clusters. This will be triggered upon the arrival of certain files in AWS S3. In this regard, we are evaluating two options -
1. Shell script that will invoke a AWS CLI to launch the desired EMR cluster
2. Python script that will invoke methods for EMR start, stop using the boto3
Is there any preference of using one option over the other?
The former appears easier, as we can take the CLI from the manually created EMRs from the AWS console and package it into a shell script. While the later option has intricacies and doesn't have such a starting point and the methods would have to be written from scratch.
Appreciate your inputs in this regard.
While both can achieve what you want, I would suggest to go with Lambda (Python).
Create an event trigger on the S3 location where data is expected - this will invoke your lambda (python code) and lambda can in-turn launch your EMR.
s3-> lambda -> EMR
Another option could be to trigger a data pipeline from lambda which will create the EMR for you.
s3 -> lambda -> pipeline -> EMR
Advantages of using pipeline vs lambda to create EMR
GUI based: You can pick and choose the components needed like resources, activites, schedules etc.
Minimal Python: In the lambda you will just configure the pipeline to be triggered, you don't need to implement error handling, retries, success or failure emails etc. All of this is inbuilt in the pipelines
Flexible: Since pipeline components are modular and configurable, you can change any configuration quickly. Code changes often takes more time.
You can read more about it here - https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html