How to differentiate Glue and Athena to use Apache Spark in AWS? - amazon-web-services

On Nov-2022, Amazon Athena had started supporting Apache Spark.
https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-athena-now-supports-apache-spark/?nc1=h_ls
I think it looks very similar to Glue.
How can we differentiate Athena and Glue when using serverless Spark on AWS ?
Thanks in advance.
I looked for information comparing the two on internet, but could not find it.
I would like to decide which to use depending on the situation, especially for streaming processes.

Related

RDS(dynamic schema) -> AWS opensearch by using AWS Glue

I am using AWS RDS(MySQL) and I would like to sync this data to AWS elasticsearch in real-time.
I am thinking that the best solution for this is AWS Glue but I am not sure about I could realize what I want.
This is information for my RDS database:
■ RDS
・I would like to sync several tables(MySQL) to opensearch(1 table to 1 index).
・The schema of tables will be changed dynamically.
・The new column will be added or The existing columns will be removed since previous sync.
(so I also have to sync this schema change)
Could you teach me roughly whether I could do these things by AWS Glue?
I wonder if AWS Glue can deal with dynamic schame change and syncing in (near) real-time.
Thank you in advance.
Glue Now have OpenSearch connector but Glue is like a ETL tool and does batch kind of operation very well but event based or very frequent load to elastic search might not be best fit ,and cost also can be high .
https://docs.aws.amazon.com/glue/latest/ug/tutorial-elastisearch-connector.html
DMS can help not completely as you have mentioned schema keeps changing .
Logstash Solution
Since Elasticsearch 1.5, Elasticsearch added jdbc input plugin in Logstash to sync MySQL data into Elasticsearch.
AWS Native solution
You can have a lambda function on MySQL event Invoking a Lambda function from an Amazon Aurora MySQL DB cluster
The lambda will write to Kinesis Firehouse in json and kinesis can load into OpenSearch .

Calling a stored procedure in Redshift from AWS Glue

Is there an easier way than using Lambda/Boto to call a stored procedure in Redshift from AWS Glue?
I have everything setup in a Glue job and need to call a stored procedure in Redshift from the Spark script. I have a connection made to Redshift in Glue.
This question does not have the answer: Calling stored procedure from aws Glue Script
Please share any guidance on this.
Thank you.
you can do that using py4j and java jdbc. The best part you do not even have to install anything as Glue comes with jdbc connectors for many supported databases.
How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue
You can zip pg8000 as additional library and use it to establish connection to redshift then trigger the stored procedure.
It is not possible to trigger stored procedure from spark jdbc.

What is drawback of AWS Glue Data Catalog?

What is major drawback of AWS Glue Data Catalog? Been asked in one of interview.
That could be answered in a number of ways depending on the wider context. For example:
It's an AWS managed service, so using it locks you into the AWS
ecosystem (instead of using a standalone Hive metastore for example)
It's limited to the data sources supported by the Glue Data Catalog
It doesn't integrate with third-party authentication and
authorisation tools like Apache Ranger (as far as I am aware)

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.

ETL onto EMR for Impala

We have an EMR cluster running Impala.
We have lots of data in DynamoDB and S3.
What is the best/recomended way of getting data into our HDFS EMR cluster from Dynamo (So that I can get it into Impala afterwards)? Should I write a python script that imports boto and some HDFS library to do it, should I learn PIG directly, or is there a better solution?
My recommendation would be to take a small learning curve and get familiarization with AWS Data Pipe. By itself it is a very good service; the best thing is that it is fully managed and interoperates really well.
So without the involvement of additional 3rd Party Tools [ ETL ] suite and by extension without running additional EC2 instances; you get to link, schedule, transfer Data from DynamoDB to EMR.
This link has necessary information in bit and pieces; but you can pick up ideas from here and there and create your DynamoDB to EMR link [http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part2.html]
I use alteryx for ETL . I would recommend using it. It has pretty cool analytics package as well.