JDBC Connectivity for Data Source in AWS Glue - amazon-web-services

We are trying to bring oracle table catalogs into glue . But unable to read the data from source
We have tried to give all the possibilities in include path parameter but unable to bring the data
Anyone tried oracle as JDBC data store for AWS Glue ? Please help us to fix the issue

Related

Migrate and Transform MySQL rows from one table to another

I need to migrate all records 3 billions from one MySQL Aurora table to 5 different tables in same cluster .
There are transformation of 2 columns is also has to happen .
So when we migrate we need to convert xml to json and then json will be stored in one of the destination table .
We are looking for best way to migrate this data from one MySQL table to another and we are on AWS so we have flexibility to use any services which can help us achieve this .
So far this is what we have planned
MySQL TABLE ----DMS------>S3 ------LAMBDA to convert XML to JSON and create 5 types of files ---->Lambda on file create and Load data local to 5 Different MySQL table .
But one thing we would like to know how can we handle if Load data local fails in between ?So Lambda will submit the query for load data local from s3 to MySQL but how can we track in Lambda that Load data local success or failure ?
We can not use any direct way because we need to transform data in between .
Is there any better way we can use here ?
Can we use Data pipeline in place of Lambda function for load data local?
Or can we use DMS which will upload file from S3 to MySQL ?
Please suggest what can be the best way which will capability to handle failure scenario
What you are basically doing is an ETL process. I would advise you to look into either AWS EMR or AWS Glue. Since you don't seem to have that much experience, I would use Glue.
With Glue you could basically read from MySQL, do the transformation and write back directly to MySQL. Also, since Glue is running Spark in the background, you can leverage it's distributed computing, which will speed up your process instead of using a single thread lambda function.

AWS Athena Javascript SDK - Create table from query result (CTAS) - Specifiy otuput format

I am trying using the AWS JavaScript Node.JS SDK to make a query using AWS Athena and store the results in a table in AWS Glue with Parquet format (not just a CSV file)
If I am using the conosle, it is pretty simple with a CTAS query :
CREATE TABLE tablename
WITH (
external_location = 's3://bucket/tablename/',
FORMAT = 'parquet')
AS
SELECT *
FROM source
But with AWS Athena JavaScript SDK I am only able to set an output file destination using the Workgoup or Output parameters and make a basic select query, the results would output to a CSV file and would not be indexed properly in AWS Glue so it breaks a bigger process it is part of, if I try to call that query using the JavaScript SDK I get :
Table properties [FORMAT] are not supported.
I would be able to call that DDL statement using the Java SDK JDBC driver connection option.
Is anyone familiar with a solution or workaround with the Javascript SDK for Node.JS?
There is no difference between running the SQL you posted in the Athena web console, AWS SDK for JavaScript, AWS SDK for Java, or the JDBC driver, none of these will process the SQL, so if the SQL works in one of these it will work in all of them. It's only the Athena service that reads the SQL.
Check your SQL and make sure you really use the same in your code as you have tried in the web console. If they are indeed the same, the error is somewhere else in your code, so post that too.
Update the problem is the upper case FORMAT. If you paste the code you posted into the Athena web console, it bugs out and doesn't run the query, but if you run it with the CLI or an SDK you get the error you posted. You did not run the same SQL in the console as in the SDK, if you had you would have gotten the same error in both.
Use lower case format and it will work.
This is definitely a bug in Athena, these properties should not be case sensitive.

How data retrieved from metadata created tables in Glue Script

In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.
Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_table")
.getDynamicFrame()
Does above line retrieve data from actual source via metadata tables?
I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.
When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.
Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:
[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.
Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.
Yes you need to use getCatalogSource if you want to load tables from Glue catalog.
Does Catalog look into Crawler and refer to actual source and load data?
Check out the diagram in this [link][2] . It will give you an idea about the flow.
What if crawler deleted before I run getCatalogSource, then will I can able to load data in this case?
Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.
What if my Source has lots of million of records? then will this load all records or how in this case?
It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue.
[1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html
[2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
[3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main
[4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets

AWS Athena: HIVE_UNKNOWN_ERROR: Unable to create input format

I've crawled a couple of XML files on S3 using AWS Glue, using a simple XML classifier:
However, when I try running any query on that data using AWS Athena, I get the following error (note that it's the simplest possible query I'm doing here):
HIVE_UNKNOWN_ERROR: Unable to create input format
Note that Athena can see my tables and it can see the columns, it just can't query them:
I noticed that there is someone with the same problem on the AWS Discussion forums: Athena XML Query Give HIVE Unknown Error but it got no love from anyone.
I know there is a similar question here about this error but the query in question targeted an RDS database, unlike an S3 bucket like I have here.
Has anyone got a solution for this?
Sadly at this time 12/2018 Athena cannot query XML input which is hard to understand when you may hear that Athena along with AWS Glue can query xml.
What output you are seeing from the AWS crawler is correct though, just not what you think its doing! For example after your crawler has run and you see the tables, but cannot execute any Athena queries. Go into your AWS Glue Catalog and at the right click tables, click your table, edit properties it will look something like this:
Notice how input format is null? If you have any other tables you can look at their properties or refer back to the input formatters documentation for Athena. This is the error you recieve.
Solutions:
convert your data to text/json/avro/other supported formats prior to upload
create a AWS glue job which converts a source to target from xml to target supported Athena format(compressed hopefully with ORC/Parquet)

How can i see metadata, lineage of data stored in AWS redshift?

I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both information out of redshift.
So far i have not found anything on this.
Is there a way to integrate the same to wherehows as a crawled solution?
I found only one post which gives some information about how to get some information from redshift assuming it will be similar to postgresql. I am sure someone would have written some open source solution to this problem.
Or is it just matter of writing a simple single script to extract this information?
I am looking for a enterprise level solution. I hope someone will point me in right direction.
AWS Glue Data catalog is a fully managed metadata management service.It has AWS Glue crawler which automatically crawls through your source(for you its redshift) and creates a centralized metadata repository which can be accessed by other AWS services.
Refer:
https://docs.aws.amazon.com/glue/latest/dg/components-overview.html
https://aws.amazon.com/glue/
You can access metadata by querying the system tables in Redshift:
https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html
The system tables are on the leader node in each cluster (see this guide on the Redshift Architecture that I wrote)
Redshift deletes the content of the system tables on a rolling basis, so you need to store that data in your cluster, or another separate cluster, to get a history. With the data in the system tables, you have a baseline of information about your queries and what tables they are touching.
You can put a dashboard like Kibana or Periscope Data on top of that data to visualize it. Plaid has done a write-up of how they've built an in-house monitoring solution that has some information about data lineage:
https://blog.plaid.com/managing-your-amazon-redshift-performance-how-plaid-uses-periscope-data/
But go get true data lineage, you need to understand how queries relate to your workflows, i.e. for an Airflow DAG. To get that information, you need to "tag" your queries so you can trace them in the context of transformations / workflows, vs. looking at the individual query.
This is something we've built into our product - heads up that it's a commercial solution:
https://www.intermix.io/blog/announcing-query-insights/
Unlike the raw logs from the system tables, we give you the context of what apps / workflows are triggering queries, which users are running them, and what tables they are touching.
Lars