It seems that AWS Glue "Add Connection" can only add connections specific to only one database. I have to connect all databases from MS SQL server. Is it possible to cover multiple databases in one Aws Glue "Add Connection" or we need new connection for every new database.
The JDBC connection string is limited to one database at a time.
From glue's documentation:
For JDBC to connect to the data store, a db_name in the data store is required. The db_name is used to establish a network connection with the supplied username and password. When connected, AWS Glue can access other databases in the data store to run a crawler or run an ETL job.
https://docs.aws.amazon.com/glue/latest/dg/console-connections.html?icmpid=docs_glue_console
The database name is part of jdbc url. As you can have only one url in a glue connection you can only point to one database. But you can still use all the schema under the database.
Related
I have a Cloud SQL managed DB. Also, i have a read replica attached to the same.
I would like to my big query connected to Cloud SQL. Is it possible to connect Google Big Query with Cloud SQL Read replica?
Yes, it is possible.
To make queries in BigQuery over data residing in Cloud SQL you can use Federated Queries which are queries for data not residing in BigQuery, but registered as an external Data Source.
To perform these queries you can use the following syntax:
SELECT * FROM EXTERNAL_QUERY(<CONNECTION_ID>, <EXTERNAL_DATABASE_QUERY>);
The CONNECTION_ID is the one given in Big Query when creating the external datasource connection with the following steps:
Go to the Big Query Console
Click on +Add Data and select external data source
A menu will appear on the right side of your window, fill the form there with the data of your cloud SQL read replica instance.
On connection ID select a string that you can remember as it will be the one used for the federated queries
Create Connection
These steps will allow you to create the connection between Big Query and Cloud SQL. Once the connection is created you can perform federated queries to consult data from cloud SQL instances.
The EXTERNAL_DATABASE_QUERY is the query you would have used in CloudSQL to get this data.
You can use Cloud SQL as External Data Source in BigQuery
I'm trying to access a database in the private subnet in the AWS Glue job script. As far as see in the documentation, one can create a data source using different "connection types" and appropriate "connection options", but they don't support VPC settings.
The only thing, which supports VPC settings is AWS Glue Connection, but I cannot find a way how to create a Spark data source using AWS Glue Connection.
Or maybe there is a some workaround?
See step 8 in this guide, after you add your Glue jdbc connection, create a crawler to import table metadata from the source database into the AWS Glue Data Catalog.
Then you can access the table within a Glue job like this:
df = glueContext.create_dynamic_frame.from_catalog(database = "db1", table_name = "table1")
Or with Spark:
df = spark.sql("SELECT * FROM db1.table1")
Can we execute sql query inside DMS task so that it just fetches the required data and not the whole db.
If its not possible then which aws service is used to fetch query based data from on-prem data source to aws S3.
You can use filters and/or exclude fields: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.html
Contact me if you have problems.
For alternate solution to DMS, you can use AWS Glue with data retrieved using PYSPARK dataframe from on prem DB to either s3 and AWS RDS. This works very well. The only down side is the cost.
This solution supports both table and SQL as input for data extraction
I'm attempting to use AWS Glue to ETL a MySQL database in RDS to S3 so that I can work with the data in services like SageMaker or Athena. At this time, I don't care about transformations, this is a prototype and I simply want to dump the DB to S3 to start testing the various tool chains.
I've set up a Glue database and tested the connection to RDS successfully
I am using the AWS provide Glue IAM service role
My S3 bucket has the correct prefix of aws-glue-*
I created a crawler using the Glue database, AWSGlue service role, and S3 bucket above with the options:
Schema updates in the data store: Update the table definition in the data catalog
Object deletion in the data store: Delete tables and partitions from the data catalog.
When I run the crawler, it completes in ~60 seconds but it does not create any tables in the database.
I've tried adding the Admin policy to the glue service role to eliminate IAM access issues and the result is the same.
Also, CloudWatch logs are empty. Log groups are created for the test connection and the crawler but neither contains any entries.
I'm not sure how to further troubleshoot this, info on AWS Glue seems pretty sparse.
Figured it out. I had a syntax error in my "include path" for the crawler. Make sure the connection is the data source (RDS in this case) and the include path lists the data target you want e.g. mydatabase/% (I forgot the /%).
You can substitute the percent (%) character for a schema or table. For databases that support schemas, type MyDatabase/MySchema/% to match all tables in MySchema with MyDatabase. Oracle and MySQL don't support schema in the path, instead type MyDatabase/%. For information about which JDBC data stores support schema, see Cataloging Tables with a Crawler.
Ryan Fisher is correct in the sense that it's an error. I wouldn't categorize it as a syntax error. When I ran into this it was because the 'Include path' didn't include the default schema that sql server lovingly provides to you.
I had this: database_name/table_name
When it needed to be: database_name/dbo/table_name
The tool below is a batch import method of copying data from SQL Server RDS into Redshift.
AWS Schema Conversion Tool Exports from SQL Server to Amazon Redshift
Is there a more streamlined method, conducting every second way of streaming data from MS SQL Server into Redshift with Kinesis Firehose. I know we can move AWS Aurora SQL directly into Redshift with Kinesis.
If your goal is to move data from Microsoft SQL Server into Amazon Redshift, then you could consider using AWS Database Migration Service. It can copy data as a one-off job but can also migrate on a continuing basis.
See:
Using a Microsoft SQL Server Database as a Source for AWS DMS - AWS Database Migration Service
Using an Amazon Redshift Database as a Target for AWS Database Migration Service - AWS Database Migration Service