Image showing tables created. (crawler snapshot)
Unable to see tables under databases tab in the AWS datalake/glue UI even though the Crawler log states that - 2 tables have been created.
2020-09-05T15:16:45.020+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Running Start Crawl for Crawler db1
2020-09-05T15:17:02.149+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Classification complete, writing results to database db1
2020-09-05T15:17:02.150+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
2020-09-05T15:17:23.963+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] INFO : Created table customers in database db1
2020-09-05T15:17:23.965+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] INFO : Created table sales in database db1
2020-09-05T15:17:24.674+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Finished writing to Catalog
2020-09-05T15:18:30.608+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Crawler has finished running and is in state READY
The role consists of all admin policies. Tried refreshing many times, still cannot see the tables.
Related
I'm trying to ingest data from different tables with in same database using Data fusion Multiple database tables plugin to bigquery tables using multiple big query tables sink. I write 3 different custom SQL and add them inside the plugin section which is under "Data Section Mode" > "Custom SQL Statements".
The problem is When I preview or deploy and run the pipeline I get the error "BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point."
What I try to figure out this problem;
Run custom SQL on database and worked properly.
Create pipelines that are specific for custom SQLs but it's like 1 table ingestion from sql server to bigquery table as sink. it worked properly.
Try different Data Section Mode under multiple database tables plugin that is Table Allow List , works but it's just insert all data with no option to transform any column or filtering. Did that one to see if plugin can reach the database and able to read data ,it can read.
Data Pipeline - Multiple Database Tables Plugin Config - 1
Data Pipeline - Multiple Database Tables Plugin Config - 2
As a conclusion I would like to ingest data from one database with multiple tables with in one data pipeline. If possible I would like to do it with writing custom sqls for each tables.
Open for any advice and try.
Thank you.
Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins
The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.
Is there any system tables in Google Bigquery to check all the current running queries? I am looking similar to V$SQL table and v$Session tables in Oracle.
You can query the INFORMATION_SCHEMA.JOBS_BY_* view to retrieve real-time metadata about BigQuery jobs. This view contains currently running jobs, as well as the last 180 days of history of completed jobs.
for example
SELECT
job_id,
creation_time,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE state != "DONE"
Note: Valid states include PENDING, RUNNING, and DONE.
I am trying to connect AWS SCT to Teradata to migrate some tables to Redshift. However, while connecting to Teradata, I am getting the error which says -
"The specified account does not have sufficient privileges for working with the following object(s) :
Database 'DBC' : [SELECT]
Here is the snapshot of the error (Removed some connection details) :
What permissions should I request from the Teradata Admin to provide to the user so that I am able to access my required DB.
User connecting to Teradata should have SELECT access on DBC to pull object metadata to be converted into Redshift DDL
Following the docs, you'll need these permissions:
SELECT ON DBC
SELECT ON SYSUDTLIB
SELECT ON SYSLIB
SELECT ON <source_database>
CREATE PROCEDURE ON <source_database>
In the preceding example, replace the <source_database> placeholder with the name of the source database.
In AWS DMS how does the migration happening internally? Is it like exporting entire data from source table and importing to destination table? Or is it like migrating table records one by one to destination table? I am new to aws dms and don't have much idea on how things work there.
AWS publish how DMS works in their documentation and blog posts. This is the list I wish I had when I started with DMS:
For a high level understanding see: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.html
A task can consist of three major phases:
The full load of existing data
The application of cached changes
Ongoing replication
During a full load migration, where existing data from the source is moved to the target, AWS DMS loads data from tables on the source data store to tables on the target data store. While the full load is in progress, any changes made to the tables being loaded are cached on the replication server; these are the cached changes.
...
When the full load for a given table is complete, AWS DMS immediately begins to apply the cached changes for that table. When all tables have been loaded, AWS DMS begins to collect changes as transactions for the ongoing replication phase. After AWS DMS applies all cached changes, tables are transactionally consistent. At this point, AWS DMS moves to the ongoing replication phase, applying changes as transactions.
From: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Look at the headings:
Replication Tasks
Ongoing replication, or change data capture (CDC)
To gain a detailed understanding of how DMS works internally, read through the following blogs from AWS:
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 1)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 2)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong? (Part 3)
Finally, work through the blogs particular to your source and target databases at https://aws.amazon.com/blogs/database/category/migration/aws-database-migration-service-migration/
When I first used DMS I had same question. So simply I enabled Cloudwatch logs and created one migration task from Oracle to Aurora Postgresql.
First DMS task runs on Replication Instance and it connects to Source and Target databases.
RI then connect to Source database and based on selection rule it identifies tables and column details since it has lot of special access on Source and Target DB.
After that it start reading source table(s) in parallel and create Select col1, col2, col3.. from kind of query to fetch data from Source.
Then it write files in a temp location on RI based on tables, 1 file per table and approx 10000 rows in one commit.
While all this is happening another process is creating connection to Target DB and checking if Tables already exist if yes then it check which option we selected Do Nothing or Truncate Table etc.. Based on that it takes action.
Till now we have data from Source table in files on RI and connection and tables created on Target DB. Now RI just reads file records from RI temp location and create insert query.
Once last commit is successful it deletes the temp file from RI.
Once Source table and target table count is matched it closes connections in case of One time load.
In case of On going changes it keeps connection alive and read redo logs or other logs in Source db. Then follow same process mentioned above for CDC.
Here's a doc that provides some more information on how DMS Ongoing Replication works internally: https://aws.amazon.com/blogs/database/introducing-ongoing-replication-from-amazon-rds-for-sql-server-using-aws-database-migration-service/
The short of it is:
(following some initial steps) AWS DMS does not use any replication artifacts. When all the required information is available in the transaction log or transaction log backup, AWS DMS uses the fn_dblog() and fn_dump_dblog() functions to read changes directly from the transaction logs or transaction log backups using the log sequence number (LSN).
In addition to above answers, DMS uses Attunity underneath. There are public documents on how the later works in detail.