Data Fusion replication pipeline is not syncing data in Google Bigquery - google-cloud-platform

Hi we want to replicate the data from Mysql(source) to GoogleBigquery(destination) we adopted the method described by google Docs with Data fusion replication pipeline as mentioned in Link
https://cloud.google.com/data-fusion/docs/tutorials/replicating-data/mysql-to-bigquery
Berief of what we are doing:
Enabling bin log in MY SQL for CDC(Change data Capture)
creating a replication pipeline in data fusion
starting the pipeline and syncing the data
we are successfully able to create MySql data in comupute engine and enabling bin-log for CDC and provided all necessary permission to user for the data replication pipeline in my SQL
we are successful in creating a data Fusion instance and able to create a replication pipeline
replication pipeline is able to fetch our SQL database details and target Big query is also set
On starting the pipeline it is tracking the Changes successfully (Insert,update and delete ) and table Schema is also created in Bigquery Successfully automatically.
But we are getting PROBLEM that no data is getting transsferred to Bigquery table. In log what i have seen is loading batch of 1 event in to statging Bucket
sharing the screenshot also
able to fetch every change from MYSQL but data is not transferring to bigquery
table schema was created but data is not transferred
loading batch of 1 event in to statging Bucket we are using developer mode and waited for more than 90 mins

The issue might be happening because there may be a schema/data type mismatch with the BigQuery table and the source MYSQL database table on the columns.
For example: if you have a column in source table, in BigQuery this column is of INT64 datatype with a length of 19, while in the source database table, it is Integer type with a length of 10, so you need to update the length of columns as per your datasize.

Related

BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point error from Multiple database table plugin

I'm trying to ingest data from different tables with in same database using Data fusion Multiple database tables plugin to bigquery tables using multiple big query tables sink. I write 3 different custom SQL and add them inside the plugin section which is under "Data Section Mode" > "Custom SQL Statements".
The problem is When I preview or deploy and run the pipeline I get the error "BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point."
What I try to figure out this problem;
Run custom SQL on database and worked properly.
Create pipelines that are specific for custom SQLs but it's like 1 table ingestion from sql server to bigquery table as sink. it worked properly.
Try different Data Section Mode under multiple database tables plugin that is Table Allow List , works but it's just insert all data with no option to transform any column or filtering. Did that one to see if plugin can reach the database and able to read data ,it can read.
Data Pipeline - Multiple Database Tables Plugin Config - 1
Data Pipeline - Multiple Database Tables Plugin Config - 2
As a conclusion I would like to ingest data from one database with multiple tables with in one data pipeline. If possible I would like to do it with writing custom sqls for each tables.
Open for any advice and try.
Thank you.

Does GCP Data Loss Prevention support publishing its results to Data Catalog for External Big Query Tables

I was trying to auto tag InfoTypes like PhoneNumber, EmailId on the data in GCS Bucket and Big Query External tables using Data Loss Prevention Tool in GCP so that i can have those tags at Data Catalog and subsequently in Dataplex. Now the problems are that
If i select any sources other than Big Query table (GCS, Data Store etc.), the option to publish GCP DLP inspection results to Data Catalog is disabled.
If i select Big Query table, Data Catalog publish option is enabled but when i try to run the inspection job, its errors out saying , "External tables are not supported for inspection". Surprisingly it supports only internal big query tables.
The question is that, is my understanding of GCP DLP - Data Catalog integration works only for Internal Big Query tables correct? Am doing something wrong here, GCP documentation doesn't mention these things either!
Also while configuring the Inspection Job from the DLP UI Console, i had to provide Big Query tableid mandatorily, is there a way i can run DLP inspection job against a BQ Dataset or a bunch of tables?
Regarding Data Loss Prevention Services in Google Cloud, your understanding is correct, data cannot be ex-filtrated by copying to services outside the perimeter, e.g., a public Google Cloud Storage (GCS) bucket or an external BigQuery table. Visit this URL for more reference.
Now, about how to run a DLP Inspection job against a BQ bunch of tables, there are 2 ways to do it:
Programmatically fetch the Big Query tables, query the table and call DLP Streaming Content API. It operates in real time, but it is expensive. Here I share the concept in a Java example code:
url =
String.format(
"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
projectId);
DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
ds.setURL(url);
conn = ds.getConnection();
DatabaseMetaData databaseMetadata = conn.getMetaData();
ResultSet tablesResultSet =
databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
while (tablesResultSet.next()) {
// Query your Table Data and call DLP Streaming API
}
Here is a tutorial for this method.
Programmatically fetch the Big Query tables, and then trigger one Inspect Job for each table. It is the cheapest method, but you need to consider that it's a batch operation, so it doesn’t execute in real time. Here is the concept in a Python example:
client = bigquery.Client()
datasets = list(client.list_datasets(project=project_id))
if datasets:
for dataset in datasets:
tables = client.list_tables(dataset.dataset_id)
for table in tables:
# Create Inspect Job for table.table_id
Use this thread for more reference on running a DLP Inspection job against a BQ bunch of tables.

how to export oracle DB table with complex CLOB data into bigquery through batch upload?

We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.

Create tables in Glue Data Catalog for data in S3 and unknown schema

My current use case is, in an ETL based service (NOTE: The ETL service is not using the Glue ETL, it is an independent service), I am getting some data from AWS Redshift clusters into the S3. The data in S3 is then fed into the T and L jobs. I want to populate the metadata into the Glue Catalog. The most basic solution for this is to use the Glue Crawler, but the crawler runs for approximately 1 hour and 20 mins(lot of s3 partitions). The other solution that I came across is to use Glue API's. However, I am facing the issue of data type definition in the same.
Is there any way, I can create/update the Glue Catalog Tables where I have data in S3 and the data types are known only during the extraction process.
But also, when the T and L jobs are being run, the data types should be readily available in the catalog.
In order to create, update the data catalog during your ETL process, you can make use of the following:
Update:
additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="write_sink",
additional_options=additionalOptions)
job.commit()
The above can be used to update the schema. You also have the option to set the updateBehavior choosing between LOG or UPDATE_IN_DATABASE (default).
Create
To create new tables in the data catalog during your ETL you can follow this example:
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)
You can specify the database and new table name using setCatalogInfo.
You also have the option to update the partitions in the data catalog using the enableUpdateCatalog argument then specifying the partitionKeys.
A more detailed explanation on the functionality can be found here.
Found a solution to the problem, I ended up utilising the Glue Catalog API's to make it seamless and fast.
I created an interface which interacts with the Glue Catalog, and override those methods for various data sources. Right after the data has been loaded into the S3, I fire the query to get the schema from the source and then the interface does its work.

How AWS DMS works internally

In AWS DMS how does the migration happening internally? Is it like exporting entire data from source table and importing to destination table? Or is it like migrating table records one by one to destination table? I am new to aws dms and don't have much idea on how things work there.
AWS publish how DMS works in their documentation and blog posts. This is the list I wish I had when I started with DMS:
For a high level understanding see: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.html
A task can consist of three major phases:
The full load of existing data
The application of cached changes
Ongoing replication
During a full load migration, where existing data from the source is moved to the target, AWS DMS loads data from tables on the source data store to tables on the target data store. While the full load is in progress, any changes made to the tables being loaded are cached on the replication server; these are the cached changes.
...
When the full load for a given table is complete, AWS DMS immediately begins to apply the cached changes for that table. When all tables have been loaded, AWS DMS begins to collect changes as transactions for the ongoing replication phase. After AWS DMS applies all cached changes, tables are transactionally consistent. At this point, AWS DMS moves to the ongoing replication phase, applying changes as transactions.
From: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Introduction.Components.html
Look at the headings:
Replication Tasks
Ongoing replication, or change data capture (CDC)
To gain a detailed understanding of how DMS works internally, read through the following blogs from AWS:
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 1)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 2)
Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong? (Part 3)
Finally, work through the blogs particular to your source and target databases at https://aws.amazon.com/blogs/database/category/migration/aws-database-migration-service-migration/
When I first used DMS I had same question. So simply I enabled Cloudwatch logs and created one migration task from Oracle to Aurora Postgresql.
First DMS task runs on Replication Instance and it connects to Source and Target databases.
RI then connect to Source database and based on selection rule it identifies tables and column details since it has lot of special access on Source and Target DB.
After that it start reading source table(s) in parallel and create Select col1, col2, col3.. from kind of query to fetch data from Source.
Then it write files in a temp location on RI based on tables, 1 file per table and approx 10000 rows in one commit.
While all this is happening another process is creating connection to Target DB and checking if Tables already exist if yes then it check which option we selected Do Nothing or Truncate Table etc.. Based on that it takes action.
Till now we have data from Source table in files on RI and connection and tables created on Target DB. Now RI just reads file records from RI temp location and create insert query.
Once last commit is successful it deletes the temp file from RI.
Once Source table and target table count is matched it closes connections in case of One time load.
In case of On going changes it keeps connection alive and read redo logs or other logs in Source db. Then follow same process mentioned above for CDC.
Here's a doc that provides some more information on how DMS Ongoing Replication works internally: https://aws.amazon.com/blogs/database/introducing-ongoing-replication-from-amazon-rds-for-sql-server-using-aws-database-migration-service/
The short of it is:
(following some initial steps) AWS DMS does not use any replication artifacts. When all the required information is available in the transaction log or transaction log backup, AWS DMS uses the fn_dblog() and fn_dump_dblog() functions to read changes directly from the transaction logs or transaction log backups using the log sequence number (LSN).
In addition to above answers, DMS uses Attunity underneath. There are public documents on how the later works in detail.