Move BigQuery Data Transfer Service(DCM) data to another project - google-cloud-platform

I have BigQuery Data Transfer Service for Campaign Manager setup in dataset A in GCP project A. I would like to move this to dataset B located in project B. How can I move the existing data and setup the BigQuery Transfer with any loss of data and duplicates?

I'm afraid you would have to:
Copy the relevant tables from dataset A to dataset B
Set up the transfer service again for dataset B (assuming it can be done if the tables already exist in the target dataset)
De-dup the data yourself.
A workaround, that achieve something similar but not exactly what you asked, is to create views of relevant tables in dataset A into dataset B.
By doing so, these views will behave like proper tables in dataset B and you don't have to worry about de-deplication, data loss, and setting up again the data transfer. The downside is that you will have to keep dataset A around.

Here's how I migrated the transfer service:
The transfer service was enabled in the project B.
Once the data started to arrive at dataset B, the historical
data(from the starting till MIN(partition_date)-1) was copied from
dataset A to appropriate partitions in dataset B.
The transfer service in project A was stopped after verifying the
partition counts and row counts.

Related

Data Warehouse and connection to Power Bi on AWS Structure

I work for a startup where I have created several dashboards in Power BI using tables that are stored in an AWS RDS that I connect to using MySQL. To create additional columns for visualizations, I created views of the tables in MySQL and used DAX to add some extra columns for the Power BI visualizations.
However, my team now wants to use the AWS structure and build a data lake to store the raw data and a data warehouse to store the transformed data. After researching, I believe I should create the additional columns in either Athena or Redshift. My question is, which option is best for our needs?
I think the solution is to connect to the RDS using AWS Glue to perform the necessary transformations and deposit the transformed data in either Athena or Redshift. Then, we can connect to the chosen platform using Power BI. Please let me know if I am misunderstanding anything.
To give an approximate number of the number of records I'm handling, the fact tables have about 10 thousand new records every month
Thank you in advance!

Can I trace back the version of the data my model was trained on in VertexAI?

Let's suppose I have a table in BigQuery and I create a dataset on VertexAI based on it. I train my model. A while later, the data gets updated several times in BigQuery.
But can I simply go to my model and get redirected to the exact version of he data it was trained on?
Using time travel, I can still access the historical data in BigQuery. But I didn't manage to go to my model and figure out on which version of the data it was trained and look at that data.
On the Vertex Ai creating a dataset from BigQuery there is this statement:
The selected BigQuery table will be associated with your dataset. Making changes to the referenced BigQuery table will affect the dataset before training.
So there is no copy or clone of the table prepared automatically for you.
Another fact is that usually you don't need the whole base table to create the database, you probably subselect based on date, or other WHERE statements. Essentially the point here is that you filter your base table, and your new dataset is only a subselect of it.
The recommended way is to create a dataset, where you will drop your table sources, lets call them vertex_ai_dataset. In this dataset you will store all your tables that are part of a vertex ai dataset. Make sure to version them, and not update them.
So BASETABLE -> SELECT -> WRITE AS vertex_ai_dataset.dataset_for_model_v1 (use the later in Vertex AI).
Another option is that whenever you issue a TRAIN action, you also SNAPSHOT the base table. But we aware this need to be maintained, and cleaned as well.
CREATE SNAPSHOT TABLE dataset_to_store_snapshots.mysnapshotname
CLONE dataset.basetable;
Other params and some guide is here.
You could also automate this, by observing the Vertex AI, train event (it should documented here), and use EventArc to start a Cloud Workflow, that will automatically create a BigQuery table snapshot for you.

Snowflake: Data Split from AWS US to AWS Australia

There is 10TB of data in SNOWFLAKE db in AWS US region. The requirement is to split a subset of data with certain flag in a column to AWS Australia region.
After split, the US data will be around 6TB and Australia around 4TB.
There are 10 applications containing this mix of data.
I could think of 3 options to do this split.
1. Replicate the entire database from A to B. Then pause application before breaking the replication. In B, delete data in B database where filter is A's data. In A, repeat the delete where filter is B's data. Clone application set and configure new set to read/write to B
2. Use CTAS in B with data from A
3. Use SSIS to push data from A to B. For this option, the application need not be stopped.
Please advise on these options and if there is/are anymore options in which this data split can be achieved.
Regards,
Mani
The whole setup on how these 10 applications access your snowflake table are unclear, but important to provide a solution.
Your best option to sync data over two snowflake accounts is using database replication failover:
https://docs.snowflake.net/manuals/user-guide/database-replication-failover.html
To split data based on a field can easily be done with materialized views that have a where clause containing this field. https://docs.snowflake.net/manuals/user-guide/views-materialized.html

Bigquery, save clusters of clustered table to cloud storage

I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.
What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:
gs://my_bucket/{client_id}/{attribute_id}/object.avro
I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.
But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.
Is there a way to accomplish this already without requesting a new feature to be added?
Thanks!

Azure SQL DWH delete and restore it when requires

Is there an option to restore the deleted database in SQL DWH at a later time(more than a year )?
The documentation clearly indicates that when an Azure SQL Data Warehouse is dropped it keeps the final snapshot for seven days:
When you drop a data warehouse, SQL Data Warehouse creates a final
snapshot and saves it for seven days. You can restore the data
warehouse to the final restore point created at deletion.
The same article also mentions the fact you can vote for this feature here:
https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/35114410-user-defined-retention-periods-for-restore-points
Even if you could do this, you are basically leaving it up to someone else to be in charge of your warehouse backups. What you could do instead is take control:
Store your Azure SQL Data Warehouse schema in source code control (eg git, Azure DevOps formerly VSTS, etc). If it isn't there already you can reverse engineer the schema using SQL Server Management Studio (SSMS) versions 17.x onwards or even use the SSDT preview feature
Export your data to Data Lake or Azure Blob Storage using CREATE EXTERNAL TABLE AS SELECT (CETAS). This will export your data as flat files to storage where it won't be deleted. Alternately use Azure Data Factory to export the data and zip it up to save space.
When you need to recreate the warehouse, simply redeploy the schema from source code control and redeploy the data, eg via CTAS in to staging tables, or use Azure Data Factory to re-import. If you saved your external tables in the schema you save to source code control then it will just be there when you redeploy. INSERT back in to the main tables from the external tables.
In this way you are in charge of your warehouse schema and your data to be recreated at any point you require, whether it be a day, a month or years.
A simple diagram of the proposed design: