I want to move a dataset to another region but there are some pubsub subscriptions with dataflow templates loading the tables within the dataset. How can I do this without interrupting the dataflow jobs? Or interrupt them as little as possible.
Is it possible to do this in these steps?:
Create a temporary dataset with a temporary name in a new region
Copy original dataset to the temporary dataset
Delete old original dataset
Create a new dataset with the original name in the new region
Copy temporary dataset the new dataset with the original dataset
Im open for suggestions :D
You can use the copy dataset feature in preview for now. One interesting feature is the cross region copy available with the feature.
You can perform the same process but easier!
About your Dataflow pipeline, I think it won't work. Indeed, the location is an important information when you write to BigQuery. Have a try, but I'm pretty sure that you have to update it.
Related
Let's suppose I have a table in BigQuery and I create a dataset on VertexAI based on it. I train my model. A while later, the data gets updated several times in BigQuery.
But can I simply go to my model and get redirected to the exact version of he data it was trained on?
Using time travel, I can still access the historical data in BigQuery. But I didn't manage to go to my model and figure out on which version of the data it was trained and look at that data.
On the Vertex Ai creating a dataset from BigQuery there is this statement:
The selected BigQuery table will be associated with your dataset. Making changes to the referenced BigQuery table will affect the dataset before training.
So there is no copy or clone of the table prepared automatically for you.
Another fact is that usually you don't need the whole base table to create the database, you probably subselect based on date, or other WHERE statements. Essentially the point here is that you filter your base table, and your new dataset is only a subselect of it.
The recommended way is to create a dataset, where you will drop your table sources, lets call them vertex_ai_dataset. In this dataset you will store all your tables that are part of a vertex ai dataset. Make sure to version them, and not update them.
So BASETABLE -> SELECT -> WRITE AS vertex_ai_dataset.dataset_for_model_v1 (use the later in Vertex AI).
Another option is that whenever you issue a TRAIN action, you also SNAPSHOT the base table. But we aware this need to be maintained, and cleaned as well.
CREATE SNAPSHOT TABLE dataset_to_store_snapshots.mysnapshotname
CLONE dataset.basetable;
Other params and some guide is here.
You could also automate this, by observing the Vertex AI, train event (it should documented here), and use EventArc to start a Cloud Workflow, that will automatically create a BigQuery table snapshot for you.
BigQuery Dataset Copy (see more here: https://cloud.google.com/bigquery/docs/copying-datasets) per documentation is mentioning that is using BigQuery Data Transfer Service. But what right now is unclear to me is question if BigQuery Data Transfer Service is using slots or not? I tried to find any information about in documentation but failed. They are used for ingestion jobs, so I was curious if this is also a case for Data Transfer Service.
As per this video from Google Cloud Tech, PIPELINE type jobs (extract, load and copy jobs) do take up slots if they are assigned reservations. The dataset copy documentation states that each table in the intended dataset gets its own copy job and so, it seems that copy jobs do take up slots.
Quoting from the documentation.
Copying a dataset requires one copy job for each table in the dataset.
If no slots are assigned to PIPELINE type jobs, they use the shared pool of free slots by default.
I'm having a trouble trying to copy the particular table from the dataset (as an example) in US region to the dataset in the south-asia-1 region.
But after I try to copy the table using "Copy" button in the UI the error appears, which tells that no such dataset is found (presumably trying to find target table in the US region, or source in asia-south-1).
I don't need to copy the whole dataset anywhere as answers in another questions suggested, just a couple.
I couldn't find compelling answers to that problem on SO yet. Thanks!
Table copy only works when source and destination tables are in the same region. A workaround solution could be:
create a temp_source dataset in the same region as the source table
copy the source table to temp_source dataset
create temp_destination dataset in the same region at the wanted destination (asia-south1 in your case)
use the BigQuery Data Transfer service (Data transfers in BigQuery cloud console)to copy temp_source dataset (containing your one table) to temp_destination
copy temp_destination.your_table to your wanted destination dataset (asia-south1)
I have a bigquery table that's clustered by several columns, let's call them client_id and attribute_id.
What I'd like is to submit one job or command that exports that table data to cloud storage, but saves each cluster (so each combination of client_id and attribute_id) to its own object. So the final uri's might be something like this:
gs://my_bucket/{client_id}/{attribute_id}/object.avro
I know I could pull this off by iterating all the possible combinations of client_id and attribute_id and using a client library to query the relevant data into a bigquery temp table, and then export that data to correctly named object, and I could do so asynchronously.
But.... I imagine all the clustered data is already stored in a format somewhat like what I'm describing, and I'd love to avoid the unnecessary cost and headache of writing the script to do it myself.
Is there a way to accomplish this already without requesting a new feature to be added?
Thanks!
I have BigQuery Data Transfer Service for Campaign Manager setup in dataset A in GCP project A. I would like to move this to dataset B located in project B. How can I move the existing data and setup the BigQuery Transfer with any loss of data and duplicates?
I'm afraid you would have to:
Copy the relevant tables from dataset A to dataset B
Set up the transfer service again for dataset B (assuming it can be done if the tables already exist in the target dataset)
De-dup the data yourself.
A workaround, that achieve something similar but not exactly what you asked, is to create views of relevant tables in dataset A into dataset B.
By doing so, these views will behave like proper tables in dataset B and you don't have to worry about de-deplication, data loss, and setting up again the data transfer. The downside is that you will have to keep dataset A around.
Here's how I migrated the transfer service:
The transfer service was enabled in the project B.
Once the data started to arrive at dataset B, the historical
data(from the starting till MIN(partition_date)-1) was copied from
dataset A to appropriate partitions in dataset B.
The transfer service in project A was stopped after verifying the
partition counts and row counts.