Due to a change in the business I need to copy a whole BigQuery project from one account to another, also, the accounts are not related and is not possible to link it in any way.
Throughout the CLI I was able to export a table to Cloud Storage in a dataset. Also, list tables in a dataset looks possible so loop over it shouldn't be a problem.
But I can't find any suitable way to manage the datasets neither for exporting or creating in the new account so it left a lot of manual task.
I'm missing something? There is a way to export the whole project with all datasets or a manual task will be always required?
The data structure is not complex at all:
Project -> dataset -> table
-> table
-> ...
-> dataset -> table
-> table
-> ...
-> ...
You can't copy the whole project at once but you can try to automate the copy using a script in Python like this:
from google.cloud import bigquery
import os
source_project = "<your source project>"
new_project = "<your new project>"
#I suppose that you have access to the source project in your new project
client = bigquery.Client(project=source_project)
datasets = []
#List all the datasets in the source project and save it in a list
for i in client.list_datasets():
datasets.append(i.dataset_id)
#For all the datasets, build the commands and then execute them
for i in datasets:
create_command = "bq mk -d " + i
copy_command = "bq mk --transfer_config --project_id=" + new_project + " --data_source=cross_region_copy --target_dataset=" + i + " --display_name='My Dataset Copy' --params='{\"source_dataset_id\":\"" + i + "\",\"source_project_id\":\"" + source_project + "\",\"overwrite_destination_table\":\"true\"}'"
os.system(create_command)
os.system(copy_command)
You can use the Bigquery Data Transfer service for this. You can't copy all your project, but dataset per dataset. You can script this if you have a lot of dataset.
Be careful, you don't export from the source project to a target project, you import into the target project from the source project (I mean you have to define the transfert in the destination project)
To copy the dataset from one project to another project then you can use the below command to make the transfer job:
bq mk --transfer_config --project_id=[PROJECT_ID] --data_source=[DATA_SOURCE] --target_dataset=[DATASET] --display_name=[NAME] --params='[PARAMETERS]'
where PROJECT_ID : The destination project_ID
DATA_SOURCE : cross_region_copy
DATASET : Target dataset
NAME : Display name of your job.
PARAMETERS : Source project ID, Source Dataset ID and other parameteres can be defined( overwrite destination table etc.)
You can go through this link for detailed explanation.
Related
I have a file structure such as:
gs://BUCKET/Name/YYYY/MM/DD/Filename.csv
Every day my cloud functions are creating another path with another file innit corresponding to the date of the day (so for today's 5th of August) we would have gs://BUCKET/Name/2022/08/05/Filename.csv
I need to find a way to query this data to Big Query automatically so that if I want to query it for 'manual inspection' I can select for example data from all 3 months in one query doing CREATE TABLE with gs://BUCKET/Name/2022/{06,07,08}/*/*.csv
How can I replicate this? I know that BigQuery does not support more than 1 wildcard, but maybe there is a way to do so.
To query data inside GCS from Big Query you can use an external table.
Problem is this will fail because you cannot have a comma (,)
as part of the URI list
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = ['gs://bucket/2022/{1,2,3}/data.csv']
)
You have to specify the 3 CSV file locations like this:
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = [
'gs://inigo-test1/2022/1/data.csv',
'gs://inigo-test1/2022/2/data.csv']
'gs://inigo-test1/2022/3/data.csv']
)
Since you're using this sporadically, probably makes more sense to create a temporal external table.
se I found a solution that works at least for my use case, without using the external table.
During the creation of table in dataset in BigQuery use create table from: GCS and then when using URI pattern I used gs://BUCKET/Name/2022/* ; As long as filename is the same in each subfolder and schema is identical, then BQ will load everything and then you can perform date operations directly in BQ (I have a column with ingestion date)
I am using BIG QUERY EXPORT DATA statement to create files in cloud storage for an another team to extract for further reprocessing. I am using below statement, not pasting the select query as its huge.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I see below files getting created in my cloud storage bucket
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000002.csv
I cannot remove the suffix part as BIG QUERY creates it, but I am wondering if I can create files with DATE in the file name for the other team to identify what date it is created for??
That is like
Customer_Master_04022021_000000000000_.csv
I need to have a date in my file. Any help or inputs please?
Is there a work around or I will have to go with a data flow here that is using a data flow job to extract data from table in a file.
You can use the uri value as:
'gs://bucket/folder/your_filename-'||current_datetime()||'-*.csv'
Either Current_date() or current_datetime() can be used.
Thanks
Is it possible to do a bigquery scripting in airflow BigQueryOperator (airflow 1.10.12) ? Does someone manage to do it ?
I tried somrthing like that :
test = BigQueryOperator(
task_id='test',
sql="""DECLARE aaa STRING;
SET aaa = 'data';
CREATE OR REPLACE TABLE `project-id.dataset-id.TEST_DDL` as select aaa as TEST;""",
use_legacy_sql = False,
create_disposition=False,
write_disposition=False,
schema_update_options=False,
location='EU')
But all I get is a 'Not found: Dataset was not found in location US at [3:9]'
Actually I found the issue and it IS relative to thé bigqueryoperator. Actually when scripting there is ni referenced tables neither destination table in the bigquery insert job. In that case bigquery sets the job location in US by default. In my case as my datasets are in EU thé job fails. And there is a location parameter in the bigqueryoperator but it is wrongly passed by the operator in the configuration object of the job instead if in the job reference object. Which made it useless. The issue is corrected in airflow 2.
I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476
After uploading some files to my project and creating a catalog, I can see the list of files in the Find and Add Data section. However, there is no link Insert to code. This is true for files of type csv, json, tar.gz as well as for a data set from a catalog. What am I doing wrong?
Insert to Code Option is only available for data that you upload in Object Storage service.
I see that you are using Catalog for storage in DSX.
Catalog is still in beta state and currently insert to code is not added or supported for Catalog data assets.
Feel free to add enhancement request here:-
https://datascix.uservoice.com/forums/387207-general
If you create a project with Object storage as storage , you will see the insert to code for csv files.
For reading from catalog , you will need to use projectUtil.
Catalog data asset is considered as a resource of project so to access it you would need access token.
So first step, generate the token to access the catalog resource.
Go to Project Settings and create access token and then clear next cell and
click insert project token from those 3 dots above in notebook and
you will see code generated as below
The generated code just creates project context.
import com.ibm.analytics.projectNotebookIntegration._
val pc = ProjectUtil.newProjectContext(sc, "994b03fa-XXXXXX", "p-XXXXXXXXXX")
Lets make list of available files.
val fileList = ProjectUtil.listAvailableFilesData(pc)
fileList.indices.foreach( i => println(i + ": " + fileList(i)))
So the fileList contains your filenames.
You can directly use the name of the file as second argument.
val df = ProjectUtil.loadDataFrameFromFile(pc, fileList(1))
or
val df1 = ProjectUtil.loadDataFrameFromFile(pc, "co2.csv")
You will see below:-
"Creating DataFrame, this will take a few moments...
DataFrame created."
df.show() and you will see content.
Full Notebook:-
https://github.com/charles2588/bluemixsparknotebooks/blob/master/scala/Read_Write_Catalog_Scala.ipynb
The below doc also has python and R examples.

Ref for projectUtil:- https://datascience.ibm.com/docs/content/local/notebookfunctionsload.html
Thanks,
Charles.