DSX: Insert to code link is missing - data-science-experience

After uploading some files to my project and creating a catalog, I can see the list of files in the Find and Add Data section. However, there is no link Insert to code. This is true for files of type csv, json, tar.gz as well as for a data set from a catalog. What am I doing wrong?

Insert to Code Option is only available for data that you upload in Object Storage service.
I see that you are using Catalog for storage in DSX.
Catalog is still in beta state and currently insert to code is not added or supported for Catalog data assets.
Feel free to add enhancement request here:-
https://datascix.uservoice.com/forums/387207-general
If you create a project with Object storage as storage , you will see the insert to code for csv files.
For reading from catalog , you will need to use projectUtil.
Catalog data asset is considered as a resource of project so to access it you would need access token.
So first step, generate the token to access the catalog resource.
Go to Project Settings and create access token and then clear next cell and
click insert project token from those 3 dots above in notebook and
you will see code generated as below
The generated code just creates project context.
import com.ibm.analytics.projectNotebookIntegration._
val pc = ProjectUtil.newProjectContext(sc, "994b03fa-XXXXXX", "p-XXXXXXXXXX")
Lets make list of available files.
val fileList = ProjectUtil.listAvailableFilesData(pc)
fileList.indices.foreach( i => println(i + ": " + fileList(i)))
So the fileList contains your filenames.
You can directly use the name of the file as second argument.
val df = ProjectUtil.loadDataFrameFromFile(pc, fileList(1))
or
val df1 = ProjectUtil.loadDataFrameFromFile(pc, "co2.csv")
You will see below:-
"Creating DataFrame, this will take a few moments...
DataFrame created."
df.show() and you will see content.
Full Notebook:-
https://github.com/charles2588/bluemixsparknotebooks/blob/master/scala/Read_Write_Catalog_Scala.ipynb
The below doc also has python and R examples.

Ref for projectUtil:- https://datascience.ibm.com/docs/content/local/notebookfunctionsload.html
Thanks,
Charles.

Related

How to create files having date in the file name using big query export data statement

I am using BIG QUERY EXPORT DATA statement to create files in cloud storage for an another team to extract for further reprocessing. I am using below statement, not pasting the select query as its huge.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I see below files getting created in my cloud storage bucket
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000002.csv
I cannot remove the suffix part as BIG QUERY creates it, but I am wondering if I can create files with DATE in the file name for the other team to identify what date it is created for??
That is like
Customer_Master_04022021_000000000000_.csv
I need to have a date in my file. Any help or inputs please?
Is there a work around or I will have to go with a data flow here that is using a data flow job to extract data from table in a file.
You can use the uri value as:
'gs://bucket/folder/your_filename-'||current_datetime()||'-*.csv'
Either Current_date() or current_datetime() can be used.
Thanks

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.
You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.
Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.
I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx
It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements
Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Clone BigQuery Project to another account

Due to a change in the business I need to copy a whole BigQuery project from one account to another, also, the accounts are not related and is not possible to link it in any way.
Throughout the CLI I was able to export a table to Cloud Storage in a dataset. Also, list tables in a dataset looks possible so loop over it shouldn't be a problem.
But I can't find any suitable way to manage the datasets neither for exporting or creating in the new account so it left a lot of manual task.
I'm missing something? There is a way to export the whole project with all datasets or a manual task will be always required?
The data structure is not complex at all:
Project -> dataset -> table
-> table
-> ...
-> dataset -> table
-> table
-> ...
-> ...
You can't copy the whole project at once but you can try to automate the copy using a script in Python like this:
from google.cloud import bigquery
import os
source_project = "<your source project>"
new_project = "<your new project>"
#I suppose that you have access to the source project in your new project
client = bigquery.Client(project=source_project)
datasets = []
#List all the datasets in the source project and save it in a list
for i in client.list_datasets():
datasets.append(i.dataset_id)
#For all the datasets, build the commands and then execute them
for i in datasets:
create_command = "bq mk -d " + i
copy_command = "bq mk --transfer_config --project_id=" + new_project + " --data_source=cross_region_copy --target_dataset=" + i + " --display_name='My Dataset Copy' --params='{\"source_dataset_id\":\"" + i + "\",\"source_project_id\":\"" + source_project + "\",\"overwrite_destination_table\":\"true\"}'"
os.system(create_command)
os.system(copy_command)
You can use the Bigquery Data Transfer service for this. You can't copy all your project, but dataset per dataset. You can script this if you have a lot of dataset.
Be careful, you don't export from the source project to a target project, you import into the target project from the source project (I mean you have to define the transfert in the destination project)
To copy the dataset from one project to another project then you can use the below command to make the transfer job:
bq mk --transfer_config --project_id=[PROJECT_ID] --data_source=[DATA_SOURCE] --target_dataset=[DATASET] --display_name=[NAME] --params='[PARAMETERS]'
where PROJECT_ID : The destination project_ID
DATA_SOURCE : cross_region_copy
DATASET : Target dataset
NAME : Display name of your job.
PARAMETERS : Source project ID, Source Dataset ID and other parameteres can be defined( overwrite destination table etc.)
You can go through this link for detailed explanation.

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476

DataPrep: access to source filename

Is there a way to create a column with filename of the source that created each row ?
Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation of the files contained in a specific GCS bucket.
I've looked at the "metadata article" on GCP but it is pretty useless for my use-case.
UPDATED: I have opened a feature request with Google.
While they haven't closed that issue yet, this was part of the update last week.
There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps.
There are some caveats, such as it not returning a value for BigQuery sources and not persisting through pivot, join, or unnest . . . but it covers the vast majority of use cases handily, and in other cases you just need to materialize it before some of those destructive transforms.
NOTE: If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148