DataPrep: access to source filename - google-cloud-platform

DataPrep: access to source filename - google-cloud-platform

Is there a way to create a column with filename of the source that created each row ?
Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation of the files contained in a specific GCS bucket.
I've looked at the "metadata article" on GCP but it is pretty useless for my use-case.
UPDATED: I have opened a feature request with Google.

While they haven't closed that issue yet, this was part of the update last week.
There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps.
There are some caveats, such as it not returning a value for BigQuery sources and not persisting through pivot, join, or unnest . . . but it covers the vast majority of use cases handily, and in other cases you just need to materialize it before some of those destructive transforms.
NOTE: If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148

Related

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?

The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

Why are tables segmented when exporting to parquet from AWS RDS

We use Python's boto3 library to execute start_export_task to trigger a RDS snapshot export (to S3). This successfully generates a directory in S3 that has a predicable, expected structure. Traversing down through that directory to any particular table directory (as in export_identifier/database_name/schema_name.table_name/) I see several .parquet files.
I download several of these files and convert them to pandas dataframes so I can look at them. They are all structured the same and seem to clearly be pieces of the same table. But they range in size from 100KB to 8MB in seemingly unpredictable size segments. Do these files/'pieces' of the table account for all its rows? Do they repeat/overlap at all? Why are they segmented so (seemingly) randomly? What parameters control this segmenting?
Ultimately I'm looking for documentation on this part of parquet folder/file structure. I've found plenty of information on how individual files are structured and partitioning. But I think this falls slightly outside of those topics.

You're not going to like this, but from AWS' perspective this is an implementation detail and according to the docs:
The file naming convention is subject to change. Therefore, when reading target tables we recommend that you read everything inside the base prefix for the table.
— docs
Most of the tools that work with Parquet don't really care about the number or file names of the parquet files. You just point something like Spark or Athena to the prefix of the table and it will read all the files and figure out how they fit together.
In the API there are also no parameters to influence this behavior. If you prefer a single file for aesthetic reasons or others, you could use something like a Glue Job to read the table prefixes, coalesce the data per table in a single file and write it to S3.

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.

You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.

Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.

I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx

It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements

Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Is there a way to pass data source connection string as a parameter to power bi embedded?

I have a pbix file that takes an Azure Storage account as a parameter and reads data from there accordingly. The next step is to be able to embed this powerbi dashboard on a webpage and let the end user specify the storage account. I see a lot of questions and answers surrounding passing in filter query parameters--this is different, we're trying to read from a completely different data source and not filtering on a static data source.
Another way to ask this question is: is there a way to embed powerbi template files, if not, is there a feature request somewhere we can upvote?

The short answer is no.
There is a reason to use filters in this case instead of parameters. Parameters are something that is part of the report itself. Each users that looks at your reports will get the same parameter values as the others. If one of them changes some parameter, this will affect all other users. Filters on the other hand, is something local for your session. You can filter the report the way you like, and this will not affect other users experience in any way.
You can't embed templates, because template is simply a state of the report on the disk. When you open it, it's not a template anymore, but becomes a report.
You can either combine the data from all of your data sources in a single report, adding one more column to indicate from where this data comes from, and then filter on this new column. Or create/modify ETL process (for example dataflows can be used for this) to combine these data sources into a single one.

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?

UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js