How to create files having date in the file name using big query export data statement

How to create files having date in the file name using big query export data statement - google-cloud-platform

I am using BIG QUERY EXPORT DATA statement to create files in cloud storage for an another team to extract for further reprocessing. I am using below statement, not pasting the select query as its huge.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I see below files getting created in my cloud storage bucket
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_000000000002.csv
I cannot remove the suffix part as BIG QUERY creates it, but I am wondering if I can create files with DATE in the file name for the other team to identify what date it is created for??
That is like
Customer_Master_04022021_000000000000_.csv
I need to have a date in my file. Any help or inputs please?
Is there a work around or I will have to go with a data flow here that is using a data flow job to extract data from table in a file.

You can use the uri value as:
'gs://bucket/folder/your_filename-'||current_datetime()||'-*.csv'
Either Current_date() or current_datetime() can be used.
Thanks

Related

Loading multiple files from multiple paths to Big Query

I have a file structure such as:
gs://BUCKET/Name/YYYY/MM/DD/Filename.csv
Every day my cloud functions are creating another path with another file innit corresponding to the date of the day (so for today's 5th of August) we would have gs://BUCKET/Name/2022/08/05/Filename.csv
I need to find a way to query this data to Big Query automatically so that if I want to query it for 'manual inspection' I can select for example data from all 3 months in one query doing CREATE TABLE with gs://BUCKET/Name/2022/{06,07,08}/*/*.csv
How can I replicate this? I know that BigQuery does not support more than 1 wildcard, but maybe there is a way to do so.

To query data inside GCS from Big Query you can use an external table.
Problem is this will fail because you cannot have a comma (,)
as part of the URI list
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = ['gs://bucket/2022/{1,2,3}/data.csv']
)
You have to specify the 3 CSV file locations like this:
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = [
'gs://inigo-test1/2022/1/data.csv',
'gs://inigo-test1/2022/2/data.csv']
'gs://inigo-test1/2022/3/data.csv']
)
Since you're using this sporadically, probably makes more sense to create a temporal external table.

se I found a solution that works at least for my use case, without using the external table.
During the creation of table in dataset in BigQuery use create table from: GCS and then when using URI pattern I used gs://BUCKET/Name/2022/* ; As long as filename is the same in each subfolder and schema is identical, then BQ will load everything and then you can perform date operations directly in BQ (I have a column with ingestion date)

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.

You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.

Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.

I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx

It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements

Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Power BI dynamic csv file path change

I need to change the csv file path dynamically everyday on refresh. Like it would be path/filename_01_Dec_20.csv tomorrow it should be path/filename_02_Dec_20.csv, this way it will change daily. Please let me know if this can be done

Write the query for one of the CSV tables but modify the code where the filepath is specified from e.g.
.../filename_01_Dec_20.csv"...
To
.../filename_" & Date.ToText(Date.From(DateTime.LocalNow()), "dd_MMM_yy") & ".csv"...
This just puts the current date DateTime.LocalNow() in the format you're looking for using the Date.ToText formatting function.

Athena - CTAS file name

I used Athena's CTAS and INSERT commands and Avro files created at the external_location
But the file name is very strange and the filename extension also disappear. (That file don't have any filename extension. File has only their strange filename like hash code)
How can I define filenames rule for Athena's file?
Thank you.

As stated on page 20 of AWS Athena's manual, ..."This location in Amazon S3 comprises all of the files representing your table. For more information, see Using Folders in the Amazon Simple Storage Service Console User Guide."...
Reference:
https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf
So, no, you can't define the name of the file (or files, because more than one may be needed to represent a table). BUT THE RIGHT WAY TO THINK is that the BUCKET/PATH is what represents the file name, or the output table.
We might get confused because you're genereting and AVRO file, which really is a file, like PARQUET, but remember that Athena can also output to other formats, which may be multi-file.

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?

UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js