Google Cloud Data Fusion - Dynamic arguments based on functions

Google Cloud Data Fusion - Dynamic arguments based on functions - google-cloud-platform

Good morning all,
I'm looking in Google Data Fusion for a way to make dynamic the name of a source file stored on GCS. The files to be processed are named according to their value date, example: 2020-12-10_data.csv
My need would be to set the filename dynamically so that the pipeline uses the correct file every day (something like this: ${ new Date(). Getfullyear()... }_data.csv
I managed to use the arguments in runtime by specifying the date as a string (2020-12-10) but not with a function.
And more generally is there any documentation on how to enter dynamic parameters with ready-made or custom "functions" (I couldn't find it)
Thanks in advance for your help.

There is a readymade workaround, you can give a try "BigQuery Execute" plugin.
Steps:
Put below query in SQL
select cast(current_date as string) ||'_data.csv' as filename
--for output '2020-12-15_data.csv'
Row As Arguments to 'true'
Now use the above arguments via ${filename} wherever you want to.

Related

Passing dynamic arguments into another pipeline Googel Data Fusion

I am trying to query big query to get max date of two columns in google data fusion and pass the result into another pipeline as run time arguments.
select max(datecolumn1) as passargmnt1, max(datecolumn2) as passargmnt2 from dummy table;
upon research, it looks like Bigquery Argument setter might help...but the documentation is not much of a help.
can anyone provide some detail on how to achieve this ? Any better alternative solution is also preferred.
DK
I tried big query execute plugin and choose RUN AS ARGUMENTS but the didn't help

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?

The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

Is there a way to pass data source connection string as a parameter to power bi embedded?

I have a pbix file that takes an Azure Storage account as a parameter and reads data from there accordingly. The next step is to be able to embed this powerbi dashboard on a webpage and let the end user specify the storage account. I see a lot of questions and answers surrounding passing in filter query parameters--this is different, we're trying to read from a completely different data source and not filtering on a static data source.
Another way to ask this question is: is there a way to embed powerbi template files, if not, is there a feature request somewhere we can upvote?

The short answer is no.
There is a reason to use filters in this case instead of parameters. Parameters are something that is part of the report itself. Each users that looks at your reports will get the same parameter values as the others. If one of them changes some parameter, this will affect all other users. Filters on the other hand, is something local for your session. You can filter the report the way you like, and this will not affect other users experience in any way.
You can't embed templates, because template is simply a state of the report on the disk. When you open it, it's not a template anymore, but becomes a report.
You can either combine the data from all of your data sources in a single report, adding one more column to indicate from where this data comes from, and then filter on this new column. Or create/modify ETL process (for example dataflows can be used for this) to combine these data sources into a single one.

Google Dataflow: create templates with runtime parameters

In Data flow, I need to pass start Date and end date as runtime arguments and query bigquery for that date range and write output to day wise folders.
When we use ValueProvider, getStartDate().get() method is throwing java.lang.RuntimeException: Not called from a runtime context. If I hardcode some value when getStartDate().get().isAccessible() is false, template is being generated but the runtime arguments are not reflecting in job. It is always running with the hardcoded value during template creation.
Any suggestions ?

BigQueryIO takes a ValueProvider of the query. The easiest way to do this is to pass the query text as the runtime value.
NestedValueProvider could help you create the query string from another value provider, alas, NestedValueProvider only support one input ValueProvider at a time. So you could concatenate your start and end dates into a single value and then do the split.

libpq: get data type

I am coding a cpp project with the database "postgreSQL".
I created a table in my database its type is character varying(40).
Now I need to SELECT these data FROM the table in my cpp project. I knew that I should use the library libpq, this is the interface of "postgreSQL" for c/cpp.
I have succeeded in selecting data from the table. Now I am considering if it's possible to get the data type of this table. For example, here I want to get character varying(40).

You need to use PQftype.
As described here: http://www.idiap.ch/~formaz/doc/postgreSQL/libpq-chapter17861.htm
And just take a look here about decoding return values: http://www.postgresql.org/message-id/da7021e0608040738l3b0880a1q5a76b838937f8c78#mail.gmail.com
You must also use PQfsize to get field size.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Google Cloud Data Fusion - Dynamic arguments based on functions - google-cloud-platform

There is a readymade workaround, you can give a try "BigQuery Execute" plugin. Steps: Put below query in SQL select cast(current_date as string) ||'_data.csv' as filename --for output '2020-12-15_data.csv' Row As Arguments to 'true' Now use the above arguments via ${filename} wherever you want to.

Related

Passing dynamic arguments into another pipeline Googel Data Fusion

What is AWS S3 dataset?

Is there a way to pass data source connection string as a parameter to power bi embedded?

Google Dataflow: create templates with runtime parameters

libpq: get data type

Categories

Resources