Passing dynamic arguments into another pipeline Googel Data Fusion - google-cloud-platform

I am trying to query big query to get max date of two columns in google data fusion and pass the result into another pipeline as run time arguments.
select max(datecolumn1) as passargmnt1, max(datecolumn2) as passargmnt2 from dummy table;
upon research, it looks like Bigquery Argument setter might help...but the documentation is not much of a help.
can anyone provide some detail on how to achieve this ? Any better alternative solution is also preferred.
DK
I tried big query execute plugin and choose RUN AS ARGUMENTS but the didn't help

Related

Google Cloud Data Fusion - Dynamic arguments based on functions

Good morning all,
I'm looking in Google Data Fusion for a way to make dynamic the name of a source file stored on GCS. The files to be processed are named according to their value date, example: 2020-12-10_data.csv
My need would be to set the filename dynamically so that the pipeline uses the correct file every day (something like this: ${ new Date(). Getfullyear()... }_data.csv
I managed to use the arguments in runtime by specifying the date as a string (2020-12-10) but not with a function.
And more generally is there any documentation on how to enter dynamic parameters with ready-made or custom "functions" (I couldn't find it)
Thanks in advance for your help.
There is a readymade workaround, you can give a try "BigQuery Execute" plugin.
Steps:
Put below query in SQL
select cast(current_date as string) ||'_data.csv' as filename
--for output '2020-12-15_data.csv'
Row As Arguments to 'true'
Now use the above arguments via ${filename} wherever you want to.

Django: SQL way (i.e using annotation, aggregate, subquery, when, case, window etc) vs python way

Sql way means using using annotation, aggregate, subquery, when, case, window etc. i.e get all the extra calculation columns from sql
Till now if I want to get some addition information from the data of each row like calculate something etc, I used to get the table in queryset and loop over each object and then store the desired result in a list of dicts and pass it to the template. Ofcourse i was using prefetch so that i can avoid N+1 queries.
But since django 1.11 we can do the same thing in more expressive way using sql (i.e using annotation, aggregate, subquery, when, case, window etc) and no python.
Disadvantage:
One disadvantage i found using sql way rather than python way is debugging.
I have to do complex calculations on the data of each queryset object. So that i can check step by step.
If i do it in the sql way i can only see the final result but not be able to track the steps.
Advantage:
I didnt tried, but heard from The Dramatic Benefits of Django Subqueries and Annotations that its quite fast.
Presently most of my sql takes less than 100 ms and most of the time goes in dom loading. So by using sql way will it help me any way.
I appreciate that Django created more functions which help to write the sql expressively.

Google Dataflow: create templates with runtime parameters

In Data flow, I need to pass start Date and end date as runtime arguments and query bigquery for that date range and write output to day wise folders.
When we use ValueProvider, getStartDate().get() method is throwing java.lang.RuntimeException: Not called from a runtime context. If I hardcode some value when getStartDate().get().isAccessible() is false, template is being generated but the runtime arguments are not reflecting in job. It is always running with the hardcoded value during template creation.
Any suggestions ?
BigQueryIO takes a ValueProvider of the query. The easiest way to do this is to pass the query text as the runtime value.
NestedValueProvider could help you create the query string from another value provider, alas, NestedValueProvider only support one input ValueProvider at a time. So you could concatenate your start and end dates into a single value and then do the split.

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.
Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps
For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Custom Date Aggregate Function

I want to sort my Store models by their opening times. Store models contains is_open function which controls Store's opening time ranges and produces a boolean if it's open or not. The problem is I don't want to sort my queryset manually because of efficiency problem. I thought if I write a custom annotate function then I can filter the query more efficiently.
So I googled and found that I can extend Django's aggregate class. From what I understood, I have to use pre-defined sql functions like MAX, AVG etc. The thing is I want to check that today's date is in a given list of time intervals. So anyone can help me that which sql name should I use ?
Edit
I'd like to put the code here but it's really a spaghetti one. One pages long code only generates time intervals and checks the suitable one.
I want to avoid :
alg= lambda r: (not (s.is_open() and s.reachable))
sorted(stores,key=alg)
and replace with :
Store.objects.annotate(is_open = CheckOpen(datetime.today())).order_by('is_open')
But I'm totally lost at how to write CheckOpen...
have a look at the docs for extra