How to use query parameters in GCP BigQuery federated queries - google-cloud-platform

I have a gcp based environment. I use standard SQL scripting in gcp BigQuery and federated query to cloudsql MySql. Federated query selects data from cloudsql mysql database. I need to select data from cloudsql mysql database based on condition that depends on data in BigQuery. I use variables in standard sql scriping in gcp bigquery to store the value that I select from bigquery. I want to value of this variable in the where clause of mysql query. See following example where I select a date from BigQuery and store it in a variable "BQ_LAST_DATETIME".
DECLARE BQ_LAST_DATETIME DATETIME
SET BQ_LAST_DATETIME = (select max(date_created) from bq_my_dataset.bq_my_table);
Since I am using bigquery federated query to read data out of cloudsql database (https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries) as shown below and I want to use value that I stored in the variable "BQ_LAST_DATETIME" in the mysql query where clause
SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql", "select * from mysqlschema.mysql_table where where date_created = #BQ_LAST_DATETIME;" );
Please note that in above query I have used "#BQ_LAST_DATETIME" as a placeholder to show what I want to achieve. I am not sure if I can directly use bigquery scripting variable as query parameter in the "external" query part of federated query.
Any suggestions on how to achieve parametrization of external queries in federated query, or if you know how I could achieve effect similar to what my intent is?
I actually tried following as depicted . I used bigquery scripting variable as query parameter in the "external" query part of federated query. only nuance here is that since the I was dealing with dates I performed a cast and also since the date variable actually is treated as a string I formatted it back to date using mysql STR_TO_DATE as follows
DECLARE BQ_LAST_DATETIME DATETIME
SET BQ_LAST_DATETIME = (select max(date_created) from bq_my_dataset.bq_my_table);
SET BQ_LAST_DATE= CAST(BQ_LAST_DATETIME AS DATE);
SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql", "select * from mysqlschema.mysql_table where where date_created = STR_TO_DATE(#BQ_LAST_DATE,'%Y-%m-%d') ;" );
While this query is accepted by parser it is NOT giving expected result.
Basically the value of the variable #BQ_LAST_DATE does not seem to get to MySQL query as expected.
Does anyone know what am I missing ?
Thanks a lot for your help

You can try EXECUTE IMMEDIATE:
DECLARE BQ_LAST_DATETIME STRING;
DECLARE DSQL STRING;
SET BQ_LAST_DATETIME = 'SELECT max(date_created) from bq_my_dataset.bq_my_table';
SET DSQL = '"select * from mysqlschema.mysql_table where date_created = (' || BQ_LAST_DATETIME || ')"';
EXECUTE IMMEDIATE 'SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql",' || DSQL || ');'

Related

How to query dynamodb where I have fetch records based on a list of key values?

I have a dynamodb table on which a GSI is defined with a partition key and sort key.
Let's say the parition key is name and sort key is ssn for the GSI.
I have to fetch based upon a name and ssn, below is the query I am using and it works fine.
table.query(IndexName='lookup-by-name',KeyConditionExpression=Key('name').eq(name)\
& Key('ssn').eq(ssn))
Now, I have to query based upon a name and a list of ssns.
For Example
ssns=['ssn1','ss2','ss3',ssn4']
name='Alex'
query all records which has name as 'Alex' and whose ssn is present in ssns list.
How do I implement something like this ?
While DynamoDB native SDK cannot provide the functionality to do this, you can achieve it using PartiQL which provides a SQL like interface for interacting with DynamoDB.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-gettingstarted.html
import boto3
client = boto3.client('dynamodb', region_name="eu-west-1")
name = 'Alex'
ssns = ['ssn1','ssn2','ssn3','ssn4']
response = client.execute_statement(
Statement = "Select * from \"MyTableTest\".\"lookup-by-name\" where \"name\" = '%s' AND \"ssn\" IN %s" % (name, ssns)
)
print(response['Items'])
It would also require you to use the lower level Client instead of the Table level resource which you are using above.
You would have to do multiple queries.
Ended up using just the name as keycondition and then filter out the ssn in python code.
Below worked for me as the number of records was not a lot.
response=table.query(IndexName='lookup-by-name',KeyConditionExpression=Key('name').eq(name)
ssns=['ssn1','ss2','ss3',ssn4']
data= response['Items']
data=list(filter(lambda record: record['ssn'] in ssns,data))
return data

PowerBI Query contains transformations that can't be used for DirectQuery

I am using PowerBI Desktop (2.96.1061.0) to connect to a local MS SQL server so I can prepare some visualizations. It is important to mention that all data connections (Tables, SQL queries) are using the DirectQuery option.
It's been quite a smooth experience so far. No issues at all. Now I am trying to get some new data, again, through a direct SQL query:
SELECT BillId, string_agg(PGroupName, ', ')
FROM
(SELECT bm.ImportedBillsId as BillId, pg.Name as PGroupName
FROM [BillMp] bm
JOIN [Mps] m on bm.ImportersId = m.Id
JOIN [PGroups] pg on m.PoliticalGroupId = pg.Id
GROUP BY bm.ImportedBillsId, pg.Name) t
GROUP BY BillId
but for some reason, it is not letting me re-create the model and apply the new changes. No matter that the import wizard is able to visualize the actual data prior to the update. This is the error that I am getting:
I have also tried to import only the data from the internal/nested query
SELECT bm.ImportedBillsId as BillId, pg.Name as PGroupName
FROM [BillMp] bm
JOIN [Mps] m on bm.ImportersId = m.Id
JOIN [PGroups] pg on m.PoliticalGroupId = pg.Id
GROUP BY bm.ImportedBillsId, pg.Name
and process (according to this article) the other/outer query through PowerBI but I am still getting the same error.

List GCP BigQuery tables older than 90days

I am using the following Standard SQL query in BigQuery to list tables older than 90 days using table metadata.
DECLARE projects ARRAY<STRING>;
DECLARE dt_list ARRAY<STRING>;
DECLARE i INT64 DEFAULT 0;
DECLARE query_string STRING;
SET projects = ['my-project-1', 'my-project-2,...,'my-project-n'];
# List dataset of current project
SET schema_list = (
SELECT
ARRAY_AGG(schema_name)
FROM
INFORMATION_SCHEMA.SCHEMATA #1
#List Datasets of a Project
WHILE
i < ARRAY_LENGTH(projects) DO
SET dt_list = ( "SELECT ARRAY_AGG(schema_name) FROM UNNSET(projects) as proj,"|| proj[OFFSET(iter)] ||".INFORMATION_SCHEMA.SCHEMATA" #2
);
/*SET dt_list = ( " SELECT ARRAY_AGG(schema_name) FROM "
|| projects[OFFSET(iter)] ||".INFORMATION_SCHEMA.SCHEMATA" #3
);*/
SET i = i+1;
END WHILE;
#List tables of a Dataset
WHILE
i < ARRAY_LENGTH(dt_list) DO
SET query_string = " SELECT dataset_id, table_id, ROUND(size_bytes/POW(10,9),2) AS size_gb, TIMESTAMP_MILLIS(creation_time) AS creation_time, TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time, row_count, type FROM "
|| dt_list[OFFSET(i)] || ".__TABLES__";
EXECUTE IMMEDIATE query_string;
SET i = i + 1;
END WHILE;
I am able to get list of tables of current GCP project with last modified date using '#1' query.
When I am trying to get the same result using a Array of project(projects), I am getting errors like "Query error: Unrecognized name: proj"(for #2) and "Query error: Cannot coerce expression " SELECT ARRAY_AGG(schema_name) FROM " || projects[OFFSET(iter)] ||".INFORMATION_SCHEMA.SCHEMATA" to type ARRAY" (for #3).
My purpose is to list BigQuery tables older than 90 days(long term storage) using a array of projects( as currently we have multiple projects and planning to run this query in single project instead of running in each project individually) using standard SQL.
Please help.
In BigQuery, the data pass to long term storage after 90 days without update, and it cost twice less. This rule is enforced at the partition level.
That's why, I recommend you to have a look to the partition information schema
SELECT * FROM `projectID.dataset.INFORMATION_SCHEMA.PARTITIONS`
You have a column that tell you in your partition is in long term storage or not, and therefore, immediately, you can know if you can delete, or not the partition (and not the whole table). You can thus improve your storage optimisation like that.
You still have the last modified date, if you don't want to stick to the long term storage rule and have a purge different to 90 days without updates.

Date ranges in where clause of a proc SQL statement

There is a large table containing among other fields the following:
ID, effective_date, Expiration_date.
expiration_date is datetime20. format, and can be NULL
I'm trying to extract rows that expire after Dec 31, 2014 or do not expire (NULL).
Adding the following where statement to the proc sql query gives me no results
where coalesce(datepart(expiration_date),input('31/Dec/2020',date11.))
> input('31/Dec/2014',date11.);
However, when I only select NULL expiration dates and add the following fields:
put(coalesce(datepart(expiration_date),input('31/Dec/2020',date11.)),date11.) as value,
put(input('31/Dec/2014',date11.),date11.) as threshold,
case when coalesce(datepart(expiration_date),input('31/Dec/2020',date11.)) > input('31/Dec/2014',date11.)
then 'pass' else 'fail' end as tag
It shows 'pass' under TAG and all the other fields are correct.
This is an effort to duplicate what I used in SQL Server
where isnull(expiration_date,'9999-12-31') > '2014-12-31'
Using SAS Enterprise Guide 7.1 and while trying to figure it out I've been using
proc sql inobs=100;`
What am I doing wrong ? Thank you.
Some Expiration Dates:
30OCT2015:00:00:00
30OCT2015:00:00:00
29OCT2015:00:00:00
30OCT2015:00:00:00
I would recommend using a date constant ("31DEC2014"d) rather than date functions, or else either use explicit passthrough or disable implicit passthrough. Date functions are challenging when going between databases and so avoiding them when possible is best.

Declare a variable in RedShift

SQL Server has the ability to declare a variable, then call that variable in a query like so:
DECLARE #StartDate date;
SET #StartDate = '2015-01-01';
SELECT *
FROM Orders
WHERE OrderDate >= #StartDate;
Does this functionality work in Amazon's RedShift? From the documentation, it looks that DECLARE is used solely for cursors. SET looks to be the function I am looking for, but when I attempt to use that, I get an error.
set session StartDate = '2015-01-01';
[Error Code: 500310, SQL State: 42704] [Amazon](500310) Invalid operation: unrecognized configuration parameter "startdate";
Is it possible to do this in RedShift?
Slavik Meltser's answer is great. As a variation on this theme, you can also use a WITH construct:
WITH tmp_variables AS (
SELECT
'2015-01-01'::DATE AS StartDate,
'some string' AS some_value,
5556::BIGINT AS some_id
)
SELECT *
FROM Orders
WHERE OrderDate >= (SELECT StartDate FROM tmp_variables);
Actually, you can simulate a variable using a temporarily table, create one, set data and you are good to go.
Something like this:
CREATE TEMP TABLE tmp_variables AS SELECT
'2015-01-01'::DATE AS StartDate,
'some string' AS some_value,
5556::BIGINT AS some_id;
SELECT *
FROM Orders
WHERE OrderDate >= (SELECT StartDate FROM tmp_variables);
The temp table will be deleted after the transaction execution.
Temp tables are bound per session (connect), therefor cannot be shared across sessions.
No, Amazon Redshift does not have the concept of variables. Redshift presents itself as PostgreSQL, but is highly modified.
There was mention of User Defined Functions at the 2014 AWS re:Invent conference, which might meet some of your needs.
Update in 2016: Scalar User Defined Functions can perform computations but cannot act as stored variables.
Note that if you are using the psql client to query, psql variables can still be used as always with Redshift:
$ psql --host=my_cluster_name.clusterid.us-east-1.redshift.amazonaws.com \
--dbname=your_db --port=5432 --username=your_login -v dt_format=DD-MM-YYYY
# select current_date;
date
------------
2015-06-15
(1 row)
# select to_char(current_date,:'dt_format');
to_char
------------
15-06-2015
(1 row)
# \set
AUTOCOMMIT = 'on'
...
dt_format = 'DD-MM-YYYY'
...
# \set dt_format 'MM/DD/YYYY'
# select to_char(current_date,:'dt_format');
to_char
------------
06/15/2015
(1 row)
You can now use user defined functions (UDF's) to do what you want:
CREATE FUNCTION my_const()
RETURNS CSTRING IMMUTABLE AS
$$ return 'my_string_constant' $$ language plpythonu;
Unfortunately, this does require certain access permissions on your redshift database.
Not an exact answer but in DBeaver, you can set up variables to use in your local queries in the IDE. Our team has found this helpful in testing before we put code into production.
From this answer: https://stackoverflow.com/a/58308439/220997
You should then be able to do:
#set date = '2019-10-09'
SELECT ${date}::DATE, ${date}::TIMESTAMP WITHOUT TIME ZONE
which produces:
| date | timestamp |
|------------|---------------------|
| 2019-10-09 | 2019-10-09 00:00:00 |
Again note: This only works in the DBeaver IDE. This SQL won't work when integrated in stored procedures or called from other tools