I am using the following Standard SQL query in BigQuery to list tables older than 90 days using table metadata.
DECLARE projects ARRAY<STRING>;
DECLARE dt_list ARRAY<STRING>;
DECLARE i INT64 DEFAULT 0;
DECLARE query_string STRING;
SET projects = ['my-project-1', 'my-project-2,...,'my-project-n'];
# List dataset of current project
SET schema_list = (
SELECT
ARRAY_AGG(schema_name)
FROM
INFORMATION_SCHEMA.SCHEMATA #1
#List Datasets of a Project
WHILE
i < ARRAY_LENGTH(projects) DO
SET dt_list = ( "SELECT ARRAY_AGG(schema_name) FROM UNNSET(projects) as proj,"|| proj[OFFSET(iter)] ||".INFORMATION_SCHEMA.SCHEMATA" #2
);
/*SET dt_list = ( " SELECT ARRAY_AGG(schema_name) FROM "
|| projects[OFFSET(iter)] ||".INFORMATION_SCHEMA.SCHEMATA" #3
);*/
SET i = i+1;
END WHILE;
#List tables of a Dataset
WHILE
i < ARRAY_LENGTH(dt_list) DO
SET query_string = " SELECT dataset_id, table_id, ROUND(size_bytes/POW(10,9),2) AS size_gb, TIMESTAMP_MILLIS(creation_time) AS creation_time, TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time, row_count, type FROM "
|| dt_list[OFFSET(i)] || ".__TABLES__";
EXECUTE IMMEDIATE query_string;
SET i = i + 1;
END WHILE;
I am able to get list of tables of current GCP project with last modified date using '#1' query.
When I am trying to get the same result using a Array of project(projects), I am getting errors like "Query error: Unrecognized name: proj"(for #2) and "Query error: Cannot coerce expression " SELECT ARRAY_AGG(schema_name) FROM " || projects[OFFSET(iter)] ||".INFORMATION_SCHEMA.SCHEMATA" to type ARRAY" (for #3).
My purpose is to list BigQuery tables older than 90 days(long term storage) using a array of projects( as currently we have multiple projects and planning to run this query in single project instead of running in each project individually) using standard SQL.
Please help.
In BigQuery, the data pass to long term storage after 90 days without update, and it cost twice less. This rule is enforced at the partition level.
That's why, I recommend you to have a look to the partition information schema
SELECT * FROM `projectID.dataset.INFORMATION_SCHEMA.PARTITIONS`
You have a column that tell you in your partition is in long term storage or not, and therefore, immediately, you can know if you can delete, or not the partition (and not the whole table). You can thus improve your storage optimisation like that.
You still have the last modified date, if you don't want to stick to the long term storage rule and have a purge different to 90 days without updates.
Related
The challenge is to update a table by scanning that same table for information. In this case, I want to find how many entries I received in an Upload dataset that have the same key (effectively duplicate instructions).
I tried the obvious code:
UPDATE Base AS TAR
INNER JOIN (select cdKey, count(*) as ct FROM Base GROUP BY cdKey) AS CHK
ON TAR.cdKey = CHK.cdKey
SET ctReferences = CHK.ct
This resulted in a non-updateable complaint. Some workarounds talked about adding DISTINCTROW, but that made no difference.
I tried creating a view (query in Ms/Access parlance); same failure.
Then I projected the set (SELECT cdKey, count(*) INTO TEMP FROM Base GROUP BY cdKey), and substituted TEMP for the INNER JOIN which worked.
Conclusion: reflexive updates are also non-updateable.
An initial thought was to embed a sub-select in the update, for example:
UPDATE Base TAR SET TAR.ctReferences = (select count(*) from Base CHK where CHK.cd = TAR.cd)
This also failed.
As this is part of a job I am calling, this SQL (like the other statements) are all strings executed by CurrentDb.Execute statements. I thought maybe I could make this a DLookup, I found that as cd is a string, I had a gaggle of double- and triple-quoted elements that was too messy to read (and maintain).
Best solution was to write a function so I could avoid having to do any sort of string manipulation. Hence, in a module there's a function:
Public Function PassHelperCtOccurs(ByRef cdX As String) As Long
PassHelperCtOccurs = DLookup("count(*)", "Base", "cd='" & cdX & "'")
End Function
And the call is:
CurrentDb().Execute ("UPDATE Base SET ctOccursCd =PassHelperCtOccurs(cd)")
I have a very large (3.5B records) table that I want to update/insert (upsert) using the MERGE statement in BigQuery. The source table is a staging table that contains only the new data, and I need to check if the record with a corresponding ID is in the target table, updating the row if so or inserting if not.
The target table is partitioned by an integer field called IdParent, and the matching is done on IdParent and another integer field called IdChild. My merge statement/script looks like this:
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched and t.IdParent in unnest(parentList) then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched and IdParent in unnest(parentList) then
insert (<all the fields>)
values (<all the fields)
;
So I:
Pull the IdParent list from the staging table to know which partitions to prune
limit the partitions of the target table in the join predicate
also limit the partitions of the target table in the match/not matched conditions
The total size of dataset.Target is ~250GB. If I put this script in my BQ editor and remove all the IdParent in unnest(parentList) then it shows ~250GB to bill in the editor (as expected since there's no partition pruning). If I add the IdParent in unnest(parentList) back in so the script is exactly like you see it above i.e. attempting to partition prune, the editor shows ~97MB to bill. However, when I look at the query results, I see that it actually billed ~180GB:
The target table is also clustered on the two fields being matched, and I'm aware that the benefits of clustering are typically not shown in the editor's estimate. However, my understanding is that that should only make the bytes billed smaller... I can't think of any reason why this would happen.
Is this a BQ bug, or am I just missing something? BigQuery doesn't even say "the script is estimated to process XX MB", it says "This will process XX MB" and then it processes way more.
That's very interesting. What you did seems totally correct.
It seems BQ query planner could interpret your SQL correctly and know the partition pruning is provided, but when it executes. it failed to do so.
try removing t.IdParent in unnest(parentList) from both when matched clauses to see if the issue still happens, that is,
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched then
insert (<all the fields>)
values (<all the fields)
;
It would be a good idea to submit a bug to BigQuery if it couldn't be resolved.
I have a gcp based environment. I use standard SQL scripting in gcp BigQuery and federated query to cloudsql MySql. Federated query selects data from cloudsql mysql database. I need to select data from cloudsql mysql database based on condition that depends on data in BigQuery. I use variables in standard sql scriping in gcp bigquery to store the value that I select from bigquery. I want to value of this variable in the where clause of mysql query. See following example where I select a date from BigQuery and store it in a variable "BQ_LAST_DATETIME".
DECLARE BQ_LAST_DATETIME DATETIME
SET BQ_LAST_DATETIME = (select max(date_created) from bq_my_dataset.bq_my_table);
Since I am using bigquery federated query to read data out of cloudsql database (https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries) as shown below and I want to use value that I stored in the variable "BQ_LAST_DATETIME" in the mysql query where clause
SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql", "select * from mysqlschema.mysql_table where where date_created = #BQ_LAST_DATETIME;" );
Please note that in above query I have used "#BQ_LAST_DATETIME" as a placeholder to show what I want to achieve. I am not sure if I can directly use bigquery scripting variable as query parameter in the "external" query part of federated query.
Any suggestions on how to achieve parametrization of external queries in federated query, or if you know how I could achieve effect similar to what my intent is?
I actually tried following as depicted . I used bigquery scripting variable as query parameter in the "external" query part of federated query. only nuance here is that since the I was dealing with dates I performed a cast and also since the date variable actually is treated as a string I formatted it back to date using mysql STR_TO_DATE as follows
DECLARE BQ_LAST_DATETIME DATETIME
SET BQ_LAST_DATETIME = (select max(date_created) from bq_my_dataset.bq_my_table);
SET BQ_LAST_DATE= CAST(BQ_LAST_DATETIME AS DATE);
SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql", "select * from mysqlschema.mysql_table where where date_created = STR_TO_DATE(#BQ_LAST_DATE,'%Y-%m-%d') ;" );
While this query is accepted by parser it is NOT giving expected result.
Basically the value of the variable #BQ_LAST_DATE does not seem to get to MySQL query as expected.
Does anyone know what am I missing ?
Thanks a lot for your help
You can try EXECUTE IMMEDIATE:
DECLARE BQ_LAST_DATETIME STRING;
DECLARE DSQL STRING;
SET BQ_LAST_DATETIME = 'SELECT max(date_created) from bq_my_dataset.bq_my_table';
SET DSQL = '"select * from mysqlschema.mysql_table where date_created = (' || BQ_LAST_DATETIME || ')"';
EXECUTE IMMEDIATE 'SELECT * FROM EXTERNAL_QUERY("my-gcp-project.my-region.my-connection2-cloudsql",' || DSQL || ');'
I have been trying to make an AWS Athena query and got enough work done to get my data. However, my data needs to identify some patterns and change it in an uniform way in order to group those "similars". So I'm trying to make a regex_replacement, but how can i do multiple replacements to a same column in the same column?
Here's my query:
with q as (SELECT r.key,
r.otherid,
r.complexString,
minute(date_trunc('minute', from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York')) AS minute,
hour(from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York') AS hour,
day(from_iso8601_timestamp(r.time) AT TIME ZONE 'America/New_York') AS day
FROM requests0918 t
JOIN requests0918 t1 ON t.id = t1.id
WHERE t1.msg = 'response_written' AND t1.code = '200'
and t.otherid is not null
and t.key is not null
and t.path is not null
limit 10)
Select q.key, q.otherid, REGEXP_REPLACE(q.complexString, '\/accounts\/[0-9]+\/balances', '/accounts/.../balances' ) as path, q.minute, q.hour, q.day from q
So I'm successfully changing this strings to that ones, but I need to set more patterns and to replace under the same column name. So I'm looking on how to do it. I could add more layers of with q as {Query} to add more rules, but that sounds pretty wrong.
I am using Java DB (Java DB is Oracle's supported version of Apache Derby and contains the same binaries as Apache Derby. source: http://www.oracle.com/technetwork/java/javadb/overview/faqs-jsp-156714.html#1q2).
I am trying to update a column in one table, however I need to join that table with 2 other tables within the same database to get accurate results (not my design, nor my choice).
Below are my three tables, ADSID is a key linking Vehicles and Customers and ADDRESS and ZIP in Salesresp are used to link it to Customers. (Other fields left out for the sake of brevity.)
Salesresp(address, zip, prevsale)
Customers(adsid, address, zipcode)
Vehicles(adsid, selldate)
The goal is to find customers in the SalesResp table that have previously purchased a vehicle before the given date. They are identified by address and adsid in Customers and Vechiles respectively.
I have seen updates to a column with a single join and in fact asked a question about one of my own update/joins here (UPDATE with INNER JOIN). But now I need to take it that one step further and use both tables to get all the information.
I can get a multi-JOIN SELECT statement to work:
SELECT * FROM salesresp
INNER JOIN customers ON (SALESRESP.ZIP = customers.ZIPCODE) AND
(SALESRESP.ADDRESS = customers.ADDRESS)
INNER JOIN vehicles ON (Vehicles.ADSId =Customers.ADSId )
WHERE (VEHICLES.SELLDATE<'2013-09-24');
However I cannot get a multi-JOIN UPDATE statement to work.
I have attempted to try the update like this:
UPDATE salesresp SET PREVSALE = (SELECT SALESRESP.address FROM SALESRESP
WHERE SALESRESP.address IN (SELECT customers.address FROM customers
WHERE customers.adsid IN (SELECT vehicles.adsid FROM vehicles
WHERE vehicles.SELLDATE < '2013-09-24')));
And I am given this error: "Error code 30000, SQL state 21000: Scalar subquery is only allowed to return a single row".
But if I change that first "=" to a "IN" it gives me a syntax error for having encountered "IN" (Error code 30000, SQL state 42X01).
I also attempted to do more blatant inner joins, but upon attempting to execute this code I got the the same error as above: "Error code 30000, SQL state 42X01" with it complaining about my use of the "FROM" keyword.
update salesresp set prevsale = vehicles.selldate
from salesresp sr
inner join vehicles v
on sr.prevsale = v.selldate
inner join customers c
on v.adsid = c.adsid
where v.selldate < '2013-09-24';
And in a different configuration:
update salesresp
inner join customer on salesresp.address = customer.address
inner join vehicles on customer.adsid = vehicles.ADSID
set salesresp.SELLDATE = vehicles.selldate where vehicles.selldate < '2013-09-24';
Where it finds the "INNER" distasteful: Error code 30000, SQL state 42X01: Syntax error: Encountered "inner" at line 3, column 1.
What do I need to do to get this multi-join update query to work? Or is it simply not possible with this database?
Any advice is appreciated.
If I were you I would:
1) Turn off autocommit (if you haven't already)
2) Craft a select/join which returns a set of columns that identifies the record you want to update E.g. select c1, c2, ... from A join B join C... WHERE ...
3) Issue the update. E.g. update salesrep SET CX = cx where C1 = c1 AND C2 = c2 AND...
(Having an index on C1, C2, ... will boost performance)
4) Commit.
That way you don't have worry about mixing the update and the join, and doing it within a txn ensures that nothing can change the result of the join before your update goes through.