How to find missing dates in BigQuery table using sql

How to find missing dates in BigQuery table using sql - google-cloud-platform

How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.

Use Case:
We have a table(test_table) which is populated everyday by some job( a scheduled query or cloud function).Sometimes those job fail and data isn't available for those particular dates in my table.
How to find those dates rather than scrolling through thousands of rows.
The below query will return me a list of dates and ad_ids where data wasn't uploaded (null).
note: I have used MAX(Date) as I knew dates was missing in between my boundary dates. For safe side you can also specify the starting_date and ending_date incase data hasn't been populated in the last few days at all.
WITH Date_Range AS
-- anchor for date range
(
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM `project_name.dataset_name.test_table`
),
day_series AS
-- anchor to get all the dates within the range
(
SELECT *
FROM Date_Range
,UNNEST(GENERATE_TIMESTAMP_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) AS days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
original_table.ad_id
FROM day_series
-- do a left join on the source table
LEFT JOIN `project_name.dataset_name.test_table` AS original_table ON (original_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE original_table.ad_id IS NULL
GROUP BY 1,2
ORDER BY 1
Final output will look like below:

An Alternate solution you can try following query to get desired output:-
with t as (select 1 as id, cast ('2020-12-25' as timestamp) Days union all
select 1 as id, cast ('2020-12-26' as timestamp) Days union all
select 1 as id, cast ('2020-12-27' as timestamp) Days union all
select 1 as id, cast ('2020-12-31' as timestamp) Days union all
select 1 as id, cast ('2021-01-01' as timestamp) Days union all
select 1 as id, cast ('2021-01-04' as timestamp) Days)
SELECT *
FROM (
select TIMESTAMP_ADD(Days, INTERVAL 1 DAY) AS Days, TIMESTAMP_SUB(next_days, INTERVAL 1 DAY) AS next_days from (
select t.Days,
(case when lag(Days) over (partition by id order by Days) = Days
then NULL
when lag(Days) over (partition by id order by Days) is null
then Null
else Lead(Days) over (partition by id order by Days)
end) as next_days
from t) where next_days is not null
and Days <> TIMESTAMP_SUB(next_days, INTERVAL 1 DAY)),
UNNEST(GENERATE_TIMESTAMP_ARRAY(Days, next_days, INTERVAL 1 DAY)) AS days
Output will be as :-

I used the code above but had to restructure it for BigQuery:
-- anchor for date range - this will select dates from the source table (i.e. the table your query runs off of)
WITH day_series AS(
SELECT *
FROM (
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM --enter source table here--
---OPTIONAL: filter for a specific date range
WHERE DATE BETWEEN 'YYYY-MM-DD' AND YYYY-MM-DD'
),UNNEST(GENERATE_DATE_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) as days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
output_table.date
FROM day_series
-- do a left join on the output table (i.e. the table you are searching the missing dates for)
LEFT JOIN `project_name.dataset_name.test_table` AS output_table
ON (output_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE output_table.date IS NULL
GROUP BY 1,2
ORDER BY 1

Related

Retrieving the row with the greatest timestamp in questDB

I'm currently running QuestDB 6.1.2 on linux. How do I get the row with maximum value from a table? I have tried the following on a test table with around 5 million rows:
select * from table where cast(timestamp as symbol) in (select cast(max(timestamp) as symbol) from table );
select * from table inner join (select max(timestamp) mm from table ) on timestamp >= mm
select * from table where timestamp = max(timestamp)
select * from table where timestamp = (select max(timestamp) from table )
where 1 is correct but runs in ~5s, 2 is correct and runs in ~500ms but looks unnecessarily verbose for a query, 3 compiles but returns an empty table, and 4 is incorrect syntax although that's how sql usually does it

select * from table limit -1 works. QuestDB returns rows sorted by timestamp as default, and limit -1 takes the last row, which happens to be the row with the greatest timestamp. To be explicit about ordering by timestamp, select * from table order by timestamp limit -1 could be used instead. This query runs in around 300-400ms on the same table.
As a side note, the third query using timestamp=max(timestamp) doesn't work yet since QuestDB does not support subqueries in where yet (questDB 6.1.2).

Convert two datetimes to timestamps and keep the larger one

I work with Oracle Database 19c and I would like to convert two datetimes (DD/MM/YY HH24:MM:SS) into timestamps and only keep the larger one.
I try several script for the conversion like this one :
SELECT
CODE_ACT_PROD,
LIB,
CAST (DAT_CRE AS TIMESTAMP) AS DATE_CRE_TIMESTAMP,
CAST (DAT_MOD AS TIMESTAMP) AS DATE_MOD_TIMESTAMP
FROM ACTI
WHERE CODE_ACT_PROD
IN (
SELECT CODE_ACT_PROD
FROM ART_COM
WHERE ETAT = 0
)
but the result is not what I want, the datetimes are not convert and I don't know how to keep the larger one.

Use GREATEST:
SELECT CODE_ACT_PROD,
LIB,
CAST (DAT_CRE AS TIMESTAMP) AS DATE_CRE_TIMESTAMP,
CAST (DAT_MOD AS TIMESTAMP) AS DATE_MOD_TIMESTAMP,
CAST(GREATEST(DAT_CRE, DAT_MOD) AS TIMESTAMP) AS greatest_timestamp
FROM ACTI
WHERE CODE_ACT_PROD IN (
SELECT CODE_ACT_PROD
FROM ART_COM
WHERE ETAT = 0
)
Which, for the sample data:
CREATE TABLE acti (
code_act_prod INT,
lib INT,
dat_cre DATE,
dat_mod DATE
);
CREATE TABLE art_com (
code_act_prod INT,
etat INT
);
INSERT INTO acti (code_act_prod, lib, dat_cre, dat_mod)
SELECT 1, 2, SYSDATE - 1, SYSDATE FROM DUAL UNION ALL
SELECT 3, 4, TRUNC(SYSDATE), SYSDATE - 2 FROM DUAL;
INSERT INTO art_com (code_act_prod, etat)
SELECT 1, 0 FROM DUAL UNION ALL
SELECT 3, 0 FROM DUAL;
Outputs:
CODE_ACT_PROD
LIB
DATE_CRE_TIMESTAMP
DATE_MOD_TIMESTAMP
GREATEST_TIMESTAMP
1
2
2021-09-01 08:38:21.000000
2021-09-02 08:38:21.000000
2021-09-02 08:38:21.000000
3
4
2021-09-02 00:00:00.000000
2021-08-31 08:38:21.000000
2021-09-02 00:00:00.000000
db<>fiddle here

Oracle does not have a datetime data type. It has date which has a day and a time to the second. And it has a timestamp which also has a day and a time to the second with optional fractional seconds and time zone. Converting a date to a timestamp would just add fractional seconds which were always 0. Neither date nor timestamp data types have a format. A varchar2 would have a format. If the columns are date data types, your code is syntactically valid. I'm not sure how results you are getting differ from the results you want since you're not showing us your sample data or expected results and you're not telling us what you mean when you say that something isn't converted.
Assuming the two columns are actually of type date, your code appears to be fine and you just want to use the greatest function to get the latest date. See this fiddle
with cte as (
select sysdate dat_cr, sysdate + 1 dat_mod
from dual
)
select cast(dat_cr as timestamp) ts_cr,
cast(dat_mod as timestamp) ts_mod,
cast( greatest( dat_cr, dat_mod ) as timestamp ) ts_greatest
from cte;
TS_CR TS_MOD TS_GREATEST
02-SEP-21 08.25.38.000000 AM 03-SEP-21 08.25.38.000000 AM 03-SEP-21 08.25.38.000000 AM
Note that the conversion of the three timestamps to strings to be displayed to humans is controlled by your session's nls_timestamp_format.
If you want to handle null dates by returning whichever date is not null, you can use a coalesce and a case statement
with cte as (
select sysdate dat_cr, sysdate + 1 dat_mod
from dual
union all
select null, sysdate from dual
union all
select sysdate, null from dual
)
select cast(dat_cr as timestamp) ts_cr,
cast(dat_mod as timestamp) ts_mod,
cast( case when dat_cr is null or dat_mod is null
then coalesce( dat_mod, dat_cr )
else greatest( dat_cr, dat_mod )
end
as timestamp ) ts_greatest
from cte;
See this fiddle

i am looking to get the date diff from two or more row in a way that first rows serviceto date - second rows service start date so that i can get diff

my data looks like this
userid
completedat
serviceperiodfrom
serviceperiodto
00002cd9-94eb-4c06-a2c4-75253fd541b9
2020-11-25T14:20:04.293Z
2020-11-25T14:20:04.200Z
2021-02-25T14:20:04.200Z
00002cd9-94eb-4c06-a2c4-75253fd541b9
2021-03-21T10:27:34.842Z
2021-03-21T10:27:34.800Z
2022-03-21T10:27:34.800Z
00002cd9-94eb-4c06-a2c4-75253fd541b9
2020-07-24T11:22:12.410Z
2020-07-24T11:22:12.300Z
2020-10-24T11:22:12.300Z
I need the date diff from serviceperiodto date of first row - serviceperiodfrom date of secondrow and it goes for as many iteration as it has these details for each userid
please help me i tried joining the tables using subqueries tried to create a pivot table but none of them seem working for me please help

You can use lag/lead to access previous/next item:
WITH dataset
AS (SELECT *
FROM
(
VALUES
(1, from_iso8601_timestamp('2020-11-25T14:20:04.200Z'), from_iso8601_timestamp('2021-02-25T14:20:04.200Z')),
(1, from_iso8601_timestamp('2021-03-21T10:27:34.800Z'), from_iso8601_timestamp('2022-03-21T10:27:34.800Z')),
(1, from_iso8601_timestamp('2020-07-24T11:22:12.300Z'), from_iso8601_timestamp('2020-10-24T11:22:12.300Z'))
) AS t (userid, serviceperiodfrom, serviceperiodto)
)
SELECT date_diff(
'hour',
serviceperiodto,
lead(serviceperiodfrom, 1) OVER (PARTITION BY userid ORDER BY serviceperiodfrom))
FROM dataset
Output:
_col0
770
572

Pivot with dynamic DATE columns

I have a query that I created from a table.
example:
select
pkey,
trunc (createdformat) business_date,
regexp_substr (statistics, 'business_ \ w *') business_statistics
from business_data
where statistics like '% business_%'
group by regexp_substr(statistics, 'business_\w*'), trunc(createdformat)
This works great thanks to your help.
Now I want to show that in a crosstab / pivot.
That means in the first column are the "business_statistics", the column headings are the "dynamic days from business_date".
I've tried the following, but it doesn't quite work yet
SELECT *
FROM (
select
pkey,
trunc(createdformat) business_date,
regexp_substr(statistics, 'business_\w*') business_statistics
from business_data
where statistics like '%business_%'
)
PIVOT(
count(pkey)
FOR business_date
IN ('17.06.2020','18.06.2020')
)
ORDER BY business_statistics
If I specify the date, like here 17.06.2020 and 18.06.2020 it works. 3 columns (Business_Statistic, 17.06.2020, 18.06.2020). But from column 2 it should be dynamic. That means he should show me the days (date) that are also included in the query / table. So that is the result of X columns (Business_Statistics, Date1, Date2, Date3, Date4, ....). Dynamic based on the table data.
For example, this does not work:
...
IN (SELECT DISTINCT trunc(createdformat) FROM BUSINESS_DATA WHERE statistics like '%business_%' order by trunc(createdformat))
...

The pivot clause doesn't work with dynamic values.
But there are some workarounds discuss here: How to Convert Rows to Columns and Back Again with SQL (Aka PIVOT and UNPIVOT)
You may find one workaround that suits your requirements.

Unfortunately, I am not very familiar with PL / SQL. But could I still process the start date and the end date of the user for the query?
For example, the user enters the APEX environment as StartDate: June 17, 2020 and as EndDate: June 20, 2020.
Then the daily difference is calculated in the PL / SQL query, then a variable is filled with the value of the entered period using Loop.
Example: (Just an idea, I'm not that fit in PL / SQL yet)
DECLARE
startdate := :P9999_StartDate 'Example 17.06.2020
enddate := P9999_EndDate 'Example 20.06.2020
BEGIN
LOOP 'From the startdate to the enddate day
businessdate := businessdate .... 'Example: 17.06.2020,18.06.2020,19.06.2020, ...
END LOOP
SELECT *
FROM (
select
pkey,
trunc(createdformat) business_date,
regexp_substr(statistics, 'business_\w*') business_statistics
from business_data
where statistics like '%business_%'
)
PIVOT(
count(pkey)
FOR business_date
IN (businessdate)
)
ORDER BY business_statistics
END;
That would be my idea, but I fail to implement it. Is that possible? I hope you understand what I mean

SHOW PARTITIONS with order by in Amazon Athena

I have this query:
SHOW PARTITIONS tablename;
Result is:
dt=2018-01-12
dt=2018-01-20
dt=2018-05-21
dt=2018-04-07
dt=2018-01-03
This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered.
The documentation doesn't explain how to do it:
https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html
I tried to add order by:
SHOW PARTITIONS tablename order by dt;
But it gives:
AmazonAthena; Status Code: 400; Error Code: InvalidRequestException;

AWS currently (as of Nov 2020) supports two versions of the Athena engines. How one selects and orders partitions depends upon which version is used.
Version 1:
Use the information_schema table. Assuming you have year, month as partitions (with one partition key, this is of course simpler):
WITH
a as (
SELECT partition_number as pn, partition_key as key, partition_value as val
FROM information_schema.__internal_partitions__
WHERE table_schema = 'my_database'
AND table_name = 'my_table'
)
SELECT
year, month
FROM (
SELECT val as year, pn FROM a WHERE key = 'year'
) y
JOIN (
SELECT val as month, pn FROM a WHERE key = 'month'
) m ON m.pn = y.pn
ORDER BY year, month
which outputs:
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
Version 2:
Use the built-in $partitions functionality, where the partitions are explicitly available as columns and the syntax is much simpler:
SELECT year, month FROM my_database."my_table$partitions" ORDER BY year, month
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
For more information, see:
https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html#querying-glue-catalog-listing-partitions

From your comment it sounds like you're looking to sort the partitions as a way to figure out whether or not a specific partition exists. For this purpose I suggest you use the Glue API instead of querying Athena. Run aws glue get-partition help or check your preferred SDK's documentation for how it works.
There is also a variant to list all partitions of a table, run aws glue get-partitions help to read more about that. I don't think it returns the partitions in alphabetical order, but it has operators for filtering.

The SHOW PARTITIONS command will not allow you to order the result, since this command does not produce a resultset to sort. This command only produces a string output.
You can on the other hand query the partition column and then order the result by value.
select distinct dt from tablename order by dt asc;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to find missing dates in BigQuery table using sql - google-cloud-platform

How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.

Related

Retrieving the row with the greatest timestamp in questDB

Convert two datetimes to timestamps and keep the larger one

i am looking to get the date diff from two or more row in a way that first rows serviceto date - second rows service start date so that i can get diff

Pivot with dynamic DATE columns

SHOW PARTITIONS with order by in Amazon Athena

Categories

Resources