Replace nulls with the previous non-null value - amazon-athena

I am using Amazon Athena engine version 1, which is based on Presto 0.172.
Consider the example data set:
id
date_column
col1
1
01/03/2021
NULL
1
02/03/2021
1
1
15/03/2021
2
1
16/03/2021
NULL
1
17/03/2021
NULL
1
30/03/2021
NULL
1
30/03/2021
1
1
31/03/2021
NULL
I would like to replace all NULLs in the table with the last non-NULL value i.e. I want to get:
id
date_column
col1
1
01/03/2021
NULL
1
02/03/2021
1
1
15/03/2021
2
1
16/03/2021
2
1
17/03/2021
2
1
30/03/2021
1
1
30/03/2021
1
1
31/03/2021
1
I was thinking of using a lag function with IGNORE NULLS option but unfortunately, IGNORE NULLS is not supported by Athena engine version 1 (it is also not supported by Athena engine version 2, which is based on Presto 0.217).
How to achieve the desired format without using the IGNORE NULLS option?
Here is some template for generating the example table:
WITH source1 AS (
SELECT
*
FROM (
VALUES
(1, date('2021-03-01'), NULL),
(1, date('2021-03-02'), 1),
(1, date('2021-03-15'), 2),
(1, date('2021-03-16'), NULL),
(1, date('2021-03-17'), NULL),
(1, date('2021-03-30'), NULL),
(1, date('2021-03-30'), 1),
(1, date('2021-03-31'), NULL)
) AS t (id, date_col, col1)
)
SELECT
id
, date_col
, col1
-- This doesn't work as IGNORE NULLS is not supported.
-- CASE
-- WHEN col1 IS NOT NULL THEN col1
-- ELSE lag(col1) OVER IGNORE NULLS (PARTITION BY id ORDER BY date_col)
-- END AS col1_lag_nulls_ignored
FROM
source1
ORDER BY
date_co

After reviewing similar questions on SO (here and here), the below solution will work for all column types (including Strings and dates):
WITH source1 AS (
SELECT
*
FROM (
VALUES
(1, date('2021-03-01'), NULL),
(1, date('2021-03-02'), 1),
(1, date('2021-03-15'), 2),
(1, date('2021-03-16'), NULL),
(1, date('2021-03-17'), NULL),
(1, date('2021-03-30'), 1),
(1, date('2021-03-31'), NULL)
) AS t (id, date_col, col1)
)
, grouped AS (
SELECT
id
, date_col
, col1
-- If the row has a value in a column, then this row and all subsequent rows
-- with a NULL (before the next non-NULL value) will be in the same group.
, sum(CASE WHEN col1 IS NULL THEN 0 ELSE 1 END) OVER (
PARTITION BY id ORDER BY date_col) AS grp
FROM
source1
)
SELECT
id
, date_col
, col1
-- max is used instead of first_value, since in cases where there will
-- be multiple records with NULL on the same date, the first_value may
-- still return a NULL.
, max(col1) OVER (PARTITION BY id, grp ORDER BY date_col) AS col1_filled
, grp
FROM
grouped
ORDER BY
date_col

Related

DAX column count latest record for each set of group with a condition

I want to get Latest updated record which is bit tricky to retrieve using DAX column with power bi
Count -> Order Count based on Modified On(Datetime) with Ascending Order
Deleted -> a Flag set to be True for deleted record
Id
Name
Modified On
Deleted
Count
Result
1
Charles
09-11-2022 15:09:40
1
1
09-11-2022 15:46:33
True
2
1
Charles M
09-11-2022 20:39:40
3
True
2
Charles
09-11-2022 15:09:40
1
2
09-11-2022 15:46:33
True
2
2
Charles M
09-11-2022 20:39:40
3
2
09-11-2022 21:16:33
True
4
2
charles m
09-11-2022 21:18:33
5
3
Dani
09-11-2022 15:46:33
1
True
3
09-11-2022 21:16:33
True
2
4
George
09-11-2022 15:46:33
1
4
George K
09-11-2022 21:16:33
2
In the above example I wanted the Result column values as it is on above table.
explanation:
Here Id : 1, The record is two times created as well as deleted so the history of record will have four rows. I wanted the last updated record which is the 3rd row and not the last record because the is Deleted flag is set to be True so there is no Name on it.
as so on for the second but Id:2
If the last insert on the record is not deleted then whole result column of the id should not return anything
(Id: 3)
In the third set there is there is no update on the record with this history table. the first row is created and second is for the deletion. so we should have to retrieve the first record which only have that data on Name field
Id: 4
There is no deletion operation happened so we don't want to get that record. the result columns should be empty
Thanks in advance
I have tried to get the latest record with
LatestDeletedRecord =
VAR latest = CALCULATE(MAX('Table'[Column3]), ALLEXCEPT('Table','Table'[Id]))
RETURN IF('Table'[Column3] = latest && 'Table'[IsDeleted] = True,True)
Other than nothing I could, I am new to DAX calculations
Edit: If your requirements change, you should perhaps post a new question instead of editing your existing question :-)
With your altered requirements you can use this calculated column:
Result =
VAR _max =
CALCULATE (
MAX ( 'Table'[Modified On] ) ,
ALLEXCEPT ( 'Table' , 'Table'[Id] )
)
VAR _max_is_deleted =
CALCULATE (
SELECTEDVALUE ( 'Table'[Deleted] ) ,
ALLEXCEPT ( 'Table' , 'Table'[Id] ) ,
'Table'[Modified On] = _max
)
VAR _max_mod =
// Calculate the maximum modified date where name is not deleted
CALCULATE (
MAX ( 'Table'[Modified On] ) ,
ALLEXCEPT ( 'Table' , 'Table'[Id] ) ,
'Table'[Name] <> ""
)
RETURN
IF (
// For rows where ID has an associated deletion AND modified is max with name
_max_is_deleted
&& [Modified On] = _max_mod,
// Return "True"
"True"
)
Gives your desired result:

SQL for nested WITH CLAUSE - RESULTS OFFSET in Oracle 19c

Please suggest a way to implement nesting of (temp - results - select) as shown below?
I see that oracle 19c does not allow nesting of WITH clause.
with temp2 as
(
with temp1 as
(
__
__
),
results(..fields..) as
(
select ..<calc part>.. from temp1, results where __
)
select ..<calc part>.. from temp1 join results where __
),
results(..fields..) as
(
select ..<calc part>.. from temp2, results where __
)
select ..<calc part>.. from temp2 join results where __
For instance:
DB Fiddle
I need to calculate CALC3 in similar recursive way as of CALC
CREATE TABLE TEST ( DT DATE, NAME VARCHAR2(10), VALUE NUMBER(10,3));
insert into TEST values ( to_date( '01-jan-2021'), 'apple', 198.95 );
insert into TEST values ( to_date( '02-jan-2021'), 'apple', 6.15 );
insert into TEST values ( to_date( '03-jan-2021'), 'apple', 4.65 );
insert into TEST values ( to_date( '06-jan-2021'), 'apple', 20.85 );
insert into TEST values ( to_date( '01-jan-2021'), 'banana', 80.5 );
insert into TEST values ( to_date( '02-jan-2021'), 'banana', 9.5 );
insert into TEST values ( to_date( '03-jan-2021'), 'banana', 31.65 );
--Existing working code -
with t as
( select
test.*,
row_number() over ( partition by name order by dt ) as seq
from test
),
results(name, dt, value, calc ,seq) as
(
select name, dt, value, value/5 calc, seq
from t
where seq = 1
union all
select t.name, t.dt, t.value, ( 4 * results.calc + t.value ) / 5, t.seq
from t, results
where t.seq - 1 = results.seq
and t.name = results.name
)
select results.*, calc*3 as calc2 -- Some xyz complex logic as calc2
from results
order by name, seq;
Desired output:
CALC3 - grouped by name and dt -
((CALC3 of prev day record * 4) + CALC2 of current record )/ 5
i.e for APPLE
for 1-jan-21, CALC = ((0*4)+119.37)/5 = 23.87 -------> since it is 1st record, have taken 0 as CALC3 of prev day record
for 2-jan-21, CALC = ((23.87*4)+99.19)/5= 115.33 -----> prev CALC3 is considered from 1-jan-21 - 23.87 and 99.19 from current row
for 3-jan-21, CALC = ((115.33*4)+82.14)/5= 477.76 and so on
For BANANA
1-jan-21, CALC = ((0*4)+48.30)/5=9.66
1-jan-21, CALC = ((9.66*4)+44.34)/5=47.51
etc
You do not need to, you can just do it all in one level:
with temp1(...fields...) as
(
__
__
__
),
results1(...fields...) as
(
select ...<calc part>... from temp1 where __
),
temp2( ...fields...) as
(
select ...<calc part>... from temp1 join results1 where __
),
results2(...fields...) as
(
select ...<calc part>... from temp2 where __
)
select ...<calc part>... from temp2 join results2 where __
For your actual problem, you can use a MODEL clause:
SELECT dt,
name,
amount,
calc,
seq,
calc2,
calc3
FROM (
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY name ORDER BY dt) AS seq
FROM test t
)
MODEL
PARTITION BY (name)
DIMENSION BY (seq)
MEASURES ( dt, amount, 0 AS calc, 0 AS calc2, 0 as calc3)
RULES (
calc[1] = amount[1]/5,
calc[seq>1] = (amount[cv(seq)] + 4*calc[cv(seq)-1])/5,
calc2[seq] = 3*calc[cv(seq)],
calc3[1] = calc2[1]/5,
calc3[seq>1] = (calc2[cv(seq)] + 4*calc3[cv(seq)-1])/5
)
Which outputs:
DT
NAME
AMOUNT
CALC
SEQ
CALC2
CALC3
01-JAN-21
banana
80.5
16.1
1
48.3
9.66
02-JAN-21
banana
9.5
14.78
2
44.34
16.596
03-JAN-21
banana
31.65
18.154
3
54.462
24.1692
01-JAN-21
apple
198.95
39.79
1
119.37
23.874
02-JAN-21
apple
6.15
33.062
2
99.186
38.9364
03-JAN-21
apple
4.65
27.3796
3
82.1388
47.57688
06-JAN-21
apple
20.85
26.07368
4
78.22104
53.705712
db<>fiddle here

BigQuery Arrays - check if Array contains specific values

I'm trying to see if a certain set of items exist within a BigQuery Array.
Below query works (Checking if a 1 item exists within an array):
WITH sequences AS
(
SELECT 1 AS id, [10,20,30,40] AS some_numbers
UNION ALL
SELECT 2 AS id, [20,30,40,50] AS some_numbers
UNION ALL
SELECT 3 AS id, [40,50,60,70] AS some_numbers
)
SELECT id, some_numbers
FROM sequences
WHERE 20 IN UNNEST(some_numbers)
What I'm not able to do is below (Checking if a more than 1 item exists within an array):
(This query errors)
WITH sequences AS
(
SELECT 1 AS id, [10,20,30,40] AS some_numbers
UNION ALL
SELECT 2 AS id, [20,30,40,50] AS some_numbers
UNION ALL
SELECT 3 AS id, [40,50,60,70] AS some_numbers
)
SELECT id, some_numbers
FROM sequences
WHERE (20,30) IN UNNEST(some_numbers)
I managed to find below workaround, but I feel there's a better way to do this:
WITH sequences AS
(
SELECT 1 AS id, [10,20,30,40] AS some_numbers
UNION ALL
SELECT 2 AS id, [20,30,40,50] AS some_numbers
UNION ALL
SELECT 3 AS id, [40,50,60,70] AS some_numbers
)
SELECT id, some_numbers
FROM sequences
WHERE (
(
SELECT COUNT(1)
FROM UNNEST(some_numbers) s
WHERE s in (20,30)
) > 1
)
Any suggestions are appreciated.
Not much to suggest... Official docs suggest to use exists:
WHERE EXISTS (SELECT *
FROM UNNEST(some_numbers) AS s
WHERE s in (20,30));
Assuming you are looking for rows where ALL elements in match array [20, 30] are found in target array (some_numbers). Also assuming no duplicate numbers in both (match and target) arrays
select id, some_numbers
from sequences a,
unnest([struct([20, 30] as match)]) b
where (
select count(1) = array_length(match)
from a.some_numbers num
join b.match num
using(num)
)

Replacing everything but a specific pattern in a string (Oracle)

I want to replace everything in a string with '' except for a given pattern using Oracle's regexp_replace.
In my case the pattern refers to German licence plates. The patterns is contained in the usage column (verwendungszweck_bez) of a revenue table (of a bank). The pattern can be matched by ([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4}). Now I'd like to reverse the matching pattern in order to match everything except for the pattern.
The usage column looks like this:
ALLIANZ VERSICHERUNGS-AG VERTRAG AS-9028000568 KFZ-VERSICHERUNG KFZ-VERS. XX-Y 427 01.01.19 - 31.12.19
XX-Y 427 would be the pattern I'm interested in. The string can contain more than one license plate:
AXA VERSICHERUNG AG 40301089910 KFZ HAFTPFLICHT ABC-RM10 37,35 + 40330601383 KFZ HAFTPFLIVHT ABC-LX 283 21,19
In this case I need ABC-RM10 and ABC-LX 283.
So far I just replace everything from the string with regexp_replace:
regexp_replace(lower(a.verwendungszweck_bez),'^(.*?)kfz','')
because there's always 'kfz' in the string and the licence plate information follows (not necessarily direct) after that.
upper(regexp_replace(regexp_substr(regexp_replace(lower(a.verwendungszweck_bez),'(^(.*?)kfz',''),'([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4})',1,1),'([a-z]{1,3})[- ]([a-z]{1,2}) ?([0-9]{1,4})','\1-\2 \3'))
This works but I'm sure there's a better solution.
The result should be a list of customers, licence plates and count of cars like this:
Customer|licence plates |count
1234567 |XX-Y 427| 1
1255599 |ABC-RM 10 + ABC-LX 283| 2
You can use a recursive sub-query to find the items. Also, you can use UPPER and TRANSLATE to normalise the data to remove the optional separators in the number plates and convert it into a single case:
Test Data:
CREATE TABLE test_data ( value ) AS
SELECT 'ALLIANZ VERSICHERUNGS-AG VERTRAG AS-9028000568 KFZ-VERSICHERUNG KFZ-VERS. XX-Y 427 01.01.19 - 31.12.19' FROM DUAL UNION ALL
-- UNG AG 4030 should not match
SELECT 'AXA VERSICHERUNG AG 40301089910 KFZ HAFTPFLICHT ABC-RM10 37,35 + 40330601383 KFZ HAFTPFLIVHT ABC-LX 283 21,19' FROM DUAL UNION ALL
-- Multiple matches adjacent to each other
SELECT 'AA-A1BB-BB222CC C3333' FROM DUAL UNION ALL
-- Duplicate values with different separators and cases
SELECT 'AA-A1 AA-A 1 aa a1' FROM DUAL
Query:
WITH items ( value, item, next_pos ) AS (
SELECT value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 1, 'i', 2 ) - 1
FROM test_data
UNION ALL
SELECT value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 1, 'i', 2 ) - 1
FROM items
WHERE next_pos > 0
)
SELECT item,
COUNT(*)
FROM items
WHERE item IS NOT NULL AND NEXT_POS > 0
GROUP BY item
Output:
ITEM | COUNT(*)
:------- | -------:
CCC3333 | 1
AAA1 | 4
XXY427 | 1
ABCRM10 | 1
ABCLX283 | 1
BBBB222 | 1
db<>fiddle here
The result should be a list of customers ...
You haven't given any information on how customers relate to this; that part is left as an exercise to the reader (who hopefully has the client values somewhere and can correlate them to the input).
Update:
If you want the count of unique number plates per row then:
WITH items ( rid, value, item, next_pos ) AS (
SELECT ROWID,
value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', 1, 1, 1, 'i', 2 ) - 1
FROM test_data
UNION ALL
SELECT rid,
value,
TRANSLATE( UPPER( REGEXP_SUBSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 'i', 2 ) ), '_ -', '_' ),
REGEXP_INSTR( value, '([^a-z]|^)([a-z]{1,3}[- ][a-z]{1,2} ?\d{1,4})(\D|$)', next_pos, 1, 1, 'i', 2 ) - 1
FROM items
WHERE next_pos > 0
)
SELECT LISTAGG( item, ' + ' ) WITHIN GROUP ( ORDER BY item ) AS items,
COUNT(*)
FROM (
SELECT DISTINCT
rid,
item
FROM items
WHERE item IS NOT NULL AND NEXT_POS > 0
)
GROUP BY rid;
Which outputs:
ITEMS | COUNT(*)
:----------------------- | -------:
XXY427 | 1
ABCLX283 + ABCRM10 | 2
AAA1 + BBBB222 + CCC3333 | 3
AAA1 | 1
db<>fiddle here

Regular expression to remove duplicates from comma separated string

I have following string:
'C,2,1,2,3,1'
I need a regular expression to remove duplicates and the result string should be like this:
'C,2,1,3'
If your input data is more than one string, I assume there is some kind of id column you can use to distinguish the strings from each other. If no such column exists, it can be created in the first factored subquery, for example by using rownum.
with
inputs ( id, str ) as (
select 1, 'C,2,1,2,3,1' from dual union all
select 2, 'A,ZZ,3,A,3,ZZ' from dual
),
unwrapped ( id, str, lvl, token ) as (
select id, str, level, regexp_substr(str, '[^,]+', 1, level)
from inputs
connect by level <= 1 + regexp_count(str, ',')
and prior id = id
and prior sys_guid() is not null
),
with_rn ( id, str, lvl, token, rn ) as (
select id, str, lvl, token, row_number() over (partition by id, token order by lvl)
from unwrapped
)
select id, str, listagg(token, ',') within group (order by lvl) as new_str
from with_rn
where rn = 1
group by id, str
order by id
;
ID STR NEW_STR
---- ------------------ --------------------
1 C,2,1,2,3,1 C,2,1,3
2 A,ZZ,3,A,3,ZZ A,ZZ,3
Try this:
with
-- your input data
t_in as (select 'C,2,1,2,3,1' as s from dual),
-- your string splitted into a table, a row per list item
t_split as (
select (regexp_substr(s,'(\w+)(,|$)',1,rownum,'c',1)) s,
level n
from t_in
connect by level <= regexp_count(s,'(\w+)(,|$)') + 1
),
-- this table grouped to obtain distinct values with
-- minimum levels for sorting
t_grouped as (
select s, min(n) n from t_split group by s
)
select listagg(s, ',') within group (order by n)
from t_grouped;
Depending on your Oracle version you might have to replace listagg with wm_concat (it's googlable)
Here another shorter solution:
select listagg(val, ',') within group(order by min(id))
from (select rownum as id,
trim(regexp_substr(str, '[^,]+', 1, level)) as val
from (select 'C,2,1,2,3,1' as str from dual)
connect by regexp_substr(str, '[^,]+', 1, level) is not null)
group by val;