BigQuery compare all the columns(100+) from two rows in a sinle table - google-cloud-platform

I have input table as below-
id
col1
col2
time
01
abc
001
12:00
01
def
002
12:10
Required output table-
id
col1
col2
time
diff_field
01
abc
001
12:00
null
01
def
002
12:10
col1,col2
I need to compare both the rows and find all the columns for which there is difference in value and keep those column names in a new column diff_field.
I need a optimized solution for this as my table has more than 100 columns(all the columns need to be compared)

You might consider below approach:
WITH sample_table AS (
SELECT '01' id, 'abc' col1, '001' col2, '12:00' time UNION ALL
SELECT '01' id, 'def' col1, '002' col2, '12:10' time UNION ALL
SELECT '01' id, 'def' col1, '002' col2, '12:20' time UNION ALL
SELECT '01' id, 'ddf' col1, '002' col2, '12:30' time
)
SELECT * EXCEPT(curr, prev),
(SELECT STRING_AGG('col' || offset)
FROM UNNEST(SPLIT(curr)) c WITH offset
JOIN UNNEST(SPLIT(prev)) p WITH offset USING (offset)
WHERE c <> p AND offset < ARRAY_LENGTH(SPLIT(curr)) - 1
) diff_field
FROM (
SELECT *, FORMAT('%t', t) AS curr, LAG(FORMAT('%t', t)) OVER w AS prev
FROM sample_table t
WINDOW w AS (PARTITION BY id ORDER BY time)
);
Query results

Below approach has no dependency on actual columns' names or any names convention rather then only id and time
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
select t.*,
( select string_agg(col)
from unnest(extract_keys(cur)) as col with offset
join unnest(extract_values(cur)) as cur_val with offset using(offset)
join unnest(extract_values(prev)) as prev_val with offset using(offset)
where cur_val != prev_val and col != 'time'
) as diff_field
from (
select t, to_json_string(t) cur, to_json_string(ifnull(lag(t) over(win), t)) prev
from your_table t
window win as (partition by id order by time)
)
if apply to sample data in your question (or rather extended version of it that I borrowed from Jaytiger answer) - the output is

Related

Fetch values from another table for each column written in an array for each ID

I have a dataset that looks like this
sample_data_1
select 'Alice' AS ID, 2 AS col1, 5 AS col2, 6 AS col3, 0 AS col4
union all
select 'Bob' AS ID, 1 AS col1, 4 AS col2, -2 AS col3, 7 AS col4
and a dataset that looks like this. This provides a list of important columns for each ID
sample_data_2
select 'Alice' AS ID, [STRUCT('col1' AS column, 1 AS rank), STRUCT('col4' AS column, 2 AS rank), STRUCT('col3' AS column, 3 AS rank)] AS important_columns
union all
select 'Bob' AS ID, [STRUCT('col4' AS column, 1 AS rank), STRUCT('col2' AS column, 2 AS rank), STRUCT('col1' AS column, 3 AS rank)]
I would like to add the values of the important columns for each ID from sample_data_1
So I can have an output that looks like this
select 'Alice' AS ID, [STRUCT('col1' AS column, 1 AS rank, 2 AS value), STRUCT('col4' AS column, 2 AS rank, 0 AS value), STRUCT('col3' AS column, 3 AS rank, 6 AS value)] AS important_columns
union all
select 'Bob' AS ID, [STRUCT('col4' AS column, 1 AS rank, 7 AS value), STRUCT('col2' AS column, 2 AS rank, 4 AS value), STRUCT('col1' AS column, 3 AS rank, 1 AS value)]
output
I would like the code to be dynamic, so will work even if column names change or number of columns change
Consider below query.
EXECUTE IMMEDIATE FORMAT("""
CREATE TEMP TABLE sample_data1_unpivoted AS
SELECT ID, ARRAY_AGG(STRUCT(column, value)) outputs FROM sample_data1 UNPIVOT (value FOR column IN (%s)) GROUP BY 1
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_data1 LIMIT 1)), r'"([^,{]+)":'), ',')
);
SELECT ID, ARRAY (
SELECT AS STRUCT column, rank, value
FROM t1.important_columns LEFT JOIN t2.outputs USING (column)
) important_columns
FROM sample_data2 t1
LEFT JOIN sample_data1_unpivoted t2 USING (ID);
Query results

Conditional Merging in SAS

Hi all i have the below two datasets in which i need to map Date1 with Date2 in a range of (+/- 7)days within an ID.
Data set1;
input ID Date1 ddmmyy8.;
Format Date1 Date11.;
Datalines;
001 02-08-15
001 04-08-15
001 06-08-15
002 11-09-15
002 14-09-15
002 17-09-15
;
run;
Data set2;
input ID TYPE $ Date2 ddmmyy8.;
Format Date2 Date11.;
Datalines;
001 TYPE1 02-08-15
001 TYPE2 11-08-15
001 TYPE3 06-08-15
002 TYPE1 07-09-15
002 TYPE2 04-09-15
002 TYPE3 09-08-15
;
run;
Proc sql;
create table out as select a.ID, a.Date1, b.Date2,
intck('days', Date1, Date2) as Diff
from set1 as a full join set2 as b
on a.ID = b.ID and (Date1 + 7 >= Date2 >= Date1 - 7)
group by a.ID, Date1 having diff = min(diff);
quit;
I get the below output
i need output
Expected Output
***The output i get is highlighted in yellow when i map using min of Diff.
but the output i need is highlighted in green it is because i have to maintain the values in Date2 as distinct and does not repeat.
(i.e) Because 02-Aug-2015 is already mapped with 02-Aug-2015 as well as 09-Aug-2015 mapped with 09-Aug-2015 of Date1 i need the 04-Aug-2015 of Date1 to be mapped with the remaining 11-Aug-2015***
So judging on the comments above the rules are quite complex:
* join based on ID
* Absolute difference between date 1 and date 2 should be less then 7
* Preference for matching dates should be given to dates that are equal
* In case dates aren't equal, the solution should be so that as much as combinations are
returned as possible
* Date1 & Date2 need to be unique
I'm not sure whether the solution can be done in one proc sql step.
You can't just work with a simple rule as it is not always the result with the lowest difference between dates that needs to be returned. Next to that, you need to sort of 'remember' which dates per ID you have selected for an output since you can't select them anymore further on.
First I made an inner join on ID where the difference in dates was <= 7. This gives all possible valid combinations. Then I've put all the dates where the difference in dates was 0 in a macro variable so I can make a table of all possible combinations except dates being in the macro variable. In the last step then you want to have the solution returning the most possible combinations of date1 - date2 where both are distinct.
Data set1;
input ID Date1 ddmmyy8.;
Format Date1 Date11.;
Datalines;
001 02-08-15
001 04-08-15
001 06-08-15
002 11-09-15
002 14-09-15
002 17-09-15
;
run;
Data set2;
input ID TYPE $ Date2 ddmmyy8.;
Format Date2 Date11.;
Datalines;
001 TYPE1 02-08-15
001 TYPE2 11-08-15
001 TYPE3 06-08-15
002 TYPE1 07-09-15
002 TYPE2 04-09-15
002 TYPE3 09-08-15
;
run;
Proc sql noprint;
/*Create a macro variable with all the dates for which the difference between date1
and date2 = 0 */
select distinct put(Date1,yymmddn8.) into: dates seperated by ','
from set1 as a inner join set2 as b
on a.ID = b.ID and abs(intck('days', Date1, Date2)) = 0;
/*Create table with all lines where difference <= 7 but date is not in the ones with
difference = 0 */
create table out as select a.ID, a.Date1, b.Date2,
intck('days', Date1, Date2) as Diff
from set1 as a inner join set2 as b
on a.ID = b.ID where abs(intck('days', Date1, Date2)) <= 7 and
find("&dates.",put(Date1,yymmddn8.)) = 0 and find("&dates.",put(Date2,yymmddn8.)) = 0;
/* Check the number of possible combinations */
create table out as
select a.*,
b.cnt1 + c.cnt2 as combos
from out a left join (select distinct id,
date1,
count(*) as cnt1
from out
group by id, date1) b on a.date1 = b.date1 and a.id = b.id
left join (select distinct id,
date2,
count(*) as cnt2
from out
group by id, date2) c on a.date2 = c.date2 and a.id = c.id
order by id, combos;
quit;
/* Select unique dates per date1, date2 */
data out(keep = id date1 date2);
retain mem1 mem2;
length mem1 mem2 $100.;
set out;
by id combos;
if first.id then do;
mem1 = "0";
mem2 = "0";
end;
date10 = put(date1,yymmddn8.);
date20 = put(date2,yymmddn8.);
if find(mem1,date10) = 0 and find(mem2,date20) = 0 then do;
mem1 = catx(',',mem1,date10);
mem2 = catx(',',mem2,date20);
output;
end;
run;
/* Create a union between lines with no difference in date and lines with difference
in date*/
proc sql;
create table final as
select * from out
union
select a.ID, a.Date1, b.Date2
from set1 as a inner join set2 as b
on a.ID = b.ID and abs(intck('days', Date1, Date2)) = 0;
quit;
So this gives a table like:
Final Table

SELECT MAX PARTITION TABLE

I have a table with partition on date(transaction_time), And I have a
problem with a select MAX.
I'm trying to get the row with the highest timestamp if I get more then 1 row in the result on one ID.
Example of data:
1. ID = 1 , Transaction_time = "2018-12-10 12:00:00"
2. ID = 1 , Transaction_time = "2018-12-09 12:00:00"
3. ID = 2 , Transaction_time = "2018-12-10 12:00:00"
4. ID = 2 , Transaction_time = "2018-12-09 12:00:00"
Result that I want:
1. ID = 1 , Transaction_time = "2018-12-10 12:00:00"
2. ID = 2 , Transaction_time = "2018-12-10 12:00:00"
This is my query
SELECT ID, TRANSACTION_TIME FROM `table1` AS T1
WHERE TRANSACTION_TIME = (SELECT MAX(TRANSACTION_TIME)
FROM `table1` AS T2
WHERE T2.ID = T1.ID )
The error I receive:
Error: Cannot query over table 'table1' without a filter over
column(s) 'TRANSACTION_TIME' that can be used for partition
elimination
It looks like BigQuery does not the correlated subquery in the WHERE clause. I don't know how to fix your current approach, but you might be able to just use ROW_NUMBER here:
SELECT t.ID, t.TRANSACTION_TIME
FROM
(
SELECT ID, TRANSACTION_TIME,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY TRANSACTION_TIME DESC) rn
FROM table1
) t
WHERE rn = 1;
can be done this way:
SELECT id, MAX(transaction_time) FROM `table1` GROUP BY id;

PowerBI Dax Query - Match a Value and get its Maximum Value

I want to match the value in Column1 of one table with another and get the maximum value of column2.
For Example I have two tables,
Table 1
Col1 Col2
AA 17
AA 20
AB 10
AB 21
Table 2
Col1 Col2
AA ?
AB ?
I want my output to look like this,
Col1 Col2
AA 20
AB 21
I have tried,
Col2 = Max(Table1[col2])
but it didn't help. Thanks. Please share your thoughts.
You can use the following DAX:
Col2 =
CALCULATE(
MAX(Table1[Col2]),
FILTER(
Table1,
Table1[Col1] = Table2[Col1]
)
)
Result:

Extract string from a large string oracle regexp

I have String as below.
select b.col1,a.col2,lower(a.col3) from table1 a inner join table2 b on a.col = b.col and a.col = b.col
inner join (select col1, col2, col3,col4 from tablename ) c on a.col1=b.col2
where
a.col = 'value'
Output need to be table1,table2 and tablename from above string. please let me know the regex to get the result.
Should be a simple one :-)
SQL> WITH DATA AS(
2 select q'[select b.col1,a.col2,lower(a.col3) from table1 a inner join table2 b on
3 a.col = b.col and a.col = b.col inner join (select col1, col2, col3,col4 from tablename )
4 c on a.col1=b.col2 where a.col = 'value']' str
5 FROM DUAL)
6 SELECT LISTAGG(TABLE_NAMES, ' , ') WITHIN GROUP (
7 ORDER BY val) table_names
8 FROM
9 (SELECT 1 val,
10 regexp_substr(str,'table[[:alnum:]]+',1,level) table_names
11 FROM DATA
12 CONNECT BY level <= regexp_count(str,'table')
13 )
14 /
TABLE_NAMES
--------------------------------------------------------------------------------
table1 , table2 , tablename
SQL>
Brief explanation, so that OP/even others might find it useful :
The REGEXP_SUBSTR looks for the words 'table', it could be followed
by a number or string like 1,2, name etc.
To find all such words, I used connect by level technique, but it
gives the output in different rows.
Finally, to put them in a single row as comma separated values, I
used LISTAGG.
Oh yes, and that q'[]' is the string literal technique.