I have a data like this below :
id start_date end_date col1 col2 col3 col4 col5
issue_2017-09 2017-09-18 2017-09-30 true true true true false
i want to convert data into the following format:
id start_date end_date new_col
issue_2017-09 2017-09-18 2017-09-30 {'col1', 'col2', 'col3', 'col4'}
new_col is created out of the columns [col1, col2, col3, col4, col5] which are true.
plus I am using redshift.
I was able to resolve this using the following query :
select id, start_date , end_date, listagg(col_name, ', ') as new_col
from (
select id, start_date, end_date, col1 as val, 'col1' as col_name
from table
union all
select id, start_date, end_date, col2 as val, 'col2' as col_name
from table
union all
select id, start_date, end_date, col3 as val, 'col3' as col_name
from table
union all
select id, start_date, end_date, col4 as val, 'col4' as col_name
from table
union all
select id, start_date, end_date, col5 as val, 'col5' as col_name
from table
) t
where val is True
group by id, start_date, end_date
Here is an alternative method,
select
id, start_date, end_date,
'{' +
case when col1 then '''col1''' else '' end +
case when col2 then case when col1 then ', ''col2''' else '''col2''' end else '' end +
case when col3 then case when (col1 or col2) then ', ''col3''' else '''col3''' end else '' end +
case when col4 then case when (col1 or col2 or col3) then ', ''col4''' else '''col4''' end else '' end +
case when col5 then case when (col1 or col2 or col3 or col4) then ', ''col5''' else '''col5''' end else '' end +
'}' as new_col
from table01;
Related
I have input table as below-
id
col1
col2
time
01
abc
001
12:00
01
def
002
12:10
Required output table-
id
col1
col2
time
diff_field
01
abc
001
12:00
null
01
def
002
12:10
col1,col2
I need to compare both the rows and find all the columns for which there is difference in value and keep those column names in a new column diff_field.
I need a optimized solution for this as my table has more than 100 columns(all the columns need to be compared)
You might consider below approach:
WITH sample_table AS (
SELECT '01' id, 'abc' col1, '001' col2, '12:00' time UNION ALL
SELECT '01' id, 'def' col1, '002' col2, '12:10' time UNION ALL
SELECT '01' id, 'def' col1, '002' col2, '12:20' time UNION ALL
SELECT '01' id, 'ddf' col1, '002' col2, '12:30' time
)
SELECT * EXCEPT(curr, prev),
(SELECT STRING_AGG('col' || offset)
FROM UNNEST(SPLIT(curr)) c WITH offset
JOIN UNNEST(SPLIT(prev)) p WITH offset USING (offset)
WHERE c <> p AND offset < ARRAY_LENGTH(SPLIT(curr)) - 1
) diff_field
FROM (
SELECT *, FORMAT('%t', t) AS curr, LAG(FORMAT('%t', t)) OVER w AS prev
FROM sample_table t
WINDOW w AS (PARTITION BY id ORDER BY time)
);
Query results
Below approach has no dependency on actual columns' names or any names convention rather then only id and time
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
select t.*,
( select string_agg(col)
from unnest(extract_keys(cur)) as col with offset
join unnest(extract_values(cur)) as cur_val with offset using(offset)
join unnest(extract_values(prev)) as prev_val with offset using(offset)
where cur_val != prev_val and col != 'time'
) as diff_field
from (
select t, to_json_string(t) cur, to_json_string(ifnull(lag(t) over(win), t)) prev
from your_table t
window win as (partition by id order by time)
)
if apply to sample data in your question (or rather extended version of it that I borrowed from Jaytiger answer) - the output is
proc tabulate data=D.Arena out=work.Arena ;
class Row1 Column1/ order=freq ;
table Row1,Column1 ;
run;
after running this i received these results and now i want to restrict the columns to only 5 variables
Use a 'where' statement to restrict the col1 values being tabulated.
You can restrict based on a value property such as starts with the letter A
where col1 =: 'A';
You can restrict based on a value list:
where col1 in ('Apples', 'Lentils', 'Oranges', 'Sardines', 'Cucumber');
Sample data:
data have;
call streaminit(123);
array col1s[50] $20 _temporary_ (
"Apples" "Avocados" "Bananas" "Blueberries" "Oranges" "Strawberries" "Eggs" "Lean beef" "Chicken breasts" "Lamb" "Almonds" "Chia seeds" "Coconuts" "Macadamia nuts" "Walnuts" "Asparagus" "Bell peppers" "Broccoli" "Carrots" "Cauliflower" "Cucumber" "Garlic" "Kale" "Onions" "Tomatoes" "Salmon" "Sardines" "Shellfish" "Shrimp" "Trout" "Tuna" "Brown rice" "Oats" "Quinoa" "Ezekiel bread" "Green beans" "Kidney beans" "Lentils" "Peanuts" "Cheese" "Whole milk" "Yogurt" "Butter" "Coconut oil" "Olive oil" "Potatoes" "Sweet potatoes" "Vinegar" "Dark chocolate"
);
do row1 = 1 to 20;
do _n_ = 1 to 1000;
col1 = col1s[ceil(rand('uniform',50))];
x = ceil(rand('uniform',250));
output;
end;
end;
run;
Frequency tabulation, also showing ALL counts
* col1 values shown in order by value;
proc tabulate data=have;
class row1 col1;
table ALL row1,col1;
run;
* col1 values shown in order by ALL frequency;
proc tabulate data=have;
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
* Letter T col1 values shown in order by ALL frequency;
proc tabulate data=have;
where col1 =: 'T';
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
A top 5 only list of Col1s would require a step that determines which col1s meet that criteria. A list of those col1s can be used as part of a where in clause.
* determine the 5 col1s with highest frequency count;
proc sql noprint outobs=5;
select
quote(col1) into :top5_col1_list separated by ' '
from
( select col1, count(*) as N from have
group by col1
)
order by N descending;
quit;
proc tabulate data=have;
where col1 in (&top5_col1_list);
class row1;
class col1 / order=freq;
table ALL row1,col1;
run;
Col1s in order of value
Col1s in order of frequency
T Col1s
Top 5 Col1s
I have multiple columns that depend on this formula.
I have instances were an employee can have multiple assignments for the same Project and I use the formula to consolidate the rows and add the value that causes an extra row in the corresponding row:
I first perform a = Unique(A3:D) to extract the list and then:
=IF($A3<>"",join(", ",filter(Sheet1!E$3:E,Sheet1!$A$3:$A=$G3)),"")
How can I make this an ArrayFormula?
I tried it like this but the result is incorrect:
=arrayformula(IF($A2:A<>"",join(", ",filter(Sheet1!E$2:E,Sheet1!$A$2:$A=$G2:G)),""))
Here's an example spreadsheet:
https://docs.google.com/spreadsheets/d/1cLXNidk6FSZbUeU0CK3XlMPWdpBMbnKMXC5gzTfMvY0/edit?usp=sharing
do it all in one go:
=ARRAYFORMULA(SPLIT(REGEXREPLACE(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE({
QUERY(QUERY(IF(A3:A<>"", {A3:A&"♦"&B3:B&"♦"&C3:C&"♦"&D3:D, E3:E}, ),
"select Col1,count(Col1) where Col1 is not null group by Col1 pivot Col2", 0),
"select Col1 offset 1", 0),
IF(QUERY(QUERY(IF(A3:A<>"", {A3:A&"♦"&B3:B&"♦"&C3:C&"♦"&D3:D, E3:E}, ),
"select count(Col1) where Col1 is not null group by Col1 pivot Col2", 0), "offset 1",0)<>"",
QUERY(QUERY(IF(A3:A<>"", {A3:A&"♦"&B3:B&"♦"&C3:C&"♦"&D3:D, "♦"&E3:E&","}, ),
"select count(Col1) where Col1 is not null group by Col1 pivot Col2", 0), "limit 0",1), )})
,,999^99))), ", ♦", ", "), ",$", ), "♦"))
only column K:
=QUERY(ARRAYFORMULA(SPLIT(REGEXREPLACE(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE({
QUERY(QUERY(IF(A3:A<>"", {A3:A&"♦"&B3:B&"♦"&C3:C&"♦"&D3:D, E3:E}, ),
"select Col1,count(Col1) where Col1 is not null group by Col1 pivot Col2", 0),
"select Col1 offset 1", 0),
IF(QUERY(QUERY(IF(A3:A<>"", {A3:A&"♦"&B3:B&"♦"&C3:C&"♦"&D3:D, E3:E}, ),
"select count(Col1) where Col1 is not null group by Col1 pivot Col2", 0), "offset 1",0)<>"",
QUERY(QUERY(IF(A3:A<>"", {A3:A&"♦"&B3:B&"♦"&C3:C&"♦"&D3:D, "♦"&E3:E&","}, ),
"select count(Col1) where Col1 is not null group by Col1 pivot Col2", 0), "limit 0",1), )})
,,999^99))), ", ♦", ", "), ",$", ), "♦")), "select Col5", 0)
I want to join sas dataset with the look up table but the column/key for joining is a value in the look up table
Dataset: table4
ID lev1 lev2 lev3 lev4 lev5
1 12548 14589 85652 45896 45889
2 12548 14589 85652 45896 45890
3 12548 14547 85685 45845 45825
4 66588 24647 55255 30895 15764
Look up table:
context table_name column operator value
extract table1 col1 equals xyd
asset table2 var1 equals 11111
asset table2 var2 equals 25858
prod table3 x1 equals 87999
unprod table4 lev2 equals 14589
unprod table4 lev2 equals 14589
unprod table4 lev3 equals 55255
Now I want to join table4 with lookup table but it is only possible with fields lev2 and lev3(it is dynamic so could be changed in the future, so don't want to hardcode in it).
I have tried below code but doesn't want to hard code as the fields are dynamic( someone might add lev4 as well in future).
proc sql ;
create table want as
select ID
from table4 as a
inner join lookup as b
on a.lev2 = input(value,12.) or a.lev3=input(value,12.)
where Context="unprod";
quit;
Thanks heaps in advance.
That does not look like a lookup table. It appears to be a set of rules. You could use it to generate code. Let's simplify the process by making the table contain actual code instead of three columns. But you could easily write the code to convert from your current format into code strings.
data rules ;
infile cards truncover ;
input context $ table_name $ rule $100. ;
cards;
extract table1 col1 = xyd
asset table2 var1 = 11111
asset table2 var2 = 25858
prod table3 x1 = 87999
unprod table4 lev2 = 14589
unprod table4 lev2 = 14589
unprod table4 lev3 = 55255
;
So now it looks like you want to take the rules that have a specific value of CONTEXT and use that to generate a new dataset from the dataset named in TABLE_NAME. Not sure what name you want to use for the generated table or what you want to do when more than one table is mentioned in the same "context".
%let context=unprod ;
filename code temp;
data _null_;
set rules ;
where context=symget('context');
by table_name ;
file code ;
if first.table_name then table_no+1;
if first.table_name then put
'data want' table_no ';'
/ ' set ' table_name ';'
/ ' where 1=0'
;
put ' or (' rule ')' ;
if last.table_name then put
';'
/ 'run;'
;
run;
%include code / source2 ;
Which results in code like this:
130 +data want1 ;
131 + set table4 ;
132 + where 1=0
133 + or (lev2 = 14589 )
134 + or (lev2 = 14589 )
135 + or (lev3 = 55255 )
136 +;
137 +run;
NOTE: There were 3 observations read from the data set WORK.TABLE4.
WHERE (lev2=14589) or (lev3=55255);
NOTE: The data set WORK.WANT1 has 3 observations and 6 variables.
Here is a sample code that would get what I understood you are trying to do. This code is based on the comment by #Reeza. If this is not what you are trying to do, please send a sample output file.
data table4;
input ID $ lev1 $ lev2 $ lev3 $ lev4 $ lev5 $;
datalines;
1 12548 14589 85652 45896 45889
2 12548 14589 85652 45896 45890
3 12548 14547 85685 45845 45825
4 66588 24647 55255 30895 15764
;
run;
data look_up;
input context $ table_name $ column $ operator $ value $;
datalines;
extract table1 col1 equals xyd
asset table2 var1 equals 11111
asset table2 var2 equals 25858
prod table3 x1 equals 87999
unprod table4 lev2 equals 14589
unprod table4 lev2 equals 14589
unprod table4 lev3 equals 55255
;
run;
PROC transpose DATA=work.table4 out=temp1 prefix=value;
by ID;
VAR lev1-lev5;
run;
proc sql;
create table want as
select a.*, b.ID
from look_up as a
inner join temp1 as b
on a.value=b.value1 and a.column=_Name_;
quit;
I have String as below.
select b.col1,a.col2,lower(a.col3) from table1 a inner join table2 b on a.col = b.col and a.col = b.col
inner join (select col1, col2, col3,col4 from tablename ) c on a.col1=b.col2
where
a.col = 'value'
Output need to be table1,table2 and tablename from above string. please let me know the regex to get the result.
Should be a simple one :-)
SQL> WITH DATA AS(
2 select q'[select b.col1,a.col2,lower(a.col3) from table1 a inner join table2 b on
3 a.col = b.col and a.col = b.col inner join (select col1, col2, col3,col4 from tablename )
4 c on a.col1=b.col2 where a.col = 'value']' str
5 FROM DUAL)
6 SELECT LISTAGG(TABLE_NAMES, ' , ') WITHIN GROUP (
7 ORDER BY val) table_names
8 FROM
9 (SELECT 1 val,
10 regexp_substr(str,'table[[:alnum:]]+',1,level) table_names
11 FROM DATA
12 CONNECT BY level <= regexp_count(str,'table')
13 )
14 /
TABLE_NAMES
--------------------------------------------------------------------------------
table1 , table2 , tablename
SQL>
Brief explanation, so that OP/even others might find it useful :
The REGEXP_SUBSTR looks for the words 'table', it could be followed
by a number or string like 1,2, name etc.
To find all such words, I used connect by level technique, but it
gives the output in different rows.
Finally, to put them in a single row as comma separated values, I
used LISTAGG.
Oh yes, and that q'[]' is the string literal technique.