I need to aggregate about ten different vars on different groupings using Proc SQL;
Is there a way to achieve SUM () OVER ( [ partition_by_clause ] order_by_clause) in one sql query with different partition by clauses.
I've made an example here
data have;
infile cards;
input a b c d e f;
cards;
1 2 3 4 5
2 2 4 5 6
1 4 3 4 7
3 4 4 5 8
;
run;
proc sql;
create table want as
select *,
sum a over partiton by (b,c) as a1,
sum b over partiton by (c,d) as b1
sum c over partiton by (d,e) as c1
sum d over partiton by (a,c) as d1
from have
;
quit;
I don't want to wirte multiple sql queries and grouping on different vars and calculating one var in each step.
Hope that makes sense.
Proc SQL does not implement windowing functions and thus partition syntax therein as found in other SQL implementations. You can only do partition by with passthrough SQL to a connection that allows such syntax.
You could perform such a computation in DATA step using hashes.
data have;
infile cards;
input a b c d e ;
cards;
1 2 3 4 5
2 2 4 5 6
1 4 3 4 7
3 4 4 5 8
;
run;
data want;
if 0 then set have;
length a1 b1 c1 d1 8;
declare hash a1s();
a1s.defineKey('b', 'c');
a1s.defineData('a1');
a1s.defineDone();
declare hash b1s();
b1s.defineKey('c', 'd');
b1s.defineData('b1');
b1s.defineDone();
declare hash c1s();
c1s.defineKey('d', 'e');
c1s.defineData('c1');
c1s.defineDone();
declare hash d1s();
d1s.defineKey('a', 'c');
d1s.defineData('d1');
d1s.defineDone();
do while (not end);
set have end=end;
if a1s.find() = 0 then a1+a; else a1=a; a1s.replace();
if b1s.find() = 0 then b1+b; else b1=b; b1s.replace();
if c1s.find() = 0 then c1+c; else c1=c; c1s.replace();
if d1s.find() = 0 then d1+d; else d1=d; d1s.replace();
end;
do while (not last);
set have end=last;
a1s.find();
b1s.find();
c1s.find();
d1s.find();
output;
end;
format _numeric_ 4.;
stop;
run;
I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?
Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;
Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;
I have the following data set:
data data_one;
length X 3
Y $ 20;
input x y ;
datalines;
1 test
2 test
3 test1
4 test1
5 test
6 test
7 test1
run;
data data_two;
length Z 3
A $ 20;
input Z A;
datalines;
1 test
2 test1
3 test2
run;
What I would like to have is a data set which tells me how often column Y in data_one contains the same string of column A in data_two. The result should look like this one:
Obs test test1 test2
1 4 3 0
Thanks in advance!
First we need the counts for those values of Y present in data_one.
Then we create a sorted (for the next merge) list of the values present in data_two.
The data_one Y counts from 1. are merged with the list from 2.
The Y values present in data_two but not in data_one (b and not a) are assigned count=0, the Y values not present in data_two are discarded (if b).
The last passage transposes the vertical list of counts in an horizontal set of variables.
proc freq data=data_one noprint;
table y / out=count_one (keep=y count);
run;
proc sort data=data_two out=list_two (keep=a rename=(a=y)) nodupkey;
by a;
run;
data count_all;
merge count_one (in=a) list_two (in=b);
by y;
if (b and not a) then count=0;
if b;
run;
proc transpose data=count_all out=final (drop=_name_ _label_);
id y;
run;
The first 3 steps can be replaced with one proc SQL:
proc sql;
create table count_all as
select distinct
coalesce(t1.y,t2.a) as y,
case
when missing(t1.y) then 0
else count(t1.y)
end as N
from data_one as t1
right join data_two as t2
on t1.y=t2.a
group by 1
order by 1;
quit;
proc transpose data=count_all out=final (drop=_name_);
id y;
run;
Suppose a data are as follows:
A B C
1 3 2
1 4 9
2 6 0
2 7 3
where A B and C are the variable names.
Is there a way to transform the table to
A 1
A 1
A 2
A 2
B 3
B 4
B 6
B 7
C 2
C 9
C 0
C 3
Expanding on the advice from #donPablo, here's how you would code it. Create an array to read across the data, then output each iteration of that array so you end up with the number of rows being the rows * columns from the original dataset. The VNAME function enables you to store the variable name (A, B, C) as a value in a separate variable.
data have;
input A B C;
datalines;
1 3 2
1 4 9
2 6 0
2 7 3
;
run;
data want;
set have;
length var1 $10;
array vars{*} _numeric_;
do i=1 to dim(vars);
var1=vname(vars{i});
var2=vars{i};
keep var1 var2;
output;
end;
run;
proc sort data=want;
by var1;
run;
The least amount of (expensive) development time might be --
Read and store the first row
For each subsequent row
Read the row
Create three records
Until end
Sort
How many times will this be run? Per day/ per year?
What number of rows are there?
Might we save 1 hr / month? 1 min / year? Something will need to read the entire file. Optomize last. Make it work first.
tkx
It should work correctly:
DATA A(keep A);
new_var = 'A';
SET your_data;
RUN;
DATA B(keep B);
new_var = 'B';
SET your_data;
RUN;
DATA C(keep C);
new_var = 'C';
SET your_data;
RUN;
PROC APPEND base=A data=B FORCE;
RUN;
PROC APPEND base=A data=C FORCE;
RUN;
Data A is a result data set.
I have a dataset similar to the one below
ID A B C D E
1 1
1 1
1 1
2 1
2 1
3 1
3 1
4 1
5 1
I want to condense the data into one row for each ID. So the dataset would look like the one below.
ID A B C D E
1 1 1 1
2 1 1
3 1 1
4 1
5 1
Well I created another table and removed the duplicate ID's. So I have two tables--A and B. I then tried merging the two datasets together. I was playing around with following SAS code.
data C;
merge A B;
by ID;
run;
Here's a neat trick I picked up from another forum. There's no need to split up the original dataset, the first update statement creates the structure and the second updates the values. The BY statement ensures you only get 1 record per ID.
data have;
infile datalines dsd;
input ID A B C D E;
datalines;
1,1,,,,,
1,,,1,,,
1,,1,,,,
2,,1,,,,
2,,,,1,,
3,,,,,1,
3,1,,,,,
4,,,1,,,
5,,1,,,
;
run;
data want;
update have (obs=0) have;
by id;
run;
This could be solved using the retain statement.
data B(rename=(A2=A B2=B C2=C D2=D));
set A;
by id;
retain A2 B2 C2 D2;
if first.id then do;
A2 = .;
B2 = .;
C2 = .;
D2 = .;
end;
if A ne . then A2=A;
if B ne . then B2=B;
if C ne . then C2=C;
if D ne . then D2=D;
if last.id then output;
drop A B C D;
run;
There are other ways to solve this, but hopefully this is helpful.
PROC MEANS is a great tool for something like this. PROC SQL would also give you a reasonable solution, but MEANS is faster.
proc means data=yourdata;
var a b c d e;
class id;
types id; *to avoid the 'overall' row;
output out=yourdata max=; *output the maximum of each var for each ID - use SUM instead if you want more than 1;
run;