I have the following dataset created for this example
/*sample data*/
data have;
input subj param value visit$ base;
cards;
1 1 50 scr .
1 1 55 rand 55
1 1 . 1 55
1 1 . 2 55
1 2 120 scr .
1 2 125 rand 125
1 2 . 1 125
1 2 . 2 125
;
run;
I want to make sure that the base value of scr is the same as 'base' when visit='rand' so that it looks like the following
/*sample data*/
data want;
set have;
input subj param value visit$ base;
cards;
1 1 50 scr 55
1 1 55 rand 55
1 1 . 1 55
1 1 . 2 55
1 2 120 scr 125
1 2 125 rand 125
1 2 . 1 125
1 2 . 2 125
;
run;
Double do-loop:
data want;
do until(last.param);
set have;
by subj param notsorted;
if visit='rand' then temp=base;
end;
do until(last.param);
set have;
by subj param notsorted;
if visit='scr' then base=temp;
output;
end;
drop temp;
run;
Here's a different approach, which uses SQL and joins the table with itself.
Not quite sure how to get back your original order, if that's important you may need to add a sort variable ahead of time.
proc sql;
create table want as
select t1.subj, t1.param, t1.value, t1.visit,
case when visit='scr' then t2.base
else t1.base
end as base
from have as t1
left join (select subj, param, base from have where visit = 'rand') as t2
on t1.subj=t2.subj
and t1.param=t2.param
order by 1, 2, 3, 4;
quit;
Lots of ways to do this. Assuming the data is not huge in size, use a hash table to update the values.
Get the %create_hash() macro from here
data have;
input subj param value visit$ base;
cards;
1 1 50 scr .
1 1 55 rand 55
1 1 . 1 55
1 1 . 2 55
1 2 120 scr .
1 2 125 rand 125
1 2 . 1 125
1 2 . 2 125
;
run;
data want;
set have;
if _n_ =1 then do;
%create_hash(lk,subj param,base,"have(where=(visit='rand'))");
/* The macro above generates the following code
declare hash lk(dataset:"have(where=(visit='rand'))");
rc = lk.definekey( "subj", "param" );
rc = lk.definedata("base");
rc = lk.definedone();
*/
end;
rc = lk.find();
drop rc;
run;
This reads the data set into memory and gives you a hash based look up into the values. Use a subsetting where clause to only load the rand records into the hash.
Related
My goal is to combine these tables into one without having to manually run my macro each time for each column.
The code I currently have with me is the following:
%macro task_Oct(set,col_name);
data _type_;
set &set;
call symputx('col_type', vtype(&col_name));
run;
proc sql;
create table work.oct27 as
select "&col_name" as variable,
"&col_type" as type,
nmiss(&col_name) as missing_val,
count(distinct &col_name) as distinct_val
from &set;
quit;
%mend task_Oct;
%task_Oct(sashelp.cars,Origin)
The above code gives me the following output:
|Var |Type |missing_val|distinct_val|
|Origin|Character|0 | 3 |
But the sashelp.cars data sheet has 15 columns and so I would like to output a new data sheet which has 15 rows with 4 columns.
I would like to get the following combined table as the output of my code:
|Var |Type |missing_val|distinct_val|
|Make |Character|0 | 38 |
|Model |Character|0 | 425 |
|Type |Character|0 | 6 |
|Origin|Character|0 | 3 |
...
...
...
Since I'm using a macro, I could run my code 15 different times by manually entering the names of the columns and then merging the tables into 1; and it wouldn't be a problem. But what if I have a table with 100s of columns? I could use some loop statement but I'm not sure how to go about that in this case. Help would be appreciated. Thank you.
The main output you appear to want can be generated directly by PROC FREQ with the NLEVELS option. If you want to add in the variable type then just merge it with the output of PROC CONTENTS.
ods exclude all;
ods output nlevels=nlevels;
proc freq data=sashelp.cars nlevels;
tables _all_ / noprint ;
run;
ods select all;
proc contents data=sashelp.cars noprint out=contents;
run;
proc sql;
create table want(drop=table:) as
select c.varnum,c.name,c.type,n.*
from contents c inner join nlevels n
on c.name=n.TableVar
order by varnum
;
quit;
Result
NNon
NMiss Miss
Obs VARNUM NAME TYPE NLevels Levels Levels
1 1 Make 2 38 0 38
2 2 Model 2 425 0 425
3 3 Type 2 6 0 6
4 4 Origin 2 3 0 3
5 5 DriveTrain 2 3 0 3
6 6 MSRP 1 410 0 410
7 7 Invoice 1 425 0 425
8 8 EngineSize 1 43 0 43
9 9 Cylinders 1 8 1 7
10 10 Horsepower 1 110 0 110
11 11 MPG_City 1 28 0 28
12 12 MPG_Highway 1 33 0 33
13 13 Weight 1 348 0 348
14 14 Wheelbase 1 40 0 40
15 15 Length 1 67 0 67
The NMissLevels variable counts the number of distinct types of missing values.
If instead you want to count the number of observations with (any) missing value you will need to use code generation. So use the CONTENTS data to generate an SQL query to generate all of the counts you want into a single observation with only one pass through the data. You can then transpose that to make it usable for re-merging with the CONTENTS data.
filename code temp;
data _null_;
set contents end=eof;
length nliteral $65 dsname $80;
nliteral=nliteral(name);
dsname = catx('.',libname,nliteral(memname));
file code;
if _n_=1 then put 'create table counts as select ' / ' ' # ;
else put ',' #;
put 'nmiss(' nliteral ') as missing_' varnum
/',count(distinct ' nliteral ') as distinct_' varnum
;
if eof then put 'from ' dsname ';';
run;
proc sql;
%include code /source2;
quit;
proc transpose data=counts out=count2 name=name ;
run;
proc sql ;
create table want as
select c.varnum, c.name, c.type
, m.col1 as nmissing
, d.col1 as ndistinct
from contents c
left join count2 m on m.name like 'missing%' and c.varnum=input(scan(m.name,-1,'_'),32.)
left join count2 d on d.name like 'distinct%' and c.varnum=input(scan(d.name,-1,'_'),32.)
order by varnum
;
quit;
Result
Obs VARNUM NAME TYPE nmissing ndistinct
1 1 Make 2 0 38
2 2 Model 2 0 425
3 3 Type 2 0 6
4 4 Origin 2 0 3
5 5 DriveTrain 2 0 3
6 6 MSRP 1 0 410
7 7 Invoice 1 0 425
8 8 EngineSize 1 0 43
9 9 Cylinders 1 2 7
10 10 Horsepower 1 0 110
11 11 MPG_City 1 0 28
12 12 MPG_Highway 1 0 33
13 13 Weight 1 0 348
14 14 Wheelbase 1 0 40
15 15 Length 1 0 67
Say I have a dataset like this
day product sales
1 a 1 48
2 a 2 55
3 a 3 88
4 b 2 33
5 b 3 87
6 c 1 97
7 c 2 95
On day "b" there were no sales for product 1, so there is no row where day = b and product = 1. Is there an easy way to add a row with day = b, product = 1 and sales = 0, and similar "missing" rows to get a dataset like this?
day product sales
1 a 1 48
2 a 2 55
3 a 3 88
4 b 1 0
5 b 2 33
6 b 3 87
7 c 1 97
8 c 2 95
9 c 3 0
In R you can do complete(df, day, product, fill = list(sales = 0)). I realize you can accomplish this with a self-join in proc sql, but I'm wondering if there is a procedure for this.
In this particular example you can also use the SPARSE option in PROC FREQ. It tells SAS to generate all the complete types with every value from DAY included with PRODUCT, so similar to a cross join between those elements. If you do not have the value in the table already it cannot add the value. You would need a different method in that case.
data have;
input n day $ product sales;
datalines;
1 a 1 48
2 a 2 55
3 a 3 88
4 b 2 33
5 b 3 87
6 c 1 97
7 c 2 95
;;;;
run;
proc freq data=have noprint;
table day*product / out=want sparse;
weight sales;
run;
proc print data=want;run;
There are, as usual in SAS, about a dozen ways to do this. Here's my favorite.
data have;
input n day $ product sales;
datalines;
1 a 1 48
2 a 2 55
3 a 3 88
4 b 2 33
5 b 3 87
6 c 1 97
7 c 2 95
;;;;
run;
proc means data=have completetypes;
class day product;
types day*product;
var sales;
output out=want sum=;
run;
completetypes tells SAS to put out rows for every class combination, including missing ones. You could then use proc stdize to get them to be 0's (if you need them to be 0). It's possible you might be able to do this in the first place with proc stdize, I'm not as familiar unfortunately with that proc.
You can do this with proc freq using the sparse option.
Code:
proc freq data=have noprint;
table day*product /sparse out=freq (drop=percent);
run;
Output:
day=a product=1 COUNT=1
day=a product=2 COUNT=1
day=a product=3 COUNT=1
day=b product=1 COUNT=0
day=b product=2 COUNT=1
day=b product=3 COUNT=1
day=c product=1 COUNT=1
day=c product=2 COUNT=1
day=c product=3 COUNT=0
I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row, I can do it among multiple rows but within the row by the columns, I don't know.
Any help would be appreciated. Thank you
********************************************UPDATE*******************************************************
Thank you to everyone that has replied!!
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering:
I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that ok because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
one more way using sortn and using conditions.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
77 . 3 . 4
88 . 9 5 .
99 . . 2 2
76 . . . 2
58 1 1 . .
50 2 . 2 .
66 2 . 7 .
89 1 1 1 .
75 1 2 3 .
76 . 5 6 7
88 . 1 1 1
43 1 . . 1
31 1 . . 2
;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1
I want to insert a record of mean into the data set according the identifier variable. The data set is like DS1 and I want to insert a variable if we have more than one pair of a-b values. Such as the target data set would be like DS2. Thanks my friends.
data DS1;
input a b c;
cards;
1 2 23
1 2 43
1 2 23
1 3 55
1 4 48
2 1 43
2 1 56
2 2 34
;
run;
data DS2;
input a b c;
cards;
1 2 23
1 2 43
1 2 23
1 2 27.66
1 3 55
1 4 48
2 1 43
2 1 56
2 1 44.5
2 2 34
;
run;
Why SQL? If you're going to request a specific solution it's good to know why. Here's two methods, one uses a data step and the other is SQL. Essentially the SQL solution calculates the values and UNION ALL appends them into the data set. The DATA STEP calculates the values as it passes through the data set, requiring only one pass through the data, and maintaining the order of the original data set.
data want_datastep;
set ds1;
by a b;
retain sum count;
if first.b then do;
sum=0;
count=0;
end;
sum=sum+c;
count+1;
if last.b and not first.b then do;
output;
c=sum/count;
output;
end;
else output;
run;
proc sql;
create table want_sql as
select * from
(select ds1.* from ds1 as ds1
union all
(select ds1_x.a, ds1_x.b, mean(ds1_x.c) as c
from ds1 as ds1_x
group by ds1_x.a, ds1_x.b
having count(ds1_x.b)>1))
order by a, b, c;
quit;
I am trying to detect groups which contain the difference between first age and second age are greater than 5. For example, if I have the following data, the difference between age in grp=1 is 39 so I want to output that group in a separate data set. Same goes for grp 4.
id grp age sex
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
My initial idea was to sort them by grp and then get the absolute value between ages using something like if first.grp then do;. But I don't know how to get the absolute value between first age and second age by group or actually I don't know how should I start this.
Thanks in advance.
Here's one way that I think works.
data have;
input id $ grp $ age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
proc sort data=have ;
by grp descending age;
run;
data temp(keep=grp);
retain old;
set have;
by grp descending age;
if first.grp then old=age;
if last.grp then do;
diff=old-age;
if diff>5 then output ;
end;
run;
Data want;
merge temp(in=a) have(in=b);
by grp ;
if a and b;
run;
I would use PROC TRANSPOSE so the values in each group can easily be compared. For example:
data groups1;
input id $ grp age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
run;
proc sort data=groups1;
by grp; /* This maintains age order */
run;
proc transpose data=groups1 out=groups2;
by grp;
var age;
run;
With the transposed data you can do whatever comparison you like (I can't tell from your question what exactly you want, so I just compare first two ages):
/* With all ages of a particular group in a single row, it is easy to compare */
data outgroups1(keep=grp);
set groups2;
if abs(col1-col2)>5 then output;
run;
In this instance this would be my preferred method for creating a separate data set for each group that satisfies whatever condition is applied (generate and include code dynamically):
/* A separate data set per GRP value in OUTGROUPS1 */
filename dynacode catalog "work.dynacode.mycode.source";
data _null_;
set outgroups1;
file dynacode;
put "data grp" grp ";";
put " set groups1(where=(grp=" grp "));";
put "run;" /;
run;
%inc dynacode;
If you are after the difference between just the 1st and 2nd ages, then the following code is a fairly straightforward way of extracting these. It reads though the dataset to identify the groups, then uses the direct access method, POINT=, to extract the relevant records. I put in an extra condition, grp=lag(grp) just in case you have any groups with only 1 record.
data want;
set have;
by grp;
if first.grp then do;
num_grp=0;
outflag=0;
end;
outflag+ifn(lag(first.grp)=1 and grp=lag(grp) and abs(dif(age))>5,1,0) /* set flag to determine if group meets criteria */;
if not first.grp then num_grp+1; /* count number of records in group */
if last.grp and outflag=1 then do i=_n_-num_grp to _n_;
set have point=i; /* extract required group records */
drop num_grp outflag;
output;
end;
run;
Here's an SQL approach (using CarolinaJay's code to create the dataset):
data groups1;
input id grp age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
run;
proc sql noprint;
create table xx as
select a.*
from groups1 a
where grp in (select b.grp
from groups1 b
join groups1 c on c.id = b.id+1
and c.grp = b.grp
and abs(c.age - b.age) > 5
left join groups1 d on d.id = b.id-1
and d.grp = b.grp
where d.id eq .
)
;
quit;
The join on C finds all occurrences where the subsequent record in the same group has an absolute value > 5. The join on D (and the where clause) makes sure we only consider the results from the C join if the record is the very first record in the group.
data have;
input id $ grp $ age sex $;
datalines;
1 1 60 M
2 1 21 M
3 2 30 M
4 2 25 F
5 3 45 F
6 3 30 F
7 3 18 M
8 4 32 M
9 4 18 M
10 4 16 M
;
data want;
do i = 1 by 1 until(last.grp);
set have;
by grp notsorted;
if first.grp then cnt = 0;
cnt + 1;
if cnt = 1 then age1 = age;
if cnt = 2 then age2 = age;
diff = sum( age1, -age2 );
end;
do until(last.grp);
set have;
by grp;
if diff > 5 then output;
end;
run;