data test;
input A B C D$ E;
datalines;
. 70 . Mike 2
1 80 21 Tony 3
2 10 0 . 4
3 . 0 Lew .
3 9 4 . .
;
run;
data test2;
set test;
Total=A+B+C;
run;
data test3;
set test;
if A=. then A=0;
if B=. then B=0;
if C=. then C=0;
Total=A+B+C;
run;
I want to sum column A B C but they have missing values, so I have to replace all missings with 0 first (in test3) so that I can get a number. Is there any elegant way to do it? My method looks very awkward. Thanks for help.
Use your functions so you don't have to replace the missing values.
SUM function should do it - notice the difference in your output via the two methods shown below.
data out;
set test;
sum_func = sum(A,B,C);
sum = A+B+C;
keep sum: A B C;
run;
Related
I have observations with column ID, a, b, c, and d. I want to count the number of unique values in columns a, b, c, and d. So:
I want:
I can't figure out how to count distinct within each row, I can do it among multiple rows but within the row by the columns, I don't know.
Any help would be appreciated. Thank you
********************************************UPDATE*******************************************************
Thank you to everyone that has replied!!
I used a different method (that is less efficient) that I felt I understood more. I am still going to look into the ways listed below however to learn the correct method. Here is what I did in case anyone was wondering:
I created four tables where in each table I created a variable named for example ‘abcd’ and placed a variable under that name.
So it was something like this:
PROC SQL;
CREATE TABLE table1_a AS
SELECT
*
a as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table2_b AS
SELECT
*
b as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table3_c AS
SELECT
*
c as abcd
FROM table_I_have_with_all_columns
;
QUIT;
PROC SQL;
CREATE TABLE table4_d AS
SELECT
*
d as abcd
FROM table_I_have_with_all_columns
;
QUIT;
Then I stacked them (this means I have duplicate rows but that ok because I just want all of the variables in 1 column and I can do distinct count.
data ALL_STACK;
set
table1_a
table1_b
table1_c
table1_d
;
run;
Then I counted all unique values in ‘abcd’ grouped by ID
PROC SQL ;
CREATE TABLE count_unique AS
SELECT
My_id,
COUNT(DISTINCT abcd) as Count_customers
FROM ALL_STACK
GROUP BY my_id
;
RUN;
Obviously, it’s not efficient to replicate a table 4 times just to put a variables under the same name and then stack them. But my tables were somewhat small enough that I could do it and then immediately delete them after the stack. If you have a very large dataset this method would most certainly be troublesome. I used this method over the others because I was trying to use Procs more than loops, etc.
A linear search for duplicates in an array is O(n2) and perfectly fine for small n. The n for a b c d is four.
The search evaluates every pair in the array and has a flow very similar to a bubble sort.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;
The linear search for duplicates will occur on every row, and the count_distinct will be initialized automatically in each row to a missing (.) value. The sum function is used to increment the count when a non-missing value is not found in any prior array indices.
* linear search O(N**2);
data want;
set have;
array x a b c d;
do i = 1 to dim(x) while (missing(x(i)));
end;
if i <= dim(x) then count_distinct = 1;
do j = i+1 to dim(x);
if missing(x(j)) then continue;
do k = i to j-1 ;
if x(k) = x(j) then leave;
end;
if k = j then count_distinct = sum(count_distinct,1);
end;
drop i j k;
run;
Try to transpose dataset, each ID becomes one column, frequency each ID column by option nlevels, which count frequency of value, then merge back with original dataset.
Proc transpose data=have prefix=ID out=temp;
id ID;
run;
Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;
data count;
set count;
ID=compress(TableVar,,'kd');
drop TableVar;
run;
data want;
merge have count;
by id;
run;
one more way using sortn and using conditions.
data have;
input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
77 . 3 . 4
88 . 9 5 .
99 . . 2 2
76 . . . 2
58 1 1 . .
50 2 . 2 .
66 2 . 7 .
89 1 1 1 .
75 1 2 3 .
76 . 5 6 7
88 . 1 1 1
43 1 . . 1
31 1 . . 2
;
data want;
set have;
_a=a; _b=b; _c=c; _d=d;
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0 then count+ 0;
else if hello(i)-hello(i+1) ne 0 then count+ 1;
end;
drop i _:;
run;
You could just put the unique values into a temporary array. Let's convert your photograph into data.
data have;
input id a b c d;
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;
So make an array of the input variables and another temporary array to hold the unique values. Then loop over the input variables and save the unique values. Finally count how many unique values there are.
data want ;
set have ;
array unique (4) _temporary_;
array values a b c d ;
call missing(of unique(*));
do _n_=1 to dim(values);
if not missing(values(_n_)) then
if not whichn(values(_n_),of unique(*)) then
unique(_n_)=values(_n_)
;
end;
count=n(of unique(*));
run;
Output:
Obs id a b c d count
1 11 2 3 4 4 3
2 22 1 8 1 1 2
3 33 6 . 1 2 3
4 44 . 1 1 . 1
I have a series of string values with missing observations. I would like to use flat substitution. For instance variable x has 3 available values. There should be a 33.333% chance that a missing value will be assigned to the available values for x under this substitution method. How would I do this?
DATA have;
INPUT id a $ b $ c $ x;
CARDS;
1 Y Male . 5
2 Y Female . 4
3 . Female Tall 4
4 Y . Short 2
5 N Male Tall 1
;
Run;
You could use temporary arrays to store the possible values. Then generate a random index into the array.
DATA have;
INPUT id a $ b $ c $ x;
CARDS;
1 Y Male . 5
2 Y Female . 4
3 . Female Tall 4
4 Y . Short 2
5 N Male Tall 1
;
data want ;
set have ;
array possible_b (2) $8 ('Male','Female') ;
if missing(b) then b=possible_b(1+int(rand('uniform')*dim(possible_b)));
run;
I did this with generating random numbers and hard coding the limits. There should be an easier way to do this, but for the purposes of the question this should work.
option missing='';
data begin;
input a $;
cards;
a
.
b
c
.
e
.
f
g
h
.
.
j
.
;
run;
data intermediate;
set begin;
if a EQ '' then help= rand("uniform");
else help=.;
run;
data wanted;
set intermediate;
format help populated.;
if a EQ '' then do;
if 0<=help<0.33 then a='V1';
else if 0.33<=help<0.66 then a='V2';
else if 0.66<=help then a='V3';
end;
drop help;
run;
I am trying to do a recursive lag in sas, the problem that I just learned is that x = lag(x) does not work in SAS.
The data I have is similar in format to this:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
What I want is that given x for the first count, I want each successive x by id to be the lag(x) + some constant.
For example, lets say: if count > 1 then x = lag(x) + 3.
The output that I would want is:
id date count x
a 1/1/1999 1 10
a 1/1/2000 2 13
a 1/1/2001 3 16
b 1/1/1997 1 51
b 1/1/1998 2 54
Yes, the lag function in SAS requires some understanding. You should read through the documentation on it (http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#n0l66p5oqex1f2n1quuopdvtcjqb.htm)
When you have conditional statements with a lag inside the "then", I tend to use a retained variable.
data test;
input id $ date count x;
informat date anydtdte.;
format date date9.;
datalines;
a 1/1/1999 1 10
a 1/1/2000 2 .
a 1/1/2001 3 .
b 1/1/1997 1 51
b 1/1/1998 2 .
;
run;
data test(drop=last);
set test;
by id;
retain last;
if ^first.id then do;
if count > 1 then
x = last + 3;
end;
last = x;
run;
how can i perform calculation for the last n observation in a data set
For example if I have 10 observations I would like to create a variable that would sum the last 5 values of another variable. Please do not suggest that I lag 5 times or use module ( N ). I need a bit more elegant solution than that.
with the code below alpha is the data set that i have and bravo is the one i need.
data alpha;
input lima ## ;
cards ;
3 1 4 21 3 3 2 4 2 5
;
run ;
data bravo;
input lima juliet;
cards;
3 .
1 .
4 .
21 .
3 32
3 32
2 33
4 33
2 14
5 16
;
run;
thank you in advance!
You can do this in the data step or using PROC EXPAND from SAS/ETS if available.
For the data step the idea is that you start with a cumulative sum (summ), but keep track of the number of values that were added so far (ninsum). Once that reaches 5, you start outputting the cumulative sum to the target variable (juliet), and from the next step you start subtracting the lagged-5 value to only store the sum of the last five values.
data beta;
set alpha;
retain summ ninsum 0;
summ + lima;
ninsum + 1;
l5 = lag5(lima);
if ninsum = 6 then do;
summ = summ - l5;
ninsum = ninsum - 1;
end;
if ninsum = 5 then do;
juliet = summ;
end;
run;
proc print data=beta;
run;
However there is a procedure that can do all kind of cumulative, moving window, etc calculations: PROC EXPAND, in which this is really just one line. We just tell it to calculate the backward moving sum in a window of width 5 and set the first 4 observations to missing (by default it will expand your series by 0's on the left).
proc expand data=alpha out=gamma;
convert lima = juliet / transformout=(movsum 5 trimleft 4);
run;
proc print data=gamma;
run;
Edit
If you want to do more complicated calculations, you need to carry the previous values in retained variables. I thought you wanted to avoid that, but here it is:
data epsilon;
set alpha;
array lags {5};
retain lags1 - lags5;
/* do whatever calculation is needed */
juliet = 0;
do i=1 to 5;
juliet = juliet + lags{i};
end;
output;
/* shift over lagged values, and add self at the beginning */
do i=5 to 2 by -1;
lags{i} = lags{i-1};
end;
lags{1} = lima;
drop i;
run;
proc print data=epsilon;
run;
I can offer rather ugly solution:
run data step and add increasing number to each group.
run sql step and add column of max(group).
run another data step and check if value from (2)-(1) is less than 5. If so, assign to _num_to_sum_ variable (for example) the value that you want to sum, otherwise leave it blank or assign 0.
and last do a sql step with sum(_num_to_sum_) and group results by grouping variable from (1).
EDIT: I have added a live example of the concept in a bit more compacted way.
input var1 $ var2;
cards;
aaa 3
aaa 5
aaa 7
aaa 1
aaa 11
aaa 8
aaa 6
bbb 3
bbb 2
bbb 4
bbb 6
;
run;
data step1;
set sourcetable;
by var1;
retain obs 0;
if first.var1 then obs = 0;
else obs = obs+1;
if obs >=5 then to_sum = var2;
run;
proc sql;
create table rezults as
select distinct var1, sum(to_sum) as needed_summs
from step1
group by var1;
quit;
In case anyone reads this :)
I solved it the way I needed it to be solved. Although now I am more curious which of the two(the retain and my solution) is more optimal in terms of computing/processing time.
Here is my solution:
data bravo(keep = var1 summ);
set alpha;
do i=_n_ to _n_-4 by -1;
set alpha(rename=var1=var2) point=i;
summ=sum(summ,var2);
end;
run;
I have a dataset similar to the one below
ID A B C D E
1 1
1 1
1 1
2 1
2 1
3 1
3 1
4 1
5 1
I want to condense the data into one row for each ID. So the dataset would look like the one below.
ID A B C D E
1 1 1 1
2 1 1
3 1 1
4 1
5 1
Well I created another table and removed the duplicate ID's. So I have two tables--A and B. I then tried merging the two datasets together. I was playing around with following SAS code.
data C;
merge A B;
by ID;
run;
Here's a neat trick I picked up from another forum. There's no need to split up the original dataset, the first update statement creates the structure and the second updates the values. The BY statement ensures you only get 1 record per ID.
data have;
infile datalines dsd;
input ID A B C D E;
datalines;
1,1,,,,,
1,,,1,,,
1,,1,,,,
2,,1,,,,
2,,,,1,,
3,,,,,1,
3,1,,,,,
4,,,1,,,
5,,1,,,
;
run;
data want;
update have (obs=0) have;
by id;
run;
This could be solved using the retain statement.
data B(rename=(A2=A B2=B C2=C D2=D));
set A;
by id;
retain A2 B2 C2 D2;
if first.id then do;
A2 = .;
B2 = .;
C2 = .;
D2 = .;
end;
if A ne . then A2=A;
if B ne . then B2=B;
if C ne . then C2=C;
if D ne . then D2=D;
if last.id then output;
drop A B C D;
run;
There are other ways to solve this, but hopefully this is helpful.
PROC MEANS is a great tool for something like this. PROC SQL would also give you a reasonable solution, but MEANS is faster.
proc means data=yourdata;
var a b c d e;
class id;
types id; *to avoid the 'overall' row;
output out=yourdata max=; *output the maximum of each var for each ID - use SUM instead if you want more than 1;
run;