I hope this message finds you well. I am very new to SAS programming and I am trying to create some code that counts number of unique entries in multiple columns for several observations. I also have columns in between that I would like to disregard. Below is an example dataset:
|ID | Var1 | NotNeededVar2 | Var3 | Var4 | Var5 |
| 1 | String1 | StringSomething | String2 | String3 | String3 |
| 2 | String1 | StringSomething | String2 | String1 | String2 |
| 3 | String1 | StringSomething | String1 | String1 | String1 |
| 4 | String1 | StringSomething | . | String2 | String2 |
The desired outcome is a new dataset with a newly added column containing the count of unique entries of columns 1, 3, 4 and 5:
|ID | Var1 | NotNeededVar2 | Var3 | Var4 | Var5 | Unique(Var1, 3, 4, 5) |
| 1 | String1 | StringSomething | String2 | String3 | String3 | 3 |
| 2 | String1 | StringSomething | String2 | String1 | String2 | 2 |
| 3 | String1 | StringSomething | String1 | String1 | String1 | 1 |
| 4 | String1 | StringSomething | . | String2 | String2 | 2 |
So far all I can think of is using multiple if/then statements to test if the columns are unique and not missing, but this seems like a sure way to make some errors and make it very complicated.
Any and all help would be much appreciated!
EDIT: Changed the example to reflect string/character values rather than numeric values. Not sure if it makes a difference or not, but this is closer to my actual situation.
EDIT2: Inserted unwanted column to better reflect my dataset.
You could use the WHICHC() function to check if the current value appears earlier in the list.
data have ;
input ID (Var1 NotNeededVar2 Var3 Var4 Var5) (:$20.);
cards;
1 String1 StringSomething String2 String3 String3
2 String1 StringSomething String2 String1 String2
3 String1 StringSomething String1 String1 String1
4 String1 StringSomething . String2 String2
5 . . . . .
;
data want;
set have;
array list var1 var3-var5 ;
count=0;
do index=1 to dim(list);
if not missing(list[index]) and whichc(list[index],of list[*])=index then count+1;
end;
drop index;
run;
Result
Obs ID Var1 NotNeededVar2 Var3 Var4 Var5 count
1 1 String1 StringSomething String2 String3 String3 3
2 2 String1 StringSomething String2 String1 String2 2
3 3 String1 StringSomething String1 String1 String1 1
4 4 String1 StringSomething String2 String2 2
5 5 0
I'm going to assume the order of the variables aren't important. If they are, you need to copy them to a different array first and then sort through this.
data want;
set have;
array _myvars(*) var1 var3 var4 var5;
*sorts arrays values alphabetically - you may not want this step;
call sortc(of _myvars(*));
count = 0;
do i=2 to dim(_myvars);
if _myvars(i) ne _myvars(i-1) then count+1;
end;
run;
Related
I have a table in SAS dataset that looks like this:
proc sql;
create table my_table
(id char(1),
my_date num format=date9.,
my_col num);
insert into my_table
values('A','01JAN2010'd,.)
values('A','02JAN2010'd,0)
values('A','03DEC2009'd,1)
values('A','04NOV2009'd,1)
values('B','01JAN2010'd,.)
values('B','02NOV2009'd,2)
values('C','01JAN2010'd,.)
values('C','02OCT2009'd,3)
values('D','01JAN2010'd,.)
values('D','02NOV2009'd,2)
values('D','03OCT2009'd,1)
values('D','04AUG2009'd,2)
values('D','05MAY2009'd,3)
values('D','06APR2009'd,1);
quit;
I am trying to create a new column desired that, for each group of id column, flags the row with a value of 1 if the value in my_col is missing or less than 3.
The part I'm having trouble with is that when there is a my_col value that is greater than 2, I need the desired value for that row to be missing and also stop flagging any remaining rows in the id group with a value of 1.
The resulting dataset should look like this:
+----+-----------+--------+---------+
| id | my_date | my_col | desired |
+----+-----------+--------+---------+
| A | 01JAN2010 | . | 1 |
| A | 02JAN2010 | 0 | 1 |
| A | 03DEC2009 | 1 | 1 |
| A | 04NOV2009 | 1 | 1 |
| B | 01JAN2009 | . | 1 |
| B | 02NOV2009 | 2 | 1 |
| C | 01JAN2010 | . | 1 |
| C | 02OCT2009 | 3 | . |
| D | 01JAN2010 | . | 1 |
| D | 02NOV2009 | 2 | 1 |
| D | 03OCT2009 | 1 | 1 |
| D | 04AUG2009 | 2 | 1 |
| D | 05MAY2009 | 3 | . |
| D | 06APR2009 | 1 | . |
+----+-----------+--------+---------+
Looks like a simple application of a retained variable. Set the flag to 1 when you start a new group and then set it to missing when the value of MY_COL is larger than 2.
data want;
set my_table ;
by id;
if first.id then desired=1;
if my_col>2 then desired=.;
retain desired;
run;
Also it is not clear why you used such complicated code to create your example data. Why not a simple data step?
data my_table;
input id :$1. my_date :date. my_col;
format my_date date9.;
cards;
A 01JAN2010 .
A 02JAN2010 0
A 03DEC2009 1
A 04NOV2009 1
B 01JAN2010 .
B 02NOV2009 2
C 01JAN2010 .
C 02OCT2009 3
D 01JAN2010 .
D 02NOV2009 2
D 03OCT2009 1
D 04AUG2009 2
D 05MAY2009 3
D 06APR2009 1
;
I can't think of a simpler way to do it, but this works. You will need to have your data sorted by id.
data my_table2;
set my_table;
by id;
format gt2flag $1.;
retain gt2flag;
if first.id then gt2flag='';
if my_col gt 2 then gt2flag='Y';
if gt2flag = 'Y' then desired=.;
else desired=1;
drop gt2flag;
run;
id my_date my_col desired
A 01JAN2010 . 1
A 02JAN2010 0 1
A 03DEC2009 1 1
A 04NOV2009 1 1
B 01JAN2010 . 1
B 02NOV2009 2 1
C 01JAN2010 . 1
C 02OCT2009 3 .
D 01JAN2010 . 1
D 02NOV2009 2 1
D 03OCT2009 1 1
D 04AUG2009 2 1
D 05MAY2009 3 .
D 06APR2009 1 .
I am trying to run a code that should work on tables created considering different factors. As these factors can be more than 1, I decided to create a macro %let to list them:
%let list= factor1 factor2 ...;
What I would like to do is run a code to create these tables using different factors. For each factor, I computed using proc means the mean and the standard deviation, so I should have the variables &list._mean and &list._stddev in the table created by the proc means for each factor. This table is labelled as t2 and I need to join to another table, t1. From t1 I am considering all the variables.
My main difficulties are, therefore, in the proc sql:
proc sql;
create table new_table as
select t1.*
, t2.&list._mean as mean
, t2.&list._stddev as stddev
from table1 as t1
left join table2 as t2
on t1.time=t2.time
order by t2.&list.
quit;
This code is returning an error and I think because I am considering t2.factor1 factor2, i.e. t2 is only applied to the first factor, not to the second one.
What I would expect is the following:
proc sql;
create table new_table as
select t1.*
, t2.factor1._mean as mean
, t2.factor1._stddev as stddev
from table1 as t1
left join table2 as t2
on t1.time=t2.time
order by t2.factor1.
quit;
and another one for factor2.
UPDATE CODE:
%macro test_v1(
_dtb
,_input
,_output
,_time
,_factor
);
data &_input.;
set &_dtb..&_input.;
keep &_col_period. &_factor.;
run;
proc sort data = work.&_input.
out = &_input._1;
by &_factor. &_time.;
run;
%put ERROR: 2
proc means data=&_input._1 nonobs mean stddev;
class &_time.;
var &_factor.;
output out=&_input._n (drop=_TYPE_) mean= stddev= /autoname ;
run;
%put ERROR: 3
proc sql;
create table work.&_input._data as
select t1.*
,t2.&_factor._mean as mean
,t2.&_factor._stddev as stddev
from &_input. as t1
left join &_input._n as t2
on t1.&_time.=t2.&_time.
order by &_factor.;
quit;
%mend test_v1;
Then my question is on how I can consider multiple factors, defined into a macro as a list, as columns of tables and as input data into a macro (for example: %test(dataset, tablename, list).
I suspect that trying to use PROC SQL is what is making the problem hard. If you stick to just using normal SAS syntax your space delimited list of variable names is easy to use.
So taking your code and tweaking it a little:
%macro test_v1
(_dtb /* Input libref */
,_input /* Input member name */
,_output /* Output dataset */
,_time /* Class/By variable(s) */
,_factor /* Analysis variable(s) */
);
proc sort data= &_dtb..&_input. out=_temp1;
by &_time. ;
run;
proc means data=_temp1 nonobs mean stddev;
by &_time.;
var &_factor.;
output out=_temp2 (drop=_TYPE_) mean= stddev= /autoname ;
run;
data &_output. ;
merge _temp1 _temp2 ;
by &_time.;
run;
%mend test_v1;
We can then test it using SASHELP.CLASS by using SEX as the "time" variable and HEIGHT and WEIGHT as the analysis variables.
%test_v1(_dtb=sashelp,_input=class,_output=want,_time=sex,_factor=height weight);
You can try to add macro loop to your macros by scanning list of factors. It could look like:
%macro test(list);
%do i=1 to %sysfunc(countw(&list,%str( )));
%let factorname=%scan(&list,&i,%str( ));
/* if macro variable list equals factor1 factor2 then there would be
two iterations in loop, i=1 factorname=factor1 and i=2 factorname=2*/
/*your code here*/
%end
%mend test;
UPDATE:
%macro test(_input, _output, factors_list); %macro d; %mend d;
%do i=1 %to %sysfunc(countw(&factors_list,%str( )));
%let tfactor=%scan(&factors_list,&i,%str( ));
proc sort data = work.&_input.
out = &_input._1;
by &factors_list. time;
run;
proc means data=&_input._1 nonobs mean stddev;
class time;
var &tfactor.;
output out=&_input._num (drop=_TYPE_) mean= stddev= /autoname ;
run;
proc sql;
create table &_output._&tfactor as
select t1.*
, t2.&tfactor._mean as mean
, t2.&tfactor._stddev as stddev
from &_input as t1
left join &_input._num as t2
on t1.time=t2.time
order by t1.&tfactor;
quit;
%end;
%mend test;
%test(have,newdata,factor1 factor2);
Have dataset:
+------+---------+---------+
| time | factor1 | factor2 |
+------+---------+---------+
| 1 | 12345 | 1234 |
| 2 | 123 | 12 |
| 3 | 1 | -1 |
| 4 | -12 | -123 |
| 5 | -1234 | -12345 |
| 6 | 9876 | 987 |
| 7 | 98 | 8 |
| 8 | 9 | 7 |
| 1 | 1234 | 123 |
| 2 | 12 | 1 |
| 3 | 12 | -12 |
| 4 | -123 | -1234 |
| 5 | -12345 | -123456 |
| 6 | 987 | 98 |
| 7 | 9 | -9 |
| 8 | 1234 | 1234 |
+------+---------+---------+
NEWDATA_FACTOR1:
+------+---------+---------+---------+--------------+
| time | factor1 | factor2 | mean | stddev |
+------+---------+---------+---------+--------------+
| 5 | -12345 | -123456 | -6789.5 | 7856.6634458 |
| 5 | -1234 | -12345 | -6789.5 | 7856.6634458 |
| 4 | -123 | -1234 | -67.5 | 78.488852712 |
| 4 | -12 | -123 | -67.5 | 78.488852712 |
| 3 | 1 | -1 | 6.5 | 7.7781745931 |
| 7 | 9 | -9 | 53.5 | 62.932503526 |
| 8 | 9 | 7 | 621.5 | 866.20580695 |
| 3 | 12 | -12 | 6.5 | 7.7781745931 |
| 2 | 12 | 1 | 67.5 | 78.488852712 |
| 7 | 98 | 8 | 53.5 | 62.932503526 |
| 2 | 123 | 12 | 67.5 | 78.488852712 |
| 6 | 987 | 98 | 5431.5 | 6285.472178 |
| 1 | 1234 | 123 | 6789.5 | 7856.6634458 |
| 8 | 1234 | 1234 | 621.5 | 866.20580695 |
| 6 | 9876 | 987 | 5431.5 | 6285.472178 |
| 1 | 12345 | 1234 | 6789.5 | 7856.6634458 |
+------+---------+---------+---------+--------------+
NEWDATA_FACTOR2:
+------+---------+---------+----------+--------------+
| time | factor1 | factor2 | mean | stddev |
+------+---------+---------+----------+--------------+
| 5 | -12345 | -123456 | -67900.5 | 78567.341564 |
| 5 | -1234 | -12345 | -67900.5 | 78567.341564 |
| 4 | -123 | -1234 | -678.5 | 785.5956339 |
| 4 | -12 | -123 | -678.5 | 785.5956339 |
| 3 | 12 | -12 | -6.5 | 7.7781745931 |
| 7 | 9 | -9 | -0.5 | 12.02081528 |
| 3 | 1 | -1 | -6.5 | 7.7781745931 |
| 2 | 12 | 1 | 6.5 | 7.7781745931 |
| 8 | 9 | 7 | 620.5 | 867.62002052 |
| 7 | 98 | 8 | -0.5 | 12.02081528 |
| 2 | 123 | 12 | 6.5 | 7.7781745931 |
| 6 | 987 | 98 | 542.5 | 628.61792847 |
| 1 | 1234 | 123 | 678.5 | 785.5956339 |
| 6 | 9876 | 987 | 542.5 | 628.61792847 |
| 1 | 12345 | 1234 | 678.5 | 785.5956339 |
| 8 | 1234 | 1234 | 620.5 | 867.62002052 |
+------+---------+---------+----------+--------------+
I have a dataset like this:
string_var | var1 | var2 | var3
var2 | 8 | 8 | 4
var3 | 7 | 5 | 7
var2 | 10 | 10 | 5
I need to test whether var1 is equal to either var2 or var3, depending on the string that is contained in string_var. My problem is to convert these strings into variable names to do something like:
gen test=1 if var1==string_var
where, after the ==, I need some kind of conversion function to let Stata read the string e.g. var2 as the following:
gen test=1 if var1==var2
With just two possibilities, you can branch on the choice. (With several possibilities, I think you'd need a loop.)
clear
input str4 string_var var1 var2 var3
var2 8 8 4
var3 7 5 7
var2 10 10 5
end
gen test = cond(string_var == "var2", var1 == var2, var1 == var3)
list
+--------------------------------------+
| string~r var1 var2 var3 test |
|--------------------------------------|
1. | var2 8 8 4 1 |
2. | var3 7 5 7 1 |
3. | var2 10 10 5 1 |
+--------------------------------------+
EDIT:
Here is a more general solution. (If anyone else thinks of a neater solution, be sure to post.)
gen test = .
levelsof string_var, local(names)
quietly foreach name of local names {
replace test = var1 == `name' if string_var == "`name'"
}
I'm working with an edge list in Stata, of the type:
var1 var2
a 1
a 2
a 3
b 1
b 2
1 a
2 b
I want to remove non-unique pairs such as 1a and 2b (which are same as a1 and b2 for me). How can I go about this?
. clear
. input str1 (var1 var2)
var1 var2
1. a 1
2. a 2
3. a 3
4. b 1
5. b 2
6. 1 a
7. 2 b
8. end
. gen first = cond(var1 <= var2, var1, var2)
. gen second = cond(var1 <= var2, var2, var1)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
|------------------------------|
6. | 1 a 1 a |
7. | 2 b 2 b |
+------------------------------+
. duplicates list first second
Duplicates in terms of first second
+--------------------------------+
| group: obs: first second |
|--------------------------------|
| 1 1 1 a |
| 1 6 1 a |
| 2 5 2 b |
| 2 7 2 b |
+--------------------------------+
. duplicates drop first second, force
Duplicates in terms of first second
(2 observations deleted)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
+------------------------------+
The easy part of the answer is to use duplicates drop. But how to get the data so that 1 a and a 1 are seen to be duplicates? This is all documented here. We can sort the values in each observation so that (in this case) both sort to 1 a. The linked paper says much more, but that's the main idea, and cond() helps.
I have a set of multiple choice responses from a survey with 45 questions, and I've placed the correct responses as my first observation in the dataset.
In my DATA step I would like to set values to 0 or 1depending on whether the variable in each observation matches the same variable in the first observation, I want to replace the response letter (A-D) with the 0 or 1 in the dataset, how do I go about doing that comparison?
I'm not doing any grouping, so I believe I can access the first row using First.x, but I'm not sure how to compare that across each variable(answer1-answer45).
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | A | C |
| 3 | C | D |
| 4 | A | B |
| 5 | D | C |
| 6 | B | B |
Should become:
| Id | answer1 | answer2 | ...through answer 45
|:-------------|---------:|
| KEY | A | B |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
| 5 | 0 | 0 |
| 6 | 0 | 1 |
Current code for reading in the data:
DATA TEST(drop=name fill answer0);
INFILE SCORES DSD firstobs=2;
length id $4;
length answer1-answer150 $1;
INPUT name $ fill id $ (answer0-answer150) ($);
RUN;
Thanks in advance!
Here's how I might do it. Create a data set to PROC COMPARE the KEY to the observed. Then you have X for not matching key and missing for matched. You can then use PROC TRANSREG to score the 'X.' to 01. PROC TRANSREG also creates macro variables which contain the names of the new variables and the number.
From log NOTE: _TRGINDN=2 _TRGIND=answer1D answer2D
data questions;
input id:$3. (answer1-answer2)(:$1.);
cards;
KEY A B
2 A C
3 C D
4 A B
5 D C
6 B B
;;;;
run;
data key;
if _n_ eq 1 then set questions(obs=1);
set questions(keep=id firstobs=2);
run;
proc compare base=key compare=questions(firstobs=2) out=comp outdiff noprint;
id id;
run;
options validvarname=v7;
proc transreg design data=comp(drop=_type_ type=data);
id id;
model class(answer:) / noint;
output out=scored(drop=intercept _:);
run;
%put NOTE: &=_TRGINDN &=_TRGIND;
I don't have my SAS license here at home, so I can't actually test this code. I'll give it me best shot, though ...
First, I'd keep my correct answers in a separate table, and then merge it with the answers from the respondents. That also makes the solution scalable, should you have more multiple choice solutions and answers in the same table, since you'd be joining on the assignment ID as well.
Now, import all your correct answers to a table answers_correct with column names answer_correct1-answer_correct45.
Then, merge the two tables and determine the outcome for each question.
DATA outcome;
MERGE answers answers_correct;
* We will not be using any BY.;
* If you later add more questionnaires, merge BY the questionnaire ID;
ARRAY answer(*) answer1-answer45;
ARRAY answer_correct(*) answer_correct1-answer_correct45;
LENGTH result1-result45 $1;
ARRAY result(*) result1-result45;
DROP i;
FOR i = 1 TO DIM(answer);
IF answer(i) = answer_correct(i) THEN result(i) = '1';
ELSE result(i) = '0';
END;
RUN;