My first time posting. I'm pretty new to SAS programming (actually all programming). This seems like a simple problem but can't figure it out. I have some crosstab output and I'm trying to get it into shape for easy output to tables. I want to retain the first observation in a by group if there are only 3 observations in that group. If there are more than 3 observations, I want to retain all observations but the last. So, for example, here's what I have:
Group1 Group2 Percent
var1 1 0.25
var1 1 0.75
var1 1 1
var1 2 0.4
var1 2 0.6
var1 2 1
var1 3 0.7
var1 3 0.3
var1 3 0.6
var2 1 0.1
var2 1 0.2
var2 1 0.4
var2 1 0.3
var2 1 1
var2 2 0.2
var2 2 0.2
var2 2 0.2
var2 2 0.2
var2 2 1
var2 3 0.7
var2 3 0.1
var2 3 0.05
var2 3 0.05
var2 3 0.1
and here's what I want in a new dataset
Group1 Group2 Percent
var1 1 0.25
var1 2 0.4
var1 3 0.7
var2 1 0.1
var2 1 0.2
var2 1 0.4
var2 1 0.3
var2 2 0.2
var2 2 0.2
var2 2 0.2
var2 2 0.2
var2 3 0.7
var2 3 0.1
var2 3 0.05
var2 3 0.05
Hopefully, that's clear but please let me know if more information is needed.
I've broken it out in a few steps to help you see the logic and have used both data steps and SQL. Basically you want to count how many are in each group and keep all counts (the count within the group and the total count) around so you can use them to make your final logic.
data test;
length GROUP1 $5 GROUP2 PERCENT 8;
input GROUP1 $ GROUP2 PERCENT;
datalines;
var1 1 0.25
var1 1 0.75
var1 1 1
var1 2 0.4
var1 2 0.6
var1 2 1
var1 3 0.7
var1 3 0.3
var1 3 0.6
var2 1 0.1
var2 1 0.2
var2 1 0.4
var2 1 0.3
var2 1 1
var2 2 0.2
var2 2 0.2
var2 2 0.2
var2 2 0.2
var2 2 1
var2 3 0.7
var2 3 0.1
var2 3 0.05
var2 3 0.05
var2 3 0.1
;
run;
** count the number of obs per group **;
data test_ct; set test;
by GROUP1 GROUP2;
COUNT + 1;
if first.GROUP2 then COUNT = 1;
run;
** count the total number of obs per group and output on each row **;
proc sql noprint;
create table test_ct_all as
select *, count(*) as COUNT_TOTAL
from test_ct group by GROUP1,GROUP2
order by GROUP1, GROUP2, COUNT;
quit;
** logic to keep records **;
data keep_flags; set test_ct_all;
if COUNT=1 and COUNT_TOTAL=3 then KEEP=1;
*the last record will have COUNT and COUNT_TOTAL equal;
if COUNT_TOTAL > 3 and (COUNT_TOTAL ne COUNT) then KEEP=1;
run;
** output only the keep records **;
data keepers; set keep_flags;
if KEEP=1;
run;
Related
I have a dataset with one id column and three variables:
data have;
input id var1 var2 var3;
datalines;
1 0 1 0
2 1 1 0
3 0 0 2
4 0 4 1
;
run;
I want to use some osrt or data or proc sql step over var1 to var3 to keep as 0 if it is 0, and 1 if it is greater than 0. It should ideally use an array var1 -- var3 as the actual dataset has many more variables.
Try this
data have;
input id var1 var2 var3;
datalines;
1 0 1 0
2 1 1 0
3 0 0 2
4 0 4 1
;
run;
data want;
set have;
array v var1 -- var3;
do over v;
v = v > 0;
end;
run;
I have a dataset as follows:
variable level value
-----------------------
Age_group 1 0.1
Age_group 2 0.3
Age_group 3 0.2
Age_group 4 0.5
Sex 1 0.9
0 0.6
I would like to reformat it to get,
variable value
------------------------
Age_group
1 0.1
2 0.3
3 0.2
4 0.5
Sex
1 0.9
0 0.6
Is there any way to perform this?
The 'reformat' is more appropriately an output report versus a data set.
Example:
data have;
length variable $32;
input variable $ level value;
datalines;
Age_group 1 0.1
Age_group 2 0.3
Age_group 3 0.2
Age_group 4 0.5
Sex 1 0.9
Sex 0 0.6
;
ods html file='report.html' style=plateau;
proc report data=have;
define variable / order order=data noprint;
define level / 'variable';
compute before variable;
line variable $32.;
endcomp;
run;
ods html close;
Output
If you really want the output as a dataset and not a report (which you should perhaps consider), and you do not want to change the sorting of the input dataset, the following should work:
data have;
length variable $32;
input variable $ level value;
datalines;
Age_group 1 0.1
Age_group 2 0.3
Age_group 3 0.2
Age_group 4 0.5
Sex 1 0.9
Sex 0 0.6
;
data newdata;
set have(rename = value = old_val);
if lag(variable) ne variable then output;
variable = left(put(level,best.));
value = old_val;
output;
keep variable value;
run;
Sorry i don't have Sas installed and i didn't programmed for 3 years. But this could be helpful.
BTW it's not good to have your key field (variable) with spaces
libname class 'SAS-library';
proc sort data= (yourdataset);
by variable;
run;
data newdata ;
set yourdataset;
by variable;
retain variable;
if first.variable then newvariable=variable;
else newvariable=level;
run;
your will need to remove variable and rename the newvariable.
documentation
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214163.htm
In SAS, how can I create an identifier for each unique combination of a set of variables?
I have, for example, a several thousand observations with a dichotomous value for six variables. There are 2^6 unique combinations for the values of these variables for each observation. I would like to create an identifier for each unique combination, and eventually group my observations according to this value.
Have:
SubjectID Var1 Var2 Var3 Var4 Var5 Var6
---------------------------------------------------------------
ID1 1 1 1 1 1 1
ID2 1 0 1 1 1 1
ID3 0 1 1 1 1 1
ID4 0 0 1 1 1 0
... ... ... ... ... ... ...
ID3000 1 1 0 1 0 0
Want:
SubjectID Var1 Var2 Var3 Var4 Var5 Var6 Identifier
------------------------------------------------------------------------------
ID1 1 1 1 1 1 1 A
ID2 1 1 1 1 1 1 A
ID3 0 1 1 1 1 1 B
ID4 0 0 1 1 1 0 C
... ... ... ... ... ... ...
ID3000 1 1 0 1 0 0 Z
A would represent 1, 1, 1, 1, 1, 1 as a unique combination and B would represent 0, 1, 1, 1, 1, 1 etc.
I have thought about creating a dummy variable based on 64 Var1-Var6 conditional statements. I've also thought about concatenating the values from Var1-Var6 into a new row to create a unique identifier.
Is there a more straightforward way of going about this?
I prefer an approach that assigns a specific identifier to a specific combination of the values, rather than one that just generates some arbitrary unique string whenever a new combination comes up.
Proc summary works well with the LEVELS option. This technique works for any values of the group variables numeric or character.
data have;
input (v1-v6)(1.);
cards;
111111
111110
111101
111011
110111
;;;;
proc print;
proc summary data=have nway;
class v1-v6;
output out=unique(drop=_type_) / levels;
run;
Why not just concatenate the values?
So your combinations are:
111111
111110
111101
111011
110111
....
You can use PROC FREQ to check the number of each type.
proc freq data=have;
table var1*var2*var3*var4*var5*var6 / out=want list;
run;
By using the unique values of the given variables' combinations and then creating an alphabetical List of Ids, you can create the result
data inp;
length combined $6.;
input subjectid $4. v1 1. v2 1. v3 1. v4 1. v5 1. v6 1.;
combined=compress(v1||v2||v3||v4||v5||v6);
datalines;
ID1 111111
ID2 011111
ID3 001111
ID4 111110
ID5 000111
ID6 111111
ID7 000111
;
run;
proc sql;
create table uniq
as
select distinct combined from inp order by combined desc;
quit;
data uniq1;
set uniq;
retain alphabet 65;
Id=byte(alphabet) ;
alphabet+1;
drop alphabet;
run;
proc sql;
create table final_ds
as
select subjectid, v1, v2, v3, v4, v5, v6, Id
from inp a
left join uniq1 b
on a.combined=b.combined;
quit;
Assuming the data is sorted by your grouping variables then just use BY group processing.
data want;
set have;
by var1-var6 ;
groupid + first.var6 ;
run;
Or you could just convert the 6 binary variables into a single unique value.
group2 = input(cats(of var1-var6),binary6.);
This has the added value of not requiring that you sort the data, but it does need for none of the grouping variables to be missing.
Result
SubjectID Var1 Var2 Var3 Var4 Var5 Var6 Identifier Want groupno group2
ID4 0 0 1 1 1 0 C 1 14
ID3 0 1 1 1 1 1 B 2 31
ID1 1 1 1 1 1 1 A 3 63
ID2 1 1 1 1 1 1 A 3 63
Suppose my dataset includes the following variables:
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
input double(id var5 var6)
1 10000 0.4
2 22000 0.55
3 25000 0.5
4 40000 1
end
I need to delete rows of ids that have an increased value of var5 and reduced value of var6 compared with at least one other id. In the first example, number 4 with 2028 and 17.396 should be deleted. In the second example, number 3 with 25000 and 0.5 should be deleted. After the elimination, the observations of the three variables should look like this:
1 1052 17.348
2 1288 17.378
3 1536 17.387
5 1810 17.402
6 2034 17.407
1 10000 0.4
2 22000 0.55
4 40000 1
while var1 and var2 should remain intact.
How can I do this?
This is very odd because you appear to say that you have a dataset with completely unrelated variables. You have an initial dataset of 100 observations with variables var1 and var2 and then a secondary dataset with 6 observations with variables var5 and var6. Your objective appears to be to remove observations, but only for values contained in variables var5 and var6. This looks like spreadsheet thinking as Stata only has a single dataset in memory at any given time.
The task of identifying observations to drop requires that you compare each observations with values for var5 and var6 with all other observations with values for those variables. This can be done in Stata by forming all pairwise combinations using the cross command.
Here's a solution that starts with data organized exactly as you presented it and separates the two datasets in order to perform the task of dropping the observations based on var5 and var6 values. Since the datasets appear completely unrelated, an unmatched merge is used to recombine the data.
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
tempfile main
save "`main'"
* extract secondary dataset
keep id var5 var6
keep if !mi(id)
tempfile data2
save "`data2'"
* form all pairwise combinations
rename * =_0
cross using "`data2'"
* identify cases where there's an increase in var5 and decrease in var6
gen todrop = var5_0 > var5 & var6_0 < var6
* drop id if there's at least one case, reduce to original obs and vars
bysort id_0 (todrop): keep if !todrop[_N]
keep if id == id_0
keep id var5 var6
list
* now merge back with original data, use unmatched merge since
* secondary data is unrelated
sort id
tempfile newdata2
save "`newdata2'"
use "`main'", clear
drop id var5 var6
merge 1:1 _n using "`newdata2'", nogen
Here's one way to do this without separating the datasets. The task of identifying the observations to drop require a double-loop to make all pairwise comparisons. There is however no command in Stata to drop observations for just a few variables. In the following example, I switch to Mata to load the observations to preserve and then clear out values and save the observations back into the Stata variables:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* take a trip to Mata to load the data to keep and store it back from there
mata:
// load data, ignore observations with missing values
X = st_data(., ("id","var5","var6"), 0)
// set all obs to missing
st_store(., ("id","var5","var6") ,J(st_nobs(),3,.))
// store non-missing values back into the variables
st_store((1,rows(X)), ("id","var5","var6") ,X)
end
drop obsid todrop
Alternatively, you can manually move up values by doing some observation index gymnastics:
clear
set obs 100
generate var1 = rnormal()
generate var2 = rnormal()
input double(id var5 var6)
1 1052 17.348
2 1288 17.378
3 1536 17.387
4 2028 17.396
5 1810 17.402
6 2034 17.407
end
* an observation index
gen obsid = _n if !mi(id)
* identify observations to drop
gen todrop = 0 if !mi(id)
sum obsid, meanonly
local n = r(N)
quietly forvalues i = 1/`n' {
forvalues j = 1/`n' {
replace id = . if var5[`i'] > var5[`j'] & var6[`i'] < var6[`j'] & _n == `i'
}
}
* move observations up
local j 0
quietly forvalues i = 1/`n' {
if !mi(id[`i']) {
local ++j
replace id = id[`i'] in `j'
replace var5 = var5[`i'] in `j'
replace var6 = var6[`i'] in `j'
}
}
local ++j
replace id = . in `j'/l
replace var5 = . in `j'/l
replace var6 = . in `j'/l
drop obsid todrop
My dataset looks like this:
ID Var1 Var2 Var3
A 1 0 1
B 0 0 1
B 1 1 0
A 0 0 0
A 1 1 1
My expected output will be:
ID Var1 Var2 Var3
A 2 1 3
B 1 1 1
Can someone help with this?
I've not access to SAS right now but try the following:-
proc tabulate data = in;
class id;
var var:;
table id, sum=''*(var1 var2 var3);
run;