SAS: get the first value where a condition is verified by group

SAS: get the first value where a condition is verified by group - sas

I have this database:
data temp;
input ID date type ;
datalines;
1 10/11/2006 1
1 10/12/2006 2
1 15/01/2007 2
1 20/01/2007 3
2 10/08/2008 1
2 11/09/2008 1
2 17/10/2008 1
2 12/11/2008 2
2 10/12/2008 3
;
I would like to create a new column where I put the first date where the variable type changes from 1 to 2 by ID as follows:
data temp;
input ID date type date_change_type1to2;
datalines;
1 10/11/2006 1 .
1 10/12/2006 2 10/12/2006
1 15/01/2007 2 .
1 20/01/2007 3 .
2 10/08/2008 1 .
2 11/09/2008 1 .
2 17/10/2008 1 .
2 12/11/2008 2 12/11/2008
2 10/12/2008 3 .
;
I have tried this code but it doesn't work:
data temp;
set temp;
if first.type= 2 then date_change_type1to2=date;
by ID;
run;
Thank you in advance for your help!

Solution(input data must be sorted!):
data temp;
input ID date $10. type ;
datalines;
1 10/11/2006 1
1 10/12/2006 2
1 15/01/2007 2
1 20/01/2007 2
2 10/08/2008 1
2 11/09/2008 1
2 17/10/2008 1
2 12/11/2008 2
2 10/12/2008 2
;
run;
data temp(drop=type_store);
set temp;
by ID;
retain type_store;
if first.id then type_store = type;
if type ne type_store and type = 2 then do;
date_change_type1to2=date;
type_store = type;
end;
run;
Output:
+----+------------+------+----------------------+
| ID | date | type | date_change_type1to2 |
+----+------------+------+----------------------+
| 1 | 10/11/2006 | 1 | |
+----+------------+------+----------------------+
| 1 | 10/12/2006 | 2 | 10/12/2006 |
+----+------------+------+----------------------+
| 1 | 15/01/2007 | 2 | |
+----+------------+------+----------------------+
| 1 | 20/01/2007 | 2 | |
+----+------------+------+----------------------+
| 2 | 10/08/2008 | 1 | |
+----+------------+------+----------------------+
| 2 | 11/09/2008 | 1 | |
+----+------------+------+----------------------+
| 2 | 17/10/2008 | 1 | |
+----+------------+------+----------------------+
| 2 | 12/11/2008 | 2 | 12/11/2008 |
+----+------------+------+----------------------+
| 2 | 10/12/2008 | 2 | |
+----+------------+------+----------------------+

The variable first.type will not be created if you have not included type in a by statement. And even if it did exits its value could never be 2, its value will be either 1 (true) or 0 (false).
If you just want to set it and keep its value for the rest of the observations for that ID then you could RETAIN the value. Make sure to clear it when starting a new ID value.
data temp;
set temp;
by ID;
if first.id then date_change_type1to2=.;
retain date_change_type1to2 ;
if type=2 and missing(date_change_type1to2) then date_change_type1to2=date;
run;

Related

In SAS, how do you stop flagging a group of rows if a specific condition is met?

I have a table in SAS dataset that looks like this:
proc sql;
create table my_table
(id char(1),
my_date num format=date9.,
my_col num);
insert into my_table
values('A','01JAN2010'd,.)
values('A','02JAN2010'd,0)
values('A','03DEC2009'd,1)
values('A','04NOV2009'd,1)
values('B','01JAN2010'd,.)
values('B','02NOV2009'd,2)
values('C','01JAN2010'd,.)
values('C','02OCT2009'd,3)
values('D','01JAN2010'd,.)
values('D','02NOV2009'd,2)
values('D','03OCT2009'd,1)
values('D','04AUG2009'd,2)
values('D','05MAY2009'd,3)
values('D','06APR2009'd,1);
quit;
I am trying to create a new column desired that, for each group of id column, flags the row with a value of 1 if the value in my_col is missing or less than 3.
The part I'm having trouble with is that when there is a my_col value that is greater than 2, I need the desired value for that row to be missing and also stop flagging any remaining rows in the id group with a value of 1.
The resulting dataset should look like this:
+----+-----------+--------+---------+
| id | my_date | my_col | desired |
+----+-----------+--------+---------+
| A | 01JAN2010 | . | 1 |
| A | 02JAN2010 | 0 | 1 |
| A | 03DEC2009 | 1 | 1 |
| A | 04NOV2009 | 1 | 1 |
| B | 01JAN2009 | . | 1 |
| B | 02NOV2009 | 2 | 1 |
| C | 01JAN2010 | . | 1 |
| C | 02OCT2009 | 3 | . |
| D | 01JAN2010 | . | 1 |
| D | 02NOV2009 | 2 | 1 |
| D | 03OCT2009 | 1 | 1 |
| D | 04AUG2009 | 2 | 1 |
| D | 05MAY2009 | 3 | . |
| D | 06APR2009 | 1 | . |
+----+-----------+--------+---------+

Looks like a simple application of a retained variable. Set the flag to 1 when you start a new group and then set it to missing when the value of MY_COL is larger than 2.
data want;
set my_table ;
by id;
if first.id then desired=1;
if my_col>2 then desired=.;
retain desired;
run;
Also it is not clear why you used such complicated code to create your example data. Why not a simple data step?
data my_table;
input id :$1. my_date :date. my_col;
format my_date date9.;
cards;
A 01JAN2010 .
A 02JAN2010 0
A 03DEC2009 1
A 04NOV2009 1
B 01JAN2010 .
B 02NOV2009 2
C 01JAN2010 .
C 02OCT2009 3
D 01JAN2010 .
D 02NOV2009 2
D 03OCT2009 1
D 04AUG2009 2
D 05MAY2009 3
D 06APR2009 1
;

I can't think of a simpler way to do it, but this works. You will need to have your data sorted by id.
data my_table2;
set my_table;
by id;
format gt2flag $1.;
retain gt2flag;
if first.id then gt2flag='';
if my_col gt 2 then gt2flag='Y';
if gt2flag = 'Y' then desired=.;
else desired=1;
drop gt2flag;
run;
id my_date my_col desired
A 01JAN2010 . 1
A 02JAN2010 0 1
A 03DEC2009 1 1
A 04NOV2009 1 1
B 01JAN2010 . 1
B 02NOV2009 2 1
C 01JAN2010 . 1
C 02OCT2009 3 .
D 01JAN2010 . 1
D 02NOV2009 2 1
D 03OCT2009 1 1
D 04AUG2009 2 1
D 05MAY2009 3 .
D 06APR2009 1 .

data missing for the selected date range in SAS

I have an issue with finding out the missing months from the data set in SAS. Since I am new to SAS, I need some help on working on it. I have a data set which is shown as below: In the below example I took the date range from 201810 to 201906 (which is 8 months sample data but need to have 15 months to check the missing data). I want to do this in SAS
+----+------------+
| ID | Elig Month |
+----+------------+
| 1 | 201810 |
| 1 | 201811 |
| 1 | 201901 |
| 1 | 201902 |
| 1 | 201903 |
| 1 | 201904 |
| 1 | 201905 |
| 1 | 201906 |
| 2 | 201811 |
| 2 | 201901 |
| 2 | 201903 |
| 2 | 201904 |
| 2 | 201905 |
| 2 | 201906 |
| 3 | 201901 |
| 3 | 201902 |
| 3 | 201903 |
| 3 | 201904 |
| 3 | 201905 |
| 3 | 201906 |
| 4 | 201810 |
| 4 | 201903 |
| 4 | 201904 |
| 4 | 201905 |
| 4 | 201906 |
| 5 | 201906 |
| 6 | 201810 |
+----+------------+
I want to see if that data is present for all the months between 15 months date range. I have date format as 201901 (yearmonth). I want to check if the data is missing and create groups based on the missing number of months say
1. if only one month is missing then I want to group as "1 month missing"
2. if two months missing consecutively then name the group as " 2 month missing"
3. if 3 months then "3 month missing"
4. if 4 - 6 months missing then "4-6 months missing"
5. if missing months alternatively like available in one month and not available in next month and then available in next two months then I want to group them as "Chaos"
6. if missing more than 7-12 months then "7-12 months missing"
7. if missing more than 12 months then "12+ months missing"
8. if can be seen only once in ending periods name as "reborn"
9. If seen in the start of the period and never see any data set f or 15 moths then "dead"
The expected result is show as below:
+----+-------+--------------------+
| ID | Group | Group description |
+----+-------+--------------------+
| 1 | 1 | 1 months missing |
| 2 | 5 | choas |
| 3 | 2 | 2 months missing |
| 4 | 4 | 4-6 months missing |
| 5 | 8 | Reborn |
| 6 | 9 | Dead |
+----+-------+--------------------+

First to replicate your data
/***********************************************************************/
/* ORIGINAL DATA */
/***********************************************************************/
Data have;
Input ID date yymmn6.;
datalines;
1 201810
1 201811
1 201901
1 201902
1 201903
1 201904
1 201905
1 201906
2 201811
2 201901
2 201903
2 201904
2 201905
2 201906
3 201901
3 201902
3 201903
3 201904
3 201905
3 201906
4 201810
4 201903
4 201904
4 201905
4 201906
5 201906
6 201810
;
Run;
Data Have;
Set Have;
format date yymmn6.;
Run;
Then the following macro you can use to create a master list of all the possible year months between the dates you want:
/***********************************************************************/
/* How to Find out IF Your Data is Missing a Date in sequence */
/***********************************************************************/
%let start_date=01OCT2018; /*Change this to your starting date*/
%let end_date=01jun2019; /*Change this to your Ending date*/
data month;
want=1;
date="&start_date"d;
do while (date<="&end_date"d);
output;
date=intnx('month', date, 1, 's');
end;
format date yymmn6.;
run;
Lastly you just merge the two, any column that has a null/missing ID is what you will group by for your categorization logic.
Proc sort data= have;
by date;
Proc sort data=month;
by date;
Run;
Data Want;
merge Have month;
by date;
Run;

The dataset you have provided does not let you create all the categories - so assuming 9 months instead of 18 here. You will have to make some changes to make it work for 18 months. Here is one way of doing this:
I am reading the months as just numbers:
data have;
input id month;
1 201810
1 201811
1 201901
;
run;
If you read month as a date field, then you need to do so when creating the allmonths dataset below also.
/* Create a dataset that contains all months for all IDs */
proc sort data=have(keep=id) nodupkey out=ids;
by id;
run;
/* Very lazy way of populating the months. There are elegant ways to do this */
data allmonths;
set ids;
do month = 201810 to 201812;
output;
end;
do month = 201901 to 201906;
output;
end;
run;
/* Merge the full combination with what you have and put a marker to indicate if a particular month is present for the ID or not */
data merged;
merge allmonths have (in=a);
by id month;
if a then present=1;
else present =0;
run;
/* Form a bit pattern and use that to categorize your cases */
data want;
set merged;
by id;
retain pattern counter cnt0;
if first.id then do;
pattern = repeat('0',8);
counter = 1;
cnt0 = 0;
end;
if present then substr(pattern,counter,1) = '1';
else cnt0 + 1;
counter + 1;
/* You could use a macro to auto generate these combinations if you expect to have very many categories */
length desc $50;
if last.id then do;
if pattern = "000000001" then desc = "Reborn";
else if pattern = "100000000" then desc = "Dead";
else if cnt0 = 1 then desc = "1 months missing";
else if cnt0 = 2 and index(pattern, '00') then desc = "2 months missing";
else if cnt0 = 2 then desc = "chaos";
else if cnt0 = 3 and index(pattern, '000') then desc = "3 months missing";
else if cnt0 = 3 then desc = "chaos";
else if cnt0 = 4 and index(pattern, '0000') then desc = "4 - 6 months missing";
else if cnt0 = 4 then desc = "chaos";
else if cnt0 = 5 and index(pattern, '00000') then desc = "4 - 6 months missing";
else if cnt0 = 5 then desc = "chaos";
else if cnt0 = 6 and index(pattern, '000000') then desc = "4 - 6 months missing";
else if cnt0 = 6 then desc = "chaos";
output;
end;
keep id desc;
run;

Removing observations before 'beginning' and after 'ending' - SAS code

My table has some leading and trailing observations that I am trying to remove. I want to remove the rows that come before every 'begin' event and after every 'end' event for every single group. The table resembles the below:
| Time | Group | Event | Value |
| 1 | 1 | NA | 0 |
| 2 | 1 | NA | 0 |
| 3 | 1 | Begin | 1.1 |
| 4 | 1 | NA | 1.2 |
| 5 | 1 | NA | 1.3 |
| 6 | 1 | End | 1.4 |
| 7 | 1 | NA | 0 |
| 1 | 2 | NA | 0 |
| 2 | 2 | Begin | 1.1 |
| 3 | 2 | NA | 1.2 |
| 4 | 2 | End | 1.3 |
| 5 | 2 | NA | 1.4 |

On the presumption that the incoming data is already sorted and that there are zero or more serially bounded ranges of Begin to End within each group:
data want;
do until (last.group);
set have;
by group time;
if event = 'Begin' then _keeprow = 1;
if _keeprow then output;
if event = 'End' then _keeprow = 0;
end;
drop _keeprow;
end;

I have came out an easy way but will be limited by the actual data size.
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
run;
proc sort data = have;
by group time;
run;
data have1;
set have;
count + 1;
by group;
if first.group then count = -100;
if event = 'Begin' then count = 0;
if event = 'End' then count = 100;
if count < 0 or count >100 then delete;
run;
The current code could be applied to the small size data if you have less than 100 observations between 'Begin' and 'End' and less than 100 observations before 'Begin'. You can adjust the initial count value according to the true data size.

one way to do is
data have;
input Time Group Event $ Value ;
datalines;
1 1 NA 0
2 1 NA 0
3 1 Begin 1.1
4 1 NA 1.2
5 1 NA 1.3
6 1 End 1.4
7 1 NA 0
1 2 NA 0
2 2 Begin 1.1
3 2 NA 1.2
4 2 End 1.3
5 2 NA 1.4
;
data have2(keep= Group min_var max_var);
set have;
by group;
retain min_var max_var;
if trim(Event)= "Begin" then min_var =_n_ ;
if trim(Event)= "End" then max_var =_n_;
if last.group;
run;
data want;
merge have have2;
by group;
if _n_ ge min_var and _n_ le max_var ;
drop min_var max_var;
run;

Ranking variables according to their percent contribution to total

Consider the following example data:
psu | sumsc sumst sumobc sumother sumcaste
-------|-----------------------------------------------
10018 | 3 2 0 4 9
|
10061 | 0 0 2 5 7
|
10116 | 1 1 2 4 8
|
10121 | 3 0 1 2 6
|
20002 | 4 1 0 1 6
-------------------------------------------------------
I want to rank the variables sumsc, sumst, sumobc, and sumother according to their percent contribution to sumcaste (this is the total of all variables) within psu.
Could anyone help me do this in Stata?

First we enter the data:
clear all
set more off
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
Second, we prepare the reshape:
local j=1
foreach var of varlist sumsc sumst sumobc sumother {
gen temprl`j' = `var' / sumcaste
ren `var' addi`j'
local ++j
}
reshape long temprl addi, i(psu) j(ord)
lab def ord 1 "sumsc" 2 "sumst" 3 "sumobc" 4 "sumother"
lab val ord ord
Third, we order before presenting:
gsort psu -temprl
by psu: gen nro=_n
drop temprl
order psu nro ord
Fourth, presenting the data:
br psu nro ord addi
EDIT:
This is a combination of Aron's solution with mine (#PearlySpencer):
clear
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
local i = 0
foreach var of varlist sumsc sumst sumobc sumother {
local ++i
generate pct`i' = 100 * `var' / sumcaste
rename `var' temp`i'
local rvars "`rvars' r`i'"
}
rowranks pct*, generate("`rvars'") field lowrank
reshape long pct temp r, i(psu) j(name)
label define name 1 "sumsc" 2 "sumst" 3 "sumobc" 4 "sumother"
label values name name
keep psu name pct r
bysort psu (r): replace r = sum(r != r[_n-1])
Which gives you the desired output:
list, sepby(psu) noobs
+---------------------------------+
| psu name pct r |
|---------------------------------|
| 10018 sumother 44.44444 1 |
| 10018 sumsc 33.33333 2 |
| 10018 sumst 22.22222 3 |
| 10018 sumobc 0 4 |
|---------------------------------|
| 10061 sumother 71.42857 1 |
| 10061 sumobc 28.57143 2 |
| 10061 sumsc 0 3 |
| 10061 sumst 0 3 |
|---------------------------------|
| 10116 sumother 50 1 |
| 10116 sumobc 25 2 |
| 10116 sumst 12.5 3 |
| 10116 sumsc 12.5 3 |
|---------------------------------|
| 10121 sumsc 50 1 |
| 10121 sumother 33.33333 2 |
| 10121 sumobc 16.66667 3 |
| 10121 sumst 0 4 |
|---------------------------------|
| 20002 sumsc 66.66666 1 |
| 20002 sumst 16.66667 2 |
| 20002 sumother 16.66667 2 |
| 20002 sumobc 0 3 |
+---------------------------------+
This approach will be useful if you need the variables for further analysis as opposed to just displaying the results.

First you need to calculate percentages:
clear
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
foreach var of varlist sumsc sumst sumobc sumother {
generate pct_`var' = 100 * `var' / sumcaste
}
egen pcttotal = rowtotal(pct_*)
list pct_* pcttotal, abbreviate(15) noobs
+--------------------------------------------------------------+
| pct_sumsc pct_sumst pct_sumobc pct_sumother pcttotal |
|--------------------------------------------------------------|
| 33.33333 22.22222 0 44.44444 100 |
| 0 0 28.57143 71.42857 100 |
| 12.5 12.5 25 50 100 |
| 50 0 16.66667 33.33333 100 |
| 66.66666 16.66667 0 16.66667 99.99999 |
+--------------------------------------------------------------+
Then you need to get the ranks and do some gymnastics:
rowranks pct_*, generate(r_sumsc r_sumst r_sumobc r_sumother) field lowrank
mkmat r_*, matrix(A)
matrix A = A'
svmat A, names(row)
local matnames : rownames A
quietly generate name = " "
forvalues i = 1 / `: word count `matnames'' {
quietly replace name = substr(`"`: word `i' of `matnames''"', 3, .) in `i'
}
ds row*
foreach var in `r(varlist)' {
sort `var' name
generate `var'b = sum(`var' != `var'[_n-1])
drop `var'
rename `var'b `var'
list name `var' if name != " ", noobs
display ""
}
The above will give you what you want:
+-----------------+
| name row1 |
|-----------------|
| sumother 1 |
| sumsc 2 |
| sumst 3 |
| sumobc 4 |
+-----------------+
+-----------------+
| name row2 |
|-----------------|
| sumother 1 |
| sumobc 2 |
| sumsc 3 |
| sumst 3 |
+-----------------+
+-----------------+
| name row3 |
|-----------------|
| sumother 1 |
| sumobc 2 |
| sumsc 3 |
| sumst 3 |
+-----------------+
+-----------------+
| name row4 |
|-----------------|
| sumsc 1 |
| sumother 2 |
| sumobc 3 |
| sumst 4 |
+-----------------+
+-----------------+
| name row5 |
|-----------------|
| sumsc 1 |
| sumother 2 |
| sumst 2 |
| sumobc 3 |
+-----------------+
Note that you will first need to install the community-contributed command rowranks before you execute the above code:
net install pr0046.pkg

Creating complete data in stata

I have the following purchasing data
clear
input id productid purchase
1 1 1
2 1 1
3 2 1
1 3 1
end
I want to add a row for every id-productid combo to create the following dataset
id productid purchase
1 1 1
2 1 1
3 1 0
1 2 0
2 2 0
3 2 1
1 3 1
2 3 0
3 3 0
end
I have tried a lot that has not work. This is my latest.
qui sum id, d
local obs = r(N)
expand = `obs'
levelsof productid, local(id)
local j = 1
foreach i of local id {
replace productid = `i' if `j' == id
local j = `j' + 1
}

The fillin command (see help fillin) is the tool for this task.
Starting with your sample data in memory:
fillin id productid
replace purchase = 0 if _fillin
drop _fillin
sort productid id
list, sepby(productid) abbreviate(12)
produces
+---------------------------+
| id productid purchase |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 1 |
3. | 3 1 0 |
|---------------------------|
4. | 1 2 0 |
5. | 2 2 0 |
6. | 3 2 1 |
|---------------------------|
7. | 1 3 1 |
8. | 2 3 0 |
9. | 3 3 0 |
+---------------------------+

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS: get the first value where a condition is verified by group - sas

Related

In SAS, how do you stop flagging a group of rows if a specific condition is met?

data missing for the selected date range in SAS

Removing observations before 'beginning' and after 'ending' - SAS code

Ranking variables according to their percent contribution to total

Creating complete data in stata

Categories

Resources