Select the minimum over blocks of observations - stata

I am trying to make Stata select the minimum value of ice_cream eaten by every person (Amanda, Christian, Paola) so that I end up with just 3 rows:
person ice_cream
Amanda 16
Amanda 27
Amanda 29
Amanda 40
Amanda 96
Amanda 97
Christian 19
Christian 23
Christian 26
Christian 27
Christian 28
Christian 34
Christian 62
Christian 70
Christian 78
Paola 5
Paola 11
Paola 28
Paola 97

A one-line solution
collapse (min) ice_cream, by(person)

An answer that avoids creating a new variable:
sort person ice_cream
by person: keep if _n == 1

This should work:
* Generate a variable with the group minimums
sort person
by person: egen Min = min(ice_cream)
* Only keep observations with same value as group minimums
keep if Min == ice_cream
* Delete minimum variable
drop Min
Note: This will leave only observations with a minimum value for ice_cream. If multiple observations in a group have the minimum value for ice_cream then you will have multiple observations for that group (Note this is not in the above data but may be likely if for instance ice_cream was a factor variable). If you wanted a unique observation per group you could then add:
duplicates drop person, force

If you want to simply display the minimum value of ice_cream eaten
by Amanda, Christian and Paola, but without altering your dataset, you can
use the summarize command instead:
clear
input str20 person ice_cream
Amanda 16
Amanda 27
Amanda 29
Amanda 40
Amanda 96
Amanda 97
Christian 19
Christian 23
Christian 26
Christian 27
Christian 28
Christian 34
Christian 62
Christian 70
Christian 78
Paola 5
Paola 11
Paola 28
Paola 97
end
bysort person: summarize ice_cream
---------------------------------------------------------------------------
-> person = Amanda
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ice_cream | 6 50.83333 36.18517 16 97
---------------------------------------------------------------------------
-> person = Christian
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ice_cream | 9 40.77778 22.63171 19 78
---------------------------------------------------------------------------
-> person = Paola
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
ice_cream | 4 35.25 42.30347 5 97

Related

Is there a way to make a dummy variable in SAS for a Country in my SAS Data Set?

I am looking to create a dummy variable for Latin American Countries in my data set which I need to make for a log-log model. I know how to log all of them for my later regression. Any suggestion or help on how to make a dummy variable for the Latin American countries with my data would be appreciated.
data HW6;
input country : $25. midyear sancts lprots lfrac ineql pop;
cards;
CHILE 1955 58 44 65 57 6.743
CHILE 1960 19 34 65 57 7.785
CHILE 1965 27 24 65 57 8.510
CHILE 1970 36 29 65 57 9.369
CHILE 1975 38 58 65 57 10.214
COSTA_RICA 1955 16 7 54 60 1.024
COSTA_RICA 1960 6 1 54 60 1.236
COSTA_RICA 1965 2 1 54 60 1.482
COSTA_RICA 1970 3 1 54 60 1.732
COSTA_RICA 1975 2 3 54 60 1.965
INDIA 1955 81 134 47 52 404.478
INDIA 1960 101 190 47 52 445.857
INDIA 1965 189 845 47 52 494.882
INDIA 1970 133 915 47 52 553.619
INDIA 1975 132 127 47 52 616.551
JAMICA 1955 11 12 47 62 1.542
JAMICA 1960 9 2 47 62 1.629
JAMICA 1965 8 6 47 62 1.749
JAMICA 1970 1 1 47 62 1.877
JAMICA 1975 7 1 47 62 2.043
PHILIPPINES 1955 26 123 48 56 24.0
PHILIPPINES 1960 20 38 48 56 27.898
PHILIPPINES 1965 9 5 48 56 32.415
PHILIPPINES 1970 79 25 48 56 37.540
SRI_LANKA 1955 29 2 73 52 8.679
SRI_LANKA 1960 75 35 73 52 9.879
SRI_LANKA 1965 25 63 73 52 11.202
SRI_LANKA 1970 34 14 73 52 12.532
TURKEY 1955 79 1 67 61 24.145
TURKEY 1960 138 19 67 61 28.217
TURKEY 1965 36 51 67 61 31.951
TURKEY 1970 51 22 67 61 35.743
URUGUAY 1955 8 4 57 48 2.372
URUGUAY 1960 12 1 57 48 2.538
URUGUAY 1965 16 14 57 48 2.693
URUGUAY 1970 21 19 57 48 2.808
URUGUAY 1975 24 45 57 48 2.829
VENEZUELA 1955 38 14 76 65 6.110
VENEZUELA 1960 209 23 76 65 7.632
VENEZUELA 1965 100 162 76 65 9.119
VENEZUELA 1970 9 27 76 65 10.709
VENEZUELA 1975 4 12 76 65 12.722
;
data newData;
set HW6;
sancts = log (sancts);
lprots = log (lprots);
lfrac = log (lfrac);
ineql = log (ineql);
pop = log (pop);
run;
The GLMSELECT procedure is one simple way of creating dummy variables.
There is a nice article about how to use it to generate dummy variables
data newData;
set HW6;
sancts = log (sancts);
lprots = log (lprots);
lfrac = log (lfrac);
ineql = log (ineql);
pop = log (pop);
Y = 0; *-- Create a fake response variable --*
run;
proc glmselect data=newData noprint outdesign(addinputvars)=want(drop=Y);
class country;
model Y = country / noint selection=none;
run;
If needed in further step, use the macro-variable &_GLSMOD created by the procedure that contains the names of the dummy variables.
The real question here is not related to SAS, it is related on how to get the region of a country by its name.
I would give a try to the ISO 3166 which lists all countries and their geographical location.
Getting that list is straight forward, then import that list in SAS, use a merge by country and finally identify the countries in Latin America

Sas base: one-to-one reading by biggest table or getting data from next row

Im new in sas base and need help.
I have 2 tables with different data and I need merge it.
But on step I need data from next row.
example what I need:
ID Fdate Tdate NFdate NTdate
id1 date1 date1 date2 date2
id2 date2 date2 date3 date3
....
I did it by 2 merges:
data result;
merge table1 table2 by ...;
merge table1(firstobs=2) table2(firstobs=2) by...;
run;
I expected 10 rows but got 9 becouse one-to-one reading stopted on last row of smallest table(merge). How I can get the last row (do one-to-one reading by biggest table)?
Most simple data steps stop not at the bottom of the step but in the middle when they read past the end of the input. The reason you are getting N-1 observations is because the second input has one fewer records. So you need to do something to stop that.
One simple way is to not execute the second read when you are processing the last observation read by the first one. You can use the END= option to create a boolean variable that will let you know when that happens.
Here is simple example using SASHELP.CLASS.
data test;
set sashelp.class end=eof;
if not eof then set sashelp.class(firstobs=2 keep=name rename=(name=next_name));
else call missing(next_name);
run;
Results:
next_
Obs Name Sex Age Height Weight name
1 Alfred M 14 69.0 112.5 Alice
2 Alice F 13 56.5 84.0 Barbara
3 Barbara F 13 65.3 98.0 Carol
4 Carol F 14 62.8 102.5 Henry
5 Henry M 14 63.5 102.5 James
6 James M 12 57.3 83.0 Jane
7 Jane F 12 59.8 84.5 Janet
8 Janet F 15 62.5 112.5 Jeffrey
9 Jeffrey M 13 62.5 84.0 John
10 John M 12 59.0 99.5 Joyce
11 Joyce F 11 51.3 50.5 Judy
12 Judy F 14 64.3 90.0 Louise
13 Louise F 12 56.3 77.0 Mary
14 Mary F 15 66.5 112.0 Philip
15 Philip M 16 72.0 150.0 Robert
16 Robert M 12 64.8 128.0 Ronald
17 Ronald M 15 67.0 133.0 Thomas
18 Thomas M 11 57.5 85.0 William
19 William M 15 66.5 112.0

SAS Restructure Data

I need help restructuring the data. My Table looks like this
NameHead Department Per_test Per_Delta Per_DB Per_Vul
Nancy Health 55 33.2 33 63
Jim Air 25 22.8 23 11
Shu Water 26 88.3 44 12
Dick Electricity 77 55.9 66 10
Elena General 88 22 67 9
Nancy Internet 66 12 44 79
And I want my table to look like this
NameHead Nancy Jim Shu Dick Elena Nancy
Department Health Air Water Electricity General Internet
Per_test 55 25 26 77 88 66
Per_Delta 33.2 22.8 88.3 55.9 22 12
PerDB 33 23 44 66 67 44
Per_Vul 63 11 12 10 9 79
I tried proc transpose but couldnt get the desired result. Please help!
Thanks!
PROC TRANSPOSE does exactly what you want. You must include a VAR statement if you want to include the character variables.
proc transpose data=have out=want;
var _all_;
run;
Note that you cannot have variables that do not have names. Here is what the dataset looks like.
Obs _NAME_ COL1 COL2 COL3 COL4 COL5 COL6
1 NameHead Nancy Jim Shu Dick Elena Nancy
2 Department Health Air Water Electricity General Internet
3 Percent_test 55 25 26 77 88 66
4 Percent_Delta 33.2 22.8 88.3 55.9 22 12
5 Percent_DB 33 23 44 66 67 44
6 Percent_Vul 63 11 12 10 9 79

Removing entire panel with missing values

I'm working on a panel dataset, which has missing values for four variables (at the start, end and in-between of panels). I would like to remove the entire panel which has missing values.
This is the code I have tried to use so far:
bysort BvD_ID YEAR: drop if sum(!missing(REV_LAY,EMP_LAY,FX_ASSET_LAY,MATCOST_LAY))==0
This piece of code successfully removes all observations with missing values in any of the four variables but it retains observations with non-missing values.
Example data:
Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
In the above sample data, I want to drop panel Firm_ID = 001 completely.
You can do something like:
clear
input Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
end
generate index = _n
bysort Firm_ID (index): generate todrop = sum(missing(REV_LAY, EMP_LAY, FX_ASSET_LAY))
by Firm_ID: drop if todrop[_N]
list Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
+-----------------------------------------------+
| Firm_ID Year REV_LAY EMP_LAY FX_ASS~Y |
|-----------------------------------------------|
1. | 2 2001 40 15 45 |
2. | 2 2002 42 18 48 |
3. | 2 2003 45 20 50 |
+-----------------------------------------------+

Filling in missing age values in NLSY79 data with Stata

The NLSY79 is a panel survey that records an individual's age but in some years it will skip the question and code it as -5. The dataset is annual between 1979 and 1994 and then becomes biennial through 2012. So sample data for an individual may look like this:
caseid_1979 year age
73 1988 25
73 1989 26
73 1990 -5
73 1991 -5
73 1992 -5
73 1993 30
73 1994 30
73 1996 32
73 1998 -5
73 2000 36
So my question is how can I program Stata so that the missing age values are filled in? I realize that in some years the person's birthday has not yet occurred so the age might be repeated in consecutive years and am not sure what to do (if anything) about that.
In this case, we can be confident that age is, or should be, linear in year for each individual. So we have an exercise in interpolation.
clonevar age2 = age
replace age2 = . if age2 == -5
ipolate age2 year, generate(age3) by(caseid_79) epolate
bysort caseid_79 (year) : assert age3 == age3[1] + (year - year[1])
If you want to allow a tolerance of 1 year, say
bysort caseid_79 (year) : assert inlist(age3 - age3[1] + year - year[1], 0, 1)
See also this FAQ, which discusses replaces missing values in sequences