The NLSY79 is a panel survey that records an individual's age but in some years it will skip the question and code it as -5. The dataset is annual between 1979 and 1994 and then becomes biennial through 2012. So sample data for an individual may look like this:
caseid_1979 year age
73 1988 25
73 1989 26
73 1990 -5
73 1991 -5
73 1992 -5
73 1993 30
73 1994 30
73 1996 32
73 1998 -5
73 2000 36
So my question is how can I program Stata so that the missing age values are filled in? I realize that in some years the person's birthday has not yet occurred so the age might be repeated in consecutive years and am not sure what to do (if anything) about that.
In this case, we can be confident that age is, or should be, linear in year for each individual. So we have an exercise in interpolation.
clonevar age2 = age
replace age2 = . if age2 == -5
ipolate age2 year, generate(age3) by(caseid_79) epolate
bysort caseid_79 (year) : assert age3 == age3[1] + (year - year[1])
If you want to allow a tolerance of 1 year, say
bysort caseid_79 (year) : assert inlist(age3 - age3[1] + year - year[1], 0, 1)
See also this FAQ, which discusses replaces missing values in sequences
Related
I have a data that looks like this
Date
Name
SurveyID
Score
Error
2022-02-17
Jack
10
95
Name
2022-02-17
Jack
10
95
Address
2022-02-16
Tom
9
100
2022-02-16
Carl
8
93
Zip
2022-02-16
Carl
8
93
Email
2022-02-15
Dan
7
72
Zip
2022-02-15
Dan
7
72
Email
2022-02-15
Dan
7
72
Name
2022-02-15
Dan
6
90
Phone
2022-02-14
Tom
5
98
Gender
I wanted to have a segmentation data using the avg. score per individual.
Segment
A: 98%-100%
B: 95%-97%
C: 90%-94%
D: 80%-89%
E: 0% -79%
I did an if else formula which is this:
ifelse(Score} >= 98,'A',ifelse({Score} >= 95,'B',ifelse({Score} >= 90,'C',ifelse({Score} >= 80,'D','E'))))
This is now the output of what I did:
Date
Name
SurveyID
Score
Error
Segement
2022-02-17
Jack
10
95
Name
B
2022-02-17
Jack
10
95
Address
B
2022-02-16
Tom
9
100
A
2022-02-16
Carl
8
93
Zip
C
2022-02-16
Carl
8
93
Email
C
2022-02-15
Dan
7
72
Zip
E
2022-02-15
Dan
7
72
Email
E
2022-02-15
Dan
7
72
Name
E
2022-02-15
Dan
6
90
Phone
C
2022-02-14
Tom
5
98
Gender
A
I realized that the calculation I did only applies for the score. I was expecting an output like this:
Name
Average Score
Total Survey
Segement
Jack
95
1
B
Tom
99
2
A
Carl
93
1
C
Dan
81
2
D
I have tried to create another calculated field for Average Score which is:
avgOver({Score}, [Name], PRE_AGG)
I believe I am missing a distinct count of survey IDs in that formula, that I do not know where to place. As for segmentation calculation, I cannot on my life figure that part out without getting aggregation errors on Quicksight. Please help, thank you.
Got the answer from Quicksight Community. Pasting it here.
For segmentation, you can use the calculated field which you created for average score .
avg_score = avgOver(Score,[Name],PRE_AGG)
Segment
ifelse
(
{avg_score}>= 98,'A',
{avg_score}>= 95,'B',
{avg_score}>= 90,'C',
{avg_score}>= 80,'D',
'E'
)
The survey id can be used to get the distinct count per individual.
Im new in sas base and need help.
I have 2 tables with different data and I need merge it.
But on step I need data from next row.
example what I need:
ID Fdate Tdate NFdate NTdate
id1 date1 date1 date2 date2
id2 date2 date2 date3 date3
....
I did it by 2 merges:
data result;
merge table1 table2 by ...;
merge table1(firstobs=2) table2(firstobs=2) by...;
run;
I expected 10 rows but got 9 becouse one-to-one reading stopted on last row of smallest table(merge). How I can get the last row (do one-to-one reading by biggest table)?
Most simple data steps stop not at the bottom of the step but in the middle when they read past the end of the input. The reason you are getting N-1 observations is because the second input has one fewer records. So you need to do something to stop that.
One simple way is to not execute the second read when you are processing the last observation read by the first one. You can use the END= option to create a boolean variable that will let you know when that happens.
Here is simple example using SASHELP.CLASS.
data test;
set sashelp.class end=eof;
if not eof then set sashelp.class(firstobs=2 keep=name rename=(name=next_name));
else call missing(next_name);
run;
Results:
next_
Obs Name Sex Age Height Weight name
1 Alfred M 14 69.0 112.5 Alice
2 Alice F 13 56.5 84.0 Barbara
3 Barbara F 13 65.3 98.0 Carol
4 Carol F 14 62.8 102.5 Henry
5 Henry M 14 63.5 102.5 James
6 James M 12 57.3 83.0 Jane
7 Jane F 12 59.8 84.5 Janet
8 Janet F 15 62.5 112.5 Jeffrey
9 Jeffrey M 13 62.5 84.0 John
10 John M 12 59.0 99.5 Joyce
11 Joyce F 11 51.3 50.5 Judy
12 Judy F 14 64.3 90.0 Louise
13 Louise F 12 56.3 77.0 Mary
14 Mary F 15 66.5 112.0 Philip
15 Philip M 16 72.0 150.0 Robert
16 Robert M 12 64.8 128.0 Ronald
17 Ronald M 15 67.0 133.0 Thomas
18 Thomas M 11 57.5 85.0 William
19 William M 15 66.5 112.0
I'm working on a panel dataset, which has missing values for four variables (at the start, end and in-between of panels). I would like to remove the entire panel which has missing values.
This is the code I have tried to use so far:
bysort BvD_ID YEAR: drop if sum(!missing(REV_LAY,EMP_LAY,FX_ASSET_LAY,MATCOST_LAY))==0
This piece of code successfully removes all observations with missing values in any of the four variables but it retains observations with non-missing values.
Example data:
Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
In the above sample data, I want to drop panel Firm_ID = 001 completely.
You can do something like:
clear
input Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
end
generate index = _n
bysort Firm_ID (index): generate todrop = sum(missing(REV_LAY, EMP_LAY, FX_ASSET_LAY))
by Firm_ID: drop if todrop[_N]
list Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
+-----------------------------------------------+
| Firm_ID Year REV_LAY EMP_LAY FX_ASS~Y |
|-----------------------------------------------|
1. | 2 2001 40 15 45 |
2. | 2 2002 42 18 48 |
3. | 2 2003 45 20 50 |
+-----------------------------------------------+
Hello,
I want to write a dynamic program which helps me to flag the start and end dates of events that are nested within the consolidated dates that are present at the top of each Pt.ID in the attached example. I can easily do these if there is only one such consolidated period per Pt.ID. However, there could be more than one such consolidated periods per Pt. ID. (As shown for second Pt.ID, 1002). As shown in the example, the events that fall within the consolidated period/s are fagged as "Y" in the flag variable and if they don't fall within the consolidated period then they are flagged as "N" in this variable. How can I write a program that accounts for all of such consolidated periods per Pt.ID and then compare them with the dates for the rest of the events of a particular patient and flag events which fall within any of those consolidated periods?
Thank you.
So join the event records with the period records and calculate whether the event is within the period. Then you could take the MAX over all periods.
For example here is code for your sample that creates a binary 1/0 flag variable called INCLUDED.
data Sample;
infile datalines missover;
input Pt_ID Event_ID Category $ Start_Date : mmddyy10.
Start_Day End_date : mmddyy10. End_day Duration
;
format Start_date End_date mmddyy10.;
datalines;
1001 . Moderate 8/5/2016 256 9/3/2016 285 30
1001 1 Moderate 3/8/2016 106 3/16/2016 114 9
1001 2 Moderate 8/5/2016 256 8/14/2016 265 10
1001 3 Moderate 8/21/2016 272 8/24/2016 275 4
1001 4 Moderate 8/23/2016 274 9/3/2016 285 12
1002 . Severe 11/28/2016 13 12/19/2016 34 22
1002 . Severe 2/6/2017 83 2/28/2017 105 23
1002 1 Severe 11/28/2016 13 12/5/2016 20 8
1002 2 Severe 12/12/2016 27 12/19/2016 34 8
1002 3 Severe 1/9/2017 55 1/12/2017 58 4
1002 4 Severe 2/6/2017 83 2/13/2017 90 8
1002 5 Severe 2/20/2017 97 2/28/2017 105 9
1002 6 Severe 3/17/2017 122 3/24/2017 129 8
1002 7 Severe 5/4/2017 170 5/13/2017 179 10
1002 8 Severe 5/24/2017 190 5/30/2017 196 7
1002 9 Severe 6/9/2017 206 6/13/2017 210 5
;
proc sql ;
create table want as
select a.*
, max(b.start_date <= a.start_date and b.end_date >= a.end_date ) as Included
from sample a
left join sample b
on a.pt_id = b.pt_id and missing(b.event_id)
group by 1,2,3,4,5,6,7,8
order by a.pt_id, a.event_id, a.start_date , a.end_date
;
quit;
I have a transaction level dataset and I want to collapse and calculate weekly average price. The dataset can be simplified as follows,
clear
input str9 date quantity price id
"01jan2010" 50 70 1
"02jan2010" 60 80 2
"02jan2010" 70 90 3
"04jan2010" 70 95 4
"08jan2010" 60 81 5
"09jan2010" 70 88 6
"12jan2010" 55 87 7
"13jan2010" 52 88 8
end
gen date2=date(date,"DMY")
format date2 %td
drop date
I want to create a variable date3. For every transaction happened in a week, date3 is the Monday of that week.
Here's the code I have:
sort date2
gen date3=date2 if dow(date2)==1
replace date3=date3[_n-1] if missing(date3)
format date3 %td
However, there are Mondays with no transactions, but the rest of the week has transactions. In those cases, date3 is not the Monday date of that week, but Monday date in the weeks before.
My data becomes the following using the above code:
quantity price id date2 date3
50 70 1 01jan2010
60 80 2 02jan2010
70 90 3 02jan2010
70 95 4 04jan2010 04jan2010
60 81 5 08jan2010 04jan2010
70 88 6 09jan2010 04jan2010
55 87 7 12jan2010 04jan2010
52 88 8 13jan2010 04jan2010
To me, it does not matter if id =1,2,3 have no date3. What I am concerned is that id=7 and id=8 should have a date3 of 11jan2010. But because there is no transaction on that day, the date becomes 04jan2010. Is there a way to fix this?
(I was thinking of constructing a new dataset with consecutive dates since 01jan2010 and then merge with the one above, and then drop if missing quantity of price. But I was wondering if there's a more efficient way).
In addition, I have a weekly index data that reports on every Friday since 01jan2010. If I use wofd command, Stata will generate 53 weeks in 2010. (Or more precisely, two 2010w52.) How can I get just 52 weeks in Stata?
(I found this http://www.stata.com/statalist/archive/2012-02/msg01030.html but I still cannot figure out how this can help solve my problem. )
Your weeks start on Mondays. Everything you need follows from using dow() to exploit the fact that in every one of your weeks, the day of week function dow() yields 1, 2, 3, 4, 5, 6, 0 for the days from Monday to Sunday.
The present or previous Monday for daily dates daily is just
gen Monday = cond(dow(daily) == 0, daily - 6, daily - dow(daily) + 1)
The branch is like this. If it's a Sunday, the previous Monday was 6 days ago. Otherwise, the Monday that starts the week was today if it's Monday and dow() yields 1, yesterday if it's Tuesday and 2, and so forth. Here the variable Monday is just the dates of Mondays that define the weeks.
Important detail: There are no assumptions here about dates being complete in the data or even in order.
Small note: Arbitrary names like date2 and date3 mean nothing much. Use evocative names in your questions (and your practice).
There was a sequel to the article mentioned by Robert Ferrer. search week, sj in Stata to get the references.
Do not use Stata's weeks and in particular do not use the wofd() function (not a command), as they can't help you. Stata's weeks will not map on to your weeks. The article mentioned by Robert Ferrer really is worthwhile reading to understand this (even though I wrote it).
(This is all explained in the Statalist threads you link to.)