Generate a new variable from row conditions in Stata - stata

I am using Stata13 on Windows 7. I have a dataset with repeated observations of age and educ in a row for each id. i.e. variables q9p1educ and q9p1age is the education and age for person1 respectively , q9p2educ and q9p2age is the education and age for person2 respectively etc. I want to extract the education level of the person with the highest age. I have managed to extract the maximum age maxage using egen maxage = rowmax(q9p1age - q9p9age) How can I get the education of the person with the maximum age?
The sample data is here

I would start by reshaping your data into long format
reshape long q9#educ q9#age, i(id maxage) j(pid) string
Then the answer depends on What do you want to do if the maxage is not unique. Perhaps you could do something like average them?
bysort id (age): gen temp=q9educ if age==maxage
bysort id: egen educmaxage=mean(temp)
drop temp
Then if you want it wide again, you could simply reshape wide.
reshape wide q9#educ q9#age, i(id maxage educmaxage) j(pid) string

Related

How to create a new variable of age based upon an existing numeric date born variable in sas?

I want to create a numeric age variable using an existing numeric born date variable (MMDDYY10) in SAS. This "BORN" variable is numeric with a length of 8, the format is MMDDYY10. I'm assuming to use: age=today's date -BORN date. However, BORN date is like:-15226、-8803….I just don't understand why before these number, there is a minus signal. So what is the code to transfer to actual age?
I don't understand why before born date number, there is a minus signal. So how to use today's date minus born date of patient?
SAS is using a number for date/time. Dates are defined as number of days between 1.1. 1960 and specified date, so dates before that time are negative. To translate it to a (for people) readable form, you have to use formats (for example MMDDYY10.)
Similarly time is a number of seconds since midnight of the current day. SAS time values are between 0 and 86400.
Your code would look like this:
data have;
input born MMDDYY10.;
format born MMDDYY10.;
datalines;
03/17/2000
11/11/1988
08/11/1923
;
run;
data want;
set have;
age = floor((DATE()-born) / 365.25);
run;
SAS will correctly translate your input (if you correctly used your formats) into numbers, which are easy for a program to calculate with.

Finding string value associated with max value of record subset in long format

For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)
Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:
cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp)
bysort id (visitdate): gen ldl3 = ldl3max if _n==1
Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?
In this case, how can I find out what _n is for the maximum value? If I know that, I could use
bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl
Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).
cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3
bysort id (visitdate): egen maxldl3 = max(tmpldl3)
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.
You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.

Calculate the number of firms at a given month

I'm working on a dataset in Stata
The first column is the name of the firm. the second column is the start date of this firm and the third column is the expiration date of this firm. If the expdate is missing, this firm is still in business. I want to create a variable that will record the number of firms at a given time. (preferably to be a monthly variable)
I'm really lost here. Please help!
Next time, try using dataex (ssc install dataex) rather than a screen shot, this is recommended in the Stata tag wiki, and will help others help you!
Here is an example for how to count the number of firms that are alive in each period (I'll use years, but point out where you can switch to month). This example borrows from Nick Cox's Stata journal article on this topic.
First, load the data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input long(firmID dt_start dt_end)
3923155 20080123 99991231
2913168 20070630 99991231
3079566 20000601 20030212
3103920 20020805 20070422
3357723 20041201 20170407
4536020 20120201 20170407
2365954 20070630 20190630
4334271 20110721 20191130
4334338 20110721 20170829
4334431 20110721 20190429
end
Note that my in my example data my dates are not in Stata format, so I'll convert them here:
tostring dt_start, replace
generate startdate=date(dt_start, "YMD")
tostring dt_end, replace
generate enddate=date(dt_end, "YMD")
format startdate enddate
Next make a variable with the time interval you'd like to count within:
generate startyear = year(startdate)
generate endyear = year(enddate)
In my dataset I have missing end dates that begin with '9999' while you have them as '.' I'll set these to the current year, the assumption being that the dataset is current. You'll have to decide whether this is appropriate in your data.
replace endyear = year(date("$S_DATE","DMY")) if endyear == 9999
Next create an observation for the first and last years (or months) that the firm is alive:
expand 2
by firmID, sort: generate year = cond(_n == 1, startyear, endyear)
keep firmID year
duplicates drop // keeps one observation for firms that die in the period they were born
Now expand the dataset to have an observation for every period between the start and end date. For this I use tsfill.
xtset firmID year
tsfill
Now I have one observation per existing firm in each period. All that remains is to count the observations by year:
egen entities = count(firmID), by(year)
drop firmID
duplicates drop

How to calculate month between two dates in stata

I want to create a variable with the age of credit. The data only has the date of the start of credit.
I create date variable (eg 2017-12-31) for default. Then, i want calculate age with date of the start credit.
I tried to search for an article about that, but no luck.
Thanks.
It seems like your data is daily. In that case what you need is:
gen current_date=date("20171231","YMD")
format current_date %td //this will be the variable from which age will be calculated
gen age=current_date-start_credit_date //again, assuming the start credit variable is daily
this gives the age variable as the number of days. If you want it as the number of months, you need to add:
gen current_month=mofd(current_date)
format current_month %tm
gen start_month=mofd(start_credit_date)
format start_month %tm
gen age_month=current_month-start_month

How to replace a zero-valued answer by its respective average value?

I have a household data set which includes expenditures for various foods. I categorized them into main food groups and price is obtained by dividing the expenditure value by quantity. For some households price comes as zero since their consumption with respect to the corresponding food group is zero. In such cases, I want to get the price as the average price of the corresponding city, district & province, which that non-consumed household is selected.
How could I do it using STATA?
The mean of the positive values is
egen mean_price = mean(price / (price > 0)), by(province district city)
and you can replace zeros in a clone by
gen price2 = cond(price > 0, price, mean_price)
The division trick can be explained like this. If price > 0 is true, then that expression evaluates to 1; and if false to 0. Dividing by 1 clearly leaves values unchanged. Dividing by 0 creates missings, which egen's mean() function will ignore, which is precisely what is wanted.
There is more discussion of related technique in the article referred to in http://www.stata-journal.com/article.html?article=dm0055
P.S. Stata is the correct spelling. It is an invented word, and was never an acronym.
P.S. You have yet to acknowledge an answer at How to get the difference of two variables, when there are missing values?
LATER:
In this case another way is
egen total = total(price), by(province district city)
egen number = total(price > 0), by(province district city)
gen price2 = cond(price > 0, price, total/number)
as zero prices make no difference to the total. Use doubles throughout.