I have a set of household data with more than 20,000 records of 4200 households. In my data set there is no any column for household ID & all the records are per household member. There is a column for person's serial no & with each & every "1", the household should be changed.( i.e: if we start to number households, with the very 1st person's serial no when it's equal to 1, the corresponding HH_ID should be "1". Once the next record when person's serial no=1 meets, then the HH_ID should be 2.) So I want to add a column named HH_ID & number it from 1-4200. How could I write a program using STATA?
What you want is (assuming a variable personid for person identifier)
. gen hhid = sum(personid == 1)
That's it. The explanation is longer than the code. The expression personid == 1 evaluates as 1 when true and 0 when false. For the first household, first person, this will be 1, and for the other persons in the same household 0. For the second household, first person, this will be 1, and so on. The function sum() gives the cumulative or running sum, so that you should end with something that goes 1,1,1,2,2,2,2,3,3,3,... Clearly the actual numbers of 1s, 2s, 3s etc. will depend on the numbers of persons in the households.
On true and false in Stata see
http://www.stata.com/support/faqs/data-management/true-and-false/index.html
Related
For non-longitudinal analysis using long-formatted data, when subjects have multiple visits or records, I will typically hunt down a record within each subject using bysort ID, and set a temporary variable to hold the integer or real value that I found, and then egen max() to find the max value for all records found, then set a final value in record _n==1 for that subject. This is so I can have the values I want from different visits percolate to a single record for each subject. Each single record per subject will then be used during analysis (but not longitudinal, maybe cross-sectional or regression, ANOVA, etc.)
Let's say I want the highest cholesterol (ldl) value for the 3rd year of a trial, where ldl is measured quarterly (every 3 months) for all subjects, which can be accomplished using the code below:
cap drop ldl3tmp
cap drop ldl3max
cap drop ldl3
bysort id (visitdate): gen ldl3tmp = ldl if trialyear==3
bysort id (visitdate): egen ldl3max = max(ldl3tmp)
bysort id (visitdate): gen ldl3 = ldl3max if _n==1
Suppose there are initials for the lab technician or phlebotomist that did the blood draw. How can I percolate a string value to record _n==1 that's associated with the greatest ldl value among the subset of records for the 3rd year of the trial? String values can't be sorted, so I am guessing the answer might be to eliminate records for which ldl is not the greatest value in year 3, then the string will be in that record?
In this case, how can I find out what _n is for the maximum value? If I know that, I could use
bysort id (visitdate): drop if _n!=6 //if _n==6 has the max value of ldl
Here is how to find the record number associated with the greatest ldl value within 4 quarterly ldl values in year 3 of a trial. The result is a variable called recmax, which will only be filled in for the specific record where the greatest value was found (among all records for each subject).
cap drop tmpldl3
cap drop maxldl3
cap drop recmax
cap drop visitdate
gen long visitdate = date(dateofvisit, "MDY") //You have to convert date ("MM/DD/YYYY") to a long integer format - based on #days since Jan 1, 1960
bysort id (visitdate): gen tmpldl3 = ldl if trialyear ==3
bysort id (visitdate): egen maxldl3 = max(tmpldl3)
bysort id (visitdate): gen recmax = _n if tmpldl3==maxldl3 & tmpldl3!=. & maxldl3!=.
You can then analyze all the other data (such as string data) in that record cross-sectionally (ANOVA, correlation, regression) by specifying if recmax!=. in the trailing if statement for any analysis command. If you are careful, you could also drop all other records with extraneous ldl values not of interest by using the command drop if recmax!=. providing you realize you dropped data and if you save, save to a filename with "_reduced" or "_dropped" in it.
I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;
I am working with the CES diary data from 2006. I have a file which for each household has an entry for each item bought during a week long period. I have the following variables
newid id of household
cost dollar cost of item
ucc a code denoting the type of item
I am interested in restaurant expenditures which is covered by ucc 190111, 190112, ... . I want to collapse my data so for each newid I have the sum of restaurant expenditures for the household during the week. I used the command
collapse (sum) cost if ucc=="190111".... , by (newid)
However, I would like to have a zero when there are no restaurant expenditures and Stata simply removes those entries.
You need an intermediate variable with some zeros for non-restaurant expenditures:
gen rest_exp = cond(inlist(ucc,"190111","190112"),cost,0)
collapse (sum) rest_exp, by(newid)
One caveat is that inlist() has a constraint of 9 possible values for strings, but you probably have fewer than that or should destring, in which case the limit is 254. You can also hitch a few inlist()s together with |.
In a panel data set I have 3 variables: name, week, and income.
I would like to make an indicator variable that indicates initial weeks where income is 0. So say a person X has 0 income in the first 13 weeks, the indicator takes the value 1 the first 13 weeks, and is otherwise 0. The same procedure for person Y and so on.
I have tried using by groups, but I can't get it to work.
Any suggestions?
One solution is
bysort name (week) : gen no_income = sum(income) == 0
The function sum() yields cumulative or running sum. So, as long as income is 0, its cumulative sum remains 0 too. As soon as a person earns something, the cumulative sum becomes positive. The code is based on the presumption that cumulative income can not cross zero again because in a given week, income is negative. To exclude that possibility use an appropriate extra condition, such as
bysort name (week) : gen no_income = sum(income) == 0 & income == 0
For a problem with very similar flavour, see this FAQ. A meta-lesson is to look at the StataCorp FAQs as one of several resources.
I have a household data set which includes expenditures for various foods. I categorized them into main food groups and price is obtained by dividing the expenditure value by quantity. For some households price comes as zero since their consumption with respect to the corresponding food group is zero. In such cases, I want to get the price as the average price of the corresponding city, district & province, which that non-consumed household is selected.
How could I do it using STATA?
The mean of the positive values is
egen mean_price = mean(price / (price > 0)), by(province district city)
and you can replace zeros in a clone by
gen price2 = cond(price > 0, price, mean_price)
The division trick can be explained like this. If price > 0 is true, then that expression evaluates to 1; and if false to 0. Dividing by 1 clearly leaves values unchanged. Dividing by 0 creates missings, which egen's mean() function will ignore, which is precisely what is wanted.
There is more discussion of related technique in the article referred to in http://www.stata-journal.com/article.html?article=dm0055
P.S. Stata is the correct spelling. It is an invented word, and was never an acronym.
P.S. You have yet to acknowledge an answer at How to get the difference of two variables, when there are missing values?
LATER:
In this case another way is
egen total = total(price), by(province district city)
egen number = total(price > 0), by(province district city)
gen price2 = cond(price > 0, price, total/number)
as zero prices make no difference to the total. Use doubles throughout.