Creating indicator variable in Stata - stata

In a panel data set I have 3 variables: name, week, and income.
I would like to make an indicator variable that indicates initial weeks where income is 0. So say a person X has 0 income in the first 13 weeks, the indicator takes the value 1 the first 13 weeks, and is otherwise 0. The same procedure for person Y and so on.
I have tried using by groups, but I can't get it to work.
Any suggestions?

One solution is
bysort name (week) : gen no_income = sum(income) == 0
The function sum() yields cumulative or running sum. So, as long as income is 0, its cumulative sum remains 0 too. As soon as a person earns something, the cumulative sum becomes positive. The code is based on the presumption that cumulative income can not cross zero again because in a given week, income is negative. To exclude that possibility use an appropriate extra condition, such as
bysort name (week) : gen no_income = sum(income) == 0 & income == 0
For a problem with very similar flavour, see this FAQ. A meta-lesson is to look at the StataCorp FAQs as one of several resources.

Related

Count by groups and collapse

I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.

what is this program doing exactly? (SAS)

I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;

Stata - Generating rolling average variable

I'd like to generate a rolling average variable from a basketball dataset. So if the first observation is 25 points on January 1, the generated variable will show 25. If the second observation is 30 points on January 2, the variable generated will show 27.5. If the third observation is 35 points, the variable generated will show 30, etc.
For variable y ordered by some time t at its simplest the average of values to date is
gen yave = sum(y) / _n
which is the cumulative sum divided by the number of observations. If there are occasional missing values, they are ignored by sum() but the denominator needs to be fixed, say
gen yave = sum(y) / sum(y < .)
This generalises easily to panel structure
bysort id (t) : gen yave = sum(y) / sum(y < .)
Here is the solution I came up with. I had to create three variables, a cumulative point total (numerator) and a running count (denominator), then divided the two variables to get player points per game:
gen player_pts = points if player[_n]!=player[_n-1]
replace player_pts=points+player_pts[_n-1] if player[_n]==player[_n-1]&[_n]!=1
by player: gen player_games= [_n]
gen ppg=player_pts/player_games

Count number of living firms efficiently

I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.
That said, maybe someone can suggest an overall better strategy.
Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to #Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.

How to replace a zero-valued answer by its respective average value?

I have a household data set which includes expenditures for various foods. I categorized them into main food groups and price is obtained by dividing the expenditure value by quantity. For some households price comes as zero since their consumption with respect to the corresponding food group is zero. In such cases, I want to get the price as the average price of the corresponding city, district & province, which that non-consumed household is selected.
How could I do it using STATA?
The mean of the positive values is
egen mean_price = mean(price / (price > 0)), by(province district city)
and you can replace zeros in a clone by
gen price2 = cond(price > 0, price, mean_price)
The division trick can be explained like this. If price > 0 is true, then that expression evaluates to 1; and if false to 0. Dividing by 1 clearly leaves values unchanged. Dividing by 0 creates missings, which egen's mean() function will ignore, which is precisely what is wanted.
There is more discussion of related technique in the article referred to in http://www.stata-journal.com/article.html?article=dm0055
P.S. Stata is the correct spelling. It is an invented word, and was never an acronym.
P.S. You have yet to acknowledge an answer at How to get the difference of two variables, when there are missing values?
LATER:
In this case another way is
egen total = total(price), by(province district city)
egen number = total(price > 0), by(province district city)
gen price2 = cond(price > 0, price, total/number)
as zero prices make no difference to the total. Use doubles throughout.