How to rank without skipping values? [duplicate] - stata

I have some data in Stata which look like the first two columns of:
group_id var_to_rank desired_rank
____________________________________
1 10 1
1 20 2
1 30 3
1 40 4
2 10 1
2 20 2
2 20 2
2 30 3
I'd like to create a rank of each observation within group (group_id) according to one variable (var_to_rank). Usually, for this purpose I used:
gen id = _n
However some of my observations (group_id = 2 in my small example) have the same values of ranking variable and this approach doesn't work.
I have also tried using:
egen rank
command with different options, but cannot make my rank variables make to look like desired_rank.
Could you point me to a solution to this problem?

The following works for me:
bysort group_id: egen desired_rank=rank(var_to_rank)

I'd say this question is posed the wrong way round for best understanding. The aim is to group observations, those with the lowest value all being assigned a grade 1, the next lowest being all assigned 2 and so forth. This isn't ranking in most senses that I have seen discussed, but Stata's egen, rank() does get you part of the way.
But the direct way, which was mentioned in the Statalist thread cited elewhere in this thread (start here) is simpler in spirit than any solution quoted:
bysort group_id (var_to_rank): gen desired_rank = sum(var_to_rank != var_to_rank[_n-1])
Once data are sorted on var_to_rank then when values differ from previous values at the start of each block of distinct values a value of 1 is the result of var_to_rank != var_to_rank[_n-1]; otherwise 0 is the result. Summing those 1s and 0s cumulatively gives the desired variable. The prefix command bysort does the sorting required and ensures that this is all done separately within the groups defined by group_id. No need for egen at all (a command that many people who only use Stata occasionally often find bizarre).
Declaration of interest: The Statalist thread cited shows that when asked a similar question I too did not think of this solution in one.

Stumbled upon such solution on the Statalist:
bysort group_id (var_to_rank) : gen rank = var_to_rank != var_to_rank[_n-1]
by group_id : replace rank = sum(rank)
Seems to sort out this issue.

#radek: you surely got it sorted out in the meantime ... but this would have been an easy (though not very elegant) solution:
bysort group_id: egen desired_rank_HELP =rank(var_to_rank), field
egen desired_rank =group(grup_id desired_rank_HELP)
drop desired_rank_HELP

Way too much work. Easy and elegant. Try this one.
gen desired_rank=int(var_to_rank/10)

try this command, it works for me so well: egen newid=group(oldid)

Related

identify groups with few observations in paneldata models (stata)

How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.

Count by groups and collapse

I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.

EU-SILC database about education and experience

I am using EU-SILC database for 2008 for Greece. Firstly, I would like to use PE040 so as to create three dummies: primeduc for education on pre-primary AND primary school seceduc on lower secondary education +(upper) secondary + post-secondary non tertiary education and tereduc on 1st + 2nd tertiary stage.
Secondly, I would like to make a variable about working experience based on the idea exper=age-educ-6 where educ I would like sth about the years (generally) spent in education.
Any ideas of which commands I should use on stata???
What I've tried so far
About stata syntax:
tabulate PE040, gen(educ)
gen primeduc=educ1+educ2
gen seceduc=educ3+educ4+educ5
gen tereduc=educ6
Having defined lnwage as =log(PY010N/(PL060+PL070)) and age as =2008-PB140, I've tried to regress and it takes only into account 191 obs.
For your first question, I think you want a 0-1 indicator, equal to 1 if either of the indicated educational categories was recorded.
gen primeduc=educ1 | educ2
gen seceduc =educ3 |educ4 |educ5
The "|" stands for logical "or". For example, primeduc will be 1 if educ1 is 1 or educ2 is 1.

Creating indicator variable in Stata

In a panel data set I have 3 variables: name, week, and income.
I would like to make an indicator variable that indicates initial weeks where income is 0. So say a person X has 0 income in the first 13 weeks, the indicator takes the value 1 the first 13 weeks, and is otherwise 0. The same procedure for person Y and so on.
I have tried using by groups, but I can't get it to work.
Any suggestions?
One solution is
bysort name (week) : gen no_income = sum(income) == 0
The function sum() yields cumulative or running sum. So, as long as income is 0, its cumulative sum remains 0 too. As soon as a person earns something, the cumulative sum becomes positive. The code is based on the presumption that cumulative income can not cross zero again because in a given week, income is negative. To exclude that possibility use an appropriate extra condition, such as
bysort name (week) : gen no_income = sum(income) == 0 & income == 0
For a problem with very similar flavour, see this FAQ. A meta-lesson is to look at the StataCorp FAQs as one of several resources.

Count number of living firms efficiently

I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.
That said, maybe someone can suggest an overall better strategy.
Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to #Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.