Consider the following example data:
psu | sumsc sumst sumobc sumother sumcaste
-------|-----------------------------------------------
10018 | 3 2 0 4 9
|
10061 | 0 0 2 5 7
|
10116 | 1 1 2 4 8
|
10121 | 3 0 1 2 6
|
20002 | 4 1 0 1 6
-------------------------------------------------------
I want to rank the variables sumsc, sumst, sumobc, and sumother according to their percent contribution to sumcaste (this is the total of all variables) within psu.
Could anyone help me do this in Stata?
First we enter the data:
clear all
set more off
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
Second, we prepare the reshape:
local j=1
foreach var of varlist sumsc sumst sumobc sumother {
gen temprl`j' = `var' / sumcaste
ren `var' addi`j'
local ++j
}
reshape long temprl addi, i(psu) j(ord)
lab def ord 1 "sumsc" 2 "sumst" 3 "sumobc" 4 "sumother"
lab val ord ord
Third, we order before presenting:
gsort psu -temprl
by psu: gen nro=_n
drop temprl
order psu nro ord
Fourth, presenting the data:
br psu nro ord addi
EDIT:
This is a combination of Aron's solution with mine (#PearlySpencer):
clear
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
local i = 0
foreach var of varlist sumsc sumst sumobc sumother {
local ++i
generate pct`i' = 100 * `var' / sumcaste
rename `var' temp`i'
local rvars "`rvars' r`i'"
}
rowranks pct*, generate("`rvars'") field lowrank
reshape long pct temp r, i(psu) j(name)
label define name 1 "sumsc" 2 "sumst" 3 "sumobc" 4 "sumother"
label values name name
keep psu name pct r
bysort psu (r): replace r = sum(r != r[_n-1])
Which gives you the desired output:
list, sepby(psu) noobs
+---------------------------------+
| psu name pct r |
|---------------------------------|
| 10018 sumother 44.44444 1 |
| 10018 sumsc 33.33333 2 |
| 10018 sumst 22.22222 3 |
| 10018 sumobc 0 4 |
|---------------------------------|
| 10061 sumother 71.42857 1 |
| 10061 sumobc 28.57143 2 |
| 10061 sumsc 0 3 |
| 10061 sumst 0 3 |
|---------------------------------|
| 10116 sumother 50 1 |
| 10116 sumobc 25 2 |
| 10116 sumst 12.5 3 |
| 10116 sumsc 12.5 3 |
|---------------------------------|
| 10121 sumsc 50 1 |
| 10121 sumother 33.33333 2 |
| 10121 sumobc 16.66667 3 |
| 10121 sumst 0 4 |
|---------------------------------|
| 20002 sumsc 66.66666 1 |
| 20002 sumst 16.66667 2 |
| 20002 sumother 16.66667 2 |
| 20002 sumobc 0 3 |
+---------------------------------+
This approach will be useful if you need the variables for further analysis as opposed to just displaying the results.
First you need to calculate percentages:
clear
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
foreach var of varlist sumsc sumst sumobc sumother {
generate pct_`var' = 100 * `var' / sumcaste
}
egen pcttotal = rowtotal(pct_*)
list pct_* pcttotal, abbreviate(15) noobs
+--------------------------------------------------------------+
| pct_sumsc pct_sumst pct_sumobc pct_sumother pcttotal |
|--------------------------------------------------------------|
| 33.33333 22.22222 0 44.44444 100 |
| 0 0 28.57143 71.42857 100 |
| 12.5 12.5 25 50 100 |
| 50 0 16.66667 33.33333 100 |
| 66.66666 16.66667 0 16.66667 99.99999 |
+--------------------------------------------------------------+
Then you need to get the ranks and do some gymnastics:
rowranks pct_*, generate(r_sumsc r_sumst r_sumobc r_sumother) field lowrank
mkmat r_*, matrix(A)
matrix A = A'
svmat A, names(row)
local matnames : rownames A
quietly generate name = " "
forvalues i = 1 / `: word count `matnames'' {
quietly replace name = substr(`"`: word `i' of `matnames''"', 3, .) in `i'
}
ds row*
foreach var in `r(varlist)' {
sort `var' name
generate `var'b = sum(`var' != `var'[_n-1])
drop `var'
rename `var'b `var'
list name `var' if name != " ", noobs
display ""
}
The above will give you what you want:
+-----------------+
| name row1 |
|-----------------|
| sumother 1 |
| sumsc 2 |
| sumst 3 |
| sumobc 4 |
+-----------------+
+-----------------+
| name row2 |
|-----------------|
| sumother 1 |
| sumobc 2 |
| sumsc 3 |
| sumst 3 |
+-----------------+
+-----------------+
| name row3 |
|-----------------|
| sumother 1 |
| sumobc 2 |
| sumsc 3 |
| sumst 3 |
+-----------------+
+-----------------+
| name row4 |
|-----------------|
| sumsc 1 |
| sumother 2 |
| sumobc 3 |
| sumst 4 |
+-----------------+
+-----------------+
| name row5 |
|-----------------|
| sumsc 1 |
| sumother 2 |
| sumst 2 |
| sumobc 3 |
+-----------------+
Note that you will first need to install the community-contributed command rowranks before you execute the above code:
net install pr0046.pkg
Related
I have a panel dataset in Stata with several countries and each country containing groups. I would like to rank the groups by country, according to the variable var1.
The structure of my dataset is as follows (the rank column is what I would like to achieve). Note that var1 is indeed constant within groups (it is just the within group average of another variable).
--country--|--groupId--|---time----|---var1----|---rank---
1 | 1 | 1 | 50 | 3
1 | 1 | 2 | 50 | 3
1 | 1 | 3 | 50 | 3
1 | 2 | 1 | 90 | 1
1 | 2 | 2 | 90 | 1
1 | 2 | 3 | 90 | 1
1 | 3 | 1 | 60 | 2
1 | 3 | 2 | 60 | 2
1 | 3 | 3 | 60 | 2
2 | 4 | 1 | 15 | 2
2 | 4 | 2 | 15 | 2
2 | 4 | 3 | 15 | 2
2 | 5 | 1 | 10 | 3
2 | 5 | 2 | 10 | 3
2 | 5 | 3 | 10 | 3
2 | 6 | 1 | 80 | 1
2 | 6 | 2 | 80 | 1
2 | 6 | 3 | 80 | 1
Among the options I have tried is:
sort country groupId
by country (groupId): egen rank = rank(var1)
However, I cannot achieve the desired result.
Thanks for the data example. There are two problems with your code. One is that as you want to rank from highest to lowest, you need to negate the argument to rank(). The second is that given the repetitions, you need to rank on one time only and then copy those ranks to other times.
This works with your data example, here edited to be input code. (See also the Stata tag wiki for that principle.)
clear
input country groupId time var1 rank
1 1 1 50 3
1 1 2 50 3
1 1 3 50 3
1 2 1 90 1
1 2 2 90 1
1 2 3 90 1
1 3 1 60 2
1 3 2 60 2
1 3 3 60 2
2 4 1 15 2
2 4 2 15 2
2 4 3 15 2
2 5 1 10 3
2 5 2 10 3
2 5 3 10 3
2 6 1 80 1
2 6 2 80 1
2 6 3 80 1
end
bysort country : egen wanted = rank(-var) if time == 1
bysort country groupId (time) : replace wanted = wanted[1]
assert rank == wanted
I have this database:
data temp;
input ID date type ;
datalines;
1 10/11/2006 1
1 10/12/2006 2
1 15/01/2007 2
1 20/01/2007 3
2 10/08/2008 1
2 11/09/2008 1
2 17/10/2008 1
2 12/11/2008 2
2 10/12/2008 3
;
I would like to create a new column where I put the first date where the variable type changes from 1 to 2 by ID as follows:
data temp;
input ID date type date_change_type1to2;
datalines;
1 10/11/2006 1 .
1 10/12/2006 2 10/12/2006
1 15/01/2007 2 .
1 20/01/2007 3 .
2 10/08/2008 1 .
2 11/09/2008 1 .
2 17/10/2008 1 .
2 12/11/2008 2 12/11/2008
2 10/12/2008 3 .
;
I have tried this code but it doesn't work:
data temp;
set temp;
if first.type= 2 then date_change_type1to2=date;
by ID;
run;
Thank you in advance for your help!
Solution(input data must be sorted!):
data temp;
input ID date $10. type ;
datalines;
1 10/11/2006 1
1 10/12/2006 2
1 15/01/2007 2
1 20/01/2007 2
2 10/08/2008 1
2 11/09/2008 1
2 17/10/2008 1
2 12/11/2008 2
2 10/12/2008 2
;
run;
data temp(drop=type_store);
set temp;
by ID;
retain type_store;
if first.id then type_store = type;
if type ne type_store and type = 2 then do;
date_change_type1to2=date;
type_store = type;
end;
run;
Output:
+----+------------+------+----------------------+
| ID | date | type | date_change_type1to2 |
+----+------------+------+----------------------+
| 1 | 10/11/2006 | 1 | |
+----+------------+------+----------------------+
| 1 | 10/12/2006 | 2 | 10/12/2006 |
+----+------------+------+----------------------+
| 1 | 15/01/2007 | 2 | |
+----+------------+------+----------------------+
| 1 | 20/01/2007 | 2 | |
+----+------------+------+----------------------+
| 2 | 10/08/2008 | 1 | |
+----+------------+------+----------------------+
| 2 | 11/09/2008 | 1 | |
+----+------------+------+----------------------+
| 2 | 17/10/2008 | 1 | |
+----+------------+------+----------------------+
| 2 | 12/11/2008 | 2 | 12/11/2008 |
+----+------------+------+----------------------+
| 2 | 10/12/2008 | 2 | |
+----+------------+------+----------------------+
The variable first.type will not be created if you have not included type in a by statement. And even if it did exits its value could never be 2, its value will be either 1 (true) or 0 (false).
If you just want to set it and keep its value for the rest of the observations for that ID then you could RETAIN the value. Make sure to clear it when starting a new ID value.
data temp;
set temp;
by ID;
if first.id then date_change_type1to2=.;
retain date_change_type1to2 ;
if type=2 and missing(date_change_type1to2) then date_change_type1to2=date;
run;
I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+
Suppose I have the following dataset:
clear
input SubjectID DecisionID AltID my_alpha
1 1 1 0.4
1 1 2 0.4
1 2 1 0.6
1 2 2 0.6
2 1 1 0.8
2 1 2 0.8
2 2 1 0.5
2 2 2 0.5
end
I want to create a new variable for each value of AltID that depends on the value of my_alpha. In this scenario, there would now be AltID_alpha_1 and AltID_alpha_2. AltID_alpha_1 would be equal to my_alpha when AltID is 1, and equal to 0 otherwise. Similarly, AltID_alpha_2 would be equal to my_alpha when AltID is equal to 2, and equal to 0 otherwise. That is, it should look like this:
| SubjectID | DecisionID | AltID | my_alpha | alpha_AltID_1 | alpha_AltID_2 |
| 1 ------------ | 1 ------------ | 1 ----- | 0.4 -------- | 0.4--------------- | 0----------------- |
| 1 ------------ | 1 ------------ | 2 ----- | 0.4 -------- | 0----------------- | 0.4--------------- |
| 1 ------------ | 2 ------------ | 1 ----- | 0.6 -------- | 0.6--------------- | 0----------------- |
| 1 ------------ | 2 ------------ | 2 ----- | 0.6 -------- | 0----------------- | 0.6--------------- |
| 2 ------------ | 1 ------------ | 1 ----- | 0.8 -------- | 0.8--------------- | 0----------------- |
| 2 ------------ | 1 ------------ | 2 ----- | 0.8 -------- | 0----------------- | 0.8--------------- |
| 2 ------------ | 2 ------------ | 1 ----- | 0.5 -------- | 0.5--------------- | 0----------------- |
| 2 ------------ | 2 ------------ | 2 ----- | 0.5 -------- | 0----------------- | 0.5--------------- |
The problem is that in my actual data, I have well over a million observations, 5151 values for AltID, and need to create variables for both my_alpha and my_beta (at a minimum). I need a way to do this "quickly".
I tried using a foreach loop to create the variables, but I had to cut it off after it had been running for 20 hours (on my desktop that has 24 GB of RAM). I was able to use the command quietly tab AltID, gen(alpha_AltID_) to get 0's in the proper places and 1's elsewhere, which took only a few seconds, but I then need a loop that replaces all the 1's with proper values, which seems to be taking roughly two hours (at the current pace). Does anyone have a more time-efficient solution?
Here are two ways to do this with your example.
clear
input SubjectID DecisionID AltID my_alpha
1 1 1 0.4
1 1 2 0.4
1 2 1 0.6
1 2 2 0.6
2 1 1 0.8
2 1 2 0.8
2 2 1 0.5
2 2 2 0.5
end
gen alpha_AltID_1 = cond(AltID == 1, my_alpha, 0)
gen alpha_AltID_2 = cond(AltID == 2, my_alpha, 0)
separate my_alpha, by(AltID)
mvencode my_alpha?, mv(0)
list AltID *alpha*, sep(0)
+--------------------------------------------------------------+
| AltID my_alpha alpha_~1 alpha_~2 my_alp~1 my_alp~2 |
|--------------------------------------------------------------|
1. | 1 .4 .4 0 .4 0 |
2. | 2 .4 0 .4 0 .4 |
3. | 1 .6 .6 0 .6 0 |
4. | 2 .6 0 .6 0 .6 |
5. | 1 .8 .8 0 .8 0 |
6. | 2 .8 0 .8 0 .8 |
7. | 1 .5 .5 0 .5 0 |
8. | 2 .5 0 .5 0 .5 |
+--------------------------------------------------------------+
So, what about your real case? The separate/mvencode method should work. So should this if your ids go from 1 to 5151:
forval j = 1/5151 {
gen alpha_AltID_`j' = cond(AltID == `j', my_alpha, 0)
}
If your ids are not so well behaved, then tell us how so.
It's difficult to believe that any loop code would take as long as you report, but since you don't show us the code, detailed comment is difficult.
All that said, why you do need one variable to be mapped to thousands?
I am trying to create a table of summary statistics (mean, sd) for a DV when there are three dichotomous IV. Using the command tab IV1 Iv2, sum (DV) I can create a summary statistics table for only two IV variables, but not for three. However, I need the summary stats for the three IV and their interactions. Is there any way around? An alternative command? Thanks!
You can make an interaction variable like this:
webuse nlswork
egen interaction = group(race nev_mar union), label
tab interaction, sum(ln_wage)
Here I use the same well-chosen sandbox as Dimitriy.
webuse nlswork, clear
quietly statsby n=r(N) mean=r(mean) sd=r(sd), by(race nev_mar union) subsets clear: summarize ln_wage
egen nvars = rownonmiss(race nev_mar union )
sort nvars race nev_mar union
format mean sd %4.3f
l race-sd, sepby(nvars) noobs
+-------------------------------------------------+
| race nev_mar union n mean sd |
|-------------------------------------------------|
| . . . 28534 1.675 0.478 |
|-------------------------------------------------|
| white . . 13590 1.796 0.464 |
| black . . 5426 1.647 0.458 |
| other . . 211 1.890 0.510 |
| . 0 . 15509 1.758 0.466 |
| . 1 . 3718 1.740 0.477 |
| . . 0 14720 1.702 0.466 |
| . . 1 4507 1.927 0.432 |
|-------------------------------------------------|
| white 0 . 11399 1.794 0.462 |
| white 1 . 2191 1.808 0.474 |
| white . 0 10774 1.753 0.465 |
| white . 1 2816 1.961 0.422 |
| black 0 . 3955 1.651 0.455 |
| black 1 . 1471 1.634 0.467 |
| black . 0 3779 1.551 0.432 |
| black . 1 1647 1.867 0.440 |
| other 0 . 155 1.893 0.553 |
| other 1 . 56 1.881 0.369 |
| other . 0 167 1.865 0.510 |
| other . 1 44 1.983 0.507 |
| . 0 0 11936 1.707 0.464 |
| . 0 1 3573 1.930 0.429 |
| . 1 0 2784 1.682 0.474 |
| . 1 1 934 1.914 0.444 |
|-------------------------------------------------|
| white 0 0 9071 1.751 0.462 |
| white 0 1 2328 1.961 0.423 |
| white 1 0 1703 1.766 0.479 |
| white 1 1 488 1.958 0.420 |
| black 0 0 2745 1.556 0.433 |
| black 0 1 1210 1.867 0.429 |
| black 1 0 1034 1.536 0.430 |
| black 1 1 437 1.866 0.469 |
| other 0 0 120 1.856 0.550 |
| other 0 1 35 2.020 0.553 |
| other 1 0 47 1.888 0.391 |
| other 1 1 9 1.842 0.235 |
+-------------------------------------------------+
So you get compactly all the three-way combinations, all the two-way, all the one-way and the overall summary. Moreover, this summary set is now the dataset in memory, so you can manipulate it further, export it, and so forth.