I would like to retain the variable labels after reshaping the dataset from long to wide. I have a problem in inputting the dataset (I keyed them in Excel then imported).
clear
set obs 7
input id a1 a2 a3
"s001" "John" 23 "Primary"
"s002" "Mary" 32 "Secondary"
"s002" "Anna" 23 "Tertiary"
"s003" "Joseph" 34 "Secondary"
"s003" "Oganyo" 23 "Primary"
"s004" "Manyoya" 34 "Tertiary"
"s005" "Makbuti" 45 "Primary"
end
*======= Label the variables
label var a1 "partners name"
label var a2 "partners age"
label var a3 "partners education"
foreach variable of varlist a*{
local varlabel : variable label `variable'
di "`varlabel'"
bys id: gen index = _n
renvars a* , postfix(_)
reshape wide a*, i(id) j(index)
label var `variable'* "`varlabel'"
}
The code here is very confused and just won't work well without major surgery.
At the outset your input statement fails to declare string variables when needed.
The set obs 7 is compatible with an end to the input.
A big problem is that you are looping over a bunch of commands including reshape, but there is just one reshape to carry out.
Your code for saving variable labels needs to save then separately.
renvars is from the Stata Journal (2005), and doesn't seem needed here any way. As from Stata 12 (2011!) it should still work but is essentially superseded by extensions to rename.
Here's my best guess at the example you should have given and the code you need.
clear
input str4 id str7 a1 a2 str9 a3
"s001" "John" 23 "Primary"
"s002" "Mary" 32 "Secondary"
"s002" "Anna" 23 "Tertiary"
"s003" "Joseph" 34 "Secondary"
"s003" "Oganyo" 23 "Primary"
"s004" "Manyoya" 34 "Tertiary"
"s005" "Makbuti" 45 "Primary"
end
label var a1 "partner's name"
label var a2 "partner's age"
label var a3 "partner's education"
local j = 0
foreach variable of varlist a* {
local ++j
local varlabel`j' : variable label `variable'
di "`varlabel`j''"
}
bys id: gen index = _n
reshape wide a*, i(id) j(index)
forval j = 1/3 {
foreach v of var a`j'* {
label var `v' "`varlabel`j''"
}
}
list
describe
Here's the output of the last two commands.
. list
+------------------------------------------------------------+
| id a11 a21 a31 a12 a22 a32 |
|------------------------------------------------------------|
1. | s001 John 23 Primary . |
2. | s002 Mary 32 Secondary Anna 23 Tertiary |
3. | s003 Joseph 34 Secondary Oganyo 23 Primary |
4. | s004 Manyoya 34 Tertiary . |
5. | s005 Makbuti 45 Primary . |
+------------------------------------------------------------+
.
. describe
Contains data
Observations: 5
Variables: 7
------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------
id str4 %9s
a11 str7 %9s partner's name
a21 float %9.0g partner's age
a31 str9 %9s partner's education
a12 str7 %9s partner's name
a22 float %9.0g partner's age
a32 str9 %9s partner's education
------------------------------------------------------------------------------------------------------------------
Sorted by: id
Note: reshape wide makes most things more difficult in Stata.
Note: Please use dataex to create reproducible data examples in Stata.
Related
I am working on modyfinyg, renaming and arranging estimation results for final publication in Stata. What I have so far is:
sysuse auto, clear
* interaction model
gen int_mpg_mpg = mpg*mpg // generate interaction manually
qui regress price weight mpg int_mpg_mpg foreign
mat b=e(b) // store estimation results in matrix b
matrix list b // see the problem, colum 3 "int_mpg_mpg"
b[1,5]
weight mpg int_mpg_mpg foreign _cons
y1 3.0379251 -298.14637 5.8617627 3420.2156 -527.04589
* rename interaction with additional package erepost
local coln "b:weight b:mpg b:c.mpg#c.mpg b:foreign b:_cons"
mat colnames b= `coln'
capt prog drop replace_b
program replace_b, eclass
erepost b= b, rename
end
replace_b
eststo my_model
esttab my_model, label
This gives a nice version of the interaction term:
------------------------------------
(1)
Price
------------------------------------
b
Weight (lbs.) 3.038***
(3.84)
Mileage (mpg) -298.1
(-0.82)
Mileage (mpg) # Mi~) 5.862
(0.90)
Car type 3420.2***
(4.62)
Constant -527.0
(-0.08)
------------------------------------
Observations 74
------------------------------------
But imagine that I have many more models and variables. Then it becomes very unhandy to list all variables, all interactions etc. in this local coln
local coln "b:weight b:mpg b:c.mpg#c.mpg b:foreign b:_cons"
mat colnames b= `coln'
I am trying to only fix particular column names in the estimation matrix:
local colfix "b:c.mpg#c.mpg"
mat colnames b[1,3]= `colfix'
How can I access a particular column of a matrix in Stata?
Your intended new name seems the same as the existing name.
I don't know a way to change individual column names, but this may help:
. matrix foo = J(3, 4, 42)
. matrix colname foo = frog toad DRAGON griffin
. mat li foo
foo[3,4]
frog toad DRAGON griffin
r1 42 42 42 42
r2 42 42 42 42
r3 42 42 42 42
. local colnames : colnames foo
. local colnames : subinstr local colnames "DRAGON" "newt", all
. matrix colname foo = `colnames'
. matrix li foo
foo[3,4]
frog toad newt griffin
r1 42 42 42 42
r2 42 42 42 42
r3 42 42 42 42
It's a modest exercise to make that into a subroutine. The bigger deal is whether there is a way to do this in the estout suite, wonderful commands I never use.
for simplification, let's assume following script to create a simple regression table:
sysuse auto
eststo clear
qui regress price weight mpg
esttab using "table.rtf", cells(t) mtitles onecell nogap ///
stats(N, labels("Observations")) label ///
compress replace
eststo clear
Output:
(1)
.
t
Weight (lbs.) 2.723238
Mileage (mpg) -.5746808
Constant .541018
Observations 74
Question:
Would it be possible to mark every t-value above 0.5 or below 0.5 with an asterisk? (= greater than absolute value 0.5)
Please note: In the specific application case, I can't work with given p-values, and need a custom solution that works with thresholds of t.
Desired outcome:
(1)
.
t
Weight (lbs.) 2.723238*
Mileage (mpg) -.5746808*
Constant .541018*
Observations 74
Crossposting can be found here:
Thank you for your help!
You cannot do this directly with estout but the following works all the same:
sysuse auto, clear
regress price weight mpg
quietly esttab, mtitles onecell nogap stats(N, labels("Observations")) label ///
compress replace star staraux
matrix A = r(coefs)
matrix A = A[1...,2]
svmat A
generate A2 = "*" if abs(A1) >= 0.5
generate A4 = string(A1) + A2
local names : rownames A
generate A3 = ""
forvalues i = 1 / `: word count `names'' {
replace A3 = `"`: word `i' of `names''"' in `i'
}
list A3 A4 if !missing(A3)
+---------------------+
| A3 A4 |
|---------------------|
1. | weight 2.723238* |
2. | mpg -.5746808* |
3. | _cons .541018* |
+---------------------+
preserve
keep if !missing(A3)
export delimited A3 A4 using table.txt, delimiter(" ") novarnames
restore
You will have to do some more gymnastics to get the variable labels etc.
The questionnaire I have data from asked respondents to rank 20 items on a scale of importance to them. The lower end of the scale contained a "bin" in which respondents could throw away any of the 20 items that they found completely unimportant to them. The result is a dataset with 20 variables (1 for every item). Every variable receives a number between 1 and 100 (and 0 if the item was thrown in the bin)
I would like to recode the entries into a ranking of the variables for every respondent. So all variables would receive a number between 1 and 20 relative to where that respondent ranked it.
Example:
Current:
item1 item2 item3 item4 item5 item6 item7 item8 etc.
respondent1 67 44 29 7 0 99 35 22
respondent2 0 42 69 50 12 0 67 100
etc.
What I want:
item1 item2 item3 item4 item5 item6 item7 item8 etc.
respondent1 7 6 4 2 1 8 5 3
respondent2 1 4 7 5 3 1 6 8
etc.
As you can see with respondent2, I would like items that received the same value, to get the same rank and the ranking to then skip a number.
I have found a lot of info on how to rank observations but I have not found out how to rank variables yet. Is there anyone that knows how to do this?
Here is one solution using reshape:
/* Create sample data */
clear *
set obs 2
gen respondant = "respondant1"
replace respondant = "respondant2" in 2
set seed 123456789
forvalues i = 1/10 {
gen item`i' = ceil(runiform()*100)
}
replace item2 = item1 if respondant == "respondant2"
list
+----------------------------------------------------------------------------------------------+
| respondant item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 |
|----------------------------------------------------------------------------------------------|
1. | respondant1 14 56 69 62 56 26 43 53 22 27 |
2. | respondant2 65 65 11 7 88 5 90 85 57 95 |
+----------------------------------------------------------------------------------------------+
/* reshape long first */
reshape long item, i(respondant) j(itemNum)
/* Rank observations, accounting for ties */
by respondant (item), sort : gen rank = _n
replace rank = rank[_n-1] if item[_n] == item[_n-1] & _n > 1
/* reshape back to wide format */
drop item // optional, you can keep and just include in reshape wide
reshape wide rank, i(respondant) j(itemNum)
I have a dataset in Stata of the following form
id | year
a | 1950
b | 1950
c | 1950
d | 1950
.
.
.
y | 1950
-----
a | 1951
b | 1951
c | 1951
d | 1951
.
.
.
y | 1951
-----
...
I'm looking for a quick way to rewrite the following code
gen dummya=1 if id=="a"
gen dummyb=1 if id=="b"
gen dummyc=1 if id=="c"
...
gen dummyy=1 if id=="y"
and
gen dummy50=1 if year==1950
gen dummy51=1 if year==1951
...
Note that all your dummies would be created as 1 or missing. It is almost always more useful to create them directly as 1 or 0. Indeed, that is the usual definition of dummies.
In general, it's a loop over the possibilities using forvalues or foreach, but the shortcut is too easy not to be preferred in this case. Consider this reproducible example:
. sysuse auto, clear
(1978 Automobile Data)
. tab rep78, gen(rep78)
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
. d rep78?
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
rep781 byte %8.0g rep78== 1.0000
rep782 byte %8.0g rep78== 2.0000
rep783 byte %8.0g rep78== 3.0000
rep784 byte %8.0g rep78== 4.0000
rep785 byte %8.0g rep78== 5.0000
That's all the dummies (some prefer to say "indicators") in one fell swoop through an option of tabulate.
For completeness, consider an example doing it the loop way. We imagine that years 1950-2015 are represented:
forval y = 1950/2015 {
gen byte dummy`y' = year == `y'
}
Two digit identifiers dummy50 to dummy15 would be unambiguous in this example, so here they are as a bonus:
forval y = 1950/2015 {
local Y : di %02.0f mod(`y', 100)
gen byte dummy`y' = year == `y'
}
Here byte is dispensable unless memory is very short, but it's good practice any way.
If anyone was determined to write a loop to create indicators for the distinct values of a string variable, that can be done too. Here are two possibilities. Absent an easily reproducible example in the original post, let's create a sandbox. The first method is to encode first, then loop over distinct numeric values. The second method is find the distinct string values directly and then loop over them.
clear
set obs 3
gen mystring = word("frog toad newt", _n)
* Method 1
encode mystring, gen(mynumber)
su mynumber, meanonly
forval j = 1/`r(max)' {
gen dummy`j' = mynumber == `j'
label var dummy`j' "mystring == `: label (mynumber) `j''"
}
* Method 2
levelsof mystring
local j = 1
foreach level in `r(levels)' {
gen dummy2`j' = mystring == `"`level'"'
label var dummy2`j' `"mystring == `level'"'
local ++j
}
describe
Contains data
obs: 3
vars: 8
size: 96
------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
mystring str4 %9s
mynumber long %8.0g mynumber
dummy1 float %9.0g mystring == frog
dummy2 float %9.0g mystring == newt
dummy3 float %9.0g mystring == toad
dummy21 float %9.0g mystring == frog
dummy22 float %9.0g mystring == newt
dummy23 float %9.0g mystring == toad
------------------------------------------------------------------------------
Sorted by:
Use i.<dummy_variable_name>
For example, in your case, you can use following command for regression:
reg y i.year
I also recommend using
egen year_dum = group(year)
reg y i.year_dum
This can be generalized arbitrarily, and you can easily create, e.g., year-by-state fixed effects this way:
egen year_state_dum = group(year state)
reg y i.year_state_dum
I have a dataset that can be simplified in the following format:
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
In the dataset, there is Date, ID, VarA, and VarB. Each ID represents a unique set of transactions. I want to collapse (sum) VarA VarB, by(Date) in Stata. However, I want to keep the date of the first observation for each ID number.
Essentially, I want the above dataset to become the following:
+--------------------------------+
| Date ID Var1 Var2 |
|--------------------------------|
| 12jan2010 5 21 42 |
| 12jan2010 6 41 17 |
| 15jan2010 10 7 68 |
+--------------------------------+
12jan2010 17jan2010 and 19jan2010 have the same ID, so I want to collapse (sum) Var1 Var2 for these three observations. I want to keep the date 12jan2010 because it is the date for the first observation. The other two observations are dropped.
I know it might be possible to collapse by ID first and then merge with the original dataset and then subset. I was wondering if there is an easier way to make this work. Thanks!
collapse allows you to calculate a variety of statistics, so you can convert your string date into a numerical date, then take the minimum of the numerical date to get the first occurrence.
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2010" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY")
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+
In response to the comment: You can generate the formatted date for only observations where VarA is > 0 (and not missing). (Assuming that, per your comment, VarA & VarB always have the same sign.)
// now assume ID 6 has an earliest date of 17jan2005 (obs.4)
// but you want to return your 'first date' as the
// first date where varA & varB are both positive
clear
input str9 Date ID VarA VarB
"12jan2010" 5 21 42
"12jan2010" 6 47 21
"15jan2010" 10 7 68
"17jan2005" 6 -5 -3
"19jan2010" 6 -1 -1
end
gen Date2 = date(Date, "DMY") if VarA > 0 & !missing(VarA)
format Date2 %td
collapse (sum) VarA VarB (min) Date2 , by(ID)
order Date2, first
li
yielding
+------------------------------+
| Date2 ID VarA VarB |
|------------------------------|
1. | 12jan2010 5 21 42 |
2. | 12jan2010 6 41 17 |
3. | 15jan2010 10 7 68 |
+------------------------------+