I have a dataset like
year CNMubiBeijing CNMubiTianjing CNMubiShanghai ··· ··· Wulumuqi
1998 . . . .
1999 . . . .
····
2013 . . . .
As you can see, the first row is a list of city names in China,like Beijing, Shanghai and so on, combined with a prefix "CNMubi" (which is redundant). The first column corresponds to the year,and the observations are of another variable(like local government's tax revenue).It's similar to a "wide" type data and I want to convert it to a long type panel data like
city year tax_rev
Beijing 1998
···
Beijing 2013
Shanghai 1998
···
Shanghai 2013
Two immediate solutions come into my mind. One is to directly use the --reshape-- command, like reshape long CNMubi,i(year) j(city_eng) but it turn out give me a column of missing values (column of city_eng)
The second possible solution is use loop,like
foreach var of varlist _all {
replace city_eng="`var'"
}
It also doesn't work (in fact,the new generated city_eng equals to the last variables in the varlist), I need to "expand" the data from a mn to a mnm matrix. So how can I achieve my goal, thank you.
This works:
clear
set more off
*----- example data -----
input ///
year CNMubiBeijing CNMubiTianjing
1998 . .
1999 . .
2000
2001
2002
2003
end
set seed 259376
replace CNMubiBeijing = runiform()
replace CNMubiTianjing = runiform()
*----- what you want -----
reshape long CNMubi, i(year) j(city) string
sort city year
list, sepby(city)
Notice the string option, since j() contains string values.
The result is:
. sort city year
. list, sepby(city)
+----------------------------+
| year city CNMubi |
|----------------------------|
1. | 1998 Beijing .658855 |
2. | 1999 Beijing .494634 |
|----------------------------|
3. | 1998 Tianjing .0204465 |
4. | 1999 Tianjing .0454614 |
+----------------------------+
Related
I'm trying to scale one variable by another lagged variable.
(IB) scaled by the lagged total assets(AT) = ROA
I've tried this two methods below from here.
xtset companyid fyear, year
gen at1 = l.at
gen roa=ib/at1
and
xtset gvkey year
gen roa=(ib)/(at[_n-1])
The first one came back with all zeros for 1.ta
The second one seems to generate values on the previous entry, even if it's a different company. I think this is true because only the first row has a missing value. I would assume there should be a missing value for the first year of each company.
Additionally I've tried this code below but it said invalid syntax.
xtset gvkey year
foreach gvkey {
gen roa = (ib)/(at[_n-1]) }
I'm using compustat so it's similar to below:
gvkey|Year |Ticker | at | ib |
-------|-----|--------|------|------|
001111| 2006| abc |1000 |50 |
001111| 2007| abc |1100 |60 |
001111| 2008| abc |1200 |70 |
001111| 2009| abc |1300 |80 |
001112| 2008| www |28777 |1300 |
001112| 2009| www |26123 |870 |
001113| 2009| ttt |550 |-1000 |
001114| 2010| vvv |551 |-990 |
This is hard to follow. 1.ta may, or may not, be a typo for L.at.
Is gvkey string? At the Stata tag, there is really detailed advice about how to give Stata data examples, which you are not following.
In principle, your first approach is correct, so it is hard to know what went wrong, except that
The second one seems to generate values on the previous entry, even if
it's a different company.
That's exactly correct. The previous observation is the previous observation, and nothing in that command refers or alludes to the panel structure or xtset or tsset information.
Your foreach statement is just wild guessing and nothing to do with any form supported by foreach. foreach isn't needed here at all: the lag operator implies working within panels automatically.
I did this, which may help.
clear
input str6 gvkey Year str3 Ticker at ib
001111 2006 abc 1000 50
001111 2007 abc 1100 60
001111 2008 abc 1200 70
001111 2009 abc 1300 80
001112 2008 www 28777 1300
001112 2009 www 26123 870
001113 2009 ttt 550 -1000
001114 2010 vvv 551 -990
end
egen id = group(gvkey), label
xtset id Year
gen wanted = at/L.ib
list, sepby(gvkey)
+------------------------------------------------------------+
| gvkey Year Ticker at ib id wanted |
|------------------------------------------------------------|
1. | 001111 2006 abc 1000 50 001111 . |
2. | 001111 2007 abc 1100 60 001111 22 |
3. | 001111 2008 abc 1200 70 001111 20 |
4. | 001111 2009 abc 1300 80 001111 18.57143 |
|------------------------------------------------------------|
5. | 001112 2008 www 28777 1300 001112 . |
6. | 001112 2009 www 26123 870 001112 20.09462 |
|------------------------------------------------------------|
7. | 001113 2009 ttt 550 -1000 001113 . |
|------------------------------------------------------------|
8. | 001114 2010 vvv 551 -990 001114 . |
+------------------------------------------------------------+
I have a data set that has data sorted by months and years. I want to destring the month variable so that I can ultimately create one date variable, but as they are all labeled as January, February, etc. how do I destring the variable?
You don't. That's a job for date functions. All are documented, e.g. via help datetime.
destring is for numbers that happen to be read as string variables so that typical entries might be "42" and "666". Import as string usually arises when the variable includes metadata (e.g. header lines), or non-Stata flags for missings (e.g. "NA"), or some other non-numeric characters, often in as few as one observation. Import from MS Excel is a common cause, as spreadsheet users tend to be loose on sprinkling text in numeric data columns.
A variable with values such as "January" doesn't qualify. It's in your mind that month names map on to month numbers, but destring doesn't share that knowledge.
Date functions have this job:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 month float year
"January" 2017
"February" 1942
end
gen mdate = monthly(month + string(year), "MY")
list
+-------------------------+
| month year mdate |
|-------------------------|
1. | January 2017 684 |
2. | February 1942 -215 |
+-------------------------+
format mdate %tm
list
+--------------------------+
| month year mdate |
|--------------------------|
1. | January 2017 2017m1 |
2. | February 1942 1942m2 |
+--------------------------+
(Declaration of interest: original author of destring.)
See also this thread.
Based on the image, I would like to loop through the columns to find where there is a text mo. It updates mo with the results not the text mo. The challenge has been how to select the result in the next column different from where mo is.
Your answer to my comment above suggests to me that the question you ask reflects the wrong approach to the larger problem. Your description suggests that you have observations with a varying number of testname/testvalue pairs, such as
+----------------------------------------+
| id day test1 val1 test2 val2 |
|----------------------------------------|
| A 1 mo 11 . |
| A 2 mo 12 df 98.2 |
|----------------------------------------|
| B 1 df 98.3 mo 23 |
| B 2 mo 14 . |
+----------------------------------------+
and your objective is to produce observations that look like this
+----------------------+
| id day df mo |
|----------------------|
| A 1 . 11 |
| A 2 98.2 12 |
|----------------------|
| B 1 98.3 23 |
| B 2 . 14 |
+----------------------+
If that is the case, here is a reproducible example that you can copy, paste into Stata's Do-file Editor window, execute it, and examine the output to see how the technique avoids all the complexity you introduce by trying to use loops to accomplish the task. The reshape command is one of Stata's most powerful data management tools and it will benefit you to learn how to use it.
clear
input str8 id int day str8 test1 float val1 str8 test2 float val2
A 1 "mo" 11 "" .
A 2 "mo" 12 "df" 98.2
B 1 "df" 98.3 "mo" 23
B 2 "mo" 14 "" .
end
list, sepby(id) noobs
reshape long test val, i(id day) j(num)
drop if missing(test)
drop num
list, sepby(id) noobs
reshape wide val, i(id day) j(test) str
rename val* *
list, sepby(id) noobs
I want to create new variables for the group country (iso_o/iso_d) with characteristics of the variable indepdate.
So far I have been typing:
gen include=1 if heg_o != 1
egen iso_o_indepdate1=min(indepdate * include), by(iso_o)
egen iso_o_indepdate2=max(indepdate * include), by(iso_o)
replace iso_o_indepdate2=. if iso_o_indepdate1==iso_o_indepdate2
drop include
*
gen include=1 if heg_d !=1
egen iso_d_indepdate1=min(indepdate * include), by(iso_d)
egen iso_d_indepdate2=max(indepdate * include), by(iso_d)
replace iso_d_indepdate2=. if iso_d_indepdate1==iso_d_indepdate2
drop include
The problem is I can use min() and max() combined to create two new variables for the values within indepdate, but if there are more then three I haven't been able to get a solution. Here a small table.
iso_o group indepdate new1 new2 new3
FRA 1 1960 1960 1980 1999
FRA 1 1980 1960 1980 1999
FRA 1 1999 1960 1980 1999
FRA 1 . 1960 1980 1999
USA 2 1955 1955 . .
USA 2 . 1955 . .
USA 2 . 1955 . .
So for this small example I could try work with intervals, however the dataset is very large and therefore I cannot tell for sure how many values are in one interval.
Any hint on another approach for this?
You can reshape and then merge:
clear all
set more off
*----- example data ---
input ///
str3 iso_o group indepdate new1 new2 new3
FRA 1 1960 1960 1980 1999
FRA 1 1980 1960 1980 1999
FRA 1 1999 1960 1980 1999
FRA 1 . 1960 1980 1999
USA 2 1955 1955 . .
USA 2 . 1955 . .
USA 2 . 1955 . .
end
drop new*
list, sepby(group)
tempfile orig
save "`orig'"
*----- what you want -----
bysort group (indepdate) : gen j = _n
reshape wide indepdate, i(group) j(j)
keep group indepdate*
merge 1:m group using "`orig'", assert(match) nogenerate
// list
sort group indepdate
order iso_o group indepdate indepdate*
list, sepby(group)
See help dropmiss to drop variables that have only missing values.
But the bigger question is why do you want to do this?
I am pretty new to Stata programming.
My question: I need to reorder/reshape a dataset through (I guess) a macro.
I have a dataset of individuals, with a variable birthyear' (year of birth) and variables each containing weight at a given CALENDAR year: e.g.
BIRTHYEAR | W_1990 | W_1991 | W_1992 | ... | w_2000
1989 | 7.2 | 9.3 | 10.2 | ... | 35.2
1981 | 33.2 | 35.3 | ...
I would like to obtain new variables containing weight at different ages, e.g. Weight_age_1, Weight_age_2, etc.: this means take for instance first obs of example, leave Weight_age_1 blank, put 7.2 in Weight_age_2, and so on.
I have tried something like...
forvalues i = 1/10{
capture drop weight_age_`i'
capture drop birth`i
gen birth_`i'=birthyear-1+`i'
tostring birth_`i', replace
gen weight_age_`i'= w_birth_`i'
}
.. but it doesn't work.
Can you please help me?
Experienced Stata users wouldn't try to write a self-contained program here: they would see that the heart of the problem is a reshape.
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
gen id = _n
reshape long w_, i(id)
rename _j year
gen age = year - birthyear
l, sepby(id)
+-----------------------------------+
| id year birthy~r w_ age |
|-----------------------------------|
1. | 1 1990 1989 7.2 1 |
2. | 1 1991 1989 9.3 2 |
3. | 1 1992 1989 10.2 3 |
|-----------------------------------|
4. | 2 1990 1981 33.2 9 |
5. | 2 1991 1981 35.3 10 |
6. | 2 1992 1981 37.6 11 |
+-----------------------------------+
To get the variables you say you want, you could reshape wide, but this long structure is by far the more convenient way to store these data for future Stata work.
P.S. The heart of your programming problem is that you are getting confused between the names of variables and their contents.
But this is a "look-up" approach made to work:
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
quietly forval j = 1/10 {
gen weight_`j' = .
forval k = 1990/1992 {
replace weight_`j' = w_`k' if (`k' - birthyear) == `j'
}
}
The essential trick is to do name manipulation using local macros. In Stata, variables are mainly for holding data; single-valued constants are better held in local macros and scalars. (Your sense of the word "macro" as meaning script or program is not how the term is used in Stata.)
As above: this is the data structure you ask for, but it is likely to be more problematic than that produced by reshape long.