I'm running into some issues while trying to reshape a data set from long to wide. Here's an example, since I think that explains it best:
Say I wanted to take this long data set...
|study_id |event_date |code |
|--------------------------------|
|1 |09 June 15 |546 |
|1 |09 June 15 |643 |
|2 |23 May 13 |324 |
|2 |12 May 13 |435 |
And shape it into a wide one like this...
|study_id |event_date_1 |event_date_1_code1 |event_date_1code2| event_date_2 |event_date_2_code1 | event_date_2_code2|
|-------------------------------------------------------------------------------------------------------------------------|
|1 |09 June 15 |546 |643 | | | |
|2 |23 May 15 |324 | |12 May 13 |435 | |
What would be the best method of doing this? I imagine I would have to create some sort of j variable, but am not certain how to make it so each event_date could have multiple codes, and each study_id multiple event_dates.
I already tried doing making a j variable and reshaping, using the following code:
//Sort by id (just in case)
sort study_id event_date code
//Create j variable
quietly by study_id: gen code_num = cond(_N==1, 1, _n)
//Reshape data
reshape wide event_date code, i(study_id) j(code_num)
This, however, did not account for each event_date having multiple potential codes.
I am attempting to convert the data to wide so that I can merge it with another wide data set, and then run analysis over both. An observation in either set is an unique study_id.
Let me start by saying that I would not ever choose to organize my data in the requested fashion, so this should not be taken as support for doing so.
Having said that, something like the following seems to do the trick. The data is similar yours but I'm too lazy to deal with full dates, I just read in the day of the month. I'm posting this as a curiosity, because I've never before seen a need to do reshape wide twice in succession.
clear
input study_id date code
1 09 546
1 09 643
2 23 324
2 12 435
end
list
bysort study_id date (code): generate codenum = _n
reshape wide code, i(study_id date) j(codenum)
rename code* code_*_
list
bysort study_id (date): generate eventnum = _n
reshape wide date code_*, i(study_id) j(eventnum)
list
Related
I am attempting to identify which observations are below 15 seconds for timer variables. First, I generated the variable speed, and attempted to replace all observations of the variables that start with timer, that are below 15 with 1.
gen speed = 0
replace speed = 1 if timer* <15
However, Stata is telling me,
timer* invalid name
r(198);
What might be going on? I'm not sure how to attach my data from Stata here, any insight into that would be appreciated too.
The Stata tag wiki has massively detailed information on how to post a data example.
What is going on is simply that Stata doesn't support your syntax. Indeed, it is not even clear what it might mean.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen timer1 = 10
. gen timer2 = 20
. list
+-----------------+
| timer1 timer2 |
|-----------------|
1. | 10 20 |
+-----------------+
. gen wanted1 = min(timer1, timer2) < 15
. gen wanted2 = max(timer1, timer2) < 15
. l
+-------------------------------------+
| timer1 timer2 wanted1 wanted2 |
|-------------------------------------|
1. | 10 20 1 0 |
+-------------------------------------+
.
One guess is that you want an indicator which is 1 if any of the variables timer* is less than 15, in which case you need to compute the minimum over those variables in an observation and compare it with 15. Another guess is that you want an indicator which is 1 if all of the variables timer* are all less than 15, in which case you need first to compute the maximum in an observation. For the simple example above the functions min() and max() serve well. For a dataset with many more variables in timer*, you will find it more convenient to reach for the egen functions rowmin() and rowmax().
There are other ways to do it, and other wilder guesses at what you want, but I will stop there.
I am using Stata and investigating the variable household net wealth NetWealth).
I want to construct the quintiles of this variable and use the following command--as you can see I use survey data and thus apply survey weights:
xtile Quintile = NetWealth [pw=surveyweight], nq(5)
Then I give the following command to check what I have obtained:
tab Quintile, sum(NetWealth)
This is the result:
Means, Standard Deviations and Frequencies of DN3001 Net wealth
5 |
quantiles |
of dn3001 |
-----------+-----------+
1 |1519.4221
|43114.959
| 154
-----------+-----------+
2 | 135506.67
| 74360.816
| 179
-----------+-----------+
3 | 396712.16
| 69715.49
| 161
-----------+-----------+
4 | 669065.69
| 111102.02
| 182
-----------+-----------+
5 | 2552620.5
| 3872350.9
| 274
-----------+-----------+
Total | 957419.29
| 2323329.8
| 950
Why do I get a different number of households in each quintile? In particular in the last quintile?
The only explanation that I can come up with is that when Stata constructs quintiles with xtile, it excludes from the computation those observations that present a replicate value of NetWealth. I have had this impression also while consulting the Stata material.
What do you think?
Your problem is not fully reproducible in so far as you don't give a self-contained example, but in general there is no puzzle here.
Often people seeking such binnings have a small problem in that their number of observations is not a multiple (meaning, exact multiple) of the number of quantile-based bins they want, but in your case that does not bite as calculation
. di 154 + 179 + 161 + 182 + 274
950
shows that you have 950 observations, which is 5 x 190.
The bigger deal -- here and almost always -- arises from Stata's rule that identical values in different observations must be assigned to the same bin. So, ties are likely to be the problem here.
You have perhaps three possible solutions. Only one involves direct coding.
Live with it.
Do something else. For example, why you are doing this any way? Why not use the original data?
Try a different boundary condition. To do that, just negate the variable and bin that version. Then values on the boundary will jump differently.
Adding random noise to separate ties is utterly indefensible in my view. It's not reproducible (except trivially using the same program and the same settings) and it will have different implications in terms of the same observations' values on other variables.
Here's an example where #3 doesn't help, but it sometimes does:
. sysuse auto, clear
(1978 Automobile Data)
. xtile bin5 = mpg, nq(5)
. gen negmpg = -mpg
. xtile bin5_2 = negmpg, nq(5)
. tab bin5
5 quantiles |
of mpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 18 24.32 24.32
2 | 17 22.97 47.30
3 | 13 17.57 64.86
4 | 12 16.22 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
. tab bin5_2
5 quantiles |
of negmpg | Freq. Percent Cum.
------------+-----------------------------------
1 | 19 25.68 25.68
2 | 12 16.22 41.89
3 | 16 21.62 63.51
4 | 13 17.57 81.08
5 | 14 18.92 100.00
------------+-----------------------------------
Total | 74 100.00
See also some discussion within Section 4 of this paper
I see no hint whatsoever in the documentation that xtile would omit observations in the way that you imply. You give no precise quotation supporting that. It would be perverse to exclude any non-missing values unless so instructed.
I don't comment directly here on use of pweights except that using pweights might be a complicating factor here.
I am working with a dataset that has purchases per date (called ItemNum) on multiple dates across 2800 individuals. Each Item is given its own line, so if an individual has purchased two items on a date, that date will appear twice. I don't care how many items were purchased on a date (with each date representing one trip), but rather the mean number of trips made across the 2800 individuals (For about 18230 lines of data). My data looks like this:
+---+----------+-------+---------------------- ---+
|ID | Date |ItemNum| ItemDescript |
| 1 |01/22/2010| 1 |Description of the item |
| 1 |01/22/2010| 2 |Description of other item |
| 1 |07/19/2013| 1 | |
| 2 |06/04/2012| 1 | |
| 2 |02/02/2013| 1 | |
| 2 |11/13/2013| 1 | |
+---+----------+-------+---------------------- ---+
In the above table, person 1 made two trips and three item purchases (because two dates are shown), person 2 made three trips. I am interested in the average number of trips across all people, but first I need to collapse it down to unique dates. So I know I need to collapse on the date, but when I do
collapse (mean) ItemNum (first) Date, by(ID)
it just takes the first date that the ID shows up, not the first occurrence of each unique date.
The next issue is that once it's collapsed, I need to take the mean of the count of the dates, not the date itself, which is also where I seem to be getting tripped up.
Or perhaps something like
clear
input ID str16 dt ItemNum
1 "01/22/2010" 1
1 "01/22/2010" 2
1 "07/19/2013" 1
end
generate Date = daily(dt,"MDY")
egen trip = tag(ID Date)
collapse (sum) trip, by(ID)
summarize trip
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
trip | 1 2 . 2 2
if what you are looking for is found in "Mean" - a single number giving the average number of trips made by the 2800 individuals (1 individual with the limited sample data given).
are you trying to do the following?
collapse (mean) ItemNum, by(ID Date) fast
This is the Stata code I used to divide a Winsorised & centred variable (num_exp, denoting number of experienced managers) based on 4 quartiles & thereafter to generate the highest & lowest quartile dummies thereof:
egen quartile_num_exp = xtile(WC_num_exp), n(4)
gen high_quartile_numexp = 1 if quartile_num_exp==4
(1433 missing values generated);
gen low_quartile_num_exp = 1 if quartile_num_intlexp==1
(1062 missing values generated);
Thanks everybody - here's the link
https://dl.dropboxusercontent.com/u/64545449/No%20of%20expeienced%20managers.dta
I did try both Aspen Chen's & Roberto's suggestions - Chen's way of creating high quartile dummy gives the same results as I had earlier & Roberto's - both quartiles show 1 for the same rows - how's that possible?
I forgot to mention here that there are indeed many ties - the range of the original variable W_num_exp is from 0 to 7, the mean being 2.126618, i subtracted that from each observation of W_num_exp to get the WC_num_exp.
tab high_quartile_numexp shows the same problem I originally had
le_numexp | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,433 80.64 80.64
1 | 344 19.36 100.00
------------+-----------------------------------
Total | 1,777 100.00
Also, I checked egenmore is already installed in my Stata version 13.1
What I fail to understand is why the dummy variable based on the highest quartile doesn't have 75% of observations below it (I've got 1777 total observations): to my understanding this dummy variable should be the cut-off point above which exactly 25% of the total no. of observations should lie (as we can see it contains only 19.3% of observations).
Am I doing anything wrong in writing the correct Stata code for high_quartile low_quartile dummy variables?
Consider the following code:
clear
set more off
sysuse auto
keep make mpg
*-----
// your way (kind of)
egen mpg4 = xtile(mpg), nq(4)
gen lowq = mpg4 == 1
gen highq = mpg4 == 4
*-----
// what you want
summarize mpg, detail
gen lowq2 = mpg < r(p25)
gen highq2 = mpg < r(p75)
*-----
summarize high* low*
list
Now check the listing to see what's going on.
See help stored results.
The dataset provided answers the question. Consider the tabulation:
. tab W_num_exp
num_execs_i |
ntl_exp, |
Winsorized |
fraction |
.01 | Freq. Percent Cum.
------------+-----------------------------------
0 | 297 16.71 16.71
1 | 418 23.52 40.24
2 | 436 24.54 64.77
3 | 282 15.87 80.64
4 | 171 9.62 90.26
5 | 109 6.13 96.40
6 | 34 1.91 98.31
7 | 30 1.69 100.00
------------+-----------------------------------
Total | 1,777 100.00
Exactly equal numbers in each of 4 quartile-based bins can be provided if, and only if, there are values with cumulative percents 25, 50, 75. No such values exist. You have to make do with approximations. The approximations can be lousy, but the only alternative, of arbitrarily assigning observations with the same value to different bins to even up frequencies, is statistically indefensible.
(The number of observations needing to be a multiple of 4 for 4 bins, etc., for exactly equal frequencies is also a complication, which bites hard for small datasets, but that is not the major issue here.)
I am pretty new to Stata programming.
My question: I need to reorder/reshape a dataset through (I guess) a macro.
I have a dataset of individuals, with a variable birthyear' (year of birth) and variables each containing weight at a given CALENDAR year: e.g.
BIRTHYEAR | W_1990 | W_1991 | W_1992 | ... | w_2000
1989 | 7.2 | 9.3 | 10.2 | ... | 35.2
1981 | 33.2 | 35.3 | ...
I would like to obtain new variables containing weight at different ages, e.g. Weight_age_1, Weight_age_2, etc.: this means take for instance first obs of example, leave Weight_age_1 blank, put 7.2 in Weight_age_2, and so on.
I have tried something like...
forvalues i = 1/10{
capture drop weight_age_`i'
capture drop birth`i
gen birth_`i'=birthyear-1+`i'
tostring birth_`i', replace
gen weight_age_`i'= w_birth_`i'
}
.. but it doesn't work.
Can you please help me?
Experienced Stata users wouldn't try to write a self-contained program here: they would see that the heart of the problem is a reshape.
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
gen id = _n
reshape long w_, i(id)
rename _j year
gen age = year - birthyear
l, sepby(id)
+-----------------------------------+
| id year birthy~r w_ age |
|-----------------------------------|
1. | 1 1990 1989 7.2 1 |
2. | 1 1991 1989 9.3 2 |
3. | 1 1992 1989 10.2 3 |
|-----------------------------------|
4. | 2 1990 1981 33.2 9 |
5. | 2 1991 1981 35.3 10 |
6. | 2 1992 1981 37.6 11 |
+-----------------------------------+
To get the variables you say you want, you could reshape wide, but this long structure is by far the more convenient way to store these data for future Stata work.
P.S. The heart of your programming problem is that you are getting confused between the names of variables and their contents.
But this is a "look-up" approach made to work:
clear
input birthyear w_1990 w_1991 w_1992
1989 7.2 9.3 10.2
1981 33.2 35.3 37.6
end
quietly forval j = 1/10 {
gen weight_`j' = .
forval k = 1990/1992 {
replace weight_`j' = w_`k' if (`k' - birthyear) == `j'
}
}
The essential trick is to do name manipulation using local macros. In Stata, variables are mainly for holding data; single-valued constants are better held in local macros and scalars. (Your sense of the word "macro" as meaning script or program is not how the term is used in Stata.)
As above: this is the data structure you ask for, but it is likely to be more problematic than that produced by reshape long.