Matching values in a variable by year - stata

I have the following minimal example:
input str5 name year match1 match2 match3
Alice 2000 . . .
Alice 2000 . . .
Bob 2000 . . .
Carol 2001 0 . .
Alice 2002 0 1 .
Carol 2002 1 0 .
Bob 2003 0 0 1
Bob 2003 0 0 1
end
I have data on name and year, and I want to create binary variables called match'year' that equals 1 if this name is in the data previous 'year'. For example, looking at the first observation in Stata, match1 is a binary variable that equals 1 if Alice appears in year 1999, and match2 is a binary variable that equals 1 if Alice appears in 1998, etc.
If there is no year prior to that year (in this case there is no 1999 or 1998), the binary variable will be missing.
How can I construct these match variables? Note that I have millions of unique names, and using command levelsof name, local(match) results in macro substitution results in line that is too long error. Also note that there are sometimes duplicates of names in a given year, and some names may be missing in a given year.

Thanks for the data example. Here is some technique using rangestat from SSC. I don't understand your rule on which values should be 0 and which missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input str5 name float year
"Alice" 2000
"Alice" 2000
"Alice" 2002
"Bob" 2000
"Bob" 2003
"Bob" 2003
"Carol" 2001
"Carol" 2002
end
gen one = 1
forval j = 1/3 {
rangestat (max) match`j'=one, int(year -`j' -`j') by(name)
}
drop one
sort name year
list, sepby(year)
+-----------------------------------------+
| name year match1 match2 match3 |
|-----------------------------------------|
1. | Alice 2000 . . . |
2. | Alice 2000 . . . |
|-----------------------------------------|
3. | Alice 2002 . 1 . |
|-----------------------------------------|
4. | Bob 2000 . . . |
|-----------------------------------------|
5. | Bob 2003 . . 1 |
6. | Bob 2003 . . 1 |
|-----------------------------------------|
7. | Carol 2001 . . . |
|-----------------------------------------|
8. | Carol 2002 1 . . |
+-----------------------------------------+
As the original author of levelsof I find it a little melancholy to see it pressed into service where it is of little or no help.

Here is an alternative approach using frames:
keep name year
frame copy default prev
frame prev: duplicates drop
frame prev: rename year myear
gen myear=.
forvalues i=1/3 {
replace myear = year-`i'
frlink m:1 name myear, frame(prev) generate(match`i')
replace match`i' = 1 if match`i'!=.
}
drop myear
Output:
name year match1 match2 match3
1. Alice 2000 . . .
2. Alice 2000 . . .
3. Bob 2000 . . .
4. Carol 2001 . . .
5. Alice 2002 . 1 .
6. Carol 2002 1 . .
7. Bob 2003 . . 1
8. Bob 2003 . . 1

Related

destring 18 digit number returns rounding error

I have IDs with a length of 18 as strings and I want to transform them to a numeric variable.
I tried the follwoing code:
destring var, replace
That returns the variable with a numeric format. However, the last digit of the ID includes a rounding error. E.g.: 123456789123456000 --> 123456789123456001
How can I destring my values without any change in the ID?
I can't reproduce that as this shows:
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen test = "123456789123456000"
. destring, gen(check)
test: all characters numeric; check generated as double
. l
+--------------------------------+
| test check |
|--------------------------------|
1. | 123456789123456000 1.235e+17 |
+--------------------------------+
. format check %23.0f
. l
+-----------------------------------------+
| test check |
|-----------------------------------------|
1. | 123456789123456000 123456789123456000 |
+-----------------------------------------+
The other way round does produce error as the string "123456789123456001" maps to 123456789123456000. In essence, you are bumping up against what can be held exactly in a double with 8 bytes for each number.

Formatting a Stata table like a table in SAS

I have a 3-way table in Stata that looks like this:
I would like to format this 3-way crosstab like a table in SAS that looks like this:
The actual output in the table isn't important, I just want to know how I can change the formatting of the Stata table. Any help is appreciated!
The groups command from the Stata Journal will get you most of the way. This reproducible example doesn't exhaust the possibilities.
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. groups union race , show(f F p P) sepby(union)
+--------------------------------------------------+
| union race Freq. #<= Percent %<= |
|--------------------------------------------------|
| 0 white 10777 10777 56.02 56.02 |
| 0 black 3784 14561 19.67 75.69 |
| 0 other 167 14728 0.87 76.56 |
|--------------------------------------------------|
| 1 white 2817 17545 14.64 91.20 |
| 1 black 1649 19194 8.57 99.77 |
| 1 other 44 19238 0.23 100.00 |
+--------------------------------------------------+
The command must be installed before you can use it. groups is a lousy search term, but this search will find the 2017 write-up and later updates of the software (at the time of writing, just one in 2018).
. search st0496, entry
Search of official help files, FAQs, Examples, and Stata Journals
SJ-18-1 st0496_1 . . . . . . . . . . . . . . . . . Software update for groups
(help groups if installed) . . . . . . . . . . . . . . . . N. J. Cox
Q1/18 SJ 18(1):291
groups exited with an error message if weights were specified;
this has been corrected
SJ-17-3 st0496 . . . . . Speaking Stata: Tables as lists: The groups command
(help groups if installed) . . . . . . . . . . . . . . . . N. J. Cox
Q3/17 SJ 17(3):760--773
presents command for listing group frequencies and percents and
cumulations thereof; for various subsetting and ordering by
frequencies, percents, and so on; for reordering of columns;
and for saving tabulated data to new datasets

Group together records that share either one of two variables

I am working with Stata and I have a large data set where I need to group records together if they share one of two variables.
For example, take the following three observations:
Observation | matching var1 | matching var2
1 xxx aaa
2 xxx bbb
3 yay bob
If I were to group the records by var1, the first two observations will be in the same group and the last observation will be in a separate group. Similarly, if I were to group using var2, observations two and three will be in in the same group and observation one will be in a separate group. However, if I were to group the records based on a match on either var1 or var2, all observations will be in the same group.
I would like to create a 'group id' variable that will take the same value across all these records.
Any suggestions on how I should go about it?
The community-contributed group_twoway (available in SSC) can match two variables:
ssc install group_twoway
Using an additional example from yours:
clear
input str3(var1 var2)
"xxx" "aaa"
"yyy" "bbb"
"mmm" "ccc"
"nnn" "ccc"
"mmm" "ddd"
"ooo" "ff"
"pp" "eee"
"qq" "ff"
"rr" "u"
"xxx" "bbb"
end
group_twoway var1 var2, generate(group_id)
Result # of obs.
-----------------------------------------
not matched 0
matched 10
-----------------------------------------
list, sepby(group_id) constant
+------------------------+
| var1 var2 group_id |
|------------------------|
1. | xxx aaa 1 |
2. | yyy bbb 1 |
|------------------------|
3. | mmm ccc 2 |
4. | nnn ccc 2 |
5. | mmm ddd 2 |
|------------------------|
6. | ooo ff 3 |
|------------------------|
7. | pp eee 4 |
|------------------------|
8. | qq ff 3 |
|------------------------|
9. | rr u 5 |
|------------------------|
10. | xxx bbb 1 |
+------------------------+

Calculate compounded annual growth rates

I have a panel dataset with values of companies between 2006-2015.
This looks something like the example below:
I want to calculate three-year compounded annual growth rates:
2006-2009
2007-2010
...
2012-2015
I have already tried to use the following command:
bys tina: generate SalesGrowth=(Sales/L3.Sales)^(1/3) - 1 if mod(ano, 5) == 0
However, although Stata generates the new variable, all values are missing.
Alternatively to compounded annual growth rate, I could simply use a growth rate with 2009 and 2006 data. But, the same problem arises - no observations are created.
Consider this toy example:
clear
input tina ano Sales
500000069 2006 15000
500000069 2007 17000
500000069 2008 19000
500000069 2009 24000
500000069 2010 22000
500000069 2011 28000
500000069 2012 26000
500000069 2013 29000
500000069 2014 31000
500000069 2015 33000
500000087 2006 40000
500000087 2007 42000
500000087 2008 44000
500000087 2009 46000
500000087 2010 48000
500000087 2011 50000
500000087 2012 52000
500000087 2013 54000
500000087 2014 56000
500000087 2015 58000
end
format tina %9.0f
The following solution:
bysort tina: summarize ano
forvalues i = 1 / `= `r(N)' - 3' {
bysort tina (ano): generate SalesGrowth`i' = (Sales[`i'+3]/Sales[`i'])^(1/3) - 1
bysort tina (ano): replace SalesGrowth`i' = . if ano != ano[`i'+3]
}
Gives accurate estimates of what you need:
. list
+-------------------------------------------------------------------------------------------------------+
| tina ano Sales SalesG~1 SalesG~2 SalesG~3 SalesG~4 SalesG~5 SalesG~6 SalesG~7 |
|-------------------------------------------------------------------------------------------------------|
1. | 500000064 2006 15000 . . . . . . . |
2. | 500000064 2007 17000 . . . . . . . |
3. | 500000064 2008 19000 . . . . . . . |
4. | 500000064 2009 24000 .1696071 . . . . . . |
5. | 500000064 2010 22000 . .0897442 . . . . . |
|-------------------------------------------------------------------------------------------------------|
6. | 500000064 2011 28000 . . .1379805 . . . . |
7. | 500000064 2012 26000 . . . .02704 . . . |
8. | 500000064 2013 29000 . . . . .0964574 . . |
9. | 500000064 2014 31000 . . . . . .0345097 . |
10. | 500000064 2015 33000 . . . . . . .0827134 |
|-------------------------------------------------------------------------------------------------------|
11. | 500000096 2006 40000 . . . . . . . |
12. | 500000096 2007 42000 . . . . . . . |
13. | 500000096 2008 44000 . . . . . . . |
14. | 500000096 2009 46000 .0476896 . . . . . . |
15. | 500000096 2010 48000 . .0455159 . . . . . |
|-------------------------------------------------------------------------------------------------------|
16. | 500000096 2011 50000 . . .043532 . . . . |
17. | 500000096 2012 52000 . . . .041714 . . . |
18. | 500000096 2013 54000 . . . . .0400419 . . |
19. | 500000096 2014 56000 . . . . . .0384988 . |
20. | 500000096 2015 58000 . . . . . . .0370703 |
+-------------------------------------------------------------------------------------------------------+

Stata Regular expressions extracting numerical values

I have some data that looks like this
var1
h 01 .00 .0 abc
d 1.0 .0 14.0abc
1,0.0 0.0 .0abc
It should be noted that the last three alpha values are the same, and I am hoping to extract all the numerical values within the string. The code that I'm using look like this
gen x1=regexs(1) if regexm(var1,"([0-9]+) [ ]*(abc)*$")
However, this code only extracts the numbers before the abc term and stops after a space or a .. For example, only 0 before abc is extracted from the first term. I was wondering whether there is a way to handle this and extract all the numerical values before the alpha characters.
As #Roberto Ferrer points out, your question isn't very clear, but here is an example using moss from SSC:
. clear
. input str16 var1
var1
1. "h 01 .00 .0 abc"
2. "d 1.0 .0 14.0abc"
3. "1,0.0 0.0 .0abc"
4. end
. moss var1, regex match("([0-9]+\.*[0-9]*|\.[0-9]+)")
. l _match*
+---------------------------------------+
| _match1 _match2 _match3 _match4 |
|---------------------------------------|
1. | 01 .00 .0 |
2. | 1.0 .0 14.0 |
3. | 1 0.0 0.0 .0 |
+---------------------------------------+