Using summation to create a new variable - stata

I have data that look like this:
| Country | Year | Firm | Profit |
|---------|------|------|--------|
| A | 1 | 1 | 10 |
| A | 1 | 2 | 20 |
| A | 1 | 3 | 30 |
| A | 1 | 4 | 40 |
I want to create a new variable for each firm i that calculates the following:
For example, the value of the variable for firm 1 would be:
max(20 - 10, 0) + max(30 - 10, 0) + max(40 - 10, 0)
How can I do this in Stata by country and year?

Below is a direct solution to your problem (note the use of dataex for providing example data):
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Country float(Year Firm Profit)
"A" 1 1 10
"A" 1 2 20
"A" 1 3 30
"A" 1 4 40
end
generate Wanted = -Profit
bysort Country Year (Wanted): replace Wanted = sum(Profit) - _n * Profit
list
+-----------------------------------------+
| Country Year Firm Profit Wanted |
|-----------------------------------------|
1. | A 1 4 40 0 |
2. | A 1 3 30 10 |
3. | A 1 2 20 30 |
4. | A 1 1 10 60 |
+-----------------------------------------+
The logic behind it is the following:

Note: This was the first answer posted. It didn't avoid the pitfall of taking the OP's algebra literally and wanting to implement the calculation in terms of maxima within groups. But I realised after posting that there must be a much simpler way of doing it and #Romalpa Akzo got there, which is excellent. I undeleted this on request because it does show some machinery for looping over groups and implementing a calculation for each group with a customised Mata function.
Here I write a Mata function to return the wanted result for a group and then loop over the groups to populate a pre-defined variable.
To test the code for a dataset with more than one group, I use mpg from Stata's auto toy dataset.
mata :
void wanted (string scalar varname, string scalar usename, string scalar resultname) {
real scalar i
real colvector x, result, zero
result = x = st_data(., varname, usename)
zero = J(rows(x), 1, 0)
for(i = 1; i <= rows(x); i++) {
result[i] = sum(rowmax((x :- x[i], zero)))
}
st_store(., resultname, usename, result)
}
end
sysuse auto, clear
sort foreign rep78 mpg
egen group = group(foreign rep78), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1 / `G' {
replace touse = group == `g'
mata : wanted("mpg", "touse", "wanted")
}
How did that work out? Here are some results:
. list mpg wanted group if foreign, sepby(group)
+--------------------------+
| mpg wanted group |
|--------------------------|
53. | 21 7 Foreign 3 |
54. | 23 3 Foreign 3 |
55. | 26 0 Foreign 3 |
|--------------------------|
56. | 21 35 Foreign 4 |
57. | 23 19 Foreign 4 |
58. | 23 19 Foreign 4 |
59. | 24 13 Foreign 4 |
60. | 25 8 Foreign 4 |
61. | 25 8 Foreign 4 |
62. | 25 8 Foreign 4 |
63. | 28 2 Foreign 4 |
64. | 30 0 Foreign 4 |
|--------------------------|
65. | 17 84 Foreign 5 |
66. | 17 84 Foreign 5 |
67. | 18 77 Foreign 5 |
68. | 18 77 Foreign 5 |
69. | 25 42 Foreign 5 |
70. | 31 18 Foreign 5 |
71. | 35 6 Foreign 5 |
72. | 35 6 Foreign 5 |
73. | 41 0 Foreign 5 |
|--------------------------|
74. | 14 . . |
+--------------------------+
So, how would that be applied to your data?
clear
input str1 Country Year Firm Profit
A 1 1 10
A 1 2 20
A 1 3 30
A 1 4 40
end
egen group = group(Country Year), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1/`G' {
replace touse = group == `g'
mata: wanted("Profit", "touse", "wanted")
}
Results:
. list Firm Profit wanted, sepby(group)
+------------------------+
| Firm Profit wanted |
|------------------------|
1. | 1 10 60 |
2. | 2 20 30 |
3. | 3 30 10 |
4. | 4 40 0 |
+------------------------+

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

Stacking variables for each unique ID

I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)

Rolling sum with unbalanced panel with non-even times in Stata

I have an unbalanced daily panel where entries occur at uneven times. I would like to generate the rolling sum of some variable x over the past 365 days. I can think of two ways to do this, but the first is memory hungry and the second is processor hungry. Is there a third alternative that avoids these problems?
Here are my two solutions. Is there a third solution without memory or speed problems?
clear
set obs 200
set seed 2001
/* panel variables */
generate id = 1 + int(2*runiform())
generate time = mdy(1, 1, 2000) + int(10*365*runiform())
format time %td
duplicates drop
xtset id time
/* data */
generate x = runiform()
/* first approach is to fill the panel with `tsfill` */
/* then remove "seasonality" with `s.` */
tsfill
generate sx = sum(x)
generate ssx = s365.sx
/* second approach without `tsfill` */
/* but nested loop is fairly slow */
drop if missing(x)
generate double ssx_alt = 0
forvalues i = 1/`= _N' {
local j = `i'
local delta = time[`i'] - time[`j']
while ((`j' > 0) & (`delta' < 365) & (id[`i'] == id[`j'])) {
local x = cond(missing(x[`j']), 0, x[`j'])
replace ssx_alt = ssx_alt + `x' in `i'
local j = `j' - 1
local delta = time[`i'] - time[`j']
}
}
The sum over the last # days is the difference between two cumulative sums, the cumulative sum to now and the cumulative sum to # days ago. The extension to panel data is easy, but not shown here. I don't think gaps disturb this principle once you have applied tsfill.
. set obs 20
obs was 0, now 20
. gen t = _n
. gen y = 100 + _n
. gen sumy = sum(y)
. tsset t
time variable: t, 1 to 20
delta: 1 unit
. gen diff = sumy - L10.sumy
(10 missing values generated)
. l
+------------------------+
| t y sumy diff |
|------------------------|
1. | 1 101 101 . |
2. | 2 102 203 . |
3. | 3 103 306 . |
4. | 4 104 410 . |
5. | 5 105 515 . |
|------------------------|
6. | 6 106 621 . |
7. | 7 107 728 . |
8. | 8 108 836 . |
9. | 9 109 945 . |
10. | 10 110 1055 . |
|------------------------|
11. | 11 111 1166 1065 |
12. | 12 112 1278 1075 |
13. | 13 113 1391 1085 |
14. | 14 114 1505 1095 |
15. | 15 115 1620 1105 |
|------------------------|
16. | 16 116 1736 1115 |
17. | 17 117 1853 1125 |
18. | 18 118 1971 1135 |
19. | 19 119 2090 1145 |
20. | 20 120 2210 1155 |
+------------------------+

Stata: Cumulative number of new observations

I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.