Rolling sum with unbalanced panel with non-even times in Stata - stata

I have an unbalanced daily panel where entries occur at uneven times. I would like to generate the rolling sum of some variable x over the past 365 days. I can think of two ways to do this, but the first is memory hungry and the second is processor hungry. Is there a third alternative that avoids these problems?
Here are my two solutions. Is there a third solution without memory or speed problems?
clear
set obs 200
set seed 2001
/* panel variables */
generate id = 1 + int(2*runiform())
generate time = mdy(1, 1, 2000) + int(10*365*runiform())
format time %td
duplicates drop
xtset id time
/* data */
generate x = runiform()
/* first approach is to fill the panel with `tsfill` */
/* then remove "seasonality" with `s.` */
tsfill
generate sx = sum(x)
generate ssx = s365.sx
/* second approach without `tsfill` */
/* but nested loop is fairly slow */
drop if missing(x)
generate double ssx_alt = 0
forvalues i = 1/`= _N' {
local j = `i'
local delta = time[`i'] - time[`j']
while ((`j' > 0) & (`delta' < 365) & (id[`i'] == id[`j'])) {
local x = cond(missing(x[`j']), 0, x[`j'])
replace ssx_alt = ssx_alt + `x' in `i'
local j = `j' - 1
local delta = time[`i'] - time[`j']
}
}

The sum over the last # days is the difference between two cumulative sums, the cumulative sum to now and the cumulative sum to # days ago. The extension to panel data is easy, but not shown here. I don't think gaps disturb this principle once you have applied tsfill.
. set obs 20
obs was 0, now 20
. gen t = _n
. gen y = 100 + _n
. gen sumy = sum(y)
. tsset t
time variable: t, 1 to 20
delta: 1 unit
. gen diff = sumy - L10.sumy
(10 missing values generated)
. l
+------------------------+
| t y sumy diff |
|------------------------|
1. | 1 101 101 . |
2. | 2 102 203 . |
3. | 3 103 306 . |
4. | 4 104 410 . |
5. | 5 105 515 . |
|------------------------|
6. | 6 106 621 . |
7. | 7 107 728 . |
8. | 8 108 836 . |
9. | 9 109 945 . |
10. | 10 110 1055 . |
|------------------------|
11. | 11 111 1166 1065 |
12. | 12 112 1278 1075 |
13. | 13 113 1391 1085 |
14. | 14 114 1505 1095 |
15. | 15 115 1620 1105 |
|------------------------|
16. | 16 116 1736 1115 |
17. | 17 117 1853 1125 |
18. | 18 118 1971 1135 |
19. | 19 119 2090 1145 |
20. | 20 120 2210 1155 |
+------------------------+

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

How to make moving average across groups?

| month | year | amount|
|-------|--------|-------|
| 1 | 2010 | 26 |
| 1 | 2010 | 26 |
| 2 | 2010 | 30 |
| 3 | 2010 | 35 |
| 3 | 2010 | 35 |
I need to figure out how to make another variable, that takes the prior month amount _n-1 and _n and divide it by 2, kind of like a moving average. The problem is that I need to do it by month and year, since there are multiples of the same month and year. There are other variables as well that are irrelevant, but that is why I can't just delete duplicates.
For example, for observation 5, I would need it to be (35+30+26) / 3
Your prescription and your example don't match at all. Your example is a mean of 3 monthly means, this month and the two previous. Your prescription is a month and the month previous.
Here is some technique that focuses on two possible meanings of your prescription.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte month int year byte amount
1 2010 26
1 2010 26
2 2010 30
3 2010 35
3 2010 35
end
gen mdate = ym(year, month)
format mdate %tm
foreach w in total mean count {
egen `w' = `w'(amount), by(mdate)
}
gen wanted1 = (mean + mean[_n-1]) / 2 if mdate == mdate[_n-1] + 1
bysort mdate (wanted1) : replace wanted1 = wanted1[_n-1] if missing(wanted1)
gen wanted2 = (total + total[_n-1]) / (count + count[_n-1]) if mdate == mdate[_n-1] + 1
bysort mdate (wanted2) : replace wanted2 = wanted2[_n-1] if missing(wanted2)
list, sepby(mdate)
+----------------------------------------------------------------------------+
| month year amount mdate total mean count wanted1 wanted2 |
|----------------------------------------------------------------------------|
1. | 1 2010 26 2010m1 52 26 2 . . |
2. | 1 2010 26 2010m1 52 26 2 . . |
|----------------------------------------------------------------------------|
3. | 2 2010 30 2010m2 30 30 1 28 27.33333 |
|----------------------------------------------------------------------------|
4. | 3 2010 35 2010m3 70 35 2 32.5 33.33333 |
5. | 3 2010 35 2010m3 70 35 2 32.5 33.33333 |
+----------------------------------------------------------------------------+

SAS - Combine like values within rows, then add new variable for non like value(s)

I have a large dataset and am trying to run an analyses on each customer (same account and routing #), which have 100's of transactions within the dataset. I
was able to add SEQ # for like acct#'s and routing #s. How would I run an analyses to say SEQ #1 and give total # of deposits (Amount), max, min of deposits and potentially some other helpful data.
+-----------+--------+---------+--------+
| Routing# | Acct# | AMOUNT | TOTAL |SEQ #
+-----------+--------+---------+--------+
| 518 | 0 | 490.50 | 3777.5 | 1
| 518 | 0 | 170.00 | 3777.5 | 1
| 518 | 0 | 3117.00 | 3777.5 | 1
| 518 | 99 | 875.00 | 875 | 2
| 518 | 999 | 499.00 | 499 | 3
| 519 | 2 | 100.00 | 200.00 | 4
| 519 | 2 | 100.00 | 200.00 | 4
+-----------+--------+---------+--------+
Thanks
There are multiple ways to do this, but here is a data step way
data have;
input Routing Acct AMOUNT;
datalines;
518 0 490.50
518 0 170.00
518 0 3117.00
518 99 875.00
518 999 499.00
519 2 100.00
519 2 100.00
;
data want;
do until (last.Acct);
set have;
by Routing Acct notsorted;
total+amount;
end;
seq+1;
do until (last.Acct);
set have;
by Routing Acct notsorted;
output;
end;
total=0;
run;

Using summation to create a new variable

I have data that look like this:
| Country | Year | Firm | Profit |
|---------|------|------|--------|
| A | 1 | 1 | 10 |
| A | 1 | 2 | 20 |
| A | 1 | 3 | 30 |
| A | 1 | 4 | 40 |
I want to create a new variable for each firm i that calculates the following:
For example, the value of the variable for firm 1 would be:
max(20 - 10, 0) + max(30 - 10, 0) + max(40 - 10, 0)
How can I do this in Stata by country and year?
Below is a direct solution to your problem (note the use of dataex for providing example data):
* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 Country float(Year Firm Profit)
"A" 1 1 10
"A" 1 2 20
"A" 1 3 30
"A" 1 4 40
end
generate Wanted = -Profit
bysort Country Year (Wanted): replace Wanted = sum(Profit) - _n * Profit
list
+-----------------------------------------+
| Country Year Firm Profit Wanted |
|-----------------------------------------|
1. | A 1 4 40 0 |
2. | A 1 3 30 10 |
3. | A 1 2 20 30 |
4. | A 1 1 10 60 |
+-----------------------------------------+
The logic behind it is the following:
Note: This was the first answer posted. It didn't avoid the pitfall of taking the OP's algebra literally and wanting to implement the calculation in terms of maxima within groups. But I realised after posting that there must be a much simpler way of doing it and #Romalpa Akzo got there, which is excellent. I undeleted this on request because it does show some machinery for looping over groups and implementing a calculation for each group with a customised Mata function.
Here I write a Mata function to return the wanted result for a group and then loop over the groups to populate a pre-defined variable.
To test the code for a dataset with more than one group, I use mpg from Stata's auto toy dataset.
mata :
void wanted (string scalar varname, string scalar usename, string scalar resultname) {
real scalar i
real colvector x, result, zero
result = x = st_data(., varname, usename)
zero = J(rows(x), 1, 0)
for(i = 1; i <= rows(x); i++) {
result[i] = sum(rowmax((x :- x[i], zero)))
}
st_store(., resultname, usename, result)
}
end
sysuse auto, clear
sort foreign rep78 mpg
egen group = group(foreign rep78), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1 / `G' {
replace touse = group == `g'
mata : wanted("mpg", "touse", "wanted")
}
How did that work out? Here are some results:
. list mpg wanted group if foreign, sepby(group)
+--------------------------+
| mpg wanted group |
|--------------------------|
53. | 21 7 Foreign 3 |
54. | 23 3 Foreign 3 |
55. | 26 0 Foreign 3 |
|--------------------------|
56. | 21 35 Foreign 4 |
57. | 23 19 Foreign 4 |
58. | 23 19 Foreign 4 |
59. | 24 13 Foreign 4 |
60. | 25 8 Foreign 4 |
61. | 25 8 Foreign 4 |
62. | 25 8 Foreign 4 |
63. | 28 2 Foreign 4 |
64. | 30 0 Foreign 4 |
|--------------------------|
65. | 17 84 Foreign 5 |
66. | 17 84 Foreign 5 |
67. | 18 77 Foreign 5 |
68. | 18 77 Foreign 5 |
69. | 25 42 Foreign 5 |
70. | 31 18 Foreign 5 |
71. | 35 6 Foreign 5 |
72. | 35 6 Foreign 5 |
73. | 41 0 Foreign 5 |
|--------------------------|
74. | 14 . . |
+--------------------------+
So, how would that be applied to your data?
clear
input str1 Country Year Firm Profit
A 1 1 10
A 1 2 20
A 1 3 30
A 1 4 40
end
egen group = group(Country Year), label
summarize group, meanonly
local G = r(max)
generate wanted = .
generate touse = 0
quietly forvalues g = 1/`G' {
replace touse = group == `g'
mata: wanted("Profit", "touse", "wanted")
}
Results:
. list Firm Profit wanted, sepby(group)
+------------------------+
| Firm Profit wanted |
|------------------------|
1. | 1 10 60 |
2. | 2 20 30 |
3. | 3 30 10 |
4. | 4 40 0 |
+------------------------+

Proc sql subquery based on nonexisitng column returns not null

Here is a sample code that was derived from actual application. There are two datasets - "aa" for a query and "bb" for subquery. Column "m" from datasets "aa" matches column "y" from datasets "bb". Also, there is "yy" column on "aa" table has a value of 30. Column "m" from datasets "aa" contains value "30" in one of its rows, and column "y" from datasets "bb" does not. First proc sql uses values from "y" column of "bb" table to subset table "aa" based on matching values in column "m". It is a correct query and produces results as expected. Second proc sql block has column "y" intentionally misspelled as "yy" in subquery in a row that stars with where statement. Otherwise the whole proc sql block is the same as the first one. Given that there is no column "yy" on dataset bb, I would expect an error message to appear and the whole query to fail. However, it does return one row without failing or error messages. Closer look would suggest that it actually uses "yy" column from table "aa" (see tree in the log output). I do not think this is a correct behavior. If you would have some comments or explanations, I would greatly appreciate it. Otherwise, I maybe should report it to SAS as a bug. Thank you!
Here is the code:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
yy=30;
output;
end;
run;
data bb;
do i=10 to 20;
y=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Correct sql command*/
proc sql _method
_tree
;
create table cc as
select *
from aa
where m in (select y from bb)
;quit;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select yy from bb)
;quit;
Here is log with sql tree:
119 options
120 msglevel = I
121 ;
122 data aa;
123 do i=1 to 20;
124 m=i*5;
125 yy=30;
126 output;
127 end;
128 run;
NOTE: The data set WORK.AA has 20 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
129
130 data bb;
131 do i=10 to 20;
132 y=i*5;
133 output;
134 end;
135 run;
NOTE: The data set WORK.BB has 11 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
136 option DEBUG=JUNK ;
137
138 /*Correct sql command*/
139 proc sql _method
140 _tree
141 ;
142 create table cc as
143 select *
144 from aa
145 where m in (select y from bb)
146 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-V-(bb.y:2 flag=0001)
| | /-OBJ----|
| | /-SRC----|
| | | \-TABL[WORK].bb opt=''
| \-SUBC---|
--SSEL---|
NOTE: Table WORK.CC created, with 11 rows and 3 columns.
146! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
147
148
149 /*Incorrect sql command - column "yy" in not on "bb" table"*/
150 proc sql _method
151 _tree;
152 create table dd as
153 select *
154 from aa
155 where m in (select yy from bb)
156 ;
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxsrc( WORK.AA )
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps
sqxsrc( WORK.BB )
Tree as planned.
/-SYM-V-(aa.i:1 flag=0001)
/-OBJ----|
| |--SYM-V-(aa.m:2 flag=0001)
| \-SYM-V-(aa.yy:3 flag=0001)
/-FIL----|
| | /-SYM-V-(aa.i:1 flag=0001)
| | /-OBJ----|
| | | |--SYM-V-(aa.m:2 flag=0001)
| | | \-SYM-V-(aa.yy:3 flag=0001)
| |--SRC----|
| | \-TABL[WORK].aa opt=''
| | /-SYM-V-(aa.m:2)
| \-IN-----|
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag=
0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
--SSEL---|
NOTE: Table WORK.DD created, with 1 rows and 3 columns.
156! quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
Here are datasets:
aa:
i m yy
1 5 30
2 10 30
3 15 30
4 20 30
5 25 30
6 30 30
7 35 30
8 40 30
9 45 30
10 50 30
11 55 30
12 60 30
13 65 30
14 70 30
15 75 30
16 80 30
17 85 30
18 90 30
19 95 30
20 100 30
bb:
i y
10 50
11 55
12 60
13 65
14 70
15 75
16 80
17 85
18 90
19 95
20 100
I agree, this looks pretty weird and may well be a bug. I was able to reproduce this from the code you provided in SAS 9.4 and in SAS 9.1.3, which would make it at least ~12 years old.
In particular, I'm interested in this bit of the output you got from the _method option when creating the DD table but not when creating the CC table:
NOTE: SQL subquery execution methods chosen are:
sqxsubq
sqxreps <--- What is this doing?
sqxsrc( WORK.BB )
Similarly, the corresponding section from the _tree output is highly obscure:
| | /-SYM-A-(#TEMA001:1 flag=0035)
| | /-OBJ----|
| | /-REPS---|
| | | |--empty-
| | | |--empty-
| | | | /-OBJ----|
| | | |--SRC----|
| | | | \-TABL[WORK].bb opt=''
| | | |--empty-
| | | |--empty-
| | | | /-SYM-A-(#TEMA001:1 flag= 0035)
| | | | /-ASGN---|
| | | | | \-SUBP(1)
| | | \-OBJE---|
| \-SUBC---|
| \-SYM-V-(aa.yy:3)
I have never seen sqxreps or reps in the respective bits of output before. Neither of them is listed in any of the papers I was able to find based on a brief bit of googling (in fact, this question is currently the only hit on Google for sas + sqxreps):
http://support.sas.com/resources/papers/proceedings10/139-2010.pdf
http://www2.sas.com/proceedings/sugi30/101-30.pdf
Quoting the first of these:
Codes Description
sqxcrta Create table as Select
Sqxslct Select
sqxjsl Step loop join (Cartesian)
sqxjm Merge join
sqxjndx Index join
sqxjhsh Hash join
sqxsort Sort
sqxsrc Source rows from table
sqxfil Filter rows
sqxsumg Summary stats with GROUP BY
sqxsumn Summary stats with no GROUP BY
Based on a bit of quick testing, this seems to happen regardless of the variable and tables names used, provided that the variable name from AA is repeated multiple times in the subquery referencing table BB. It also happens if you have a variable named e.g. YYY in AA but one named YY in BB, or more generally whenever you have a variable in BB whose name is initially the same as the name of the corresponding variable in AA but then continues for one or more characters.
From this, I'm guessing at some point in the SQL parser, someone used a like operator rather than checking for equality of variable names, and somehow as a result this syntax is triggering an undocumented or incomplete 'feature' in proc sql.
An example of the more general case:
options
msglevel = I
;
data aa;
do i=1 to 20;
m=i*5;
myvar_plus_suffix=30;
output;
end;
run;
data bb;
do i=10 to 20;
myvar=i*5;
output;
end;
run;
option DEBUG=JUNK ;
/*Incorrect sql command - column "yy" in not on "bb" table"*/
proc sql _method
_tree;
create table dd as
select *
from aa
where m in (select myvar_plus_suffix from bb)
;quit;
Here is a response from SAS support.
What you are seeing is related to column scoping in PROC SQL.
PROC SQL supports Corellated Subqueries. A Correlated Subquery references a column in the "outer" table which can then be compared to columns in the "inner" table. PROC SQL does not require that a fully qualified column name is used. As a result, if it sees a column in the subquery that does not exist in the inner table (the table referenced in the subquery), it looks for that column in the "outer" table and uses the value if it finds one.
If a fully qualified column name is used, the error you are expecting will occur such as the following:
proc sql;
create table dd as
select *
from aa as outer
where outer.m in (select inner.yyy from bb as inner);
quit;