How to calculate time varying historical mean with Stata - stata

How can I calculate the mean of X using an expanding window with at least four observations?
Here is a numeric example:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
46.654168
44.924652
43.807024
45.679814
48.366395
49.883396
48.230502
49.869179
53.942757
56.167884
56.226512
56.25608
58.765728
62.077038
62.780799
61.858235
61.167646
60.671859
60.480263
60.226433
61.65349
60.769882
61.497553
60.146182
60.292934
60.173739
58.60077
58.445601
60.404868
end

Time-varying means in an expanding time window can be phrased otherwise as to imply the mean of all values from the start of records to the current date. You don't give a time variable so I assume data are in order and supply a time variable.
The community-contributed command rangestat (to be installed from SSC using ssc install rangestat) can give the mean of all values to date in this way:
clear
input X
50.735469
48.278413
42.807671
49.247854
52.20223
49.726689
50.823169
49.099351
48.949562
47.410434
end
gen t = _n
rangestat (count) X (mean) X, int(t . 0)
list
+-------------------------------------+
| X t X_count X_mean |
|-------------------------------------|
1. | 50.73547 1 1 50.73547 |
2. | 48.27841 2 2 49.506941 |
3. | 42.80767 3 3 47.273851 |
4. | 49.24785 4 4 47.767351 |
5. | 52.20223 5 5 48.654327 |
|-------------------------------------|
6. | 49.72669 6 6 48.833054 |
7. | 50.82317 7 7 49.117356 |
8. | 49.09935 8 8 49.115105 |
9. | 48.94956 9 9 49.096711 |
10. | 47.41043 10 10 48.928084 |
+-------------------------------------+
Evidently you can ignore results for small counts as you please.
The syntax is naturally explained in the help for rangestat: suffice it to say here that the syntax for the option -- namely interval(t . 0) -- is three-fold:
for the time variable t
and two offsets
backwards as far as possible: system missing . here means arbitrarily large
forwards just 0
In mathematical terms the mean is from time minus infinity, or as much as possible, to time 0, the present.
The count result is the number of observations in the window with non-missing values on X. Here as the time variable is 1 up the count is trivially the same as the time variable, but in real problems the time variable is much more likely to be a date of some kind. Unlike some other commands rangestat doesn't have an option to insist on a minimum number of points with non-missing values in a window, but you can count how many there are and decide to ignore those based on too few data. That is left to the user here.
Incidentally, you could make a good start on this kind of problem by working out a cumulative sum and then dividing by the number of values so far. That needs care with (e.g.) gaps in data, irregularly spaced data or missing values and a virtue of rangestat is that all such difficulties are considered.

Related

How can I replace observations for multiple variables with the same prefix?

I am attempting to identify which observations are below 15 seconds for timer variables. First, I generated the variable speed, and attempted to replace all observations of the variables that start with timer, that are below 15 with 1.
gen speed = 0
replace speed = 1 if timer* <15
However, Stata is telling me,
timer* invalid name
r(198);
What might be going on? I'm not sure how to attach my data from Stata here, any insight into that would be appreciated too.
The Stata tag wiki has massively detailed information on how to post a data example.
What is going on is simply that Stata doesn't support your syntax. Indeed, it is not even clear what it might mean.
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen timer1 = 10
. gen timer2 = 20
. list
+-----------------+
| timer1 timer2 |
|-----------------|
1. | 10 20 |
+-----------------+
. gen wanted1 = min(timer1, timer2) < 15
. gen wanted2 = max(timer1, timer2) < 15
. l
+-------------------------------------+
| timer1 timer2 wanted1 wanted2 |
|-------------------------------------|
1. | 10 20 1 0 |
+-------------------------------------+
.
One guess is that you want an indicator which is 1 if any of the variables timer* is less than 15, in which case you need to compute the minimum over those variables in an observation and compare it with 15. Another guess is that you want an indicator which is 1 if all of the variables timer* are all less than 15, in which case you need first to compute the maximum in an observation. For the simple example above the functions min() and max() serve well. For a dataset with many more variables in timer*, you will find it more convenient to reach for the egen functions rowmin() and rowmax().
There are other ways to do it, and other wilder guesses at what you want, but I will stop there.

Collapse with weights -- how to get the sample count, not the population count?

I'm collapsing my data using weight, but I only want the weight to apply to my median and sum, not my count. I want my count to only be the sample size, not the population size.
Example:
. input outcome group weight
outcome group weight
1. 1 1 3
2. 1 2 3
3. 1 3 3
4. end
Running collapse (sum) outcome (count) n = outcome [pweight = weight], by(group) gives
. list
+---------------------+
| group outcome n |
|---------------------|
1. | 1 3 3 |
2. | 2 3 3 |
3. | 3 3 3 |
+---------------------+
Both the sum and count are using the weight. I want the count to be the sample size, i.e. 1 for each group.
Unfortunately it is not possible to have different weights when using collapse.
The few solutions I have in mind:
create the weights yourself in the data, and compute your weighted statistics yourself
have a look at the user-written version of collapse, which might include this feature. For instance, collapse2 or xcollapse

Show all types of missing values for a variable (sysmiss and extended missing values)

I have data containing three different types of missing values, the "usual" ones . and extended missing values .a and .b.
As I am working with numeric questionnaires, the sysmiss . are not interesting to me since they mean that the respondent just didn't reach this question (for a filtered question).
Extended missing values .a .b are "real" missing values (didn't answer/didn't know).
I would like to present a table showing the number of missing values for each kind, for example
Variable | (.) | .a | .b
__________________________________________________
Income | 9 | 15 | 2
Any ideas on how to create such table? I looked at different commands in Stata, tabmiss, missings, missing sum without a clear answer for now.
Here is an example that may point you in a useful direction.
clear
input x y z
1 1 1
. . .
3 .a .b
.b 4 .a
.a .a 5
end
list, clean
gen seqno = _n
rename (x y z) (vbl=)
reshape long vbl, i(seqno) j(variable) string
list, clean
rename vbl value
drop if !missing(value)
tab variable value, missing
| value
variable | . .a .b | Total
-----------+---------------------------------+----------
x | 1 1 1 | 3
y | 1 2 0 | 3
z | 1 1 1 | 3
-----------+---------------------------------+----------
Total | 3 4 2 | 9

to create highest & lowest quartiles of a variable in Stata

This is the Stata code I used to divide a Winsorised & centred variable (num_exp, denoting number of experienced managers) based on 4 quartiles & thereafter to generate the highest & lowest quartile dummies thereof:
egen quartile_num_exp = xtile(WC_num_exp), n(4)
gen high_quartile_numexp = 1 if quartile_num_exp==4
(1433 missing values generated);
gen low_quartile_num_exp = 1 if quartile_num_intlexp==1
(1062 missing values generated);
Thanks everybody - here's the link
https://dl.dropboxusercontent.com/u/64545449/No%20of%20expeienced%20managers.dta
I did try both Aspen Chen's & Roberto's suggestions - Chen's way of creating high quartile dummy gives the same results as I had earlier & Roberto's - both quartiles show 1 for the same rows - how's that possible?
I forgot to mention here that there are indeed many ties - the range of the original variable W_num_exp is from 0 to 7, the mean being 2.126618, i subtracted that from each observation of W_num_exp to get the WC_num_exp.
tab high_quartile_numexp shows the same problem I originally had
le_numexp | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,433 80.64 80.64
1 | 344 19.36 100.00
------------+-----------------------------------
Total | 1,777 100.00
Also, I checked egenmore is already installed in my Stata version 13.1
What I fail to understand is why the dummy variable based on the highest quartile doesn't have 75% of observations below it (I've got 1777 total observations): to my understanding this dummy variable should be the cut-off point above which exactly 25% of the total no. of observations should lie (as we can see it contains only 19.3% of observations).
Am I doing anything wrong in writing the correct Stata code for high_quartile low_quartile dummy variables?
Consider the following code:
clear
set more off
sysuse auto
keep make mpg
*-----
// your way (kind of)
egen mpg4 = xtile(mpg), nq(4)
gen lowq = mpg4 == 1
gen highq = mpg4 == 4
*-----
// what you want
summarize mpg, detail
gen lowq2 = mpg < r(p25)
gen highq2 = mpg < r(p75)
*-----
summarize high* low*
list
Now check the listing to see what's going on.
See help stored results.
The dataset provided answers the question. Consider the tabulation:
. tab W_num_exp
num_execs_i |
ntl_exp, |
Winsorized |
fraction |
.01 | Freq. Percent Cum.
------------+-----------------------------------
0 | 297 16.71 16.71
1 | 418 23.52 40.24
2 | 436 24.54 64.77
3 | 282 15.87 80.64
4 | 171 9.62 90.26
5 | 109 6.13 96.40
6 | 34 1.91 98.31
7 | 30 1.69 100.00
------------+-----------------------------------
Total | 1,777 100.00
Exactly equal numbers in each of 4 quartile-based bins can be provided if, and only if, there are values with cumulative percents 25, 50, 75. No such values exist. You have to make do with approximations. The approximations can be lousy, but the only alternative, of arbitrarily assigning observations with the same value to different bins to even up frequencies, is statistically indefensible.
(The number of observations needing to be a multiple of 4 for 4 bins, etc., for exactly equal frequencies is also a complication, which bites hard for small datasets, but that is not the major issue here.)

Lead and Lag Function in informatica

How can we use Lead and Lag functions in Informatica?
Name | No.
------------
X | 100
Y | 200
Z | 300
I have to convert it to:
Name | No. | Lead(No.)
-----------------------------
X | 100 | 200
Y | 200 | 300
Z | 300 | 100
Name | No. | Lag(No.)
----------------------------
X | 100 | 0
Y | 200 | 100
Z | 300 | 200
The logic I used was:
EXP Transformation
Name (input & Output Port)
No. (input)
O_No.(VAR)=IIF(Prv_no IS NULL,0,No.)
Prv_no.(VAR)=No.
This was for Lag Function.
Never done that, but I would use the order of evaluation. In a EXP transformation, input port are evaluated before variable port that in turn are evaluated before output port. Also the row are readed one at a time.
If you send to the EXP TRAN sorted data, you could simulate lag() function. For Lead() you should reverse the sort.
Multiple ways-
You can use RANK transformation,check the rank (R) option for No..Set Top/Bottom as Top and Number of Ranks as 1 for
getting the lead value.**Set Top/Bottom as Bottom and Number
of Ranks as 1 for getting the lag value. Then using these
values in expression transformation, we can implement lead/lag
function.
2.Use sql query in source qualifier, select MAX() and MIN() to get the respective lead and lag values.
You can use Informatica Expressions to achieve lead and lag. For every incoming row make 2 output ports one for lead and one for lag calculation. Use mapping parameter for configuraing lead and lag variable for the run. After expression connect the ports to two different targets.