Iterate through variables in a regression in SAS/JMP - sas

I'm trying to take a set of independent variables and test if they are (statistically significantly) differently-correlated to two groups of data.
I've been advised that the way to do this in JMP is to make a series of linear regressions like the following,
result = group + varA + group*varA
and then examine the significance of the interaction effect, e.g., the "Prob > F" column in this "Country*Displacement" example: http://i.stack.imgur.com/EcCdd.png (I don't have the reputation to post an image.)
Now, I need to be able to switch out one of these variables; that is, for a list of ~350 variables, say varA, varB, etc., I need to run the following regressions,
result = group + varA + group*varA
result = group + varB + group*varB
result = group + varC + group*varC
...
and get the significance of that interaction effect. Previous attempts to scripting have resulted in ~350 results windows, or ~350 model dialogs . . . any advice would be appreciated.
Edit:
For example, using the Airline Delays JMP sample data set, this is the result from one of the steps: http://i.stack.imgur.com/HVFL8.png. I need to extract the significance of the interaction effect (the 0.1397 under Effect Tests) for each of a set of variables; for example, interchanging the "Distance" variable with "Elapsed Time". But I need to interchange this variable for each in a set of ~350.

Assuming you know how to for through these values. This will get you the effect P Values.
fit = Fit Model(
Y( :Arrival Delay ),
Effects( :Distance, :Day of Week, :Distance * :Day of Week ),
Personality( Standard Least Squares ),
Emphasis( Minimal Report ),
Run(
:Arrival Delay << {Lack of Fit( 0 ), Plot Actual by Predicted( 0 ),
Plot Residual by Predicted( 0 ), Plot Effect Leverage( 0 )}
)
);
hash = associative array(fit<<Get Effect Names, fit<<Get Effect PValues);
value = hash["Distance*Day of Week"];
then just close fit << Close window; and move on to the next parameter.

Related

How to take maximum of rolling window without SSC packages

How can I create a variable in Stata that contains the maximum of a dynamic rolling window applied to another variable? The rolling window must be able to change iteratively within a loop.
max(numvar, L1.numvar, L2.numvar) will give me what I want for a single window size, but how can I change the window size iteratively within the loop?
My current code for calculating the rolling sum (credit to #Nick Cox for the algorithm):
generate var1lagged = 0
forval k = -2(1)2 {
if `k' < 0 {
local k1 = -(`k')
replace var1lagged = var1lagged + L`k1'.var1
}
else replace var1lagged = var1lagged + F`k'.var1
}
How would I achieve the same sort of flexibility but with the maximum, minimum, or average of the window?
In the simplest case suppose K at least 1 is given as the number of lags in the window
local arg L1.numvar
forval k = 2/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
If the window includes the present value, that is just a twist
local arg numvar
forval k = 1/`K' {
local arg `arg', L`k'.numvar
}
gen wanted = max(`arg')
More generally, numvar would not be a specific variable name, but would be a local macro including such a name.
EDIT 1
This returns missing as a result only if all arguments are missing. If you wanted to insist on missing as a result if any argument is missing, then go
gen wanted = cond(missing(`arg'), ., max(`arg'))
EDIT 2
Check out rolling more generally. Otherwise for a rolling mean you calculate directly you need to work out (1) the sum as in the question (2) the number of non-missing values.
The working context of the OP evidently rules out installing community-contributed commands; otherwise I would recommend rangestat and rangerun (SSC). Note that many community-contributed commands have been published via the Stata Journal, GitHub or user sites.

PowerBI empty values row not displayed

I have a confusing mystery...
Simple DIVIDE formula works correctly. However blank rows are not displayed.
I attempted a different method using IF, and now the blank row is correctly displayed.
However this line is only displayed if I include the IF formula (which gives a zero value I don't want).
Formula 1:
Completion % =
DIVIDE(SUM(Courses[Completed]),SUM(Courses[Attended]),BLANK())
Formula 2:
Completion % with IF =
IF(SUM(Courses[Attended])=0,0,DIVIDE(SUM(Courses[Completed]),SUM(Courses[Attended])))
With only the DIVIDE formula:
Including the IF formula:
It appears that Power BI is capable of showing this row without error, but only if I inlude the additional IF formula. I'm guessing it's because there is now a value (0) to display.
However I want to be able show all courses, including those that have no values, without the inaccurate zero value.
I don't understand why the table doesn't include these lines. Can anyone explain/help?
The point is very simple, by default Power BI shows only elements for which there is at least one non-blank measure.
The DIVIDE operator under-the-hood execute the following:
IF(ISBLANK(B), BLANK(), A / B))
You can change its behaviour by defining the optimal parameter in order to show 0 instead of BLANK:
DIVIDE(A, B, 0) will be translated in the following:
IF(ISBLANK(B), 0, A/B))
Proposed solution
Those mentioned avobe might all be possible solutions to your problem, however, my personal suggestion is to simply enable the option "show item with no data" in your visualization.
While DIVIDE(A, B, 0) will return zero when when B is zero or blank, I think a blank A will still return a blank.
One possibility is to simply append +0 (or prepend 0+) to your measure so that it always returns a numeric value.
DIVIDE ( SUM ( Courses[Completed] ), SUM ( Courses[Attended] ) ) + 0

plsql if control different calculation

I am using PLSQL to realize some of the function below.
I have the table which have piece level data with each piece weight. Basically I want to realize the following function:
if piece weight is over 1 LB. groupby ceil(weight) (next LB)
if piece weight is less 1 LB groupby cell(weight*16) ( Next OZ)
I am just curious how can I realize that in plsql. I feel I need to have the if statement. But I am not sure how to do that.
(Weight is already an variable in that table, do I need to declare here?)
begin
if weight <1 then
select ceil(weight*16),sum(weight)
from ops_owner.track_mail_item
where manifestdate = '24-aug-2016'
group by ceil(weight*16)
else select ceil(weight),sum(weight)
from ops_owner.track_mail_item
where manifestdate = '24-aug-2016'
end if,
end;
Thank you very much!
I would adjust the weight value in an inline view, I think.
select
ceil(adjusted_weight),
sum(adjusted_weight)
from
(
select
case
when weight < 1 then weight * 16
else weight
end adjusted_weight
from
ops_owner.track_mail_item
where
manifestdate = '24-aug-2016'
)
group by
ceil(adjusted_weight);

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.

Perform Fisher Exact Test from aggregated using Stata

I have a set of data like below:
A B C D
1 2 3 4
2 3 4 5
They are aggregated data which ABCD constitutes a 2x2 table, and I need to do Fisher exact test on each row, and add a new column for the p-value of the Fisher exact test for that row.
I can use fisher.exact and loop to do it in R, but I can't find a command in Stata for Fisher exact test.
You are thinking in R terms, and that is often fruitless in Stata (just as it is impossible for a Stata guy to figure out how to do by ... : regress in R; every package has its own paradigm and its own strengths).
There are no objects to add columns to. May be you could say a little bit more as to what you need to do, eventually, with your p-values, so as to find an appropriate solution that your Stata collaborators would sympathize with.
If you really want to add a new column (generate a new variable, speaking Stata), then you might want to look at tabulate and its returned values:
clear
input x y f1 f2
0 0 5 10
0 1 7 12
1 0 3 8
1 1 9 5
end
I assume that your A B C D stand for two binary variables, and the numbers are frequencies in the data. You have to clear the memory, as Stata thinks about one data set at a time.
Then you could tabulate the results and generate new variables containing p-values, although that would be a major waste of memory to create variables that contain a constant value:
tabulate x y [fw=f1], exact
return list
generate p1 = r(p_exact)
tabulate x y [fw=f2], exact
generate p2 = r(p_exact)
Here, [fw=variable] is a way to specify frequency weights; I typed return list to find out what kind of information Stata stores as the result of the procedure. THAT'S the object-like thing Stata works with. R would return the test results in the fisher.test()$p.value component, and Stata creates returned values, r(component) for simple commands and e(component) for estimation commands.
If you want a loop solution (if you have many sets), you can do this:
forvalues k=1/2 {
tabulate x y [fw=f`k'], exact
generate p`k' = r(p_exact)
}
That's the scripting capacity in which Stata, IMHO, is way stronger than R (although it can be argued that this is an extremely dirty programming trick). The local macro k takes values from 1 to 2, and this macro is substituted as ``k'` everywhere in the curly bracketed piece of code.
Alternatively, you can keep the results in Stata short term memory as scalars:
tabulate x y [fw=f1], exact
scalar p1 = r(p_exact)
tabulate x y [fw=f2], exact
scalar p2 = r(p_exact)
However, the scalars are not associated with the data set, so you cannot save them with the
data.
The immediate commands like cci suggested here would also have returned values that you can similarly retrieve.
HTH, Stas
Have a look the cci command with the exact option:
cci 10 15 30 10, exact
It is part of the so-called "immediate" commands. They allow you to do computations directly from the arguments rather than from data stored in memory. Have a look at help immediate
Each observation in the poster's original question apparently consisted of the four counts in one traditional 2 x 2 table. Stas's code applied to data of individual observations. Nick pointed out that -cci- can analyze a b c d data. Here's code that applies -cci to each table and, like Stas's code, adds the p-values to the data set. The forvalues i = 1/`=_N' statement tells Stata to run the loop from the first to the last observation. a[`i'] refers to the the value of the variable `a' in the i-th observation.
clear
input a b c d
10 2 8 4
5 8 2 1
end
gen exactp1 = .
gen exactp2 =.
label var exactp1 "1-sided exact p"
label var exactp2 "2-sided exact p"
forvalues i = 1/`=_N'{
local a = a[`i']
local b = b[`i']
local c = c[`i']
local d = d[`i']
qui cci `a' `b' `c' `d', exact
replace exactp1 = r(p1_exact) in `i'
replace exactp2 = r(p_exact) in `i'
}
list
Note that there is no problem in giving a local macro the same name as a variable.