Individual score contributions in ML estimation - stata

I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?

If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.

Related

Transition between PCA and factor analysis Eviews

I have 8 financial variables that vary across sectors (see attached file). In total I have 30 observations.
I want to run a PCA analysis in order two scores (one score for profitability and the other for indebtness). In my financial variables, I have Income, Capex, Ebitda.... that represent the profitability of the industry and Gearing and CFO coeff represent the indebtness of the industry.
I first run the correlation matrix. Then, view==> principal components. I want, now, to keep the first three principal components. So, if i understand well, in order to run the factor specification, I have to save the components.
So, I go to proc->make principal components. However, from that moment on, I am lost. How can eviews knows that I want to keep just three principal components. Is it under the "scores series"?

Clarification re Principle Component Analysis

I do understand the principle component analysis. I know how to do it and what it actually does. I have applied PCA and my best result has shown to be two components. I do understand that each of my inputs are now contributing partially in each component. What I do not understand is how to feed the result of PCA (in my case 2 components ) to a machine learning model?
How do we input them?
For example when I want to run a NN on my features, I just can navigate to where they are stored and import them, but my PCA analysis has been run in SPSS and all it shows me is the contribution of my features on each component.
What should I import to my NN model?
PCA is a method of feature extraction, which is used to avoid the problem of co-linearity. For example, if several variables are highly correlated because "they measure the same thing", then PCA can extract a measure of "that thing" (technically: a component), which is called a score. Your data set of, say, 100 measured variables may reduce to, say, 10 significant components. Then you can use the scores your test persons have achieved in those 10 components to do for example a multi-dimensional regression, a cluster analysis or a discriminance analysis. This will result in more valid results than performing the analysis directly on the 100 variables.
So the procedure is to sort the eigenvalues (and -vectors) by size, identify the number of significant components p (e.g., by scree-plot), set up the projection matrix F (eigenvectors corresponding to the largest q eigenvalues in columns) and multiply it with the data matrix D. This will give you the score matrix C (dimension n times q, with n the number of test persons), which you can use as input for whatever method you want to use next.

Finding a correlation between variable and class variable

I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution?
So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs/tutorials/resources online for how to do this.
To give you a good link I just read through, here is a blog with a tutorial on some ways to do feature selection in Weka, and the same blog's general introduction on feature selection. Naturally there are a lot of different approaches, as knb's answer pointed out.
To give a short description though, there are a few ways to go about it: you can assign a score to each of your features (like information gain, etc) and filter out features with 'bad' scores; you can treat finding the best parameters as a search problem, where you take different subsets of the features and assess the accuracy in turn; and you can use embedded methods, which kind of learn which features contribute most to the accuracy as the model is being built. Examples of embedded methods are regularization algorithms like LASSO and ridge regression.
Do you just want that attribute's name, or do you also want a quantifiable metric (like a t-value) for this "best" attribute?
For a qualitative approach, you can generate a classification tree with just one split, two leaves.
For example, weka's "diabetes.arff" sample-dataset (n = 768), which has a similar structure as your dataset (all attribs numeric, but the class attribute has only two distinct categorical outcomes), I can set the minNumObj parameter to, say, 200. This means: create a tree with minimum 200 instances in each leaf.
java -cp $WEKA_JAR/weka.jar weka.classifiers.trees.J48 -C 0.25 -M 200 -t data/diabetes.arff
Output:
J48 pruned tree
------------------
plas <= 127: tested_negative (485.0/94.0)
plas > 127: tested_positive (283.0/109.0)
Number of Leaves : 2
Size of the tree : 3
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.04 seconds
=== Error on training data ===
Correctly Classified Instances 565 73.5677 %
This creates a tree with one split on the "plas" attribute. For interpretation, this makes sense, because indeed, patients with diabetes have an elevated concentration of glucose in their blood plasma. So "plas" is the most important attribute, as it was chosen for the first split. But this does not tell you how important.
For a more quantitative approach, maybe you can use (Multinomial) Logistic Regression. I'm not so familiar with this, but anyway:
In the Exlorer GUI Tool, choose "Classify" > Functions > Logistic.
Run the model. The odds ratio and the coefficients might contain what you need in a quantifiable manner. Lower odds-ratio (but > 0.5) is better/more significant, but I'm not sure. Maybe read on here, this answer by someone else.
java -cp $WEKA_JAR/weka.jar weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -t data/diabetes.arff
Here's the command line output
Options: -R 1.0E-8 -M -1
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable tested_negative
============================
preg -0.1232
plas -0.0352
pres 0.0133
skin -0.0006
insu 0.0012
mass -0.0897
pedi -0.9452
age -0.0149
Intercept 8.4047
Odds Ratios...
Class
Variable tested_negative
============================
preg 0.8841
plas 0.9654
pres 1.0134
skin 0.9994
insu 1.0012
mass 0.9142
pedi 0.3886
age 0.9852
=== Error on training data ===
Correctly Classified Instances 601 78.2552 %
Incorrectly Classified Instances 167 21.7448 %

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Stata seems to be ignoring my starting values in maximum likelihood estimation

I am trying to estimate a maximum likelihood model and it is running into convergence problems in Stata. The actual model is quite complicated, but it converges with no troubles in R when it is supplied with appropriate starting values. I however cannot seem to get Stata to accept the starting values I provide.
I have included a simple example below estimating the mean of a poisson distribution. This is not the actual model I am trying to estimate, but it demonstrates my problem. I set the trace variable, which allows you to see the parameters as Stata searches the likelihood surface.
Although I use init to set a starting value of 0.5, the first iteration still shows that Stata is trying a coefficient of 4.
Why is this? How can I force the estimation procedure to use my starting values?
Thanks!
generate y = rpoisson(4)
capture program drop mypoisson
program define mypoisson
args lnf mu
quietly replace `lnf' = $ML_y1*ln(`mu') - `mu' - lnfactorial($ML_y1)
end
ml model lf mypoisson (mean:y=)
ml init 0.5, copy
ml maximize, iterations(2) trace
Output:
Iteration 0:
Parameter vector:
mean:
_cons
r1 4
Added: Stata doesn't ignore the initial value. If you look at the output of the ml maximize command, the first line in the listing will be titled
initial: log likelihood =
Following the equal sign is the value of the likelihood for the parameter value set in the init statement.
I don't know how the search(off) or search(norescale) solutions affect the subsequent likelihood calculations, so these solution might still be worthwhile.
Original "solutions":
To force a start at your initial value, add the search(off) option to ml maximize:
ml maximize, iterate(2) trace search(off)
You can also force a use of the initial value with search(norescale). See Jeff Pitblado's post at http://www.stata.com/statalist/archive/2006-07/msg00499.html.