I am using syncmrates in Stata. Could you please tell me what command should I add to following command for getting mortality estimates by different educational group (suppose the variable for education is v008 which holds categories 1, 2 and 3). Can I get estimates for all the educational categories using a single command that could be added to following command?
syncmrates v008 b3 b7 [iw=v005]
syncmrates is from SSC. It's always helpful to be told the provenance of community-contributed commands (see the Stata tag wiki).
Your example is from the help file, and it's implied that v008 is (must be) a variable holding date of interview.
But given a further variable, say education, my reading of the help implies that you need to repeat the command with qualifiers
... if education == 1
... if education == 2
... if education == 3
to get separate estimates.
Related
I have a group of treated firms in a country, and for each firm I would like to find the closest match in terms of industry, size and profitability in the rest of the country. I am working on Stata. All I need is to form a control group- could anybody guide me with the code? That'd be greatly appreciated! I currently have the following, which doesn't get me what I need:
psmatch2 (logpension) (treated sector logassets logebitda), logit ate
Here's how you might match on x1 and x2 using Mahalanobis distance as a metric, to get the effect on y from treatment t:
use http://ssc.wisc.edu/sscc/pubs/files/psm, clear
psmatch2 t, mahalanobis(x1 x2) outcome(y) ate
The variable _n1 stores the observation number of the matched control observation for every treatment observation.
The following is a full set of code you can run to find your average treatment effect on the treated (your most important indicator result) and to check if the data is balanced (whether your result is valid). Before you run it, you need to make sure your treated is labeled in the following manner: 0 should be labeled as the control group and 1 should be labeled as the experimental/treatment. "neighbor(1)" means I chose the option nearest-neighbor matching. It basically pairs each treated observation with a control observation whose propensity score is closest in absolute value.
psmatch2 treated sector logassets logebitda, outcome (logpension) neighbor(1) common
After running psmatch, you need to make sure your data is balanced. So you need to run this:
pstest sector logassets logebitda, treated(treated)
if your t-test shows any significance below 0.05, it means your data is not balanced. to check the balance of your data visually, you can also run
psgraph
right after your psmatch2 command.
Good luck!
I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution?
So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs/tutorials/resources online for how to do this.
To give you a good link I just read through, here is a blog with a tutorial on some ways to do feature selection in Weka, and the same blog's general introduction on feature selection. Naturally there are a lot of different approaches, as knb's answer pointed out.
To give a short description though, there are a few ways to go about it: you can assign a score to each of your features (like information gain, etc) and filter out features with 'bad' scores; you can treat finding the best parameters as a search problem, where you take different subsets of the features and assess the accuracy in turn; and you can use embedded methods, which kind of learn which features contribute most to the accuracy as the model is being built. Examples of embedded methods are regularization algorithms like LASSO and ridge regression.
Do you just want that attribute's name, or do you also want a quantifiable metric (like a t-value) for this "best" attribute?
For a qualitative approach, you can generate a classification tree with just one split, two leaves.
For example, weka's "diabetes.arff" sample-dataset (n = 768), which has a similar structure as your dataset (all attribs numeric, but the class attribute has only two distinct categorical outcomes), I can set the minNumObj parameter to, say, 200. This means: create a tree with minimum 200 instances in each leaf.
java -cp $WEKA_JAR/weka.jar weka.classifiers.trees.J48 -C 0.25 -M 200 -t data/diabetes.arff
Output:
J48 pruned tree
------------------
plas <= 127: tested_negative (485.0/94.0)
plas > 127: tested_positive (283.0/109.0)
Number of Leaves : 2
Size of the tree : 3
Time taken to build model: 0.11 seconds
Time taken to test model on training data: 0.04 seconds
=== Error on training data ===
Correctly Classified Instances 565 73.5677 %
This creates a tree with one split on the "plas" attribute. For interpretation, this makes sense, because indeed, patients with diabetes have an elevated concentration of glucose in their blood plasma. So "plas" is the most important attribute, as it was chosen for the first split. But this does not tell you how important.
For a more quantitative approach, maybe you can use (Multinomial) Logistic Regression. I'm not so familiar with this, but anyway:
In the Exlorer GUI Tool, choose "Classify" > Functions > Logistic.
Run the model. The odds ratio and the coefficients might contain what you need in a quantifiable manner. Lower odds-ratio (but > 0.5) is better/more significant, but I'm not sure. Maybe read on here, this answer by someone else.
java -cp $WEKA_JAR/weka.jar weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -t data/diabetes.arff
Here's the command line output
Options: -R 1.0E-8 -M -1
Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
Class
Variable tested_negative
============================
preg -0.1232
plas -0.0352
pres 0.0133
skin -0.0006
insu 0.0012
mass -0.0897
pedi -0.9452
age -0.0149
Intercept 8.4047
Odds Ratios...
Class
Variable tested_negative
============================
preg 0.8841
plas 0.9654
pres 1.0134
skin 0.9994
insu 1.0012
mass 0.9142
pedi 0.3886
age 0.9852
=== Error on training data ===
Correctly Classified Instances 601 78.2552 %
Incorrectly Classified Instances 167 21.7448 %
I want to run block bootstrap, where the blocks are countries, and include country indicator variables. I thought the following would work.
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
But I get the following error.
. regress mvalue kstock i.country, vce(bootstrap, cluster(country))
(running regress on estimation sample)
Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxx 50
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);
It seems that this should work. If the block bootstrap picks the same country for every block, then it seems it should just drop the intercept.
Is my error coding or conceptual? Here is some code using the grunfeld data.
webuse grunfeld, clear
xtset, clear
generate country = int((company - 1) / 2) + 1
regress mvalue kstock i.country, vce(bootstrap, cluster(country))
The problem here is not with your coding, but is conceptual. The problem is that you cannot identify each coefficient in each regression in each bootstrap sample. Not all "countries" are included in the dataset for each bootstrap repetition. You can diagnose what is going on with the vce( , noisily) sub-option:
. regress mvalue kstock i.bscountry, vce(bootstrap, cluster(country) noisily)
Errors are generated because some coefficients are missing when the regression runs with particular bootstrap samples. In each regression you can see that some countries dummies are being omitted due to collinearity. This should be expected and makes a lot of sense -- the country dummies could =0 for all observations in the bootstrap sample if the country was not drawn!
If you are really trying to estimate the coefficients on the country dummies, you are going to have to find another approach than bootstrapping with K clusters if K is the number of countries. If you don't care about the coefficient dummies you could use another command that simply absorbs the fixed effects and only reports the coefficients on the other independent variables (e.g., areg or xtreg). One way think about what is going on is that it is analogous to this:
.bootstrap, cluster(country) idcluster(bscountry) noisily: regress mvalue kstock i.bscountry
With the idcluster() option, each country that is drawn in a bootstrap sample is given its own ID number. If a country is drawn twice then there are two dummies. (The coefficients for the two dummies naturally turn out to be identical or near-identical.) However, the coefficients in this output are are completely meaningless because bscountry "2" will be different countries in different bootstrap iterations. Since you would ignore any output on the dummies, you might as well use a model like areg or xtreg since they run more quickly.
Although there are many applications where bootstrapping with clusters would work fine, the problem here is the inclusion of cluster dummies in the regression. This all begs the question of whether this exercise makes any sense at all. If you are trying to estimate the coefficients for the country dummies, then certainly not. Otherwise, the solutions above might be OK, but it is hard to say without knowing your research question.
I have been using the estout commands for quite some time now, but am now working on a couple of computers that are not connected to the internet. I have been getting the following error from the esttab command:
current estimation results do not have e(b) and e(V)
To test this with a simpler example, I tried to replicate the example here: http://repec.org/bocode/e/estout/estpost.html#estpost101
I created a sample dataset as below:
price mpg rep78 foreign
1 3 1 1
2 3 1 1
3 3 1 2
4 3 2 3
5 3 2 5
6 3 2 8
7 3 3 13
8 3 3 21
9 3 3 34
And then ran the following commands as per the example:
estpost summarize price mpg rep78 foreign, listwise
esttab, cells("mean sd min max") nomtitle nonumber
I got the expected output with the estpost command, but got the aforementioned error when running esttab. I have uninstalled and reinstalled this package several times, using both a version downloaded onto another computer using ssc install and a version downloaded from http://repec.org/bocode/e/estout/installation.html. I feel that I must be missing something obvious...any help would be appreciated.
EDIT: I just checked with a colleague of mine, and it seems they have been inexplicably getting the same error recently.
This works for me. I suspect faulty installation. Despite your best efforts, you may have previously installed versions of some or all of the programs in estout on your system.
In Stata, you need to check what Stata is seeing by typing commands such as
. which esttab, all
in every directory in which you work. If necessary, repeat for all the other command files in the package as named by ssc desc estout. You should be seeing one (and only one) version of each command, dates mostly in 2009 (eststo in 2008).
A wilder hypothesis is that estout has been broken by recent changes in Stata. This seems unlikely to me, but check the above first.
(UPDATE) OP reply reveals a nightmare scenario: old versions have been found that are difficult to remove. What to do?
Look at the adopath command. In the first instance, just type adopath. Stata wants to install estout stuff in an e off whatever it labels PLUS. (If you are in the US, you probably want to say "off of".)
One possibility is that you use adopath to reset PLUS to somewhere you have write access, preferably somewhere dedicated to programs alone. That's got implications for programs already in PLUS, which are now longer visible, so you need to add whatever is now PLUS as an extra place to look. The help for adopath explains how to do this.
Another possibility is that Stata always looks in the current directory or folder for a user-written program before it looks in PLUS (unless you mess with that order, which I advise against). So, so long as programs you want are in the current directory, that should work. However, this would usually be considered poor style. Worse, as you change the working directory, you need to copy programs to that directory.
You may need to approach the administrators to delete the old stuff. It is hard to work with user-written programs for Stata if you don't have the right privileges.
Here is a sample program .do file, sampleprog.do:
program sampleprog
egen newVar = group (`1' `2')
end
How can I post it on my website (or dropbox), so that other people could install it to their Stata like this?
net from http://www.mywebsite.com/sampleprog.do
*** or may be like like this:
ssc install ...
I read the documentation about stata.toc...but I did not quite get it. What files should I upload and should it be one folder or what?
(PS: I definitely can simply email the .do file but this is not an option in my case.)
Here is a full explanation of how to share program or data files with others using your own website. I tried using Dropbox, but Stata 12 appears to have issues with https, which is the protocol for all Dropbox public links. If you want to use Dropbox, I recommend creating a shared folder that will sync on your collaborators' machines. The rest of this answer assumes you have a website serving pages over http or are using Stata 13, which supports https.
If this is a one-time thing, you can skip the rest of this answer by putting the file on your website and telling your collaborator to type:
. copy http://your-site.com/ado/program.ado program.ado
That will copy the ado file at the specified url into the user's current directory. If you want to provide information about your files, plan on sharing with multiple people and need to maintain/document a set files, read on!
Step 1 Create a folder on your website to hold the programs. I will call mine ado/
Step 2 Add the program files, help files, and data files you want to share. For this example, I have created a simple ado file called unique.ado with the following contents:
********************************************** unique.ado
capture program drop unique
program define unique
*! Count and number observations within group defined by varlist
* Example: unique person_id, obs(prow) tobs(pcount) sortby(time)
* to count and number rows by a variable called person_id
syntax varlist, obs(name) tobs(name) [sortby(varlist)]
bys `varlist' (`sortby') : gen long `obs' = _n
bys `varlist' (`sortby') : gen long `tobs' = _N
la var `obs' "Number of this row within `varlist' group."
la var `tobs' "Total number of rows with identical `varlist' values."
end
Step 3 Create a file called stata.toc to describe the files you wish to share. Here is mine:
********************************************** stata.toc
v 3
d Program to count observations by group
p unique [The unique.ado program for counting observations by group]
These files can be complicated. There are many features I won't cover here, but you can read this documentation to learn more.
Step 4 Create a package file for each of the packages defined by the lines in stata.toc that start with the letter p. Here is my package file for the unique package defined above:
********************************************** unique.pkg
v 3
d unique
d Program to count observations by group
d Distribution-Date: 28 June 2012
f unique.ado
Your directory now looks like this:
ado/
stata.toc
unique.ado
unique.pkg
Step 5 Use the site! Here are the commands to enter.
. net from http://example.com/ado/
. net describe unique
. net install unique
Here is what you'll see after entering the first command:
-----------------------------------------------------------------------------------
http://www.example.com/ado/
Program to count observations by group
-----------------------------------------------------------------------------------
PACKAGES you could -net describe-:
unique [The unique.ado program for counting observations by group]
-----------------------------------------------------------------------------------
The second command will tell you more about the package net describe unique:
---------------------------------------------------------------------------------------
package unique from http://www.example.com/ado
---------------------------------------------------------------------------------------
TITLE
unique
DESCRIPTION/AUTHOR(S)
Program to count observations by group
Distribution-Date: 28 June 2012
INSTALLATION FILES (type net install unique)
unique.ado
---------------------------------------------------------------------------------------
The third command will install the package net install unique:
checking unique consistency and verifying not already installed...
installing into /Users/cpoliquin/Library/Application Support/Stata/ado/plus/...
installation complete.
EDIT
See Nick's comments in the answer below. I intended this example to be simple and I don't expect other people to use this program. If you plan on submitting things to Stata Journal or SSC then his comments certainly apply! I hope this answer can serve as a decent tutorial for those confused by the official documentation.
This will be too long for a comment, so it is going to be an extra answer.
Your example uses the program name unique. If you search unique, all (or in Stata 13, search unique) you will find that a user-written program with the same name has been installed on SSC since 1998. This will create a clash of names for your users if (and only if) they attempt to use your program and also that earlier program. The more general advice is to search to see if a program name is already in use to try to avoid these problems.
Specifically, although you may just be using your unique as an arbitrary example, note that it contains bugs. An int doesn't contain enough bits to hold observation numbers exactly for large datasets. Also, as a matter of style, unique can change the sort order of your data, which is widely considered to be poor data management style.
Your example concerns dissemination of a program file without an accompanying help file. Suffice it to say that the SSC site would never accept such a program and the Stata Journal would not even review a paper based on such a submission before a help file was written to accompany it. Including explanatory comments with the code may be sufficient for your personal practices, but it falls below general Stata standards.
Stata 13 now supports https. See http://www.stata.com/manuals13/u.pdf, Section 3.6.
In short, I appreciate that you are trying to explain how to do something, but it is already well documented, and explicitly and implicitly some of your recommendations are below community standards.