I want to ask questions in terms of PSM code for stata - stata

I have a trouble in understanding PSM(propensity score matching)-time varying did code in stata:
I found a list of code online but I have trouble in understanding the code:enter image description here
I don't have data available on hand, but I am wondering how did the person generate variable "myblock" and what did variable that I should enter in the position of the "comsup" when I run my own code.
I am also wondering how many variables are the propriate for "xlist"
Thank you so much for the help and support!

Related

How to compute p-value when number of degrees of freedom and chi-square value are known?

I know how to do this in R: e.g. pchisq(18.98, df=2, lower.tail=FALSE)
However, I've no idea about how to write Stata code to solve this problem.
In case you're interested in more general post-estimation tests check out: help test.
Otherwise is seems like chi2(2,18.98) or chi2tail(2,18.98) are what you're after (depending on what lower.tail=FALSE means.
Note that in Stata you'll probably want to put this into a "local" in order to do other things with the output.
For example if you say the following to Stata:
local pchi2 = chi2(2,18.98)
display "chi2: `pchi2'"
Stata should reply:
chi2: .9999243958967154
See for more detail and links to the Stata manual section on statistical functions:
help chi2
help chi2tail

SAS DI Stop job if dataset is populated

I'm quite new to SAS and really can't get my head around it's code, so asking here for help.
I've a job that is reading an external csv file, and have a macro created by a colleague that validates the data in this external file and prints out error message to a work table.
What I'd like to do is either on precode of the file reader, or by using another user written code transformation is to read the work table and check if observations exist, and if they do, abort the job. From googling, and between here and SAS community, I can find how to read a dataset and count observations but I'm having real difficulty in figuring out how to implement it so any guidance would be really appreciated
Can anyone please help me on this?
Thanks

Stata: Esttab of xtreg with time fixed effects

I'm trying to save output from several hundred eststo's storing results of bivariate probability models into one excel file using esttab. It works for xtlogit(both ,re and ,pa), xtprobit (both ,re and ,pa) and for the linear probability model xtreg (both standard and ,fe. However, when I use xtreg y x i.year, fe I get the error message too many base levels specified. Google doesn't help me much.
I've been trying for an hour to create a reproducible example but the stata datasets all work fine. It does not seem to be due to the number of years or the fact that different specifications have data for different years. Still, the normal xtreg, fe' works, the problem only appears with time dummies. The weirdest thing is that it works for all subsets of my variables but not for the whole list (again just the time fixed effects specifications).
Does anyone have an idea how to proceed? Using drop(*.year) works whenever the problem does not arise (so in specifications where it works, I get outputs without the year dummies) but does not prevent the too many base levels specified error; ,nobaselevels has no apparent effect as well. Is there a way to remove the time fixed effects from eststo before I pass those on to esttab? Any workaround would be appreciated as well.
The problem you might be facing is that of Stata creating different base levels for the factor variable year, in different regressions.
Try fixing the factor variable base level beforehand with fvset:
fvset base <some_number> year
Check help fvset and the manual entry for details. Also, read the source given below, which contains more information.
Source: two posts from Statalist; one from Tim Wade and another by Jeff Pitblado.

SAS - Keeping only observations with all variables

I have a dataset of membership information, and I want to keep only the people who have been continuously enrolled for the entire year. There are 12 variables for each person, one for each month of the year with how many days during that month they were enrolled. Is there a way to make a subset of the data for just those with a value >1 for each of the month variables?
Thanks!
SAS has various summary functions that might well be what you're looking for. See min() (minimum) in particular, as it will allow you to find the minimum of several variables. You may also want to consider nmiss() (number of missing values) and n() (number of non-missing values) if you have to deal with missing values in your data.
Summary functions can be passed lists of variables like this (in a data step):
minimum = min(var1, var2, var3);
However, that can become long winded if you need to use a lot of variables. Fortunately, SAS provides several ways to reference lists of variables to make things neater. You can read about these variable lists here. To use them in a summary function use the of qualifier:
minimum = min(of var1-var12);
maximum = max(of var:);
blanks = nmiss(of _NUMERIC_);
Finally you will want to use your new found data to decide whether what data to include. To do this in a data step look at the output statement (user guide):
if min(of var:) > 1 then output;
Or if you feel like learning a bit more about SAS's syntax you could try using an implicit output by reading through the last link.
In general it's preferred to ask specific questions and show your current work on SO, and I'd advise using google to answer your basic questions while you're learning the fundamentals. There is plenty of great documentation available to help you out there.

How can I select Yes/No qestionID dynamically in weka j48 App

I'm developing a Weka app like Akinator by using the j48 method.
Sample:
http://jbossews-vdoctor.rhcloud.com/doctor
The following is the app's table definition and sample data
qa means question id(Please refer the master which can be set by user) + answer(1:Yes, 2: I don't know, 3: No).
1 line per 1 question & answer.
id,qa,class
A,13,1
A,23,1
B,13,2
B,21,2
The point is to find a way to select the question which can maximize the entropy.
Currently this app is regarding first node id of decision tree as the best question.
And then it narrows down the options by this elimination way.
But the accuracy was too bad to run correctly so I'd like to improve it.
I noticed that the qa column was identified as numeric so it could not build the correct decision tree.
I am confused what I should do for improvement. Dataset? Table definition? Logic?
This is quite a broad question that you are asking, and without code or a clear understanding of the problem it is quite difficult to answer, but I'll give some tips for improvement:
Table Definition
What may have made more sense here is to have an attribute for each question, instead of using a single instance per question. For Example, instead of id, qa and class, you could have A, B, C, D, E, F and Disease. (I believe there were six questions, and naming each attribute would be recommended instead of A-F)
Dataset
You will need at least as many cases as there are diseases, if not more for defining multiple subsets of the problem space for the same disease. There are likely cases where some questions are irrelevant or missing, and the model may need to handle such situations.
Logic
In such a case, you might be able to do the questionnaire by starting with the root node and asking questions until you reach the estimated class. This way, you can ask from node to node until a class is reached.
I hope this helps in improving your existing model.
NOTE: I tried your questionnaire and answered No to all of your questions, and I strangely ended up with Trichomoniasis. Perhaps there could be a 'No Disease' category for your training data also.
My nominal qa data is building such a decision tree by binary split.
actually this structure won't make sense because there is tree at only one side. When qa equal 23 it would be always '3' answer. It's irrational.
http://www.fastpic.jp/viewer.php?file=2693704973.jpg
You should first reformat your features to get all possible questions A,B,C,D... as binary features and your final answer (ie. what to guess) as target class if you want your tree to get a sequence of questions reaching to your answer. Your data will certainly be sparse (many questions without data/answer).
By the way, a binary tree is not the right ML structure and algorithm to build an Akinator like or 20Q/Guess-who. Please look some suggestions here: https://stats.stackexchange.com/questions/6074/akinator-com-and-naive-bayes-classifier