I am trying to develop a spatiotemporal logistic regression model to predict the presence/absence of a disease in U.S. counties (contiguous U.S.) based on climatologic variables, with data points for each year between 2007 and 2014; ideally, I would like a model with functionality to score additional datasets, e.g., use the model developed for 2006-2014 to predict disease probability in future climate scenarios. The model needs to account for spatial autocorrelation, and (again, ideally) repeated measures (each county has one data point per year). Unfortunately, my SAS abilities are not up to the task. Would anyone have suggestions for developing the model? The data, in csv format, take the form of:
countyFIPS year outcome predictor1 predictor2 predictor3 latitude longitude
where
countyFIPS = unique 5-digit identifier for U.S. counties
outcome = at least one case in the county for the given year, coded 0/1
latitude and longitude denote the centroid of the county
I'm really bad at this, so please be gentle and use small words...
Related
Given a multiple linear regression: hypothetical example: lm(income ~ age*gender + experience), can I estimate a percentile score on a new observation (not used in model generation)?
For example, based on the residual between the model income prediction for the new observation and the true income value for this observation, can I meaningfully compute something like: 'Based on our model and given the individual's, gender, age, and experience level, we estimate they are currently earning more than 55% of individuals with the same characteristics.'
I assume that to do this, I would need to compute a distribution of predictions. Then I could use these to compute a quantile score.
My questions are:
is this approach logically/statistically sound? And if not what are the alternatives?
are there any existing implementations/ packages to do this?
I am using SAS proc surveyfreq with jack-knife replicate weights to describe frequencies across variables in a survey that used address based sampling. Some of the variables are coded by individual selection - for example, a survey question asks respondents to pick three top choices, so the actual dataset made each individual choice a variable with a Yes/No 0/1 response. Which SAS procedure that incorporate jk weights should I use in this case to describe the frequency for the entire question for the three top choices?
I have a dataset from a survey, consisting of a unique rows (representing respondents who have answered a series of questions).
The following columns are in the dataset:
Respondent ID (unique)
Business Unit (categorical)
Level (categorical)
Gender (categorical)
Columns for each question with an answer from 1 to 9
I transposed the data to be able to calculate avg, stdev etc for each of the questions, and make visuals for each of the categories. So far so good.
But now I want to take it one step further:
I want to build a comparison visual, where I would like users of the dashboard to be able to choose which "groups" they want to compare, but from different categories (BU, Gender, Level).
E.g., compare the average scores for all questions between Business Unit = X (=group 1), and Gender = Male (=group 2).
It would be even better if people could "compose" their comparison groups from all of the categories.
E.g., compare the average score for all questions between BU=X-Gender=Male-Level=Manager and BU=X (so in essence compare a subgroup with the BU total), or any of the combinations.
Any insights on how to do this would be much appreciated !
Thanks in advance 🙂
Dieter from Brussels
I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.
I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.
I am implementing a logit model in a database of households using as dependent variable the classification of poor or not poor household (1 if it is poor, 0 if it is not):
proc logistic data=regression;
model poor(event="1") = variable1 variable2 variable3 variable4;
run;
Using the proc logistic in SAS, I obtained the table "Association of predicted probabilities and observed responses" that allows me to know the concordant percentage. However, I require detailed information of how many households are classified poor adequately, in this way:
I will appreciate your help with this issue.
Add the CTABLE option to your MODEL statement.
model poor(event="1") = variable1 variable2 variable3 variable4 / ctable;
CTABLE classifies the input binary response observations according to
whether the predicted event probabilities are above or below some
cutpoint value z in the range . An observation is predicted as an
event if the predicted event probability exceeds or equals z. You can
supply a list of cutpoints other than the default list by specifying
the PPROB= option. Also, you can compute positive and negative
predictive values as posterior probabilities by using Bayes’ theorem.
You can use the PEVENT= option to specify prior probabilities for
computing these statistics. The CTABLE option is ignored if the data
have more than two response levels. This option is not available with
the STRATA statement.
For more information, see the section Classification Table.