Weka Logistic Regression Output (Gender) - data-mining

I'm running logistc regression in Weka against a dataset which contains an attribute "Gender" whose members can be {Male,Female}.
I understand the other cofficients in the output, but what does it mean when "Gender" is +0.7647? Is it favoring male or female?
On the preprocess screen, Male comes before Female in the attribute information. Does this mean I am to interpret that LR says being Female increases the probability of class membership?
Edit
I added an example of the output hiding some of the other attribute names.

Related

Any way to sync one parameter to another (same field names in two different tables)

I am building a dashboard to compare baseline demographics to those of a population of interest.
I am visualizing demographic composition with a 100% bar chart, with bars over time, and the Legend as a parameter containing the 4 field options.. I then set up a similar visual adjacent to the first to show a one-bar chart across the top of overall baseline group composition as a comparison to the population of interest over time.
Data on the population of interest is at the individual person level:
Name
Race
Ethnicity
Gender
Degree
xxxx
Asian
non-Hisp
Female
Doctorate
xxxx
White
Hispanic
Female
Masters
xxxx
White
non-Hisp
Male
Doctorate
Baseline data was obtained from a self-serve table builder, so data was not at level of record but at the "smallest subgroup" (i.e. a hierarchical table ultimately broken into individual subgroups of unique Race-Ethnicity-Gender-Degree value options, plus respective counts (e.g., one record could be for 5000 Asian non-Hispanic Women with a Doctorate):
Race
Ethnicity
Gender
Degree
Count
Asian
non-Hisp
Female
Doctorate
5000
White
Hispanic
Female
Masters
3000
White
non-Hisp
Male
Doctorate
200
I grouped the first table across those four fields to align the data for comparison -- My initial plan was to combine the tables so the same field parameter can be used for both visuals, and the visuals just need to be filtered subsets of the combined table (a simple "Source" field accomplishes this).
The wrinkle is that I can't utilize the grouped target data in the bar chart over time, as I am contextualizing it with other record-level dimension tables in the data model (including the date variable used to stratify the bars over time). This means that the target data's bar chart and the baseline bar chart must rely on two different parameters, one for each table's fields.
Especially since the fields are identically named, I was hoping for a way to command one field parameter to just take the current value of another so they can be synced and only one parameter needs to be manipulated.
Is this possible? Or, can this use case be clarified and implemented differently? The only workable solutions I can figure are clunky and undesirable (e.g., require users to manage two identical slicers). Other avenues I explored were not fruitful (create a Measure equal to the current value of the param, manipulate the param table to reference the second table in an additional field or embedded as two-element lists in the <Param> Field column). I was also hopeful for Sync Sliders, however that just imposes the same parameter onto multiple slicers and does not address this use case.

Removing characters before a certain value in variable names in stata

EDIT: the issue with this question was resolved as Stata changed the variable names in Excel to variable "labels" upon importing the data, and generated the variable "names" that I needed automatically. So the question is unnecessary.
I have a dataset in Stata that has a handful of variable names, some of which begin with a number and a period. Like so:
name of car 62. color of car 145. year of sale state of sale
Accord Red 1995 GA
Corvette Pink 2010 FL
...
How can I remove the numbers from the variable names that contain them so that I wind up with:
name of car color of car year of sale state of sale
Accord Red 1995 GA
Corvette Pink 2010 FL
...
I have some familiarity with the substr() function, but I am confused by the fact that the character count that I need to remove from is not consistent. Instead, I need to remove everything from the period following the number, back.
All those "names" are illegal as variable names, because Stata variable names just can't include spaces or periods or start with a number.
So either your Stata is corrupted beyond belief or you're misunderstanding what you have.
My best guess is that you have read in metadata so that text that could and should be variable labels is in fact making up the first observation (row) in your dataset. If so, the best advice is to go back and repeat the import so that metadata is not read into the dataset. The commands concerned have options to choose that.
In any case, it is immensely better to show data examples using dataex: see the tag wiki for Stata.

Detailed of predictions on proc logistic

I am implementing a logit model in a database of households using as dependent variable the classification of poor or not poor household (1 if it is poor, 0 if it is not):
proc logistic data=regression;
model poor(event="1") = variable1 variable2 variable3 variable4;
run;
Using the proc logistic in SAS, I obtained the table "Association of predicted probabilities and observed responses" that allows me to know the concordant percentage. However, I require detailed information of how many households are classified poor adequately, in this way:
I will appreciate your help with this issue.
Add the CTABLE option to your MODEL statement.
model poor(event="1") = variable1 variable2 variable3 variable4 / ctable;
CTABLE classifies the input binary response observations according to
whether the predicted event probabilities are above or below some
cutpoint value z in the range . An observation is predicted as an
event if the predicted event probability exceeds or equals z. You can
supply a list of cutpoints other than the default list by specifying
the PPROB= option. Also, you can compute positive and negative
predictive values as posterior probabilities by using Bayes’ theorem.
You can use the PEVENT= option to specify prior probabilities for
computing these statistics. The CTABLE option is ignored if the data
have more than two response levels. This option is not available with
the STRATA statement.
For more information, see the section Classification Table.

Linear Regression: Finding Significant Class Variables Using SAS

I'm attempting to use SAS to do a pretty basic regression problem but I'm having trouble getting the full set of results.
I'm using a data set that includes professors' overall quality (the dependent variable) and has the following independent variables: gender, numYears, pepper, discipline, easiness, and rateInterest.
I'm using the code below to generate the analysis of the data set:
proc glm data=WORK.IMPORT;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest;
run;
I get the following results, which is mostly what I need, EXCEPT that I would like to see exactly which responses from the class variables (gender, pepper, discipline) are significant.
From these results, I can see that easiness, rateInterest, pepper, and discipline are significant; however, I'd like to see which specific values of pepper and discipline are significant. For example, pepper was answered as a 'yes' or 'no' by the student. I'd like to see if quality correlates specifically to pepperyes or pepperno. Can anyone give me some advice about how to alter my code to return a breakdown of the class variables?
Here is also a link to the dataset, in case it's needed for reference:
https://drive.google.com/file/d/1Kc9cb_n-l7qwWRNfzXtZi5OsiY-gsYZC/view?usp=sharingRateprof
I really, truly appreciate any assistance!
Add the solution option to your model statement to break out statistics of each class variable; however, reference parameterization is not available in proc glm, and will cause biased estimates. There are ways around this to continue using proc glm, but the simplest solution is to use proc glmselect instead. proc glmselect allows you to specify reference parameterization. Use the selection=none option to disable variable selection.
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline / param=reference;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
The interpretation of this would be:
All other variables held constant, females affect the quality rating by
-0.046782 units compared to males. This variable is not statistically significant.
The breakdown of each class level is a comparison to a reference value. By default, the reference value selected is the last level after all class values are internally sorted. You can specify a reference using the ref= option after each class variable. For example, if you wanted to use females as a reference value instead of males:
proc glmselect data=WORK.IMPORT;
class gender(ref='female') pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / selection=none;
run;
Note that you can also do this with prox mixed. For this specific purpose, the preference is up to you based on the output style that you like. proc mixed is a more flexible way to run regressions, but would be a bit overkill here.
proc mixed data=import;
class gender pepper discipline;
model quality = gender numYears pepper discipline easiness raterInterest / solution;
run;

How do I perform spatial logistic regression in SAS?

I am trying to develop a spatiotemporal logistic regression model to predict the presence/absence of a disease in U.S. counties (contiguous U.S.) based on climatologic variables, with data points for each year between 2007 and 2014; ideally, I would like a model with functionality to score additional datasets, e.g., use the model developed for 2006-2014 to predict disease probability in future climate scenarios. The model needs to account for spatial autocorrelation, and (again, ideally) repeated measures (each county has one data point per year). Unfortunately, my SAS abilities are not up to the task. Would anyone have suggestions for developing the model? The data, in csv format, take the form of:
countyFIPS year outcome predictor1 predictor2 predictor3 latitude longitude
where
countyFIPS = unique 5-digit identifier for U.S. counties
outcome = at least one case in the county for the given year, coded 0/1
latitude and longitude denote the centroid of the county
I'm really bad at this, so please be gentle and use small words...