Is there a way to quantify impact of independent variables with gradient boosting? - sas

I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!

Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.

Related

Is there a simple way to reduce the value of positive slack variables in MILP?

Recently, I have been learning optimization, and my optimization problem, (minimization), is encoded in a MILP solver which tells me it's infeasible for my model. Hence, I introduced a few positive/negative slack variables.
Now, I get a feasible solution, but the positive slack variables are way bigger than what I can accept.
So, I gave penalties/weights to those variables (multiplied by large numbers), hoping that the MILP solver would reduce the variables, but that didn't work (I got the same solution)
Is there any approach to follow, in general, when the slack is too large?
Is there a better way to pick the slack variables, in general?
A common pitfall for people new to mathematical programming/optimization is that variables are non-negative by default, that is, they always have an implied lower bound of 0. Your mathematical model may not specify this explicitly, so those variables might need to be declared as free (with a lower bound of -infinity).
In general, you should double-check your model (as LP file) and compare it to the mathematical formulation.
Add both to the objective with a penalty coefficient.
Or add some upper bounds to the slacks.

When applying word2vec, should I standardize the cosine values within each year, when comparing them across years?

I'm a researcher, and I'm trying to apply NPL to understand the temporal changes of the meaning of some words.
So far I have obtained the trained embeddings (word2vec, sgn) of several years with identical parameters in the training.
For example, if I want to test the change of cosine similarity of word A and word B over 5 years, should I just compute them and plot the cosine values?
The reason I'm asking this is that I found the overall cosine values (mean of all possible pairs within that year) differ across the 5 years. **For example, 1990:0.21, 1991:0.19, 1992:0.31, 1993:0.22, 1994:0.31. Does it mean in some years, all words are more similar to each other than other years??
Base on my limited understanding, I think the vectors are odds in logistic functions, so they shouldn't be significantly affected by the size of the corpus? Is it necessary for me to standardize the cosine values (of all pairs within each year) so I can compare the relative ranking change across years? Or just trust the raw cosine values and compare them across years?
In general you should not think of cosine-similarities as an absolute measure that'd be comparable between models. That is, you should not think of "0.7" cosine-similarity as anything like "70%" similar, and choose some arbitrary "70%" threshold to be used across models.
Instead, it's only a measure within a single model's induced space - with its effective 'scale' affected by all the parameters & the training data.
One small exercise that may help illustrate this: with the exact same data, train a 100d model, then a 200d model. Then look at some word pairs, or words alongside their nearest-neighbors ranked by cosine-similarity.
With enough training/data, generally the same highly-related words will be nearest-neighbors of each other. But the effective ranges of cosine-similarity values will be very different. If you chose a specific threshold in one model as meaning, "close enough to feed some other analysis", the same threshold would not be sufficient in the other. Every model is its own world, induced by the training data & parameters, as well as some sources of explicit or implicit randomness during training. (Several parts of the word2vec algorithm use random sampling, but also any efficient multi-threaded training will encounter arbitray differences in training-order via host OS thread-scheduling vagaries.)
If your parameters are identical, & the corpora very-alike in every measurable internal proportion, these effects might be minimized, but never eliminated.
For example, even if people's intended word meanings were perfectly identical, one year's training data might include more discussion of 'war' or 'politics' or some medical-topic, than another. In that case, the iterative, interleaved tug-of-war in training updates will mean words from that overrepresented domain have far more push-pull influence on the final model word positions – essentially warping subregions of the final space for finer distinctions some places, and thus *coarser distinctions in the less-updated zones.
That is, you shouldn't expect any global-per-model scaling factor (as you've implied might apply) to correct for any model-to-model differences. The influences of different data & training runs are far more subtle, and might affect different 'neighborhoods' of words differently.
Instead, when comparing different models, a more stable grounds for comparison is relative rankings or relative-proportions of words with respect to their closeness-to-others. Did words move into, or out of, each others' top-N neighbors? Did A move more closely to B than C did to D? etc.
Even there, you might want to be careful about differences in the full vocabulary: if A & B were each others' closest neighbors year 1, but 5 other words squeeze between them in year 2, did any word's meaning really change? Or might it simply be because those other words weren't even suitably represented in year 1 to receive any position, or previously had somewhat 'noisier' positions nearby? (As words get rarer their positions from run to run will be more idiosyncratic, based on their few usage examples, and the influences of those other sources of run-to-run 'noise'.)
Limiting all such analyses to very-well-represented words will minimize misinterpreting noise-in-the-models as something meaningful. Re-running models more than once, either with same parameters or slightly-different ones, or slightly-different training data subsets, and seeing which comparisons hold up across such changes, may also help determine which observed changes are robust, versus methodological artifacts such as jitter from run-to-run, or other sampling effects.
A few previous answers on similar questions about comparing word-vectors across different source corpora may have other useful ideas or caveats for you:
how calculate distance between 2 node2vec model
Word embeddings for the same word from two different texts
How to compare cosine similarities across three pretrained models?

Influence diagrams / Decision models in Stan and PyMC3

Is it possible to write decision-making models in either Stan or PyMC3? By that I mean: we define not only the distribution of random variables, but also the definition of decision and utility variables, and determine the decisions maximizing expected utility.
My understanding is that Stan is more of a general optimizer than PyMC3, so that suggests decision models would be more directly implemented in it, but I would like to hear what people have to say.
Edit: While it is possible to enumerate all decisions and compute their corresponding expected utility, I am wondering about more efficient methods since the number of decisions could be combinatorially too many (for example, how many items to buy from a list with thousands of products). Influence diagram algorithms exploit factorizations in the model to identify independences that allow computing of the decisions on only a smaller set of relevant random variables. I wonder if either Stan or PyMC3 do that kind of thing.
The basic steps for Bayesian decision theory are:
Enumerate a finite set of decisions that could be made
Specify a utility function of the decision and perhaps other things
Draw from the posterior distribution of all the unknowns given the known data
Evaluate the utility function for each possible decision and each posterior draw
Make the decision with the highest expected utility, averaging over the posterior draws.
You can do those five steps with any software --- Stan and PyMC3 included --- that produces (valid) draws from the posterior distribution. In Stan, the utility function should be evaluated in the generated quantities block.

how to use tf-idf with Naive Bayes?

As per my search regarding the query, that I am posting here, I have got many links which propose solution but haven't mentioned exactly how this is to be done. I have explored, for example, the following links :
Link 1
Link 2
Link 3
Link 4
etc.
Therefore, I am presenting my understanding as to how the Naive Bayes formula with tf-idf can be used here and it is as follows:
Naive-Bayes formula :
P(word|class)=(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_all_classes(basically vocabulary of words in the entire training set))
tf-idf weighting can be employed in the above formula as:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
total_unique_words_in_all_classes : as is.
This question has been posted multiple times on stack overflow but nothing substantial has been answered so far. I want to know that the way I am thinking about the problem is correct or not i.e. implementation that I have shown above. I need to know this as I am implementing the Naive Bayes myself without taking help of any Python library which comes with the built-in functions for both Naive Bayes and tf-idf. What I actually want is to improve the accuracy(currently 30%) of the model which was using Naive Bayes trained classifier. So, if there are better ways to achieve good accuracy, suggestions are welcome.
Please suggest me. I am new to this domain.
It would be better if you actually gave us the exact features and class you would like to use, or at least give an example. Since none of those have been concretely given, I'll just assume the following is your problem:
You have a number of documents, each of which has a number of words.
You would like to classify documents into categories.
Your feature vector consists of all possible words in all documents, and has values of number of counts in each document.
Your Solution
The tf idf you gave is the following:
word_count_in_class : sum of(tf-idf_weights of the word for all the documents belonging to that class) //basically replacing the counts with the tfidf weights of the same word calculated for every document within that class.
total_words_in_class : sum of (tf-idf weights of all the words belonging to that class)
Your approach sounds reasonable. The sum of all probabilities would sum to 1 independent of the tf-idf function, and the features would reflect tf-idf values. I would say this looks like a solid way to incorporate tf-idf into NB.
Another potential Solution
It took me a while to wrap my head around this problem. The main reason for this was having to worry about maintaining probability normalization. Using a Gaussian Naive Bayes would help ignore this issue entirely.
If you wanted to use this method:
Compute mean, variation of tf-idf values for each class.
Compute the prior using a gaussian distribution generated by the above mean and variation.
Proceed as normal (multiply to prior) and predict values.
Hard coding this shouldn't be too hard since numpy inherently has a gaussian function. I just prefer this type of generic solution for these type of problems.
Additional methods to increase
Apart from the above, you could also use the following techniques to increase accuracy:
Preprocessing:
Feature reduction (usually NMF, PCA, or LDA)
Additional features
Algorithm:
Naive bayes is fast, but inherently performs worse than other algorithms. It may be better to perform feature reduction, and then switch to a discriminative model such as SVM or Logistic Regression
Misc.
Bootstrapping, boosting, etc. Be careful not to overfit though...
Hopefully this was helpful. Leave a comment if anything was unclear
P(word|class)=(word_count_in_class+1)/(total_words_in_class+total_unique_words_in_all_classes
(basically vocabulary of words in the entire training set))
How would this sum up to 1? If using the above conditional probabilities, I assume the SUM is
P(word1|class)+P(word2|class)+...+P(wordn|class) =
(total_words_in_class + total_unique_words_in_class)/(total_words_in_class+total_unique_words_in_all_classes)
To correct this, I think the P(word|class) should be like
(word_count_in_class + 1)/(total_words_in_class+total_unique_words_in_classes(vocabulary of words in class))
Please correct me if I am wrong.
I think there are two ways to do it:
Round down tf-idf as integers, then use the multinomial distribution for the conditional probabilities. See this paper https://www.cs.waikato.ac.nz/ml/publications/2004/kibriya_et_al_cr.pdf.
Use Dirichlet distribution which is a continuous version of the multinomial distribution for the conditional probabilities.
I am not sure if Gaussian mixture will be better.

What is a proper method for minimizing st deviation of dependent variable (e.g. clustering?)

I'm stuck with minimizing the st deviation of a dependent variable being time difference in days. The mean is OK, but the deviation is terrible. Tried clustering by independent variables and noticed quite dissimilar clusters. Now, wondering:
1) How I can actually apply this knowledge from clustering to the independent variable? The fact is that it was not included in initial clustering analysis, as I know it's dependent on the others.
2) Given that I know the variable of time difference is dependent, should I run clustering of it with the variable of cluster number being the result of my initial clustering analysis? Would it help?
3) Is there any other technique apart from clustering that can help me somehow categorize observation groups so that for each group I would have a separate mean of the independent variable with low st deviation?
Any help highly appreciated!
P.S. I was using Stata and SPSS, though I can also use SAS if you can share the code.
It sounds like you're going about this all wrong. Here are some relevant points to consider.
It's more important for the variance to be consistent across groups than it is to be low.
Clustering is (generally) going to organize individuals based on similar patterns of the clustering variables.
Fewer observations will generally not decrease the size of your standard deviation.
Any time you take continuous variables (either IV or DVs) and convert them into categorical variables, you are removing variance from the equation, and including more measurement error. Sometimes there are good reasons to do this, often times there are not.
Analysis should be theory-driven whenever possible, as data driven analysis (like what you're trying to accomplish here) is more likely to produce results that can't be reproduced or generalized to other data sets, samples, or populations.